Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Linear Regression and Planet Spacing

DZone's Guide to

Linear Regression and Planet Spacing

Ever wonder how big data could be applied to the study of our solar system? One data scientist did and used linear regression to get to an interesting conclusion.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

A while back I wrote about how planets are evenly spaced on a log scale. I made a bunch of plots, based on our solar system and the extrasolar systems with the most planets, and noted that they're all roughly straight lines. Here's the plot for our solar system, including dwarf planets, with distance on a logarithmic scale.

This post is a quick follow up to that one. You can quantify how straight the lines are by using linear regression and comparing the actual spacing with the spacing given by the best straight line. Here, I'm regressing the log of the distance of each planet from its star on the planet's ordinal position.

NB: I am only using regression output as a measure of goodness of fit. I am not interpreting anything as a probability.

|-----------+-----------+----------+-----------|
| System    | Adjusted  | Slope    | Intercept |
|           | R-squared | p-value  | p-value   |
|-----------+-----------+----------+-----------|
| home      |    0.9943 | 1.29e-11 |  2.84e-08 |
| kepler90  |    0.9571 | 1.58e-05 |  1.29e-06 |
| hd10180   |    0.9655 | 1.41e-06 |  2.03e-07 |
| hr8832    |    0.9444 | 1.60e-04 |  5.57e-05 |
| trappist1 |    0.9932 | 8.30e-07 |  1.09e-09 |
| kepler11  |    0.9530 | 5.38e-04 |  2.00e-05 |
| hd40307   |    0.9691 | 2.30e-04 |  1.77e-05 |
| kepler20  |    0.9939 | 8.83e-06 |  3.36e-07 |
| hd34445   |    0.9679 | 2.50e-04 |  4.64e-04 |
|-----------+-----------+----------+-----------|

R² is typically interpreted as how much of the variation in the data is explained by the model. In the table above, the smallest value of R ² is 94%.

p-values are commonly, and wrongly, understood to be the probability of a model assumption being incorrect. As I said above, I'm completely avoiding any interpretation of p-values as the probability of anything, only noting that small values are consistent with a good fit.

Journals commonly, and wrongly, are willing to assume that anything with a p-value less than 0.05 is probably true. Some are saying the cutoff should be 0.005. There are problems with using any p-value cutoff, but I don't want to get into here. I'm only saying that small p-values are typically seen as evidence that a model fits, and the values above are orders of magnitude smaller than what journals consider acceptable evidence.

When I posted my article about planet spacing I got some heated feedback saying that this isn't exact, that it's unscientific, etc. I thought that was strange. I never said it was exact, only that it was a rough pattern. And although it's not exact, it would be hard to find empirical studies of anything with such a good fit. If you held economics or psychology, for example, to the same standards of evidence, there wouldn't be much left.

This pattern is known as the Titius-Bode law. I stumbled upon it by making some plots. I assumed from the beginning that someone else must have done the same exercise and that the pattern had a name, but I didn't know that name until later.

Someone sent me a paper that analyzes the data on extrasolar planets and Bode's law, something much more sophisticated than the crude sketch above, but, unfortunately, I can't find it this morning. I don't recall what they did. Maybe they fit a hierarchical model where each system has its own slope and intercept.

One criticism has been that by regressing against planet order, you automatically get a monotone function. That's true, but you do get a much better fit on a log scale than on a linear scale in any case. You might look at just the relative planet spacings without reference to order.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data science ,linear regression

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}