On the Interpretation of a Regression Model

DZone 's Guide to

On the Interpretation of a Regression Model

Who's to say which variable in a regression model implies causality? We explore this question while examining a regression model based on temperature data.

· Big Data Zone ·
Free Resource

Yesterday, NaytaData (aka @NaytaData ) posted a nice graph on Reddit, with bicycle traffic and mean air temperature, in Helsinki, Finland, per day,

I found that graph interesting, so I did ask for the data (NaytaData kindly sent them to me tonight).

ggplot(df, aes(meanTemp, cyclists)) +
  geom_point() +
  geom_smooth(span = 0.3)

But as mentioned by someone on Twitter, the interpretation is somehow trivial: people get out on their bike when the weather is nice. The hotter it is, the more cyclists there are on the road. Which is interpreted here in a causal way...

But actually, we can also visualize the data as follows, as suggested by Antoine Chambert-Loir:

 ggplot(df, aes(cyclists, meanTemp)) +
  geom_point() +
  geom_smooth(span = 0.3)

The interpretation would be, somehow, that the more cyclists on the road, the hotter it is. Why not consider this causal interpretation here? Like cyclists go so fast, or sweat so much, that they increase temperature...

Of course, it is the standard (recurrent) discussion "correlation is not causality," but, in regression models, we like to tell a story, to pretend that we have some sort of a causal story. But we do not prove it. Here, we know that the first one is more credible than the second one, but how do we know that? To go further, how can we use machine learning techniques to prove causal relationships? How could a machine choose between the first and the second story?

big data, big data analytics, regression model

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}