Over a million developers have joined DZone.

On the Interpretation of a Regression Model

DZone's Guide to

On the Interpretation of a Regression Model

Who's to say which variable in a regression model implies causality? We explore this question while examining a regression model based on temperature data.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Yesterday, NaytaData (aka @NaytaData ) posted a nice graph on Reddit, with bicycle traffic and mean air temperature, in Helsinki, Finland, per day,

I found that graph interesting, so I did ask for the data (NaytaData kindly sent them to me tonight).

ggplot(df, aes(meanTemp, cyclists)) +
  geom_point() +
  geom_smooth(span = 0.3)

But as mentioned by someone on Twitter, the interpretation is somehow trivial: people get out on their bike when the weather is nice. The hotter it is, the more cyclists there are on the road. Which is interpreted here in a causal way...

But actually, we can also visualize the data as follows, as suggested by Antoine Chambert-Loir:

 ggplot(df, aes(cyclists, meanTemp)) +
  geom_point() +
  geom_smooth(span = 0.3)

The interpretation would be, somehow, that the more cyclists on the road, the hotter it is. Why not consider this causal interpretation here? Like cyclists go so fast, or sweat so much, that they increase temperature...

Of course, it is the standard (recurrent) discussion "correlation is not causality," but, in regression models, we like to tell a story, to pretend that we have some sort of a causal story. But we do not prove it. Here, we know that the first one is more credible than the second one, but how do we know that? To go further, how can we use machine learning techniques to prove causal relationships? How could a machine choose between the first and the second story?

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

big data ,regression model ,big data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}