Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

On the Interpretation of a Regression Model

DZone's Guide to

On the Interpretation of a Regression Model

Who's to say which variable in a regression model implies causality? We explore this question while examining a regression model based on temperature data.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Yesterday, NaytaData (aka @NaytaData ) posted a nice graph on Reddit, with bicycle traffic and mean air temperature, in Helsinki, Finland, per day,

I found that graph interesting, so I did ask for the data (NaytaData kindly sent them to me tonight).

df=read.csv("cyclistsTempHKI.csv")
library(ggplot2)
ggplot(df, aes(meanTemp, cyclists)) +
  geom_point() +
  geom_smooth(span = 0.3)

But as mentioned by someone on Twitter, the interpretation is somehow trivial: people get out on their bike when the weather is nice. The hotter it is, the more cyclists there are on the road. Which is interpreted here in a causal way...

But actually, we can also visualize the data as follows, as suggested by Antoine Chambert-Loir:

 ggplot(df, aes(cyclists, meanTemp)) +
  geom_point() +
  geom_smooth(span = 0.3)

The interpretation would be, somehow, that the more cyclists on the road, the hotter it is. Why not consider this causal interpretation here? Like cyclists go so fast, or sweat so much, that they increase temperature...

Of course, it is the standard (recurrent) discussion "correlation is not causality," but, in regression models, we like to tell a story, to pretend that we have some sort of a causal story. But we do not prove it. Here, we know that the first one is more credible than the second one, but how do we know that? To go further, how can we use machine learning techniques to prove causal relationships? How could a machine choose between the first and the second story?

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,regression model ,big data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}