Over a million developers have joined DZone.

Big Data's Causation and Correlation Issue

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

If Big Data came in a box, it would be stamped, “Warning: Correlation does not imply causation” on the side. There’s a common thread among Big Data stories, often told as exciting tales of wonder, that correlation somehow approximates causation. It sometimes gets expressed in oblique arguments that more data is better and in stories of the search for the perfect algorithm.

It isn’t that simple, as we wrote a while back: in some cases, like when choosing wine, small data actually matters far more than big. It can come down to simply whether the wine buyer likes wine with heavy tannins or not…so much for bouquet, texture and fruit.

Causation versus correlation

If you’re new to this area, I should explain that causality means A causes B, where correlation, on the other hand, means that A and B tend to be observed at the same time. These are very, very different things when it comes to Big Data but often the difference gets glossed over or ignored. Whether correlation is “good enough” to act without knowing the cause for something depends entirely on the problem being solved and the risks of being wrong.

Is correlation good enough? It depends…

Gil Press, writing in Forbes, explains this idea very well in his review of the recently published, widely commented bookBig Data: A Revolution that Will Transform How We Live, Work, and Think:

“For many everyday needs, knowing what not why is good enough.” The book is full of such examples from making better diagnostic decisions when caring for premature babies to which flavor Pop-Tarts to stock at the front of the Walmart store before a hurricane. Big data can help answer these questions, but they never required “knowing why.” Big data analysis can be about correlations OR causation—it all depends, as it has always been, on what question we are asking, what problem we are solving, and what goal we are trying to achieve.

Going off the roadGil isn’t the only one making this distinction. Algorithms by themselves don’t tell you what data means and without human input or direction (in the form of hypotheses or data discrimination), can actually steer understanding of data in the wrong direction. Data science, it seems, requires a healthy sense ofskepticism.

This is exactly what makes data scientists so hard to find…it isn’t about the ‘big-ness’ of data or the algorithm’s perfection. It is about knowing a great dealabout the data so that the true meaning can be coaxed out, not squeezed out by a mindless process.

It will disappoint many to hear that there isn’t always an expensive, industrial solution to ever larger amounts of data. Instead, it often comes down to having great governance of data so that metadata (data about data) can be fully understood and taken into account.

Examples of correlation versus causation

Getting it wrong can be expensive as shown in Freakonomics example of mistaking correlation for causation that almost led the State of Illinois to send books to every child in the state because studies showed that books in the home correlated to higher test scores. Later studies showed that children from homes with many books did better even if they never read, leading researches to correct their assumptions with the realization that homes where parents buy books have an environment where learning is encouraged and rewarded. Correlation versus causation in plain view. Illinois didn’t have money to waste going in the wrong direction and neither does today’s enterprise.

A simple explanation

Khan Academy, probably the best broad-based learning site on the Internet, has this great video lesson the the difference between correlation and causation that is a great reminder of the limitations of data:

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison

Topics:

Published at DZone with permission of Christopher Taylor, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}