If Big Data came in a box, it would be stamped, “Warning: Correlation does not imply causation” on the side. There’s a common thread among Big Data stories, often told as exciting tales of wonder, that correlation somehow approximates causation. It sometimes gets expressed in oblique arguments that more data is better and in stories of the search for the perfect algorithm.
It isn’t that simple, as we wrote a while back: in some cases, like when choosing wine, small data actually matters far more than big. It can come down to simply whether the wine buyer likes wine with heavy tannins or not…so much for bouquet, texture and fruit.
Causation versus correlation
If you’re new to this area, I should explain that causality means A causes B, where correlation, on the other hand, means that A and B tend to be observed at the same time. These are very, very different things when it comes to Big Data but often the difference gets glossed over or ignored. Whether correlation is “good enough” to act without knowing the cause for something depends entirely on the problem being solved and the risks of being wrong.
Is correlation good enough? It depends…
“For many everyday needs, knowing what not why is good enough.” The book is full of such examples from making better diagnostic decisions when caring for premature babies to which flavor Pop-Tarts to stock at the front of the Walmart store before a hurricane. Big data can help answer these questions, but they never required “knowing why.” Big data analysis can be about correlations OR causation—it all depends, as it has always been, on what question we are asking, what problem we are solving, and what goal we are trying to achieve.
Gil isn’t the only one making this distinction. Algorithms by themselves don’t tell you what data means and without human input or direction (in the form of hypotheses or data discrimination), can actually steer understanding of data in the wrong direction. Data science, it seems, requires a healthy sense ofskepticism.
This is exactly what makes data scientists so hard to find…it isn’t about the ‘big-ness’ of data or the algorithm’s perfection. It is about knowing a great dealabout the data so that the true meaning can be coaxed out, not squeezed out by a mindless process.
It will disappoint many to hear that there isn’t always an expensive, industrial solution to ever larger amounts of data. Instead, it often comes down to having great governance of data so that metadata (data about data) can be fully understood and taken into account.
Examples of correlation versus causation
Getting it wrong can be expensive as shown in Freakonomics‘ example of mistaking correlation for causation that almost led the State of Illinois to send books to every child in the state because studies showed that books in the home correlated to higher test scores. Later studies showed that children from homes with many books did better even if they never read, leading researches to correct their assumptions with the realization that homes where parents buy books have an environment where learning is encouraged and rewarded. Correlation versus causation in plain view. Illinois didn’t have money to waste going in the wrong direction and neither does today’s enterprise.
A simple explanation
Khan Academy, probably the best broad-based learning site on the Internet, has this great video lesson the the difference between correlation and causation that is a great reminder of the limitations of data: