“Does Smoking Cause Cancer?”
We have heard that lot of smokers have lung cancer. However, can we mathematically tell that smoking causes cancer?
We can look at cancer patients and check how many of them are smoking. We can look at smokers and check will they develop cancer. Let’s assume that answers come up 100%. That is, hypothetically, we can see a 1–1 relationship between smokers and cancer.
Ok great, can we claim that smoking causes cancer? Apparently it is not easy to make that claim. Let’s assume that there is a gene that causes cancer and also makes people like to smoke. If that is the cause, we will see the 1–1 relationship between cancer and smoking. In this scenario, cancer is caused by the gene. That means there may be an innocent explanation to 1–1 relationship we saw between cancer and smoking.
This example shows two interesting concepts: correlation and causality from statistics, which play a key role in Data Science and Big Data. Correlation means that we will see two readings behave together (e.g. smoking and cancer) while causality means one is the cause of the other. The key point is that if there is a causality, removing the first will change or remove the second. That is not the case with correlation.
Correlation Does Not Mean Causation!
This difference is critical when deciding how to react to an observation. If there is causality between A and B, then A is responsible. We might decide to punish A in some way or we might decide to control A. However, correlation does warrant such actions.
For example, as described in the post The Blagojevich Upside, the state of Illinois found that having books at home is highly correlated with better test scores even if the kids have not read them. So they decide the distribute books. In retrospect, we can easily find a common cause. Having the book in a home could be an indicator of how studious parents are, which will help with better scores. Sending books home, however, is unlikely to change anything.
You see correlation without a causality when there is a common cause that drives both readings. This is a common theme of the discussion. You can find a detailed discussion on causality from the talk “Challenges in Causality” by Isabelle Guyon.
Great, how can I show causality? Casualty is measured through randomized experiments (a.k.a. randomized trials or AB tests). A randomized experiment selects samples and randomly break them into two groups called the control and variation. Then we apply the cause (e.g. send a book home) to variation group and measure the effects (e.g. test scores). Finally, we measure the casualty by comparing the effect in control and variation groups. This is how medications are tested.
To be precise, if error bars for groups do not overlap for both the groups, then there is a causality. Check https://www.optimizely.com/ab-testing/ for more details.
However, that is not always practical. For example, if you want to prove that smoking causes cancer, you need to first select a population, place them randomly into two groups, make half of them smoke, and make sure other half does not smoke. Then wait for 50 years and compare.
Did you see the catch? It is not good enough to compare smokers and non-smokers as there may be a common cause, like a gene, that cause them to do so. To prove causality, you need to randomly pick people and ask some of them to smoke. Well, that is not ethical. So this experiment can never be done. Actually, this argument has been used before (e.g.https://en.wikipedia.org/wiki/A_Frank_Statement).
This can get funnier. If you want to prove that greenhouse gasses cause global warming, you need to find another copy of earth, apply greenhouse gasses to it, and wait few hundred years!!
To summarize, causality, sometime, might be very hard to prove and you really need to differentiate between correlation and causality.
Following are examples when causality is needed:
Most big data datasets are observational data collected from the real world. Hence, there is no control group. Therefore, most of the time all you can only show and it is very hard to prove causality.
There are two reactions to this problem: First, “Big data guys do not understand what they are doing. It is stupid to try to draw conclusions without a randomized experiment”. I find this view lazy.
Obviously, there are lots of interesting knowledge in observational data. If we can find a way to use them, that will let us use these techniques in many more applications. We need to figure out a way to use it and stop complaining. If current statistics does not know how to do it, we need to find a way.
The second is: “Forget causality! Correlation is enough”. I find this view blind. Playing the ostrich does not make the problem go away. This kind of crude generalization makes people do stupid things and can limit the adoption of Big Data technologies.
We need to find the middle ground!
The answer depends on what are we going to do with the data. For example, if we are going to just recommend a product based on the data, chances are that correlation is enough. However, if we are taking a life changing decision or make a major policy decision, we might need causality.
Let us investigate both types of cases.
Correlation is enough when stakes are low, or we can later verify our decision. Following are few examples:
- When stakes are low ( e.g. marketing, recommendations) — when showing an advertisement or recommending a product to buy, one has more freedom to make an error.
- As a starting point for an investigation — correlation is never enough to prove someone is guilty, however, it can show us useful places to start digging.
- Sometimes, it is hard to know what things are connected, but easy verify the quality given a choice. For example, if you are trying to match candidates to a job or decide good dating pairs, correlation might be enough. In both these cases, given a pair, there are good ways to verify the fit.
There are other cases where causality is crucial. Following are a few examples.
- Find a cause for disease
- Policy decisions (Would a 15$ minimum wage be better? Would free health care be better?)
- When the stakes are too high (Shutting down a company, passing a verdict in court, sending a book to each kid in the state)
- When we are acting on the decision (Firing an employee)
Even, in these cases, correlation might be useful to find good experiments that you want to run. You can find factors that are correlated, and design the experiments to test causality, which will reduce the number of experiments you need to do. In the book example, the state could have run a experiment by selecting a population and sending a book to half of them and looking at the outcome.
In some cases, you can build your system to inherently run experiments that let you measure causality. Google is famous for A/B testing every small thing, down to the placement of a button and shade of color. When they roll out a new feature, they select a polulation and rollout the feature for only part of the population and compare the two.
So in any of the cases, correlation is pretty useful. However, the key is to make sure that the decision makers understand the difference when they act on the results.
Causality can be a pretty hard thing to prove. Since most big data is observational data, often we can only show the correlation, but not causality. If we mixed up the two, we can end up doing stupid things.
The most important thing is having a clear understanding at the point when we act on decisions. Sometimes, when stakes are low, correlation might be enough. In some other cases, it is best to run an experiment to verify our claims. Finally, some systems might warrant building experiments into the system itself, letting you draw strong causality results. Choose wisely!
Original Post from my Medium account: https://medium.com/@srinathperera/understanding-causality-and-big-data-complexities-challenges-and-tradeoffs-db6755e8e220#.ca4j2smy3