Over a million developers have joined DZone. {{announcement.body}}
{{announcement.title}}

# Simpson's Paradox and Data Sampling

DZone 's Guide to

# Simpson's Paradox and Data Sampling

### Learn how to avoid Simpson's paradox in data sampling so that you don't end up with a conclusion about an intervention effect that's the opposite of the correct inference.

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles.views | formatCount}} Views

E.H. Simpson first described the phenomenon of Simpson's paradox in 1951. The actual name "Simpson's paradox" was introduced by Colin R. Blyth in 1972. Blyth mentioned that:

G.W. Haggstrom pointed out that Simpson's paradox is the simplest form of the false correlation paradox in which the domain of x is divided into short intervals, on each of which y is a linear function of x with large negative slope, but these short line segments get progressively higher to the right, so that over the whole domain of x, the variable y is practically a linear function of x with large positive slope.

Simpson's paradox arises from the combination of an ignored confounding variable and a disproportionate allocation of the variable, and it can lead to a conclusion about an intervention effect that is the opposite of the correct inference (hence, a paradox). Simpson demonstrated how differential analyses of contingency tables (i.e. analysis in which the confounding variable is excluded or included) can lead to different conclusions. (The topic of interactions in contingency tables dates back over eight decades. Early works include the ones from Bartlett, Norton, Lancaster, Darroch, Lewis, Whittemore, and Davis.)

Mathematically, Simpson's paradox is the following:

It is possible to have P(A|B) < P(A|B’) and have at the same time both P(A|BC)≥P(A|B'C) and P(A|BC') ≥ P(A|B'C').

The paradox rests upon the dependence or interaction of B and C. Else, the paradox would not hold as the weights would be the same in the following (the weights are in curly brackets):

P(A|B) = {P(C|B)} P(A|BC) + {P(C'|B)} P(A|BC')

P(A|B') = {P(C|B')} P(A|B'C) + {P(C'|B')} P(A|B'C')

The extreme form of the Simpson's paradox is given by the following:

Subject to the conditions P(A|BC) ≥ γ P(A|B'C) with ≥ γ 1, it is possible to have P(A|BC) ≅ 0 and P(A|BC)≅ 1/γ.

Simpson's paradox has been extensively studied in a variety of fields such as, but not limited to, statistics, medicine, cognitive sciences, and social sciences. In the context of operations, we had discussed the Simpson's paradox is an earlier blog and a research paper.

In their 1981 paper, Lindley and Novick argued the following statistical inference:

Standard procedures concentrate on the data and tend to ignore the connection with the case to which the inference is to be applied. ... This connection can be established using either de Finetti's idea of exchangeability or Fisher's concept of subpopulation.

Using causal calculus, Pearl showed the "resolution" of Simpson's paradox.

As alluded several times previously in our blog series, marrying the context to statistical analysis is key to extracting actionable insights. For instance, the plots below show the Webpage Response Time for the two different offering of AT&T. The first plot corresponds to the performance on desktop whereas the second plot corresponds to the performance of mobile. From the plot above we note that the average (and the median) performance of U-Verse is worse than its counterpart. More importantly, U-Verse experiences over 2.5x the number of >10 sec spikes than its counterpart. This difference in performance would be masked in an aggregated view. On deeper analysis, we find geography as the confounding factor. Concretely speaking, on dissecting the performance between east/west coast vs. mid-west in the U.S., we noted that the in difference in performance with U-Verse and its counterpart disappeared. From the plot above, we note that the average (and the median) performance of U-Verse is better (unlike in the desktop case) than its counterpart. Further, U-Verse experiences less than 0.74x the number of >10 sec spikes than its counterpart. This difference in performance would be masked in an aggregated view. Akin to above, we noted that the in the difference in performance with U-Verse and its counterpart disappeared on dissecting the performance between east/west coast vs. mid-west in the U.S.

Akin to Part 1, let's analyze what would be the impact if the webpage response time was downsampled by a factor of two in the aforementioned cases. Upon comparing the plot above with the other plot corresponding to the performance of desktop (without downsampling), we note that although U-Verse's average/median performance is still worse, the ratio of the number of >10 second spikes reduced to 2.3x. This artificially boosts the worst-case performance. Upon comparing the plot above with the other plot corresponding to the performance of mobile (without downsampling) we note although U-Verse's average/median performance is still better, the ratio of the number of >10 sec spikes reduced from 0.74x to 0.29x. Akin to the desktop case, downsampling artificially boosts the worst-case performance.

The key takeaway from the above is that to avoid Simpson's paradox, one should judiciously dissect the operational performance along different dimensions. (Drawing conclusions based on high-level aggregate metrics can be potentially misleading.) This would in turn help avoid any wild goose chases. Having said that, the subsamples obtained based on slicing along different dimensions should be large to ensure a high statistical power of the subsequent analysis.

Topics:
big data ,sampling ,tutorial ,simpson's paradox

Comment (0)

Save
{{ articles.views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.