The Empty Middle: Why No One Is Average
Average is sometimes treated as a synonym for common, but it shouldn't be.
Join the DZone community and get the full member experience.Join For Free
In 1945, a Cleveland newspaper held a contest to find the woman whose measurements were closest to average. This average was based on a study of 15,000 women by Dr. Robert Dickinson and embodied in a statue called Norma by Abram Belskie. Out of 3,864 contestants, no one was average on all nine factors, and fewer than 40 were close to average on five factors. The story of Norma and the Cleveland contest is told in Todd Rose’s book The End of Average.
People are not completely described by a small set of numbers. We’re much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I’ll explain why that’s true in general, then look back at the Norma example.
Suppose you have N points, each described by n independent, standard normal random variables. That is, each point has the form (x1, x2, x2, …, xn) where each xi is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, …, 0). In fact most points are in spherical shell around the origin. Specifically, as n becomes larger, most of the points will be in a thin shell with distance √n from the origin. (More details here.)
In the contest above, n = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.
We’ve made several simplifying assumptions. For example, we’ve assumed independence, though presumably some of the factors measured in the contest were correlated. There’s also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we’ll run with our simplified model just to see how it behaves in a simulation.
import numpy as np # Winning critera: minimum Euclidean distance def euclidean_norm(x): return np.linalg.norm(x) # Winning criteria: min-max def max_norm(x): return max(abs(x)) n = 9 N = 3864 # Simulated normalized measurements of contestants M = np.random.normal(size=(N, n)) # Setting a seed in case we want to reproduce results np.random.seed(42) euclid = np.empty(N) maxdev = np.empty(N) for i in range(N): euclid[i] = euclidean_norm(M[i,:]) maxdev[i] = max_norm(M[i,:]) w1 = euclid.argmin() w2 = maxdev.argmin() print( M[w1,:] ) print( euclidean_norm(M[w1,:]) ) print( M[w2,:] ) print( max_norm(M[w2,:]) )
There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were
[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]
corresponding to a Euclidean distance of 0.7808.
If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were
[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]
with the largest deviation being the last, 0.4593.
By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.
Published at DZone with permission of John Cook, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.