Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Empty Middle: Why No One Is Average

DZone's Guide to

The Empty Middle: Why No One Is Average

Average is sometimes treated as a synonym for common, but it shouldn't be.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In 1945, a Cleveland newspaper held a contest to find the woman whose measurements were closest to average. This average was based on a study of 15,000 women by Dr. Robert Dickinson and embodied in a statue called Norma by Abram Belskie. Out of 3,864 contestants, no one was average on all nine factors, and fewer than 40 were close to average on five factors. The story of Norma and the Cleveland contest is told in Todd Rose’s book The End of Average.

People are not completely described by a small set of numbers. We’re much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I’ll explain why that’s true in general, then look back at the Norma example.

General theory

Suppose you have N points, each described by n independent, standard normal random variables. That is, each point has the form (x1, x2, x2, …, xn) where each xi is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, …, 0). In fact most points are in spherical shell around the origin. Specifically, as n becomes larger, most of the points will be in a thin shell with distance √n from the origin. (More details here.)

Simulated contest

In the contest above, n = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.

We’ve made several simplifying assumptions. For example, we’ve assumed independence, though presumably some of the factors measured in the contest were correlated. There’s also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we’ll run with our simplified model just to see how it behaves in a simulation.

import numpy as np

# Winning critera: minimum Euclidean distance
def euclidean_norm(x):
 return np.linalg.norm(x)

# Winning criteria: min-max
def max_norm(x):
 return max(abs(x))

n = 9
N = 3864

# Simulated normalized measurements of contestants 
M = np.random.normal(size=(N, n))

# Setting a seed in case we want to reproduce results
np.random.seed(42)

euclid = np.empty(N)
maxdev = np.empty(N)
for i in range(N):
 euclid[i] = euclidean_norm(M[i,:])
 maxdev[i] = max_norm(M[i,:])

w1 = euclid.argmin()
w2 = maxdev.argmin()

print( M[w1,:] )
print( euclidean_norm(M[w1,:]) )
print( M[w2,:] )
print( max_norm(M[w2,:]) )

There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were

[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]

corresponding to a Euclidean distance of 0.7808.

If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were

[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]

with the largest deviation being the last, 0.4593.

By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.


Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
statistics ,simulation ,contest ,data science

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}