Over a million developers have joined DZone.

Improving on Chebyshev’s Inequality

What are some ways to make inequality equations become more tight and accurate?

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Chebyshev’s inequality says that the probability of a random variable being more than k standard deviations away from its mean is less than 1/k2. In symbols:

P(|X - E(X)| \geq k\sigma) \leq \frac{1}{k^2}

This inequality is very general, but also very weak. It assumes very little about the random variable X but it also gives a loose bound. If we assume slightly more, namely that X has a unimodal distribution, then we have a tighter bound, the Vysochanskiĭ-Petunin inequality.

P(|X - E(X)| \geq k\sigma) \leq \frac{4}{9k^2}

However, the Vysochanskiĭ-Petunin inequality does require that k be larger than √(8/3). In exchange for the assumption of unimodality and the restriction on k we get to reduce our upper bound by more than half.

While tighter than Chebyshev’s inequality, the stronger inequality is still very general. We can usually do much better if we can say more about the distribution family. For example, suppose X has a uniform distribution. What is the probability that X is more than two standard deviations from its mean? Zero, because two standard deviations puts you outside the interval the uniform is defined on!

Among familiar distributions, when is the Vysochanskiĭ-Petunin inequality most accurate? That depends, of course, on what distributions you consider familiar, and what value of k you use. Let’s look at normal, exponential, and Pareto. These were chosen because they have thin, medium, and thick tails. We’ll also throw in the double exponential, because it has the same tail thickness as exponential but is symmetric. We’ll let k be 2 and 3.

Distribution familyP(|X – E(X)| > 2σ)V-P estimateP(|X – E(X)| > 3σ)V-P estimate
Uniform0.00000.11110.00000.0484
Normal0.04550.11110.00270.0484
Exponential0.04980.11110.01830.0484
Pareto0.02770.11110.01560.0484
Double exponential0.05910.11110.01440.0484

A normal random variable is more than 2 standard deviations away from its mean with probability 0.0455, compared to the Vysochanskiĭ-Petunin bound of 1/9 = 0.1111. A normal random variable is more than 3 standard deviations away from its mean with probability 0.0027, compared to the bound of 4/81 = 0.0484.

An exponential random variable with mean μ also has standard deviation μ, so the only way it could be more than 2μ from its mean is to be 3μ from 0. So an exponential is more that 2 standard deviations from its mean with probability exp(-3) = 0.0498, and more than 3 standard deviations with probability exp(-4) = 0.0183.

We’ll set the minimum value of our Pareto random variable to be 1. As with the exponential, the Pareto cannot be 2 standard deviations less than its mean, so we look at the probability of it being more than 2 greater than its mean. The shape parameter α must be bigger than 2 for for the variance to exist. The probability of our random variable being more than k standard deviations away from its mean works out to ((α-1)/((k-1)α))α and is largest as α converges down toward 2. The limiting values for k equal to 2 and 3 are 1/36 = 0.0277 and 1/64 = 0.0156 respectively. Of our examples, the Pareto distribution comes closest to the Vysochanskiĭ-Petunin bounds, but doesn’t come that close.

The double exponential, also know as Laplace, has the highest probability of any of our examples of being two standard deviations from its mean, but this probability is still less than half of the Vysochanskiĭ-Petunin bound. The limit of the Pareto distribution has the highest probability of being three standard deviations from its mean, but stil less than one-third of the Vysochanskiĭ-Petunin bound.

Generic bounds are useful, especially in theoretical calculations, but it’s usually possible to do much better with specific distributions.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
mathematical programming

Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}