Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Why Isn't Everything Normally Distributed?

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Adult heights follow a Gaussian, a.k.a. normal, distribution [1]. The usual explanation is that many factors go into determining one’s height, and the net effect of many separate causes is approximately normal because of the central limit theorem.

If that’s the case, why aren’t more phenomena normally distributed? Someone asked me this morning specifically about phenotypes with many genetic inputs.

The central limit theorem says that the sum of many independent, additive effects is approximately normally distributed [2]. Genes are more digital than analog, and do not produce independent, additive effects. For example, the effects of dominant and recessive genes act more like max and min than addition. Genes do not appear independently—if you have some genes, you’re more likely to have certain other genes—nor do they act independently—some genes determine how other genes are expressed.

Height is influenced by environmental effects as well as genetic effects, such as nutrition, and these environmental effects may be more additive or independent than genetic effects.

Incidentally, if effects are independent but multiplicative rather than additive, the result may be approximately log-normal rather than normal.

Fine print:

[1] Men’s heights follow a normal distribution, and so do women’s. Adults not sorted by sex follow a mixture distribution as described here and so the distribution is flatter on top than a normal. It gets even more complicated when you considered that there are slightly more women than men in the world. And as with many phenomena, the normal distribution is a better description near the middle than at the extremes.

[2] There are many variations on the central limit theorem. The classical CLT requires that the random variables in the sum be identically distributed as well, though that isn’t so important here.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.