Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Why Isn't Everything Normally Distributed?

DZone's Guide to

Why Isn't Everything Normally Distributed?

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Adult heights follow a Gaussian, a.k.a. normal, distribution [1]. The usual explanation is that many factors go into determining one’s height, and the net effect of many separate causes is approximately normal because of the central limit theorem.

If that’s the case, why aren’t more phenomena normally distributed? Someone asked me this morning specifically about phenotypes with many genetic inputs.

The central limit theorem says that the sum of many independent, additive effects is approximately normally distributed [2]. Genes are more digital than analog, and do not produce independent, additive effects. For example, the effects of dominant and recessive genes act more like max and min than addition. Genes do not appear independently—if you have some genes, you’re more likely to have certain other genes—nor do they act independently—some genes determine how other genes are expressed.

Height is influenced by environmental effects as well as genetic effects, such as nutrition, and these environmental effects may be more additive or independent than genetic effects.

Incidentally, if effects are independent but multiplicative rather than additive, the result may be approximately log-normal rather than normal.


Fine print:

[1] Men’s heights follow a normal distribution, and so do women’s. Adults not sorted by sex follow a mixture distribution as described here and so the distribution is flatter on top than a normal. It gets even more complicated when you considered that there are slightly more women than men in the world. And as with many phenomena, the normal distribution is a better description near the middle than at the extremes.

[2] There are many variations on the central limit theorem. The classical CLT requires that the random variables in the sum be identically distributed as well, though that isn’t so important here.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}