Over a million developers have joined DZone.

Fitting a Triangular Distribution

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Sometimes you only need a rough fit to some data and a triangular distribution will do. As the name implies, this is a distribution whose density function graph is a triangle. The triangle is determined by its base, running between points a and b, and a point c somewhere in between where the altitude intersects the base. (c is called the foot of the altitude.) The height of the triangle is whatever it needs to be for the area to equal 1 since we want the triangle to be a probability density.

One way to fit a triangular distribution to data would be to set a to the minimum value and b to the maximum value. You could pick a and b are the smallest and largest possible values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.

How do you pick c? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean

(a + b + c)/3

so you could simply solve for c to match the sample mean.

Another approach would be to pick c so that the resulting distribution has the same median as the data. This approach is more interesting because it cannot always be done.

Suppose your sample median is m. You can always find a point c so that half the area of the triangle lies to the left of a vertical line drawn through m. However, this might require the foot c to be to the left or the right of the base [a, b]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.

For the triangle to give us the graph of a density function, c must be in the interval [a, b]. Such a density has a median in the range

[b – (ba)/√2, a + (ba)/√2].

If the sample median m is in this range, then we can solve for c so that the distribution has median m. The solution is

c = b – 2(bm)2 / (ba)

if m < (a + b)/2 and

c = a + 2(am)2 / (ba)

otherwise.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
bigdata ,big data ,computer science

Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}