Over a million developers have joined DZone.

Some Basic Statistics

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

 I've always been fascinated by the essential statistical algorithms. While there are numerous statistical libraries, the simple measures of central tendency (mean, media, mode, standard deviation) have some interesting features.

Well.  Interesting to me.

First, some basics.

def s0( samples ):
    return len(samples) # sum(x**0 for x in samples)

def s1( samples ):
    return sum(samples) # sum(x**1 for x in samples)

def s2( samples ):
    return sum( x**2 for x in samples )

Why define these three nearly useless functions? It's the cool factor of how they're so elegantly related.

Once we have these, though, the definitions of mean and standard deviation become simple and kind of cool.

def mean( samples ):
    return s1(samples)/s0(samples)

def stdev( samples ):
    N= s0(samples)
    return math.sqrt((s2(samples)/N)-(s1(samples)/N)**2)

It's not much, but it seems quite elegant. Ideally, these functions could work from iterables instead of sequence objects, but that's impractical in Python. We must work with a materialized sequence even if we replace len(X) with sum(1 for _ in X).

The next stage of coolness is the following version of Pearson correlation. It involves a little helper function to normalize samples.

def z( x, μ_x, σ_x ):
    return (x-μ_x)/σ_x

Yes, we're using Python 3 and Unicode variable names.

Here's the correlation function.

def corr( sample1, sample2 ):
    μ_1, σ_1 = mean(sample1), stdev(sample1)
    μ_2, σ_2 = mean(sample2), stdev(sample2)
    z_1 = (z(x, μ_1, σ_1) for x in sample1)
    z_2 = (z(x, μ_2, σ_2) for x in sample2)
    r = sum( zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/len(sample1)
    return r

I was looking for something else when I stumbled on this "sum of products of normalized samples" version of correlation. How cool is that? The more text-book versions of this involve lots of sigmas and are pretty bulky-looking. This, on the other hand, is really tidy.

Finally, here's least-squares linear regression.

def linest( x_list, y_list ):
    r_xy= corr( x_list, y_list )
    μ_x, σ_x= mean(x_list), stdev(x_list)
    μ_y, σ_y= mean(y_list), stdev(y_list)
    beta= r_xy * σ_y/σ_x
    alpha= μ_y - beta*μ_x
    return alpha, beta

This, too, was buried at the end of the Wikipedia article. But it was such an elegant formulation for least squares based on correlation. And it leads to a tidy piece of programming. Very tidy.

I haven't taken the time to actually measure the performance of these functions and compare them with more commonly used versions.

But I like the way the Python fits well with the underlying math.

Not shown: The doctest tests for these functions. You can locate sample data and insert your own doctests. It's not difficult.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks


Published at DZone with permission of Steven Lott, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}