Some Basic Statistics
Some Basic Statistics
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I've always been fascinated by the essential statistical algorithms. While there are numerous statistical libraries, the simple measures of central tendency (mean, media, mode, standard deviation) have some interesting features.
Well. Interesting to me.
First, some basics.
def s0( samples ): return len(samples) # sum(x**0 for x in samples) def s1( samples ): return sum(samples) # sum(x**1 for x in samples) def s2( samples ): return sum( x**2 for x in samples )
Why define these three nearly useless functions? It's the cool factor of how they're so elegantly related.
Once we have these, though, the definitions of mean and standard deviation become simple and kind of cool.
def mean( samples ): return s1(samples)/s0(samples) def stdev( samples ): N= s0(samples) return math.sqrt((s2(samples)/N)-(s1(samples)/N)**2)
It's not much, but it seems quite elegant. Ideally, these functions could work from iterables instead of sequence objects, but that's impractical in Python. We must work with a materialized sequence even if we replace len(X) with sum(1 for _ in X).
The next stage of coolness is the following version of Pearson correlation. It involves a little helper function to normalize samples.
def z( x, μ_x, σ_x ): return (x-μ_x)/σ_x
Yes, we're using Python 3 and Unicode variable names.
Here's the correlation function.
def corr( sample1, sample2 ): μ_1, σ_1 = mean(sample1), stdev(sample1) μ_2, σ_2 = mean(sample2), stdev(sample2) z_1 = (z(x, μ_1, σ_1) for x in sample1) z_2 = (z(x, μ_2, σ_2) for x in sample2) r = sum( zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/len(sample1) return r
I was looking for something else when I stumbled on this "sum of products of normalized samples" version of correlation. How cool is that? The more text-book versions of this involve lots of sigmas and are pretty bulky-looking. This, on the other hand, is really tidy.
Finally, here's least-squares linear regression.
def linest( x_list, y_list ): r_xy= corr( x_list, y_list ) μ_x, σ_x= mean(x_list), stdev(x_list) μ_y, σ_y= mean(y_list), stdev(y_list) beta= r_xy * σ_y/σ_x alpha= μ_y - beta*μ_x return alpha, beta
This, too, was buried at the end of the Wikipedia article. But it was such an elegant formulation for least squares based on correlation. And it leads to a tidy piece of programming. Very tidy.
I haven't taken the time to actually measure the performance of these functions and compare them with more commonly used versions.
But I like the way the Python fits well with the underlying math.
Not shown: The doctest tests for these functions. You can locate sample data and insert your own doctests. It's not difficult.
Published at DZone with permission of Steven Lott , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.