Why We Should Report More Than Just the Mean
Why We Should Report More Than Just the Mean
Did you mean that mean? Many people's understanding of the mean is below average.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Numbers without context are of very limited use. So it's a good thing that articles in newspapers and reports in the wider world will often compare the figures they relay to the (mean) average. But invariably that simply isn't enough to get a gauge of what the data being reported really tells us. There's an old "joke" about a statistician who drowned in a lake of average depth a few inches (the precise average depth seems to vary depending on who is telling the joke), but over-simplifying by just reporting or comparing with an average really can be highly misleading.
At the time of writing, the White Moose Café in Dublin in the Republic of Ireland has a rating of 3.8 stars (out of 5) on Facebook. From just this number, without looking at the distribution of scores, you might take that to mean something like "People generally think this is a good café which could perhaps make a few improvements to bump it above 4 stars". In fact the establishment has well over seven thousand reviews but only 42 reviewers gave it a 2-star, 3-star or 4-star rating! The overwhelming majority of ratings are either 1 or 5 stars. This rather extreme example of polarized opinions is the result of a disagreement between the proprietor and a vegan customer that led initially to a bombardment of negative reviews from many further vegans and a subsequent backlash from meat-eaters; It's safe to say most of the reviewers have never been to the café. (You can find out much more about this story here.) The average rating doesn't give us any hint of the underlying story.
So hopefully you can see why it's a good idea to go beyond just reporting (mean) averages or comparing one result to the average. We have plenty of other descriptive statistics that can tell us something more about the distribution of a set of results: median, mode, standard deviation, variance, skew, kurtosis, range, interquartile range... But frequently the best option is to visualize the results. Facebook does actually do this with its review system, as the screenshot below shows:
A classic example illustrating the need for visualization is Anscombe's quartet: a set of four small datasets of paired x and y values. All four datasets have identical mean (9) and variance (11) in the x variable and almost identical mean (~7.5) and variance (~4.12) in the y variable. The correlation coefficient for each dataset is also the same (0.82) to two decimal places. Actually plotting the data as a simple set of scatter plots highlights that the four datasets are, in fact, very different.
Perhaps most surprisingly, the linear regression lines for each set are (almost) the same. This is a case of garbage in, garbage out; if you try to fit a straight line to show how one variable effects another and the relationship is not even close to linear then don't expect your line to be even remotely representative of your data. Of course, we're not particularly good at absorbing and interpreting large amounts of data in tabular form so the fact set II isn't linear may not be entirely obvious in, say, a spreadsheet: Plot your data before trying to fit it!
Scatter plots are the obvious choice for paired datasets like Anscombe's. The one dimensional equivalent is the strip plot. Let's just use Anscombe's y values as a quick example:
The strip plots nicely highlights the presence of outliers in Set III and Set IV and show that the bulk of the data points lie between 5 and 10 for all sets.
Strip plots often work well when there is only a modest number of data points for each set. With larger datasets things quickly become overcrowded. One could try to get around this by giving each point a random vertical offset to clear things up a bit, essentially adding jitter to a non-existent second variable, but a more common alternative is to bin the data and create histograms. Below, for example, is a histogram made from 300,000 data points generated by a specific continuous random number generator.
Picking an appropriate bin width is important. Given that the above figure shows continuous data you may be able to tell that the bin width used is really unnecessarily wide. Instead of using bins one unit wide, we can decrease it to, say, 0.1 units wide.
Hopefully this makes it more obvious that the random number generator was pulling numbers from a normal distribution. The mean of the specific distribution was 15 and the standard deviation 2. In the next example numbers are drawn from a different normal distribution.
The normal distribution in this case has the same mean as the previous example — 15 — but the standard deviation is much bigger — 5. This means that the probability of getting a number below 8 or above 22 is much much higher than for the previous example. But there's no way of telling that if you just quote the mean.
Published at DZone with permission of Tim Brock , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.