Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

There Are No Outliers

DZone's Guide to

There Are No Outliers

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Matt Brigg’s comment on outliers in his post Tyranny of the mean:

Coontz used the word “outliers”. There are no such things. There can be mismeasured data, i.e. incorrect data, say when you tried to measure air temperature but your thermometer fell into boiling water. Or there can be errors in recording the data; transposition and such forth. But excluding mistakes, and the numbers you meant to measure are the numbers you meant to measure, there are no outliers. There are only measurements which do not accord with your theory about the thing of interest.

Emphasis added.

I have a slight quibble with this description of outliers. Some people use the term to mean legitimate extreme values, and some use the term to mean values that “didn’t really happen” in some sense. I assume Matt is criticizing the latter. For example, Michael Jordan’s athletic ability is an outlier in the former sense. He’s only an outlier in the latter sense if someone decides he “doesn’t count” in some context.

A few weeks ago I said this about outliers:

When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.

Sometimes you have to exclude a data point because you believe it is far more likely to be a mistake than an accurate measurement. You may also decide that an extreme value is legitimate, but that you wish to exclude from your model. The latter should be done with fear and trembling, or at least with an explicit disclaimer.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}