Over a million developers have joined DZone.

Outliers and Kettleballs

· Big Data Zone

When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.

Maybe you believe that you did not actually see the outlier. If you’re looking at a set of human heights, and one of the values is 61 feet, it is more plausible that you’ve seen a transcription error than that you’ve encountered a person an order of magnitude taller than average.

But if you believe that a data point is real, but unlikely to reoccur, you are placing more weight on subjective belief than on data, which may or may not be appropriate.

Here’s a personal example. This weekend I bought a kettlebell. As I was waiting in line to check out, I struck up a conversation with the man in line behind me. His right leg was in a cast and resting on a scooter. He told me that he broke his foot in two places by dropping a kettlebell on it! My immediate thought was that this was a fluke, an outlier. My second thought was that according to the only data I have, kettlebells are quite dangerous.

Perhaps the rational decision would have been to leave the store immediately, but I bought the kettlebell anyway. Still, the fellow behind me made an impression. I will think of him every time I work out with the kettlebell and be more careful than I would have been otherwise. Kettlebells are probably more dangerous than I’d like to believe, but so is a sedentary life.


Published at DZone with permission of John Cook , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}