Over a million developers have joined DZone.

Etsy Engineer: "Whom the Gods Would Destroy, They First Give Real-Time Analytics"

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Prognosticating analysts suggest that 2013 will be the year of real-time analytics. But Dan McKinley, Principal Engineer at Etsy.com, suggests we all hold on a sec. "Whom the gods would destroy," he writes, "they first give real-time analytics."  

...There are many ways to screw yourself with real-time analytics. I will endeavor to list a few.

The first and most fundamental way is to disregard statistical significance testing entirely. This is a rookie mistake, but it's one that's made all of the time. Let's say you're testing a text change for a link on your website. Being an impatient person, you decide to do this over the course of an hour. You observe that 20 people in bucket A clicked, but 30 in bucket B clicked. Satisfied, and eager to move on, you choose bucket B. There are probably thousands of people doing this right now, and they're getting away with it.

This is a mistake because there's no measurement of how likely it is that the observation (20 clicks vs. 30 clicks) was due to chance. Suppose that we weren't measuring text on hyperlinks, but instead we were measuring two quarters to see if there was any difference between the two when flipped. As we flip, we could see a large gap between the number of heads received with either quarter. But since we're talking about quarters, it's more natural to suspect that that difference might be due to chance. Significance testing lets us ascertain how likely it is that this is the case.

A subtler error is to do significance testing, but to halt the experiment as soon as significance is measured. This is always a bad idea, and the problem is exacerbated by trying to make decisions far too quickly. Funny business with timeframes can coerce most A/B tests into statistical significance.

It's not a jeremiad against real-time analytics tools, but rather an appeal to use the right tools mindfully, without respect to buzz cycles. The full article is well worth a read.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks


Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}