Over a million developers have joined DZone.

Etsy Engineer: "Whom the Gods Would Destroy, They First Give Real-Time Analytics"

DZone's Guide to

Etsy Engineer: "Whom the Gods Would Destroy, They First Give Real-Time Analytics"

· Big Data Zone ·
Free Resource

Cloudera Data Flow, the answer to all your real-time streaming data problems. Manage your data from edge to enterprise with a no-code approach to developing sophisticated streaming applications easily. Learn more today.

Prognosticating  analysts suggest that 2013 will be the year of real-time analytics. But Dan McKinley, Principal Engineer at Etsy.com, suggests we all hold on a sec. "Whom the gods would destroy," he writes, "they first give real-time analytics."  

...There are many ways to screw yourself with real-time analytics. I will endeavor to list a few.

The first and most fundamental way is to disregard statistical significance testing entirely. This is a rookie mistake, but it's one that's made all of the time. Let's say you're testing a text change for a link on your website. Being an impatient person, you decide to do this over the course of an hour. You observe that 20 people in bucket A clicked, but 30 in bucket B clicked. Satisfied, and eager to move on, you choose bucket B. There are probably thousands of people doing this right now, and they're getting away with it.

This is a mistake because there's no measurement of how likely it is that the observation (20 clicks vs. 30 clicks) was due to chance. Suppose that we weren't measuring text on hyperlinks, but instead we were measuring two quarters to see if there was any difference between the two when flipped. As we flip, we could see a large gap between the number of heads received with either quarter. But since we're talking about quarters, it's more natural to suspect that that difference might be due to chance. Significance testing lets us ascertain how likely it is that this is the case.

A subtler error is to do significance testing, but to halt the experiment as soon as significance is measured. This is always a bad idea, and the problem is exacerbated by trying to make decisions far too quickly. Funny business with timeframes can coerce most A/B tests into statistical significance.

It's not a jeremiad against real-time analytics tools, but rather an appeal to use the right tools mindfully, without respect to buzz cycles. The full article is well worth a read.

 Cloudera Enterprise Data Hub. One platform, many applications. Start today.


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}