Over a million developers have joined DZone.

Etsy Engineer: "Whom the Gods Would Destroy, They First Give Real-Time Analytics"

DZone's Guide to

Etsy Engineer: "Whom the Gods Would Destroy, They First Give Real-Time Analytics"

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Prognosticating  analysts suggest that 2013 will be the year of real-time analytics. But Dan McKinley, Principal Engineer at Etsy.com, suggests we all hold on a sec. "Whom the gods would destroy," he writes, "they first give real-time analytics."  

...There are many ways to screw yourself with real-time analytics. I will endeavor to list a few.

The first and most fundamental way is to disregard statistical significance testing entirely. This is a rookie mistake, but it's one that's made all of the time. Let's say you're testing a text change for a link on your website. Being an impatient person, you decide to do this over the course of an hour. You observe that 20 people in bucket A clicked, but 30 in bucket B clicked. Satisfied, and eager to move on, you choose bucket B. There are probably thousands of people doing this right now, and they're getting away with it.

This is a mistake because there's no measurement of how likely it is that the observation (20 clicks vs. 30 clicks) was due to chance. Suppose that we weren't measuring text on hyperlinks, but instead we were measuring two quarters to see if there was any difference between the two when flipped. As we flip, we could see a large gap between the number of heads received with either quarter. But since we're talking about quarters, it's more natural to suspect that that difference might be due to chance. Significance testing lets us ascertain how likely it is that this is the case.

A subtler error is to do significance testing, but to halt the experiment as soon as significance is measured. This is always a bad idea, and the problem is exacerbated by trying to make decisions far too quickly. Funny business with timeframes can coerce most A/B tests into statistical significance.

It's not a jeremiad against real-time analytics tools, but rather an appeal to use the right tools mindfully, without respect to buzz cycles. The full article is well worth a read.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}