Over a million developers have joined DZone.

Sampling Sucks: Why to Collect, Keep, and Use of All Your User Data

DZone 's Guide to

Sampling Sucks: Why to Collect, Keep, and Use of All Your User Data

Here is an explanation for keeping all of your data with examples from Amazon.

· Big Data Zone ·
Free Resource

There’s a myth in the performance industry that you don’t need to collect 100% of your user data. There’s another myth that you don’t need to keep it. Why are these myths so prevalent? Because there’s also a truth in our industry: only about 5% of the data that gets collected ever gets used in a meaningful way.

From our very beginnings at SOASTA back in 2006, our strategy has been to collect all of the data, all of the time, and keep it forever. Why? It’s not because we’re that guy who likes to show off his 600-horsepower BMW that never does anything more than highway driving. It’s because our goal is to extract meaningful insights from the other 95% of your data.

I’m going to get to how we do that in a minute, but first, I want to talk about the yellow elephant in the room…

Be Like Amazon

There’s a reason why Amazon is Amazon and everyone else isn’t — Amazon does digital transformation supremely well. They continuously monitor, measure, collect, keep, and exploit information about every user experience. They use that data to continuously improve what they’re doing. They know that when they improve the user experience, they improve the overall performance of their business.

What’s important to know about Amazon is that they have taken a very different approach to analytics than anybody else. What they’re doing — and how they’re doing it — is worth studying. Amazon’s CTO, Werner Vogels, has been giving a talk since 2011 called Data Without Limits, in which he describes (using other companies as examples) Amazon’s approach to analytics. Werner’s overarching point is that you need to collect and keep as much high-quality user data as possible. As he says, “Bigger is better. The more data you collect, the more fine-grained you can do your analysis.” He also adds the caveat, “The quality of data is much more important than the amount of data that you have.”

I firmly agree on both counts. So why aren’t more organizations collecting and using not just more data, but more high-quality data?

Why You're Probably Still Using a Dated Approach to Data Collection (and Why You Should Stop)

Consider a web property that gets 10 million pageviews a day. That’s roughly 3.65 billion pageviews in a year, which is about half a trillion page resources that were components of those page views. That’s a tremendous amount of data to store.

Traditional analytics dashboard systems (such as the ones that many of you are probably using) were built in the early 2000s, before the cloud. They were designed to sample and aggregate. The primary design rationale behind that was to keep the cost of compute and storage for that system inside somebody’s budget.

But today, because of AWS and all the other cloud companies — including Google and Microsoft and IBM — we’ve had an amazing leap forward both in terms of capabilities and in terms of the economics of being able to do this. (I spend about $9K a month to store a massive amount of data. It can be done.)

Five Types of Performance Analysis That Can Be Done Only by Using 100% of Your Data

At SOASTA we’ve collected more than 425 billion user experience beacons. We keep all the data from every one of them. Here’s why.

1. Predictive Analytics

The future can never be predicted with 100% accuracy, but it can be modeled with a high degree of certainty using past performance data and sophisticated predictive analytics. The more data available, the more accurate the prediction will be.

For example, our mPulse solution uses a nonlinear regression of a log-normal distribution to model sessions across load times. This is a model that creates the histogram in the graphic below. It most closely resembles true user behavior. When the sliders in the “What-If” dashboards are adjusted, the model and histogram adjust accordingly and calculate a new conversion rate, from which revenue is derived as well.

In this case, revenue is simply a function of the change in conversions multiplied by the average order amount. Of course, it may and will vary depending on the absolute number of sessions, which may change over the course of a data set.

predictive analytics dashboard

Read: How does the mPulse What-If Dashboard work? And why does your business need it?

2. Machine Learning for Real-time Campaign Monitoring

As I travel around and meet with prospects and customers, one of the biggest problems people suffer from is they have a very large percentage of their marketing campaign fail at launch, and they don’t see it till it’s too late. The marketing analytics systems do not give you the information to determine that you just launched a $5 million email campaign and it’s failing. You get to find out tomorrow. That sucks, frankly.

One of the primary use cases for machine learning is to create context. You use machine learning to run all of your traffic to establish what’s normal. Then you can plot, in real time, user behavior juxtaposed to those tolerance bands. You fire alerts only when what happens with that metric, activity, or revenue achievement goes out of bounds. And you fix problems today, when it matters.

Web performance: Tolerance bands and machine learning

Read: How SOASTA and Google used machine learning to predict bounce rate and conversions

3. Third-party Resource Analytics

For most of you, more than half of the resources on your site are coming from third parties. I would tell you straight up, if you aren’t managing third-party performance, you’re not managing performance at all.

If you’re relying on data samples to tell you how your third parties are performing, that’s like watching a bunch of movie stills and thinking you’re watching the movie. Those samples are not going to give you any meaningful insight into:

  • micro-outages, localized performance issues, and other third-party glitches;
  • how a third party performed historically on your site; and
  • how the performance or non-performance of those third parties affected your business.

Read: 10 pro tips for managing the performance of your third-party scripts

4. Advanced Analytics

There are a lot more cool things you can do, so I’m going to pick just one of them to illustrate: Conversion Impact Score. We calculate this by taking all of your page views, reconstituting them as sessions, and determining which sessions converted. Then we rank the page groups in your traffic in order of their importance to a conversion and plot the results on a graph like this:

conversion impact score

Why do we do this? Because not everything is important, and not everything needs to be optimized. In most engineering teams, if you give the team a goal that says, “Please optimize our site. Please cut our page load times in half,” they will do it, but they will do it in the easiest way possible, and they will do it in a way that might not be the most impactful to your users and your business.

One of the things that you’ll notice in the chart above is that the four most important page groups in this website are:

  • browse (i.e., product and category pages),
  • SKU,
  • home page, and
  • search.

Those blue bars show that these four page groups have the highest Conversion Impact Scores. The green line shows their load times — and they’re all fairly slow.

This scoring — which can only be done by collecting all your user data — lets your performance engineering team focus on those four page groups that matter, not all the other page groups that don’t.

5. Analytics That Haven’t Been Invented Yet

The last reason to collect and keep all of your data is to deal with uncertainty. You don’t know what questions you’re going to want to ask in the future. You don’t know what algorithms are going to be invented, or that you might implement, that will end up being very valuable to your company. When those questions and algorithms come up, do you want to be forced to wait until you’ve amassed enough data to answer or deploy them?

What's at Stake If You Don't Think Like Amazon?

Let’s talk numbers. Last year, retail e-commerce sales were $1.6 trillion worldwide. This market is expected to more than double to $3.6 trillion by 2019. The US digital ad marketplace is at about $70 billion and is headed to $100 billion by 2021.

More People Use Amazon to Do Product Searches Than Use Google.

Today, when customers begin a product search, 50% of them start with Amazon. If you’re a retailer, that should scare the hell out of you. Heck, even Google should be taken aback by this stat.

People Who Love Your Product Are Not Loyal to Your Website.

If they can buy it through Amazon Marketplace, they will. Millennials — the single largest demographic today — care more about price and customer experience than they care about brand loyalty. And Amazon Prime — which is used by one out of five Amazon.com users — makes it unbelievably easy, fast, and cost-effective for consumers to use it for all their shopping. I’m a big user of Amazon Prime for exactly those reasons.

Retailers Aren't the Only Ones Who Should Be Sweating

If you’re in the media business, look behind you. Netflix is still the leader in streaming video, but Amazon Prime Video is catching up in one crucial area: customer satisfaction. On the hardware end of things, last year Apple TV sales dropped behind Amazon Fire sales. And on the side, Jeff Bezos has transformed The Washington Post.

Ditto music, photo processing, and the nascent voice computing market. No matter what business you’re in — traditional or emerging — it’s fairly safe to say that Amazon has its eye on it.

There's a Lot at Stake Here

Depending on your ability to make the digital transition, your business may or may not be here five years from now.

data storage ,data analytics ,amazon ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}