This post comes from Bill Kayser at the New Relic blog.
How well is your web application servicing your customers right now?
That’s a common starting point when using an APM tool like New Relic. To put it more succinctly:
How are you doing?
There’s plenty of evidence and opinion to suggest that watching your average response time won’t give you anything near a complete answer to this question, which leaves many customers choosing to rely on the Apdex measurement instead. So what does Apdex really represent? Let’s take a closer look.
Start by considering what it really is you want to know from looking at a chart when you ask the “how am I doing?” question. Let’s assume you are focused on the user experience and, in particular, page load times as seen from the user. So you could start with a chart showing the page load time of every request processed by your server over some representative sample period. Here is scatter plot of response times for one of our key transactions in New Relic:
This does give us all the information we want: every page request for a 20 minute period. Clearly the typical experience is maybe 1.0 - 1.2 seconds, where the center of mass is. But there are a lot of outliers and probably even many that are off the chart that we are missing completely. Is it really helpful? How are you doing?
Let’s try a histogram. This will show us a count of page load times within discrete buckets, giving us a better sense of how many outliers there are relative to the center of mass. Here’s the same data as above:
This feels a little more digestible. You can see the range of response times starting at around 100 ms and going up to eight seconds. You can see that 3.6 percent of requests are completely off the chart. Does it really tell you how you are doing? Seems like you are getting there, but let’s try to add a little more information. Below is the same histogram but with markers to indicate the mean response time (green line), 95th percentile response time (dashed line) and the median or 50th percentile (red line). Also indicated are the middle two quartiles, the red region, representing the response times that fall between the 25th and 75th percentiles.
So now we’re getting somewhere. There’s a little more shape to our data. We have a much better sense of how significant those slowest loading pages are, and how prevalent they are. Most of our users are in the 1.2 - 2.4 second range, even though our average response time is 2.5 seconds. About 25 percent of our users have completely intolerable response times, more than seven seconds.
How are you doing? Not very well.
At least we have a much more informed position from the data on this chart than we would if we erased everything but the green line. In fact, we could probably use even more information. To understand our user experience we’d probably want to know if this distribution is consistent across all our users, or if there are vast differences among different browser versions or regions. Unfortunately the data we have at hand is often limited in the number of annotated attributes, so it’s hard to correlate.
Worse yet, we may not even have access to the event level needed to build charts like the ones above. Histograms and scatter plots require storing a lot of data. Even just a couple of percentile measurements like median and 95 percent can be tricky when you have to aggregate and resample data. So, what you are left with often is just average response time.
Here’s where Apdex can fill the gap. Apdex is like a histogram with just three buckets: Satisfy, Tolerate and Fail. The buckets represent requests that are satisfying the user, requests that users just tolerate, and requests that fail to meet the users’ expectations completely. The bucket intervals are 0, T, and 4T, where T is a parameter you choose in advance. Here’s what that looks like if you shade the Apdex buckets in our histogram:
I chose 1500 ms as a T value. The yellow region are the pages under 1500 ms. The red region goes from 1500 ms to 4T, or six seconds. The black region are the failed pages, those that took longer than six seconds. The Apdex score is a formula based on the count of pages falling into each of these three regions. You take the number of satisfying requests plus half the number of requests in the middle area, and divide by the total count. For this T value, our score is 0.7.
If you move the T value up or down, the regions adjust and so does your score. If I move the T value down to one second, the regions all shift left and my Apdex score goes from 0.7 to 0.49.
With T at one second, we are basically treating every request above four seconds as a failure. It doesn’t matter if it’s four seconds or 40 seconds–the result is the same: we failed. We are graded harshly for requests longer than one second, and anything under one second is viewed as serving our users. The difference between 800 ms and 1,000 ms is not important to us because it’s probably not noticeable by our users.
So when you ask "how well are you serving your customers right now?" this isn’t a bad place to start. Pick a T value that characterizes your expectations for your site. We are pretty comfortable with description based on a value of T= one second so we’ll use that. Now we can use the score to answer the question how well we are serving our customers on a scale from 0 to 100. You’ll be in a much better place than you would be with the answer that simply states the average response time is 2 seconds.
How are we doing? We are at a 49 on a scale of zero to 100. Not very good at all.
The other nice thing about an Apdex score is it allows you to answer the question quickly for an array of web transactions or applications. You can set a T value individually for each key transaction or application and see a list of scores and quickly identify where you need to focus on for improvement. Looking at a column of response times isn’t really going to help since those times might mean different things for each individual transaction or application. Is a response time of two seconds good or bad? It probably depends on a number of different assumptions. Many of those assumptions can be encapsulated with a well considered T value selection.
But the most important thing about an Apdex score is that unlike histograms and percentiles it can easily and inexpensively be collected and re-sampled.
Note: The charts and data used in this post are available for browsing using a tool I developed for experimentation with different visualizations called Marlowe, available on GitHub.
The Problem with Averages, David Heinemeier Hansson
What the Mean Really Means, Brendan Gregg
What Do You Mean: Revisiting Statistics for Web Response Time Measurements by David M Ciemiewicz (2001)
Marlowe data exploration tool