Effective APM: Find and Fix the Things That Matter
Effective APM: Find and Fix the Things That Matter
While the feature lists of APM tools grow, organizations keep having performance issues—ones they’re unable to pinpoint and resolve even after months of effort. But there are some common underlying themes as to why these issues can be so hard to catch.
Join the DZone community and get the full member experience.Join For Free
Sensu is an open source monitoring event pipeline. Try it today.
Jon C. Hodgson is an APM subject matter expert for Riverbed Technology who has helped hundreds of organizations around the world optimize the reliability and performance of their mission-critical applications. When he’s not obsessing about how to make things perform faster, he enjoys digging things up with his tractor at his home in Missouri.
Over the past 20 years as an application performance specialist, I’ve witnessed APM evolve dramatically from its roots in simplistic server monitoring, to continually adding impressive (but now ubiquitous) capabilities such as code instrumentation, multi-tier transaction tracing, and end-user experience monitoring. Although the feature lists of APM tools continue to grow, organizations continue to have major performance issues, which they’re unable to pinpoint and resolve even after months of effort. In helping to solve these problems, I noticed common themes as to why they eluded detection and resolution for so long.
The quality of the APM data is the number one reason why performance problems go unsolved. All tools claim to collect metrics about the environment and trace end-user transactions, but the way this data is captured, stored, and displayed ultimately dictates the value that data provides in detecting the presence of an issue, or accurately identifying its root cause. Many tools are fundamentally flawed in this regard.
The number two reason is the methodology of the troubleshooter. Even in cases where high-quality data exists, if you don’t ask the right questions or look at that data the right way, you may not realize the true severity of an issue, or you may be blind to it altogether. In the worst cases you may mislead yourself into futilely chasing what I call a “Performance Phantom”—an issue that appears to be a root cause, but in actuality is a symptom of a larger issue.
Let’s consider a common case that illustrates why these matter. Businesses want to ensure that their end users are happy so they can maximize productivity, loyalty, profits, etc. To that end they will often ask for KPIs to help them determine if key parts of an application are meeting SLAs, asking questions like “What’s the response time of MyAccount.aspx?”
The answer is often provided by an APM tool in a report or business dashboard with a singular value like:
The value above is from a sample dataset I will use for the remainder of this article. That value represents the average of 10,000 calls to MyAccount.aspx over a 4-hour period. Here’s a snippet of a log showing those calls:
If you really think about it, you’ll realize how ludicrous the initial question was in the first place. A singular value will never relate the range of experience for all of those users. There are actually over 10,000 answers to the question: one for each individual call, and others for subsets of calls like user type, location, etc. If you really want to know if ALL of your users are happy with ALL of their interactions with your application, you have to consider each user interaction as individually as possible, and beware the Flaw of Averages.
In this classic example, a statistician tried to cross a river that was, on average, 3 feet deep. Unfortunately, since he could not swim, the maximum value of his life became zero:
SOURCE: The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty by Sam L. Savage, with illustrations by Jeff Danziger – flawofaverages.com - used with permission.
A common alternative to the singular value is a time series chart. Here we see the same data trended for the 4-hour period, revealing that it was much faster in the beginning and end, with a worst-case response time in the middle of 25 seconds:
Although this 1-minute granularity chart has 240x more information than the singular answer, it still suffers from the Flaw of Averages. The same data at 15-second granularity tells a different story:
We see much more volatility in response times, with a worst case almost double what the previous chart suggested. As granularity improves, you’ll get a more realistic understanding of the experience of your end users. If you consider that SLAs may be less than a second, you’ll realize how inadequate even 15-second granular data is.
Many apps are plagued by periodic saturation of resources that only last for a second, but cause significant increases in response time during that second. Here’s an example with five 1-second spikes in a 15-minute period:
An APM tool will only catch the spikes if it coincidentally samples during the exact seconds the spikes occur in. If your tool samples every 15 seconds, you might be surprised at how low the odds are that it will catch those spikes. Statistically there’s a 71% chance it won’t see ANY of the spikes, so you wouldn’t even know this behavior was occurring:
There’s a 7% chance it will catch just 1 of the 5 spikes:
Here’s where your jaw will drop: There is a 1 in 759,375 chance (0.0001%) that it will catch all 5 spikes!
So even at a seemingly good 15-second granularity, there’s almost no chance at all that you’d have an accurate understanding of this behavior. I often see coarse data granularity as the reason why organizations--even those with highly rated APM tools--are blind to these sorts of recurring issues. They don’t even know the problem exists, so they don’t even attempt to solve it.
Now let’s get back to the previous MyAccounts.aspx example. I could show you how much better a 1-second sampled chart tells the story, but even that wouldn't tell the full story. Other statistics like min/max, percentiles, standard deviation, and histograms help reveal anomalies, but they too only paint a partial picture. The best option is to not sample at all. Capture everything. All transactions, all the time, down to the method & SQL level. With the right APM tool this is possible even in production under heavy loads.
But capturing that data is only half the battle, as you need to store that data in full detail and be able to nimbly analyze hundreds of thousands of transactions at once. Your APM tool needs to leverage Big Data to make sense of all that information and tell the complete story accurately. Here’s our sample dataset as only Big Data can show it:
For 10,000 transactions you have 10,000 different answers to the initial question “What’s the response time of MyAccount.aspx?”—this is a much different story than the simple line charts suggested. But even more importantly, you have the details as to why each of those 10,000 behaved the way they did:
For each individual transaction you can see what method or SQL is causing the majority of the delay. You can see multi-tier maps for each transaction independently, so if there is a certain pathway that’s causing issues, it won’t be hidden by a single one-size-fits-none application map. You can even get call-tree details for each transaction to provide the details developers need to solve the issue.
Big Data will allow you to filter out transactions with particular characteristics, and reveal clusters of behavior masked by aggregated line charts. By filtering out all the transactions that didn’t contain exceptions, we see that there are 4 different sub-behaviors of the application:
The top 3 bands of response time are due to timeouts for 3 different dependencies: a Web Service, a Database, and the Authentication service. The bottom band is due to a catastrophic failure where the transactions failed before they even initialized, resulting in ultra-fast response times which would never be caught by sampling just the slowest transactions.
Just as there isn’t a singular answer to the question “What’s the response time?” there isn’t a singular answer to “Why is it slow?”—which translates to “What are the different things we need to fix to improve performance?”
Since I’ve been using a sample dataset, I want to prove that this concept isn’t just academic. Here are some real-world examples where Big Data revealed patterns of behavior that were previously hidden by other tools:
The horizontal lines represent timeouts. The vertical lines are microbursts after stalls. The diagonal lines are client or server side queuing depending on the direction. The ramps-beneath-ramps are compound issues. You will NEVER see patterns like these in line charts. If you’ve never seen patterns like these, then you’ve never seen an accurate representation of your data.
As I mentioned earlier, even with the best data, if you ask the wrong questions you’ll get the wrong answer. It’s very common for troubleshooters to ask “Why are the slowest transactions slow?” but quite often this isn’t the reason why the application is slow overall. In our sample dataset, Big Data reveals that there isn’t a consistent reason for slowness across the slowest transactions:
This is a clear indication of the “Performance Phantoms” I referred to earlier, where some environmental issue like garbage collection or hypervisor over-commitment causes delays in whatever pieces of code happen to be running at the same time. Trying to optimize these methods will waste countless hours with little reward. You can never solve a root cause by trying to fix the symptom.
The best way to make overarching improvements to application performance is to leverage Big Data to identify the overarching reasons for delay. Here we see a consistent reason for delay in this subset of transactions:
Method C is the overall largest contributor to delays, and individual transactions confirm that consistent root cause. Focusing on this one method will yield the greatest benefit for the least effort.
I worked with a large bank who had a major performance issue in one of their key pages. Using legacy APM tools, they identified the slowest methods in the slowest transactions, but even after optimizing them, performance issues persisted. They repeated this process for months to no avail. Once they leveraged Big Data APM, in one day they were able to identify a little method that on average took 53ms, but ran so frequently it wound up being the largest contributor to delay. Optimizing that single method improved the response time of 7 Million transactions per day by 95%, and reduced total processing time by 2,000 hours per day. This is not a corner case. Issues of this magnitude are very common—and hidden in plain sight—but with the right data and methodology they are easily revealed.
I challenge you to scrutinize your current tools to make sure they’re capturing the right data in the right way. If your data is blind to an issue, or misrepresents it, then you’ll fail before you even begin. Once you have the right data, I encourage you to step out of your comfort zone of just looking at averages and line charts, and harness the power that Big Data provides. Sift through the noise, identify the patterns in your behavior, and learn to distinguish inconsistent symptoms from consistent root causes. Be the hero that identifies the one little thing that yields hours of improvement for millions of users.
Opinions expressed by DZone contributors are their own.