I’m an operations guy. I’ve been one for over 15 years. From the time when I was a Systems Administrator I was always intrigued by application performance and jumped at every opportunity to try and figure out a performance problem. All of that experience has taught me that there is one aspect of troubleshooting that makes the biggest difference in the most cases.
My Charts Will Save The Day
Before I jump right in with that single most important lesson learned I want to tell the story that set me on my path to learning this lesson. I was sitting at my desk one day when I got called into a P1 issue (also called Sev 1, customers were impacted by application problems) for an application that had components on some of my servers. This application had many distributed components like most of the applications at this particular company did. I knew I was prepared for this moment since I had installed OS monitoring that gave me charts on every metric I was interested in and I had a long history of these charts (daily dating back for months).
Simply put, I was confident I had the data I needed to solve the problem. So I joined the 20+ other people on the conference call, listened to hear what the problem was and what had already been done, and began digging through my mountains of graphs. Within the first 30 minutes of pouring over my never ending streams of data I realized that I had no clue where any of the data points should be for each metric at any given time. I had no reference point to decipher good data points from bad data points. “No problem!” I thought to myself. I have months of this data just waiting for me to look at and determine what’s gone wrong.
Now I don’t know if you’ve ever tried to manually compare graphs to each other but I can tell you that comparing 2 charts that represent 2 metrics on 2 different days is pretty easy. Comparing ~50 daily charts to multiple days or weeks in history is a nightmare that consumes a tremendous amount of time. This was the Hell I had resigned myself to when I made that fateful statement in my head “No problem!”.
Skip ahead a few hours. I’ve been flipping between multiple workbooks in Excel to try and visually identify where the charts are different. I’ve been doing this for hours. Click-flip, click-flip, click-flip, click-flip… My eyes are strained and my head is throbbing. I want the pain to end but I’m a performance geek that doesn’t give up. I’ve looked at so many charts by now that I can no longer remember why I was zeroing in on a particular metric in the first place. I’m starting to think my initial confidence was a bit misguided. I slow start banging my head on my desk in frustration.
From Hours To Seconds
Isn’t this one of the most commonly asked questions in any troubleshooting scenario? “What changed?” It’s also one of the toughest questions to answer in a short amount of time. If you want to resolve problems in minutes you need to know the answer to this question immediately. So that leads me to the most important lesson I ever learned about solving performance problems. I need something that will tell me exactly what has changed at any given moment in time.
I need a system that tracks my metrics, automatically baselines their normal behavior, and can tell me when these metrics have deviated from their baselines and by how much. Ideally I want this all in context of the problem that has been identified either by an alert or an end user calling in a trouble ticket (I’d rather know about the problem before a customer calls though).
Thankfully today this type of system does exist. Within AppDynamics Pro, every metric is automatically baselined and a candidate for alerting based upon deviation from that baseline. By default all business transactions are classified as slow or very slow based upon how much they deviate from their historic baselines but this is only the tip of the iceberg. The really cool feature is available after you drill down into a business transaction. Take a look at the screen grab below. This grab was taken from a single “Product Search” business transaction that was slow. Notice we are in the “Node Problems” area. I’ve requested that the software automatically find any JVM metrics that have deviated higher than their baseline during the time of this slow transaction. The charts on the right side of the screen are the resulting data set in descending order of most highly deviated to least highly deviated.
Whoa… we just answered the “What changed?” question in 30 seconds instead of manually doing hours of analysis. I wish I had this functionality years ago. It would have saved me countless hours and countless forehead bruises. We veterans of the performance wars now have a bigger gun in the battle to restore performance faster. Leave the manual analysis and correlation to the rookies and click here to start your free trial of AppDynamics Pro right now so you can test this out for yourself.