When performance and load testing, it’s good practice to analyze the data methodically. The reason is that the large amounts of data can be confusing, and you might lose valuable information if your analysis isn’t systematic. But what’s the right way to go?
This is the last post in the three-part series of blog posts with performance engineering best practices, based on my 17 years in the industry.
In the first part, we covered the difference between Performance Engineering and Performance Reporting, why there is no replacement for human performance engineers and three best practices: identify tier-based engineering transactions, monitor KPIs cleverly, and reduce the number of transactions you analyze.
In the second part, we went over three more best practices: wait for your test to complete before analyzing, run every test 3 times, and how to ramp-up your load.
This time, we will go over my final best practices for you.
7. Compare Test Results to the “Perfect Graph”
Knowing what a perfectly scalable application looks like allows you to spot anomalies quickly. So study that architectural diagram or whiteboard that shows what should happen in a perfectly scalable application and compare it to your test results.
Answer these questions: What should happen? What doesn’t happen? That answer to these questions is where you need to focus your attention.
For example, as the user load increases, you should see an increase in the web server’s requests per second, a dip in the web server machine’s CPU Idle, an increase in the app server’s active sessions, a decrease in free worker threads, a decrease in the APP server’s OS CPU Idle, a decrease in free DB thread pool connections, an increase in DB’s queries per second, a decrease in DB machine’s CPU Idle, etc....you get the picture! Is that what you see in your test results as well?
By using the power of visualization you can drastically reduce the investigation time, by quickly spotting a condition that does not represent a scalable application.
8. Look for KPI Trends and Plateaus to Identify Bottlenecks
As resources are reused or freed (as with JVM garbage collection or thread pools), there will be dips and rises in the KPI values. Concentrate on the trends of the values and don’t get caught up on the deviations. Use your analytical eye to see through the trees to determine the trend. You have already proven that each of your KPIs track with the increase in workload, so no real worries about chasing red herrings here. Just concentrate on the bigger picture - the trends.
A solid technique that usually gives me great success in identifying the very first bottleneck is to graph out the minimum responses times from the frontend KPIs. Use granularity to analyze and identify the first increase from its lowest value. The lift in the minimum response time won’t deviate as much, because once there is a saturation of a particular resource, then the floor is just not achievable anymore. It’s pretty precise. Pinpoint the elapsed time that this behavior first occurred.
Know this, the TPS, or hits per second, plateaus as the deployment approaches the first bottleneck. Response times degrade or increase after the bottleneck. Errors rates are cascading symptoms.
Now your job is to simply identify the first occurring graphed plateau in the monitored hit rate KPIs which precedes the minimum response time degradation (Now here's why I was adamant about collecting 3 monitored metrics per sustained load. One data point value will give a peak in a graph, 3 data points will give you a plateau. Plateaus are gold mines). Use the elapsed time of the load test. The first occurring plateau in a hit rate indicates a limitation in throughput.
Once the server with the limitation is located, graph out all of that server’s free resources. A free resource doesn’t need to be absolutely depleted in order to affect performance.
The first plateau will indicate a root cause - either a soft or hard limitation. Soft limitations are configurable (ex. max thread pools). Hard limitations are hardware (CPU). I must say, the vast majority of the bottlenecks which I have uncovered are SOFT limitations. And no amount of hardware will fix a soft limitation. Tuning can increase scalability but it’s a balancing act, you want to tune to increase throughput without saturating the hardware server. Tuning is both an art and a science.
It’s the alleviating of soft limitations that allows applications to efficiently scale UP and OUT in cloud deployments, thereby saving companies significant costs in operating expenses. I recommend load testing for peak load conditions and noting which resources have spun to accommodate the workload. Then, dedicate those resources to your deployment. Pay for it now, and only use the elastic cloud for surges beyond anticipated peak load.
Again, here is why it is important to isolate the first occurring KPI. Don’t stop at the first plateau you stumble upon and call victory because this could be just a symptom, not a root cause. A premature conclusion will cost you hours of wasted time in configuration changes and retesting - only to see that degradation happens at the same time frame, meaning the load is encountering the same bottlenecks.
Important: if you have two or more KPIs which look like a race condition, you can usually see which plateau occurred first by overlaying the KPI graphs to get a more clear visualization. If not, design a new load test that slows the ramp as it approaches the same peak capacity load. Slowly mowing it down will allow the collection of more data points and this will make the results clearer.
9. Don’t Lose Sight of Engineering Transactions
Remember those engineering transaction scripts from part 1? These can also be gold mines in uncovering scalability issues. Sometimes, if I don’t have backend monitoring, I’ll just rely on the data from these scripts alone. But together with monitoring, they tell a very accurate performance story.
These engineering transactions are executing at a sampling rate, so graph them out in correlation to the user load. I usually name my transactions according to the tier that they reach. For example, WEB, APP, MESSAGING, DB.
Use your analysis skills to see which engineering transaction starts to degrade first. Both the hit rate and the response times will tell you where you need to concentrate your efforts.
10. Increase Granularity for Better Clarity
Granularity is vital at both the KPI monitoring interval and visualization analysis. Often, if a test runs for a long time, the load tool will use a higher granularity interval when graphing out the results. In effect, the tools are aggregating data and presenting only average sampling in graphs.
Aggregated data is not optimal for analyzing. I recommend you analyze the raw and absolute data in order to understand the scalability limitations. Higher granularity does make the graphs look cleaner and is good for reporting to upper management. But for us performance engineers, these cleaner graphs are actually skewing the results.
To make this clearer: a real plateau consists of multiple data points. But under the higher resolution, these plateaus gets disguised as a peak. A peak hides the essential pattern which is a plateau.
Simply changing the data resolution (ex. from 256 seconds to 15 seconds) will drastically change the graph’s visual. Presto, peaks become plateaus. Yes, the graph will look a heck of a lot more busy, but you aren’t interested in all the noise, you need to squint and see the trend.
If the tool can’t lower the resolution down to what you need, export the raw data and create your own graph. Yes, this is a manual process, but would you rather spend your precious time chasing red herrings? No, not at all.
Hint: run longer tests. Everyone is in a hurry to do a day’s work in an hour but don’t. Make the analysis easier by slowing the ramp and having more KPI data points.
The most important result from any performance project is isolating and exposing the resource that limits scalability. Your job is not done until you have achieved this goal. Even if the application currently scales to the target workload, identify that next bottleneck and put it on the radar with no current need to eliminate it. This practice will save valuable time in the future as the workload increases.
Lastly, performance testing is a reiterative process. With every new build or environment change, a performance test is warranted, so test early and often. In this series, I mostly described capacity planning. But even if you are as seasoned performance engineer and you suspect a root cause bottleneck causing high response times in production, you still need a load tool and a performance test harness to baseline and prove that your tuning solves the scalability issue before changing production configurations.
Whether you are load testing every build, a new application deployment, a new feature, new infrastructure or new architecture, absolutely any change introduces a risk. Therefore, it requires methodical performance testing to mitigate that risk.