Methodical Approach to Performance Troubleshooting Cloud APIs
Methodical Approach to Performance Troubleshooting Cloud APIs
Let's learn more about performance troubleshooting.
Join the DZone community and get the full member experience.Join For Free
Sensu is an open source monitoring event pipeline. Try it today.
This article is intended to be a step-by-step guide, laying out methodology and guidance when solving a performance problem with backend APIs or services. And in using the tools needed, we are able to get to the root cause of performance issues and add monitoring to issues seen in cloud services.
Application Performance Monitoring
This article illustrates performance troubleshooting for a high-latency issue using an APM (Application Performance Monitoring) tool called New Relic, which belongs to a class of software products that help monitor and troubleshoot production applications related to availability and performance issues. This product space is pretty mature (has existed for more than a couple of decades) and is covered pretty regularly in software analyst reports like Gartner.
There are other close competitors like AppDynamics, CA APM, and more in this space. So, we have lots of choices. APM tools provide aggregated statistics like p99 response times, throughputs, slow queries, even Application Graph,s among other analytics focussed on understanding performance and the availability of services. They do this by instrumenting in a JVM (Java or Scala) or CLR(.Net) at runtime using hooks and adding runtime timers to provide aggregated analytics from a Cloud SaaS backend (On-Prem versions are available as well). The features are richer than a basic profiler, for example, a Java profiler like Yourkit. For instance, APM tools can understand transactions across processes and can provide insights into which service in a graph of services leads to business functionality (service) disruption. They could also automatically pinpoint the slowest servlet calls in your application (understanding servlet API spec). With that introduction, we will move to a troubleshooting sequence for a recent performance issue we ran into with Plantronics Cloud APIs, which I lead on the engineering side.
As any other REST API that exists out there in the world, the APIs we designed allow query parameters and paging techniques meant to allow querying (large datasets in the order of thousands to a million) analytics from our backend in a reliable and performant way. Our analytic database is MongoDB, which decently handles the write speed we require. One of our recently launched API integrations was with PowerBI (Microsoft), which allows our Plantronics Manager Pro customers to pull our analytics into Microsoft's cloud BI and visualization service. Through the development phase, when we tested in smaller to medium sized datasets (less than 500K), the loading time for the first time load into Power BI service (for all API calls to finish) was about 20 mins, which from our product requirement was reasonable for one year's worth of data.
The reality in production was different! When we launched the Power BI app and tried with our first large internal production tenant, the power BI app was loading for hours together and could not finish. So, the development team was under pressure to understand the root cause and fix the latency issues, which will lead to a great customer experience.
With that background, let us jump into the methodology of performance troubleshooting steps we did to get to the root cause. This is, in general, how I approach any performance issue. Let's get started.
Step 1: Define Your Performance Criterion
You might think this should be obvious. Trust me. Unless engineering and product are clear on what performance criterion is to be met for good customer experience for the specific product and what corner cases you are ok with cutting corners, you are shooting somewhere where the destination is unclear. In our case, we needed a load time for our PowerBI app of less than an hour even in our largest tenants. By extrapolating that criterion, we cannot afford to be more than about 250-300ms of latency for each API request of a page size of 20. You got to be this specific. Next, based on simple math, the tenant we were testing on had (8889 pages; So 8889 * 20 total elements). Assuming 3 seconds for each page leads us to this: 3 secs( per page) * 8889(no.of pages) would estimate to about 7.x hours. Phew! That is not going to cut for us.
Step 2: Isolate Top Services That Break Your Criterion
Our initial tests showed we took about three seconds for API calls when we loaded the PowerBI dashboard with the response times seen:
The next step would be to isolate the services, which consumes more than 80 percent (or 90 percent, in your case, basically the majority) of the latency slow down. Here is where an APM tool comes in handy.
In our case, we knew what APIs are being called for the PowerBI app. So, we requested our internal QA to document latencies on how long it takes to page through all the datasets for a large tenant for each of the APIs. Based on tests, we were able to isolate to the following APIs (which seem to have the largest datasets):
Step 3: Triage and Find the Performance Issue
This is the first area to investigate: is the JVM process struggling with garbage collection or not having enough CPU cycles? Is the application with the allocated memory bound or CPU bound? In our case, we noticed that when we ran simulated API tests, we saw symptoms to memory exhaustion, which New Relic does pretty well to highlight.
Area of the Problem
We also noticed the memory allocated was 1.3 Gb, which was lesser than what the infrastructure folks had allocated. And the GC algorithm was MarkSweep, which is an older generation GC collection algorithm. (Uses app thread so pauses application potentially). So, we added 4 Gb of memory and moved GC collection to ConcurrentMarkSweep, which does garbage collection without pausing the application threads. So, at this stage, we found a memory allocation issue and a potential GC algorithm improvement possibility. We took care of those. We saw a drastic change in response and availability of service.
Profile deeper if the issue is in application code — once you have isolated which services are causing a majority of the performance bottleneck, you dig deeper where the slow down is really happening. New Relic allows custom instrumentation (add methods inside your application that are instrumented by default, which you doubt without needing to restarting the application), and in our case, we found a count operation that calls into MongoDB count API for that incoming query range. We noticed a significant amount of transaction time was getting spent in this block of the method code. As seen in the below transaction latency breakdown from New Relic, the
countForQuery method spends about 463ms of the 3 seconds we had seen for the
callSummary APIs. So, we had a significant scope to improve. So, we added a cache to hold the count instead of going to count the total elements for every API paging call.
Next, once we added the cache (to cache the first time the API call comes in and retrieve from the cache if seeing the same URI+Query params), we saw the improvement in our tests and now the top transactions did not show the
countForQuery method. We fixed that part of the latency! Still, our
callSummary API was not doing that great. NOTICE: The
countForQuery method is not in the top methods consuming time.
The next observation was after a specific page (in this case 2054th page), the APIs calls were throwing Gateway Timeouts (504s), meaning they were at more than 2 mins of latency at least, which was our load balancer's timeout setting.
Based on the above observation, there seems to be some correlation between page number and latency increase; meaning page 2000 takes much more time than page 1.
This is when you have to look into all the inftrascture components in the pathway of the cloud API call's transaction. These include databases, message brokers, etc. In our case, we doubted MongoDB's performance and resource allocation. As we doubted, we saw CPU issues on MongoDB's cloud watch (EC2) logs:
Either the CPU compute allocated to MongoDB server is not sufficient or we are doing some heavy or inefficient MongoDB API calls that is creating heavier load on MongoDB than usual.
Based on the above inferences, we doubted if indexes are an issue. But we concluded, since our APIs calls are very open-ended in terms of query patterns from PowerBI, there was not much improvement we could squeeze out with more indexes. So, that ruled that out.
We looked at MongoDB's execution plans for the queries from API calls. We noticed that the SKIP operation took the longest time (>2 secs). That gave us a clue! SKIP happens when the API call request for page 200 with count 20. So, MongoDB has to skip 199 pages to get to the 200th page of documents. Great! That seems like a potential root cause. And then, some research online leads us to others running into this and suggesting a solution like this one.
That gave us another improvement to implement. When a request comes with page 100, we cache the ObjectId of the last element of the result set for when request 101 comes in, you basically use the ObjectId > prevResultSetObjectId for 'count' number of objects. So, we went ahead and implemented this solution.
Step 4: Implement Fix and Regress
This is probably the easier and more relaxed step in the sequence. But, it is an important one. At this point, you have a good understanding of the root cause of the performance issue and how to reproduce it. Now, you implement the solution and regression test the heck out of the changes in code. Observe improvements in the results using the APM tool. Also, do try to add performance regression suite to lock the known performance characteristics of your API so that if the next that line of code changes, it does not break your expected performance!
Here are those happy graphs:
Hope this article helps give some structure to one of the most challenging and chaotic problems of software engineering!
Opinions expressed by DZone contributors are their own.