In this post, I tell the story of how I tuned one of our own pages that was performing extremely badly under certain conditions. I will walk you through how I identified the main issue, what I considered doing, what I changes I made, what the results were, and what I might do next. Here it goes.
The Application Dashboard page was loading slowly for one of our customers. I was the one who worked to improve the performance of this page which is rendered by a Rails controller action, called ApplicationController#show. Now, the 99% case is a customer with 100 or fewer agents reporting for an application at any one time. The customer I’m looking at had thousands of agents reporting. All of a sudden, code tuned for 10 to 100 agents performed terribly. What an interesting challenge!
What’s Our Baseline To Improve On?
In New Relic, I found one of the customer’s Transaction Traces from the server-side:
Wow! 458 seconds! That’s a lot of time spent server-side. That’s so slow that even the browser has given up on us long before we’re done rendering the page. I’ve got to fix that.
Where’s the Time Going?
I noticed that 72% of the time spent was in Timeslice#find, the routine that collects our time-series data. Thankfully, New Relic lets me drill down to see what was going on in our Rails controller. When I did, I saw two significant Timeslice#find nodes in the Transaction Trace.
What’s the Code Doing?
I saw two very slow database queries here. I clicked on the ‘View Database Query’ button for both of the queries and I got access to their backtraces. Here’s one of the backtraces:
Why’s the Code Doing That?
But what were we using this data for, I wondered? What were we trying to achieve? I dug into the code and it was clear that we were querying for this data in order to modify how the page is rendered. If the application we’re monitoring is a Solr service, we render the page differently. The code discovers this by asking the following question: Of all the agents that reported for the time window the user is looking at, are any configured to instrument a Solr service? This is a very expensive query, especially when dealing with thousands of agents. So, what could I do about it?
What Options Do I Have?
I considered the options:
- I was surprised to discover that we were inadvertently querying for the same information twice. Both backtraces have the same RealAgent#ids_in_time_window call in common. I could have simply cached the result to remove the second query. The main problem with caching is that it creates bumps for the user. It’s no good if your application is very fast 99% of the time, and 1% of the time you cache-miss and the application performs poorly.
- The query is so slow because we were limiting the scope of our query to active agents in the user’s time window. While very accurate, it can be very expensive. I could have traded correctness for speed and ignored the time window completely. I noticed that the third largest culprit in the Transaction Trace, the SQL – SELECT component, was performing a related query. Once I eliminated the first and second largest culprits I needed to resolve the SQL – SELECT. The SQL – SELECT component answered the following question: Are any of the agents configured to instrument a Solr service? I wanted to eliminate having to query for this too, if I can.
- The result of the ultimate question, ‘Are you a Solr application?’ was actually fairly static. Again, I could have cached the result. Applications typically don’t oscillate between being a Solr application and a typical application. Right now we’re recomputing this information per request. Instead, we could incrementally pay this cost and precompute this as new data comes in. It’s a trade-off but it will buy us a big performance boost! This is the solution I pursued.
Make the Code Change.
I investigated, and the part of the system that collects incoming metric data from agents already maintains tags based on the metrics being reported, so we already knew whether a customer’s application is using Solr or not. I changed the code to refer to this metadata instead, easy! At this point, it was important not to gloss over the trade-off, though. Doing this means once an application indirectly declares itself as being a Solr application, it never goes back to being a non-Solr application. I accepted this was the trade-off in exchange for the performance gains.
Measure the Improvement.
The page is now performing an order of a magnitude faster. Compare our first sample of 468 seconds to the new sample of 43 seconds!
Where Do I Go From Here?
Even though I’ve dealt with the issue that our first Transaction Trace gave us, there’s more I can do. Here’s how I think I might proceed:
- The RealAgent#find_by_sql needs to be dealt with. The majority of the GC execution is directly attributable to this call. We’re using Rails 2, so we can’t make use of ActiveRecord’s pluck. So there’s opportunity here for improvement. Even better, I think we can eliminate the need to look these up from the database entirely.
- UIBaseHelper.applications_dropdown appears to have some expensive sorting logic that’s causing the CPU to burn. I should take a look.
- It appears the SQL – SELECT component is determining whether the application uses a database, much in the same way as the Solr check was being made. I suspect this can be computed ahead of time as well.
- On the Application Dashboard page we display Traced Errors and Transaction Traces in separate portions of the page, which explains why TracedError#find and TransactionTrace#find both show up here. One option I have is to defer loading and rendering these two portions of the page by requesting them via AJAX.
In my performance tuning, I avoided the temptation to speculate and made informed decisions based on real production data. Here, I started out with a performance issue under specific circumstances that didn’t reflect the normal performance characteristics of our website. Then, using New Relic, I narrowed down the problem to the server-side and pinpointed the bottleneck. After careful consideration, I decided on a trade-off that subtly changes the behavior of our site by taking advantage of pre-existing metadata in the system. This trade-off resulted in an order of magnitude performance improvement, and after I deployed the change I was able to verify the performance gain in production. I feel satisfied and armed with my deep application visibility provided by New Relic, and I went on to identify additional areas for performance tuning.