If you missed Part I, read it here!
Beginning the Exploration
As I explained previously, implicit failures often lead to false starts. In my case, they included (just a partial list):
- Inspecting Linux metrics collected by SAR, such as CPU, disk, network, paging. Nothing jumped out except for high context switching during experiments.
- Inspecting HDFS logs for HDFS I/O write latency and HDFS I/O load distribution across all nodes. (As Cloudera Search is built on top of HDFS, any I/O performance issue in HDFS can impact Solr performance.)
- Observing the GC. During experiments a small GC spike was observed; apparently a large amount of memory recycled during some short period of time. However, GC overhead and pause time were low, so GC was unlikely to be the cause of the bottleneck.
Despite these false starts, I still had a feeling that some piece of the Solr indexing code path needed improvement. The little GC spike stood out; it was true that the memory unhelpfully recycling wasted some CPU, but that alone should not result in a performance problem because there were still plenty of CPU cycles available when Solr hit the bottleneck. Could this memory recycling be causing problems other than burning some CPU cycles, maybe in combination with something else we didn’t know yet?
I decided to explore memory recycling more deeply. In particular, memory allocation is interesting because that’s where most CPU cycles are spent. The JVM exposes a few metrics about memory usage through JMX, which is tracked by Cloudera Manager. The closest metric is JVM heap usage; tracking heap usage can provide a sense of how memory is allocated. However, in this case, memory was recycled at a very high rate so the high amount of memory released canceled out the memory allocated, thus implying roughly stable numbers during the memory recycle period.
Thus, the team decided to define a new metric that would better quantify the rate of memory allocation. We call it the Solr memory allocation estimation:
Solr memory allocation est. = sum of memory allocated in every second
Unlike JVM heap usage, this metric doesn’t take any memory released into account. It’s calculated based on the JVM GC log, and although it’s just an estimate (the log doesn’t provide all the data required to calculate the metric precisely), the result is pretty good in practice. The metric is implemented via a custom ETL pipeline built on top of a data processing API used internally by the Cloudera Search team; after data processing, the metric is visualized in a custom Grafana dashboard. (Because off-the-shelf performance tools or logs often don’t give us the exact information we need, we’ve built and routinely use a suite of custom-built ones.)
The following is a visualization of Solr memory allocation (in KB/second) observed during testing. As you can see, during a short period of time, Solr allocated as much as 9GB of memory per second. And from the GC log, we knew that most memory allocated was released almost immediately.
The second observation from the chart is how dramatically memory allocation can change in a short period of time. That’s an example of the dynamic nature of non-fatal failures.
What could cause such a huge memory allocation spike? The initial hypothesis was that the Apache Lucene library inside Solr was doing a segment merge. However, it was unclear how a segment merge could result in such massive memory allocation.
Although this direction turned out to be a dead end, the observation was interesting. I felt there could be a better chance for a breakthrough if we did more observations from different angles.
The visualization of Solr memory recycling not only told us how much memory was recycled, but also when it happened. It became clear that if we used the timing information to narrow the profiler data obtained previously, we might get a breakthrough.
Indeed, we did: after further testing, we learned that during the short period of time when Solr recycled that huge amount of memory, most of it went to Solr’s Block Cache.
Cloudera Search uses HDFS for more robust, scalable index file storage than local disk, and Block Cache is used to cache HDFS blocks when accessing indexes to improve performance. (It provides functionality similar to the Linux filesystem cache for HDFS.) In a typical setting, Block Cache uses off-heap memory to store the cached blocks. In JVM heap, it stores only metadata used to track the blocks stored in off-heap memory.
Two questions come out of this observation:
- Why was Block Cache recycling such a huge amount of memory at some seemingly random point? Block Cache only stores metadata on heap. Therefore, one would expect a small amount of heap memory allocation events for Block Cache.
- Was memory recycling related to the performance bottleneck?
While a Solr committer was trying to figure out Question 1, I focused on Question 2. Looking at some early analysis that implied lock contention in indexing, another hypothesis came to me: that the memory recycling might be happening inside the critical section of a lock. The exact combination of lock contention and CPU cycles wasted on memory recycling could be slowing down Solr significantly.
If that were true, the stack of these memory-allocation events would be inside one of the lock critical sections. To verify, we needed to collect the stack of each suspicious memory allocation event, aggregate the stacks, and compare the source code to see if it matched any lock critical section. The answer was yes; the profiler revealed that heavy lock contention in Solr NRT indexing caused by excessive Block Cache-related memory allocation in a critical section (see profiler screenshot below)!
Solution & Results
Now that the root cause of the lock contention was clear, it was time to collaborate with the Solr committers on the Cloudera Search team for the solution. Soon, Mark Miller proposed the idea that there is no need to use Block Cache to cache data used only once or if data is used just for merging; preventing Block Cache usage in those situations would significantly reduce Block Cache-related memory allocations. Fortunately, Lucene is designed with performance in mind and provides a hint indicating when data is read once or used for merging. This work resulted in SOLR-8856.
In addition, other issues, such as SOLR-8857 and LUCENE-7115, were uncovered and fixed during this analysis. Some Linux and Tomcat network tuning suggestions were also communicated to Cloudera Support engineers and will be published in a future release of the Cloudera Search user guide.
As shown below, with these changes in place, a ~50% gain in throughput and ~50% drop in latency is now observed in our nightly performance testing. Although those results don’t include the gain from cascading effects because a system under test is carefully isolated in the performance lab, good isolation is also rare in real production systems. Therefore, customers are likely to observe even more performance improvement in their production systems due to a reduction in cascading effects.
We’ve also found that fluctuations in throughput, latency, and unevenness in load distribution are also now greatly reduced; system reliability is also improved because a weak point has been removed. So far, customer experiences have been consistent with these results.
I hope that this post serves as a good example of some of the nuances involved in performance analysis, including:
- The difficulties in diagnosing implicit failures
- Analysis of a non-fatal failure that is highly transitional
- How the use of scientific method, off-the-shelf tools, and custom-built tools are complementary
- How false starts can spark the imagination and lead to new productive directions
- The value of persistence, especially in the face of frustration
- How collaboration with upstream committers is necessary for permanent results
Furthermore, at Cloudera, the efforts to improve Solr performance never end. Although this particular issue was resolved, it has revealed more opportunities for optimization that have a place on the Cloudera roadmap. We’ll describe some of those in future posts as the work is completed.