Apache Solr Near Real Time (NRT) Search allows Solr users to search documents indexed just seconds ago. It’s a critical feature in many real-time analytics applications. As Solr indexes more and more documents in near
However, recently the Cloudera Search team found that Solr NRT indexing throughput often hit a bottleneck even when there are plenty of CPU, disk, and network resources available. Latency was average, in the hundreds of milliseconds range. Considering that Solr NRT indexing is a mainly machine-to-machine operation, without a human waiting for indexing to complete, that latency range was actually fairly good.
Furthermore, some customers reported other issues under heavy Solr NRT indexing workloads, such as connection resets, that could be “cascading” performance effects as a result of the throughput bottleneck. This behavior changed little even with different indexing batch sizes, numbers of indexing threads, or numbers of shards and replicas involved.
In this series, I’ll describe how we identified the cause of this bottleneck via scientific method for performance analysis and custom tools, designed a permanent fix in partnership with Solr committers, and the lessons learned for doing performance analysis overall.
Problem Statement and Initial Approach
In addition to the throughput bottleneck, in some quick experiments by the Cloudera Search team in response to customer reports, Solr NRT showed seemingly random fluctuation in indexing throughput and latency under stable load (see below)—whereas when the load was not evenly distributed across all nodes, a sharding algorithm worked just as well. Based on this testing, it appeared that the Solr node with higher load was more likely to break before the system reached its limit.
Based on this testing, lock contention, which usually results in a performance bottleneck and underutilized resources, was our first “suspect.” We knew that using a commercial Java profiler, such as Yourkit, JProfiler and Java Flight Recorder, would help easily identify locks and determine how much time threads spend waiting on them. Meanwhile, the team had built custom infrastructure that allows one to run experiments with a profiler attached via a single command-line parameter.
In my own testing, the profiler data indeed revealed some contention particularly related to
HdfsUpdateLog locks, leading to long thread wait time. Although promisingly, this result corresponded somewhat to the description in SOLR-6820, nothing actionable resulted from the experiment.
Next, with some help from Solr committers, I tried to find any code blocks that could be optimized to minimize the problem. Unfortunately, this path was also a dead end due to the complexity of the code around these locks.
Another typical approach I took was to scan all logs for any explicit errors or warning messages. Unfortunately, any errors or warnings found were either irrelevant or likely cascading effects.
It was beginning to look like we faced a case of implicit failure, which is a failure without explicit data points that suggest its cause. I was possibly on the wrong path.
Performance Analysis in Complex Systems
Although these initial results were frustrating, performance analysis typically begins this way because the way forward is often unclear.
At this point, it’s helpful to review some fundamentals:
- As many performance issues are implicit failures, which can’t be associated with an explicit data point such as an error or warning message, it can be hard to pick the right direction in which to start. In practice, engineers often have to make several “false starts.”
- Many performance issues are non-fatal failures that have no single static state in which the failure is represented. Thus, engineers have to analyze the system dynamically, while it’s running.
- When cascading effects are present—meaning a non-performant component causes other components to work in a non-performant way—they can lead to dead ends. Furthermore, identifying root causes from cascading effects is often not trivial; it requires engineers to untangle complex relationships across components.
- Multiple performance issues can be present in a running system, and it can be difficult to prioritize them. (“The real task isn’t finding an issue, it’s identifying the issue or issues that matter the most,” says Brendan Gregg in his book Systems Performance.) Therefore, in addition to exploration, engineers have to evaluate the impact of every issue found and the potential cost of the fix to decide if it is the issue “that matters.”
To summarize, as DTrace co-inventor (and Joyent CTO) Bryan Cantrill has pointed out, “Problems that are both implicit and non-fatal represent the most time-consuming, most difficult problems to debug because the system must be understood against its will.”
In addition to these general performance challenges, lock-contention optimization requires a thorough understanding of the code involved and close collaboration with the upstream community to design and deploy a permanent fix.
So, I had my work cut out for me. It was time to explore some other options.
Stay tuned for Part II, wherein Justin begins the exploration, gets results, and develops his conclusion.