Putting an Apache Lucene Index in RAM with Zing JVM
The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docker and containers in general to create packaged images for easy management, testing, and deployment of software.
RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.
The obvious approach is to load the index into Lucene's RAMDirectory, right?
Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte fragments (there are open Jira issues to address this but they haven't been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe's excellent post for more details).
While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven't been committed and many users simply use RAMDirectory anyway.
Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let's test it! But first, a quick digression on the importance of measuring search response time of all requests.
Search response time percentiles
Normally our performance testing scripts (luceneutil) measure the average response time, discarding outliers. We do this because we are typically testing an algorithmic change and want to see the effect of that change, ignoring confounding effects due to unrelated pauses from the the OS, IO systems, or GC, etc.
But for a real search application what matters is the total response time of every search request. Search is a fundamentally interactive process: the user sits and waits for the results to come back and then iterates from there. If even 1% of the searches take too long to respond that's a serious problem! Users are impatient and will quickly move on to your competitors.
So I made some improvements to luceneutil, including separating out a load testing client (sendTasks.py) that records the response time of all queries, as well as scripts (loadGraph.py and responseTimeGraph.py) to generate the resulting response-time graphs, and a top-level responseTimeTests.py to run a series of tests at increasing loads (queries/sec), automatically stopping once the load is clearly beyond the server's total capacity. As a nice side effect, this also determines the true capacity (max throughput) of the server rather than requiring an approximate measure by extrapolating from the average search latency.
Queries are sent according to a Poisson distribution, to better model the arrival times of real searches, and the client is thread-less so that if you are testing at 200 queries/sec and the server suddenly pauses for 5 seconds then there will be 1000 queries queued up once it wakes up again (this fixes an unfortunately common bug in load testers that dedicate one thread per simulated client).
The client can run (via password-less ssh) on a separate machine; this is important because if the server machine itself (not just the JVM) is experiencing system-wide pauses (e.g. due to heavy swapping) it can cause pauses in the load testing client which will skew the results. Ideally the client runs on an otherwise idle machine and experiences no pauses. The client even disables Python's cyclic garbage collector to further reduce the chance of pauses.
To test Zing, I first indexed the full Wikipedia English database (as of 5/2/2012), totalling 28.78 GB plain text across 33.3 M 1 KB sized documents, including stored fields and term vectors, so I could highlight the hits using FastVectorHighlighter. The resulting index was 78 GB. For each test, the server loads the entire index into RAMDirectory and then the client sends the top 500 hardest (highest document frequency) TermQuerys, including stop words, to the server. At the same time, the server re-indexes documents (updateDocument) at the rate of 100/sec (~100 KB/sec), and reopens the searcher once per second.
Each test ran for an hour, discarding results for the first 5 minutes to allow for warmup. Max heap is 140 GB (-Xmx 140G). I also tested MMapDirectory, with max heap of 4 GB, as a baseline. The machine has 32 cores (64 with hyper-threading) and 512 GB of RAM, and the server ran with 40 threads.
I tested at varying load rates (queries/sec), from 50 (very easy) to 275 (too hard), and then plotted the resulting response time percentiles for different configurations. The default Oracle GC (Parallel) was clearly horribly slow (10s of seconds collection time) so I didn't include it. The experimental garbage first (G1) collector was even slower starting up (took 6 hours to load the index into RAMDirectory, vs 900 seconds for Concurrent Mark/Sweep (CMS)), and then the queries hit > 100 second latencies, so I also left it out (this was surprising as G1 is targeted towards large heaps). The three configurations I did test were CMS at its defaults settings, with MMapDirectory as the baseline, and both CMS and Zing with RAMDirectory.
At the lightest load (50 QPS), Zing does a good job maintaining low worst-case response time, while CMS shows long worst case response times, even with MMapDirectory:
To see the net capacity of each configuration, I plotted the 99% response time, across different load rates:
From this it's clear that the peak throughput for CMS + MMap was somewhere between 100 and 150 queries/sec, while the RAMDirectory based indices were somewhere between 225 and 250 queries/second. This is an impressive performance gain! It's also interesting because in separately testing RAMDirectory vs MMapDirectory I usually only see minor gains when measuring average query latency.
Plotting the same graph, without CMS + MMapDirectory and removing the 275 queries/second point (since it's over-capacity):
Zing remains incredibly flat at the 99% percentile, while CMS has high response times already at 100 QPS. At 225 queries/sec load, the highest near-capacity rate, for just CMS and Zing on RAMDirectory:
The pause times for CMS are worse than they were at 50 QPS: already at the 95% percentile the response times are too slow (4479 milli-seconds).
It's clear from these tests that Zing really has a very low-pause garbage collector, even at high loads, while managing a 140 GB max heap with 78 GB Lucene index loaded into RAMDirectory. Furthermore, applications can expect a substantial increase in max throughput (around 2X faster in this case) and need not fear using RAMDirectory even for large indices, if they can run under Zing.
Note that Azul has just made the Zing JVM freely available to open-source developers, so now we all can run our own experiments, add Zing into the JVM rotation for our builds, etc.
Next I will test the new DirectPostingsFormat which holds all postings in RAM in simple arrays (no compression) for fast search performance. It requires even more RAM than this test, but gives even faster search performance!