265% indexing speedup with Lucene's concurrent flushing
(Click through the image to see details about what changed on dates A, B, C and D).
Previously we were around 102 GB of plain text per hour, and now it's about 270 GB/hour. That's a 265% jump! Lucene now indexes all of Wikipedia's 23.2 GB (English) export in 5 minutes and 10 seconds.
How did this happen? Concurrent flushing.
That new feature, having lived on a branch for quite some time, undergoing many fun iterations, was finally merged back to trunk about a week ago.
Before concurrent flushing, whenever IndexWriter needed to flush a new segment, it would stop all indexing threads and hijack one thread to perform the rather compute intensive flush. This was a nasty bottleneck on computers with highly concurrent hardware; flushing was inherently single threaded. I previously described the problem here.
But with concurrent flushing, each thread freely flushes its own segment even while other threads continue indexing. No more bottleneck!
Note that there are two separate jumps in the graph. The first jump, the day concurrent flushing landed (labelled as B on the graph), shows the improvement while using only 6 threads and 512 MB RAM buffer during indexing. Those settings resulted in the fastest indexing rate before concurrent flushing.
The second jump (labelled as D on the graph) happened when I increased the indexing threads to 20 and dropped the RAM buffer to 350 MB, giving the fastest indexing rate after concurrent flushing.
One nice side effect of concurrent flushing is that you can now use RAM buffers well over 2.1 GB, as long as you use multiple threads. Curiously, I found that larger RAM buffers slow down overall indexing rate. This might be because of the discontinuity when closing IndexWriter, when we must wait for all the RAM buffers to be written to disk. It would be better to measure steady state indexing rate, while indexing an effectively infinite content source, and ignoring the startup and ending transients; I suspect if I measured that instead, we'd see gains from larger RAM buffers, but this is just speculation at this point.
There were some very challenging changes required to make concurrent flushing work, especially around how IndexWriter handles buffered deletes. Simon Willnauer does a great job describing these changes here and here. Concurrency is tricky!
Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents). Also, if your IO system can't keep up then it will bottleneck your CPU concurrency. The nightly benchmark runs on a computer with 12 real (24 with hyperthreading) cores and a fast (OCZ Vertex 3) solid-state disk. Finally, this feature is not yet released: it was committed to Lucene's trunk, which will eventually be released as 4.0.