Lucene's indexing is fast!
Join the DZone community and get the full member experience.
Join For FreeWikipedia periodically exports all of the content on their site, providing a nice corpus for performance testing. I downloaded their most recent English XML export:
it uncompresses to a healthy 21 GB of plain text! Then I fully indexed
this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9
seconds, or 95.8 GB/hour -- not bad!
Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. I use Lucene's contrib/benchmark package to do the indexing; here's the alg I used:
There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid.
I've done a few things to speed up the indexing:
Lucene's wiki describes additional steps you can take to speed up indexing.
Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux. Java is 64bit 1.6.0_21-b06, and I run as java -server -Xmx2g -Xms2g. I could certainly give it more RAM, but it's not really needed. The resulting index is 6.9 GB.
Out of curiosity, I made a small change to contrib/benchmark, to print the ingest rate over time. It looks like this (over a 100-second window):
Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good! This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running. This is a known issue, and is actively being addressed under LUCENE-2324.
Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly. Computers usually have big write caches these days, so the IO shouldn't be a bottleneck. With LUCENE-2324, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate. Then maybe we can double this already fast ingest rate!
Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. I use Lucene's contrib/benchmark package to do the indexing; here's the alg I used:
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer content.source = org.apache.lucene.benchmark.byTask.feeds.LineDocSource docs.file = /lucene/enwiki-20100904-pages-articles.txt doc.stored = true doc.term.vector = false doc.tokenized = false doc.body.stored = false doc.body.tokenized = true log.step.AddDoc=10000 directory=FSDirectory compound=false ram.flush.mb = 256 work.dir=/lucene/indices/enwiki content.source.forever = false CreateIndex { "BuildIndex" [ { "AddDocs" AddDoc > : * ] : 6 - CloseIndex } RepSumByPrefRound BuildIndex
There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid.
I've done a few things to speed up the indexing:
- Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB
- Run with 6 threads
- Disable compound file format
- Reuse document/field instances (contrib/benchmark does this by default)
Lucene's wiki describes additional steps you can take to speed up indexing.
Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux. Java is 64bit 1.6.0_21-b06, and I run as java -server -Xmx2g -Xms2g. I could certainly give it more RAM, but it's not really needed. The resulting index is 6.9 GB.
Out of curiosity, I made a small change to contrib/benchmark, to print the ingest rate over time. It looks like this (over a 100-second window):
Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good! This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running. This is a known issue, and is actively being addressed under LUCENE-2324.
Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly. Computers usually have big write caches these days, so the IO shouldn't be a bottleneck. With LUCENE-2324, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate. Then maybe we can double this already fast ingest rate!
Lucene
Published at DZone with permission of Michael Mccandless, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments