Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Lucene Gets Concurrent Deletes and Updates!

DZone's Guide to

Lucene Gets Concurrent Deletes and Updates!

Today, hardware has only become even more concurrent, and we've finally done the same thing for processing deleted documents and updating doc values!

· Database Zone
Free Resource

Learn how to create flexible schemas in a relational database using SQL for JSON.

Long ago, Lucene could only use a single thread to write new segments to disk. The actual indexing of documents, which is the costly process of inverting incoming documents into in-memory segment data structures, could run with multiple threads, but back then, the process of writing those in-memory indices to Lucene segments was single threaded. 

We fixed that more than six years ago now, yielding big indexing throughput gains on concurrent hardware

Today, hardware has only become even more concurrent, and we've finally done the same thing for processing deleted documents and updating doc values

This change, in time for Lucene's next major release (7.0), shows a 53% indexing throughput speed-up when updating whole documents, and a 7.4X-8.6X speedup when updating doc values, on a private test corpus using highly concurrent hardware (an i3.16xlarge EC2 instance). 

Buffering vs. Applying

When you ask Lucene's IndexWriter to delete a document or update a document (which is an atomic delete and then add), or to update a doc-values field for a document, you pass it a Term, typically against a primary key field like id , that identifies which document to update. But IndexWriter does not perform the deletion right away. Instead, it buffers up all such deletions and updates, and only finally applies them in bulk once they are using too much RAM, or you refresh your near-real-time reader or call commit, or a merge needs to kick off. 

The process of resolving those terms to actual Lucene document ids is quite costly as Lucene must visit all segments and perform a primary key lookup for each term. Performing lookups in batches provides some efficiency because we sort the terms in Unicode order so we can do a single sequential scan through each segment's terms dictionary and postings. 

We have also optimized primary key lookups and the buffering of deletes and updates quite a bit over time, with issues like LUCENE-6161LUCENE-2897LUCENE-2680, and LUCENE-3342. Our fast BlockTree terms dictionary can sometimes save a disk seek for each segment if it can tell from the finite state transducer terms index that the requested term cannot possibly exist in this segment.

Still, as fast as we have made this code, only one thread is allowed to run it at a time, and for update-heavy workloads, that one thread can become a major bottleneck. We've seen users asking about this in the past, because while the deletes are being resolved it looks as if IndexWriter is hung since nothing else is happening. The larger your indexing buffer the longer the hang. 

Of course, if you are simply appending new documents to your Lucene index, never updating previously indexed documents, a common use-case these days with the broad adoption of Lucene for log analytics, then none of this matters to you! 

Concurrency Is Hard

With this change, IndexWriter still buffers deletes and updates into packets, but whereas before, when each packet was also buffered for later single-threaded application, instead IndexWriter now immediately resolves the deletes and updates in that packet to the affected documents using the current indexing thread. So you gain as much concurrency as indexing threads you are sending through IndexWriter

The change was overly difficult because of IndexWriter 's terribly complex concurrency, a technical debt I am now convinced we need to address head-on by somehow refactoring IndexWriter. This class is challenging to implement since it must handle so many complex and costly concurrent operations: ongoing indexing, deletes and updates; refreshing new readers; writing new segment files; committing changes to disk; merging segments and adding indexes. There are numerous locks, not justIndexWriter 's monitor lock, but also many other internal classes, that make it easy to accidentally trigger a deadlock today. Patches welcome! 

The original change also led to some cryptic test failures thanks to our extensive randomized tests, which we are working through for 7.0. 

That complex concurrency, unfortunately, prevented me from making the final step of deletes and updates fully concurrent: writing the new segment files. This file writing takes the in-memory resolved doc ids and writes a new per-segment bitset, for deleted documents, or a whole new doc values column per field, for doc values updates. 

This is typically a fast operation, except for large indices where a whole column of doc-values updates could be sizable. But since we must do this for every segment that has affected documents, doing this single threaded is definitely still a small bottleneck, so it would be nice, once we succeed in simplifying IndexWriter's concurrency, to also make our file writes concurrent.

Create flexible schemas using dynamic columns for semi-structured data. Learn how.

Topics:
database ,lucene ,concurrency ,buffering ,applying

Published at DZone with permission of Michael Mccandless, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}