DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Trending

  • Developers Are Scaling Faster Than Ever: Here’s How Security Can Keep Up
  • Multi-Stream Joins With SQL
  • WireMock: The Ridiculously Easy Way (For Spring Microservices)
  • Testing, Monitoring, and Data Observability: What’s the Difference?
  1. DZone
  2. Data Engineering
  3. Databases
  4. Lucene Gets Concurrent Deletes and Updates!

Lucene Gets Concurrent Deletes and Updates!

Today, hardware has only become even more concurrent, and we've finally done the same thing for processing deleted documents and updating doc values!

Michael Mccandless user avatar by
Michael Mccandless
·
Jul. 05, 17 · News
Like (2)
Save
Tweet
Share
3.51K Views

Join the DZone community and get the full member experience.

Join For Free

Long ago, Lucene could only use a single thread to write new segments to disk. The actual indexing of documents, which is the costly process of inverting incoming documents into in-memory segment data structures, could run with multiple threads, but back then, the process of writing those in-memory indices to Lucene segments was single threaded. 

We fixed that more than six years ago now, yielding big indexing throughput gains on concurrent hardware. 

Today, hardware has only become even more concurrent, and we've finally done the same thing for processing deleted documents and updating doc values! 

This change, in time for Lucene's next major release (7.0), shows a 53% indexing throughput speed-up when updating whole documents, and a 7.4X-8.6X speedup when updating doc values, on a private test corpus using highly concurrent hardware (an i3.16xlarge EC2 instance). 

Buffering vs. Applying

When you ask Lucene's IndexWriter to delete a document or update a document (which is an atomic delete and then add), or to update a doc-values field for a document, you pass it a Term, typically against a primary key field like id , that identifies which document to update. But IndexWriter does not perform the deletion right away. Instead, it buffers up all such deletions and updates, and only finally applies them in bulk once they are using too much RAM, or you refresh your near-real-time reader or call commit, or a merge needs to kick off. 

The process of resolving those terms to actual Lucene document ids is quite costly as Lucene must visit all segments and perform a primary key lookup for each term. Performing lookups in batches provides some efficiency because we sort the terms in Unicode order so we can do a single sequential scan through each segment's terms dictionary and postings. 

We have also optimized primary key lookups and the buffering of deletes and updates quite a bit over time, with issues like LUCENE-6161, LUCENE-2897, LUCENE-2680, and LUCENE-3342. Our fast BlockTree terms dictionary can sometimes save a disk seek for each segment if it can tell from the finite state transducer terms index that the requested term cannot possibly exist in this segment.

Still, as fast as we have made this code, only one thread is allowed to run it at a time, and for update-heavy workloads, that one thread can become a major bottleneck. We've seen users asking about this in the past, because while the deletes are being resolved it looks as if IndexWriter is hung since nothing else is happening. The larger your indexing buffer the longer the hang. 

Of course, if you are simply appending new documents to your Lucene index, never updating previously indexed documents, a common use-case these days with the broad adoption of Lucene for log analytics, then none of this matters to you! 

Concurrency Is Hard

With this change, IndexWriter still buffers deletes and updates into packets, but whereas before, when each packet was also buffered for later single-threaded application, instead IndexWriter now immediately resolves the deletes and updates in that packet to the affected documents using the current indexing thread. So you gain as much concurrency as indexing threads you are sending through IndexWriter. 

The change was overly difficult because of IndexWriter 's terribly complex concurrency, a technical debt I am now convinced we need to address head-on by somehow refactoring IndexWriter. This class is challenging to implement since it must handle so many complex and costly concurrent operations: ongoing indexing, deletes and updates; refreshing new readers; writing new segment files; committing changes to disk; merging segments and adding indexes. There are numerous locks, not justIndexWriter 's monitor lock, but also many other internal classes, that make it easy to accidentally trigger a deadlock today. Patches welcome! 

The original change also led to some cryptic test failures thanks to our extensive randomized tests, which we are working through for 7.0. 

That complex concurrency, unfortunately, prevented me from making the final step of deletes and updates fully concurrent: writing the new segment files. This file writing takes the in-memory resolved doc ids and writes a new per-segment bitset, for deleted documents, or a whole new doc values column per field, for doc values updates. 

This is typically a fast operation, except for large indices where a whole column of doc-values updates could be sizable. But since we must do this for every segment that has affected documents, doing this single threaded is definitely still a small bottleneck, so it would be nice, once we succeed in simplifying IndexWriter's concurrency, to also make our file writes concurrent.

Lucene Document Relational database

Published at DZone with permission of Michael Mccandless, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Developers Are Scaling Faster Than Ever: Here’s How Security Can Keep Up
  • Multi-Stream Joins With SQL
  • WireMock: The Ridiculously Easy Way (For Spring Microservices)
  • Testing, Monitoring, and Data Observability: What’s the Difference?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: