DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • How to Submit a Post to DZone
  • DZone's Article Submission Guidelines
  • Avoiding Pitfalls With Java Optional: Common Mistakes and How To Fix Them [Video]
  • Auditing Tools for Kubernetes

Trending

  • How to Submit a Post to DZone
  • DZone's Article Submission Guidelines
  • Avoiding Pitfalls With Java Optional: Common Mistakes and How To Fix Them [Video]
  • Auditing Tools for Kubernetes
  1. DZone
  2. Data Engineering
  3. Databases
  4. The Dark Sides of Lucene

The Dark Sides of Lucene

Oren Eini user avatar by
Oren Eini
·
Apr. 17, 14 · Interview
Like (0)
Save
Tweet
Share
6.72K Views

Join the DZone community and get the full member experience.

Join For Free

i’ve been using lucene for the past six or seven years, and after my last post, i thought it would be a good idea to talk a bit about the kind of things that it isn’t doing well. we’ve been using it extensively in ravendb for the past 5 years, and i think that i have a pretty good understanding of it. we used to have one of lucene.net committers working at hibernating rhinos, so i’ve a high level of confidence that i’m not just stupidly not using it properly, too.

probably the part that caused us the most pain with lucene was the fact that it isn’t transactional. that is, it is quite easy to get into situations where the indexes are corrupted. that makes it… challenging to use it in a database that needs to ensure consistency. the problem is that it is really not a use case that lucene is well suited for. in order to ensure that data is saved, we have to commit often, the problem is that in order to ensure good performance, we want to commit less often, but then we will the changes if we crash. for that matter, lucene doesn’t do any attempt to actually flush the data properly, relying on the os to do that, a system crash can cause you to lose data even though you “committed” it.

fun times, i can tell you that.

next, we have the issue of what lucene call updates. updates in lucene are actually just delete/add, and they don’t maintain the same document id (more on that later). because of that, you usually have to have an additional field in the index that would be your primary key, and you handle updates by first deleting then adding things. that is quite strange, to be fair, and it means that you can’t “extend” an index entry, you have to build it from scratch every time.

speaking of this, let us talk a bit about deletes. ignoring for the moment the absolutely horrendous decision to do deletes through the reader , let us talk about how they are actually done. deletes are recorded in a separate file, and that means that the moment you have any deletes (or, as i mentioned, updates), all the internal statistics are wrong .  we run into this quite often with ravendb when we are doing things like facets or suggestions. for example, if you have request a suggestion for a user name, it will happily give you suggestions for deletes users, even though we deleted it in lucene.

it will go away eventually, when it is ready to optimize the index by merging all the files, but in the meantime, it makes  for interesting bug reports.  speaking of merging, that is another common issue that you have to deal with. in order to ensure optimal performance, you have to be on top of the merge policy. this results in some interesting issues. for ravendb’s purposes, we do a writer commit after every indexing batch. that means that if you are writing to ravendb slowly enough, we do a commit after every document write. that result in a lot of segments, and the merge policy would have to do a lot of merges. the problem here is that merges have two distinct costs associated with them.

first, and obviously, you are going to need to write (again) all of the documents in all of the segments you are merging. that is very similar to doing merges in leveldb ( indeed, in general lucene’s file format is remarkably similar to sst ). next, and arguably more interesting / problematic from our point of view is the fact that it also kills all of the caches. let me try to explain, lucene uses a lot of caches to speed things up, in fact, most of the sorting is done by using the caches, for example.  that works really well when we are querying normally, because segments are immutable, which makes for great caching. but on a merge, not only have we just invalidate all of our caches, we now need to read, again, all of the data that we just wrote, so we would be able to use it. that can be… costly. and both things can introduce stalls into the system.

the major problem externally with merges is that the document id changes, and that means that you cannot rely on them. it would be much easier if you could send an id out into the world, and get it back later and do something with it, but that isn’t possible with lucene.

next, and not really an operational issue like the rest, lucene’s multi threaded behavior is… a hammer to an egg, in most cases. by that i mean code like this:

i mean, it is certainly functional, but it is pretty ugly.

now, don’t get me wrong, i think that lucene is pretty neat. but there are some really dark corners there. for example, the actual searching , go ahead and try to find where that is done in lucene. it is very easy to get lost between all of the different aspects: weights, sorters, queries and various enumerators. for fun, a lot of that runs at hard to figure out times, making the actual query run time interesting to try to figure out.

as a good example, let us take the simplest possible query, termquery. go ahead, try to find where it is actually doing the query for matching terms in this code: https://github.com/apache/lucene/blob/lucene_2_1/src/java/org/apache/lucene/search/termquery.java

that actually happens here: https://github.com/apache/lucene/blob/lucene_2_1/src/java/org/apache/lucene/search/termscorer.java#l79 , and it is effectively a side effect of calling reader.termdocs(term) that limit the matches only to those with the same term.  trying to track down where exactly things happen can be… interesting.

anyway, this post is getting to long, and i want to get back to figuring out how lucene does its thing without dwelling too much in the dark…


Lucene Database

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • How to Submit a Post to DZone
  • DZone's Article Submission Guidelines
  • Avoiding Pitfalls With Java Optional: Common Mistakes and How To Fix Them [Video]
  • Auditing Tools for Kubernetes

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: