Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Lucene's In-memory Terms Dictionary, Thanks to Google Summer of Code

DZone's Guide to

Lucene's In-memory Terms Dictionary, Thanks to Google Summer of Code

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

Last year, Han Jiang's Google Summer of Code project was a big success: he created a new (now, default) postings format for substantially faster searches, along with smaller indices.

This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.

In fact, he created two new terms dictionary implementations. The first, FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second, FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (TermsEnum.ord()) and looking up a term given its ordinal (TermsEnum.seekExact(long ord)). The second one also uses this ord internally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by ord.

Like the default BlockTree terms dictionary, these new terms dictionaries accept any PostingsBaseFormat so you can separately plug in whichever format you want to encode/decode the postings.

Han also improved the PostingsBaseFormat API so that there is now a cleaner separation of how terms and their metadata are encoded, as opposed to how postings are encoded; PostingsWriterBase.encodeTerm and PostingsReaderBase.decodeTerm now handle encoding and decoding any term metadata required by the postings format, abstracting away how the long[]/byte[] were persisted by the terms dictionary. Previously, this line was annoyingly blurry.

Unfortunately, while the performance for primary key lookups is substantially faster, other queries like WildcardQuery are slower; see LUCENE-3069 for details. Fortunately, using PerFieldPostingsFormat, you are free to pick and choose which fields (e.g., your "id" field) should use the new terms dictionary.

For now, this feature is trunk-only (eventually Lucene 5.0).

Thank you, Han, and thank you, Google!

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Published at DZone with permission of Michael Mccandless, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}