Building a New Lucene Postings Format
Join the DZone community and get the full member experience.
Join For FreeAs of 4.0 Lucene has switched to a new pluggable codec architecture,
giving the application full control over the on-disk format of all
index files. We have
a nice
collection of builtin codec components, and developers can create
their own such
as this
recent example using a Redis back-end
to hold updatable fields. This is an important change since it
removes the previous sizable barriers to innovating on Lucene's index
formats.
A codec is actually a collection of formats, one for each part of the index. For example,
The trickiest format to create is
However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.
I recently created a new postings format,
So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test,
The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in
If you find yourself creating a new postings format I strongly suggest using the new
A codec is actually a collection of formats, one for each part of the index. For example,
StoredFieldsFormat
handles stored
fields, NormsFormat
handles norms, etc. There are eight
formats in total, and a codec could simply be a new mix of pre-existing
formats, or perhaps you create your own
TermVectorsFormat
and otherwise use all the formats from
the Lucene40
codec, for example. The trickiest format to create is
PostingsFormat
,
which provides read/write access to all postings (fields, terms, documents,
frequencies, positions, offsets, payloads). Part of the challenge is
that it has a large API surface area. But there are also complexities
such as skipping, reuse, conditional use of different values in the
enumeration (frequencies, positions, payloads, offsets), partial
consumption of the enumeration, etc. These challenges unfortunately make
it easy for bugs to sneak in, but an awesome way to ferret out all the
bugs is to leverage Lucene's
extensive randomized
tests: run all tests with
(be sure to first
register your new postings
format). If your new postings format has a bug, tests will most
likely fail. However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.
I recently created a new postings format,
BlockPostingsFormat
, which
will hopefully be more efficient than the Sep
codec at
using fixed int block encodings. I did this to support Han Jiang's
Google
Summer of Code project to add a useful int block postings format
to Lucene. So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test,
TestPostingsFormat
.
It has layers of testing (documents, +freqs, +positions, +payloads,
+offsets) that you can incrementally enable as you iterate, as well as
different test options (skipping or not, reuse or not, stop visiting
documents and/or positions early, one or more threads, etc.). When
you turn on verbose (
)
the test prints clear details of everything it indexed and what
exactly it's testing so a failure is easy to debug. I'm very happy
with the results: I found this to be a much more productive way to
create a new postings format. The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in
TestPostingsFormat
! (Who tests the tester?) If you find yourself creating a new postings format I strongly suggest using the new
TestPostingsFormat
during early
development to get your postings format off the ground. Once it's
passing, run all tests with your new postings format, and if something
fails please let us
know so we can fix TestPostingsFormat
.
Lucene
Testing
Published at DZone with permission of Michael Mccandless, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments