As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a
nice collection of builtin codec components
, and developers can create their own such as
this recent example
back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene's index formats.
A codec is actually a collection of formats, one for each part of the index. For example,
handles stored fields,
handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own
and otherwise use all the formats from the
codec, for example.
The trickiest format to create is
, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive
: run all tests with
(be sure to first
register your new postings format
). If your new postings format has a bug, tests will most likely fail.
However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named
! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.
created a new postings format
, which will hopefully be more efficient than the
codec at using fixed int block encodings. I did this to support Han Jiang's
Google Summer of Code project
to add a useful int block postings format to Lucene.
So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test,
. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (
) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.
The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in
! (Who tests the tester?)
If you find yourself creating a new postings format I strongly suggest using the new
during early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please
let us know
so we can fix