Over a million developers have joined DZone.

Build Your Own Lucene Codec!

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

I’ve been having a lot of fun hacking on a Lucene Codec lately. My hope is to create a Lucene storage layer based on FoundationDB – a new distributed and transactional key-value store. It’s a fun opportunity to learn about both FoundationDB and low-level Lucene details.

But before we get into all that fun technical stuff, there’s some work we need to do. Our goal is going to be to get MyFirstCodec to work! Here’s the source code:

public class MyCodec extends FilterCodec {
    public MyCodec() {
        super("MyCodec", new Lucene42Codec());

Great! Done. Well not quite. A real codec looks more like this

public final class SimpleTextCodec extends Codec {
  // pretend there’s private vars here

  public PostingsFormat postingsFormat() {
    return postings;

  public StoredFieldsFormat storedFieldsFormat() {
    return storedFields;

  public TermVectorsFormat termVectorsFormat() {
    return vectorsFormat;

  public FieldInfosFormat fieldInfosFormat() {
    return fieldInfosFormat;

  public SegmentInfoFormat segmentInfoFormat() {
    return segmentInfos;

  public NormsFormat normsFormat() {
    return normsFormat;

  public LiveDocsFormat liveDocsFormat() {
    return liveDocs;

  public DocValuesFormat docValuesFormat() {
    return dvFormat;

This example is a bit more explicit. This is Mike McCandless’s SimpleText codec that is a great codec to browse for browsing for educational purposes. In this codec, each part of the class is being customized.

Typically you only want to implement a subset of the components of the codec. Lucene provides a convenient base class called “FilterCodec”. You can customize whatever pieces you’d like to, delegating the rest to another codec. For example, If we were, however, to implement a custom storage implemenation for only term vectors, we can override like so:

public class MyCodec extends FilterCodec {
    final private TermVectorsFormat myTermVectorsFormat;

    public MyCodec() {
        super("MyCodec", new Lucene42Codec());
        myTermVectorsFormat = new MyTermVectorsFormat();
    // Use custom TermVectorsFormat, default everything else to Lucene42Codec
    public TermVectorsFormat termVectorsFormat() {
        return myTermVectorsFormat;

Here, we’re delegating to Lucene42Codec for everything except our special TermVectorsFormat.

Each of these individual formats are separate pieces responsible for serializing to a backing store during indexing and deserializing into memory when read back into memory. For many of the formats though, it’s a bit more than that. For example, for the postings format, the format responsible for storing the inverted index, it’s vital to be able to efficiently iterate the inverted index. Storage of the inverted index must be done in such a way that we can easily iterate all the indexed fields, then all the terms indexed into that field, then in turn the documents with term frequencies and positions that contain that term in that field.

You’ll find similar constraints as you implement the interfaces of the other pieces of the codec. Each of these pieces is a topic in its own right worth writing about. They all deserve their own blog articles. For now, I encourage you to explore the JavaDocs to see what might be fun to customize on the Lucene backend! Before I leave you to the Javadocs though, it’s important to tackle a few bits of plumbing – building & running Lucene’s tests against your codec.

Plumbing! Unit Tests & More

Let’s take care of a bit of plumbing. How do we setup a project for a codec? How do we run Lucene’s tests against our implementation to confirm Solr/Lucene will function with our changes?

Using maven to setup the project is fairly straight-forward. Luckily I’ve created a Lucene Codec hello world project on github to get you started. It captures setting up the project with Maven. Feel free to fork it to skip the first two steps below. You’ll still need to read below to learn how to run the Lucene tests against your codec.

First, we’ll start by creating a straightforward maven project with a pom that depends on lucene-core at the version you’re targeting your codec for.

Second we’ll need to publish via our META-INF/services directory that we have a class that implements the Codec interface. This advertises our codec to Lucene’s class loader. Under src/main/resources/services create a file called org.apache.lucene.codecs.Codec. In the file should be a single line with the full name of your Codec class:


We’ll need to tell mvn to copy this into the target/META-INF by specifying it as a resource to be copied into the target folder:


Third pull down the full Lucene/Solr source tree. Let’s test our codec from the command line!

Package your codec into a jar:

mvn package

Under the Lucene source tree, run a single Lucene Test. Pass it the codec to use with the -Dtests.codec argument. Pass the jar with the codec you just packaged up with the lib argument. Executing this will prove that Lucene can find and load your codec. If Lucene can’t find load your codec, you’ll get an appropriate error right away.

C:\solr\solr-4.3.0\lucene>ant -Dtestcase=TestSegmentTermDocs -
Dtests.codec=MyCodec –lib "C:\path\to\target\codec-1.0-SNAPSHOT.jar" test

Now Run all the tests!

C:\solr\solr-4.3.0\lucene>ant -Dtests.codec=MyCodec –lib 
"C:\path\to\target\codec-1.0-SNAPSHOT.jar" test

Fourth, Naturally it’s going to be convenient to debug Lucene unit tests running our codec in Eclipse. Here’s what we need to do.

  1. Load your codec and the Lucene source code into Eclipse.
  2. Create a new Junit debug configuration
  3. Select the Radio button for “Run all the tests in the selected project, package, or folder”
  4. Enter in the folder /path/to/lucene/core/src/test
  5. Select the JUNIT 4 test runner
  6. In the arguments tab, for vm arguments specify:

    -ea -Dtests.codec=MyCodec

  7. In the “classpath” tab, make sure both your codec project and the solr/lucene projects are selected

Now you should be able to launch this debug configuration and go to town! Go forth and make some awesome codecs! Let us know about the codec you’re working on, love to hear about it!

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.


Published at DZone with permission of Doug Turnbull, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}