Lucene's SimpleText codec
Join the DZone community and get the full member experience.
Join For FreeInspired by this question
on the Lucene user's list, I created a new codec in Lucene called the
SimpleText codec. The best ideas come from the user's lists!
This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...
Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:
The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).
Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!
But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.
This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...
Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:
field contents
term file
doc 0
pos 5
term is
doc 0
pos 1
term second
doc 0
pos 3
term test
doc 0
pos 4
term the
doc 0
pos 2
term this
doc 0
pos 0
END
The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).
Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!
But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.
Lucene
Published at DZone with permission of Michael Mccandless, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Top 10 Engineering KPIs Technical Leaders Should Know
-
Reactive Programming
-
Building a Flask Web Application With Docker: A Step-by-Step Guide
-
5 Key Concepts for MQTT Broker in Sparkplug Specification
Comments