Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Anatomy of a Lucene Tokenizer

DZone's Guide to

Anatomy of a Lucene Tokenizer

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

A term is the unit of search in Lucene. A Lucene document comprises of a set of terms. Tokenization means splitting up a string into tokens, or terms.

A Lucene Tokenizer is what both Lucene (and correspondingly, Solr) uses to tokenize text.

To implement a custom Tokenizer, you extend org.apache.lucene.analysis.Tokenizer.

The only method you need to implement is public boolean incrementToken(). incrementToken returns false for EOF, true otherwise.

Tokenizers generally take a Reader input in the constructor, which is the source to be tokenized.

With each invocation of incrementToken(), the Tokenizer is expected to return new tokens, by setting the values of TermAttributes. This happens by adding TermAttributes to the superclass, usually as fields in the Tokenizer. e.g.

public class MyCustomTokenizer extends Tokenizer {
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

Here, a CharTermAttribute is added to the superclass. A CharTermAttribute stored the term text.

Here's one way to set the value of the term text in incrementToken().

public boolean incrementToken() {
      if(done) return false;
      done = true;
      int upto = 0;
      char[] buffer = new char[512];
      while (true) {
        final int length = input.read(buffer, upto, buffer.length - upto); // input is the reader set in the ctor
        if (length == -1) break;
        upto += length;
        sb.append(buffer);
      }
      termAtt.append(sb.toString());
      return true;
}

And that's pretty much all you need to start writing custom Lucene tokenizers!

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Published at DZone with permission of Kelvin Tan. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}