Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Tokenizing Second- and Top-Level Domains for a URL in Lucene and Solr

DZone's Guide to

Tokenizing Second- and Top-Level Domains for a URL in Lucene and Solr

· Big Data Zone
Free Resource

Read about the story and evolution of the data warehouse, as well as the major challenges that these systems are currenly facing.  Brought to you in partnership with Panopoly.

In my previous post, I described how to extract second- and top-level domains from a URL in Java.

Now, I'll build a Lucene Tokenizer out of it, and a Solr TokenizerFactory class.

DomainTokenizer doesn't do anything really fancy. It first returns the hostname as the first token, then the 2nd-level domain as the second token, and the top-level domain as the last token.

e.g. given the URL http://www.supermind.org, it'll return

www.supermind.org
.supermind.org
.org

Doing so allows you to quickly return all documents in the Lucene or Solr index matching the second-level domain or the TLD.

package org.supermind.solr.analysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.IOException;
import java.io.Reader;
import java.net.URL;

public class DomainTokenizer extends Tokenizer {
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

  public static final int STATE_UNINITIALIZED = -1;
  public static final int STATE_INITIALIZED = 0;
  public static final int STATE_2LD = 1;
  public static final int STATE_TLD = 2;
  public static final int STATE_DONE = 3;

  private int state = STATE_UNINITIALIZED;

  private URL url = null;
  private SecondLDExtractor extractor;
  private boolean index2LD;
  private boolean indexTLD;

  public DomainTokenizer(Reader input, SecondLDExtractor extractor, boolean index2LD, boolean indexTLD) {
    super(input);
    this.extractor = extractor;
    this.index2LD = index2LD;
    this.indexTLD = indexTLD;
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (state == STATE_DONE) return false;

    clearAttributes();
    if (this.url == null) {
      state = STATE_INITIALIZED;

      StringBuilder sb = new StringBuilder();
      int upto = 0;
      char[] buffer = new char[512];
      while (true) {
        final int length = input.read(buffer, upto, buffer.length - upto);
        if (length == -1) break;
        upto += length;
        sb.append(buffer);
      }
      this.url = new URL(sb.toString());
      if (!index2LD && !indexTLD) state = STATE_DONE;
      termAtt.append(url.getHost());
      return true;
    } else if (index2LD && state < STATE_2LD) {
      state = STATE_2LD;
      String twold = extractor.extract2LD(url.getHost());
      termAtt.append("."+twold);
      return true;
    } else if (indexTLD && state < STATE_TLD) {
      state = STATE_TLD;
      String tld = extractor.extractTLD(url.getHost());
      termAtt.append(tld);
      return true;
    }
    state = STATE_DONE;
    return false;
  }
}

and here's the corresponding Solr TokenizerFactory.

package org.supermind.solr.analysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;

import java.io.Reader;
import java.util.Map;

public class DomainTokenizerFactory extends BaseTokenizerFactory {
  private SecondLDExtractor extractor;
  private boolean index2LD;
  private boolean indexTLD;

  @Override
  public void init(Map<String, String> args) {
    super.init(args);
    assureMatchVersion();
    index2LD = getBoolean("index2LD", true);
    indexTLD = getBoolean("indexTLD", true);
    if (index2LD || indexTLD) {
      initTLDExtractor();
    }
  }

  private void initTLDExtractor() {
    extractor = new SecondLDExtractor();
    extractor.init();
  }

  public Tokenizer create(Reader input) {
    DomainTokenizer tokenizer = new DomainTokenizer(input, extractor, index2LD, indexTLD);
    return tokenizer;
  }
}

Here's a sample fieldType definition.

<fieldType name="domain" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="org.supermind.solr.analysis.DomainTokenizerFactory"/>
      </analyzer>
</fieldType>

Read about the story and evolution of the data warehouse, as well as the major challenges that these systems are currenly facing.  Brought to you in partnership with Panopoly.

Topics:

Published at DZone with permission of Kelvin Tan. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}