Over a million developers have joined DZone.

Simplistic noun-phrase chunking with POS tags in Java

· Java Zone

Discover how AppDynamics steps in to upgrade your performance game and prevent your enterprise from these top 10 Java performance problems, brought to you in partnership with AppDynamics.

I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow!

I decided to look into alternatives, and chanced upon QTag.

QTag is a "freely available, language independent POS-Tagger. It is implemented in Java, and has been successfully tested on Mac OS X, Linux, and Windows."

It's waaay faster than OpenNLP for POS-tagging, though I haven't done any benchmarks as to a accuracy.

Here's my really simplistic but adequate implementation of noun-phrase chunking using QTag.

private Qtag qt;
  public static List<String> chunkQtag(String str) throws IOException {
    List<String> result = new ArrayList<String>();
    if (qt == null) {
      qt = new Qtag("lib/english");
      qt.setOutputFormat(2);
    }

    String[] split = str.split("\n");
    for (String line : split) {
      String s = qt.tagLine(line, true);
      String lastTag = null;
      String lastToken = null;
      StringBuilder accum = new StringBuilder();
      for (String token : s.split("\n")) {
        String[] s2 = token.split("\t");
        if (s2.length < 2) continue;
        String tag = s2[1];

        if (tag.equals("JJ")
            || tag.startsWith("NN")
            || tag.startsWith("??")
            || (lastTag != null && lastTag.startsWith("NN") && s2[0].equalsIgnoreCase("of"))
            || (lastToken != null && lastToken.equalsIgnoreCase("of") && s2[0].equalsIgnoreCase("the"))
            ) {
          accum.append(s2[0]).append("-");
        } else {
          if (accum.length() > 0) {
            accum.deleteCharAt(accum.length() - 1);
            result.add(accum.toString());
            accum = new StringBuilder();
          }
        }
        lastTag = tag;
        lastToken = s2[0];
      }
      if (accum.length() > 0) {
        accum.deleteCharAt(accum.length() - 1);
        result.add(accum.toString());
      }
    }
    return result;
  }

The method returns a list of noun phrases. 

The Java Zone is brought to you in partnership with AppDynamics. AppDynamics helps you gain the fundamentals behind application performance, and implement best practices so you can proactively analyze and act on performance problems as they arise, and more specifically with your Java applications. Start a Free Trial.

Topics:

Published at DZone with permission of Kelvin Tan .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}