Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Entity recognition with Scala and Stanford NLP Named Entity Recognizer

DZone's Guide to

Entity recognition with Scala and Stanford NLP Named Entity Recognizer

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import java.io._
import org.xml.sax._
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.CoreAnnotations
 
object pdfHandler extends ContentHandler {
  val contents: StringBuffer = new StringBuffer()
 
  def characters(ch: Array[Char], start: Int, length: Int) {
    contents.append(new String(ch))
  }
 
  def endDocument() {
  }
 
  def endElement(uri: String, localName: String, qName: String) {
  }
 
  def endPrefixMapping(prefix: String) {
  }
 
  def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
  }
 
  def processingInstruction(target: String, data: String) {
  }
 
  def setDocumentLocator(locator: Locator) {
  }
 
  def skippedEntity(name: String) {
  }
 
  def startDocument() {
  }
 
  def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
  }
 
  def startPrefixMapping(prefix: String, uri: String) {
  }
}
 
object pdf extends App {
  val file = """e:\data\11-1285_i4dk.pdf"""
 
  val pdf: PDFParser = new PDFParser();
 
  val stream: InputStream = new FileInputStream(file)
  val handler: ContentHandler = pdfHandler
  val metadata: Metadata = new Metadata()
  val context: ParseContext = new ParseContext()
 
  pdf.parse(stream,
    handler,
    metadata,
    context)
 
  stream.close()
 
  val contents: String = pdfHandler.contents.toString()
  println(contents)
 
  val src = "stanford-ner-2013-04-04/classifiers/"
  val classifier1 = "english.all.3class.distsim.crf.ser.gz"
  val classifier2 = "english.conll.4class.distsim.crf.ser.gz"
  val classifier3 = "english.muc.7class.distsim.crf.ser.gz"
 
  val serializedClassifier = src + classifier1
 
  val classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier)
  val out = classifier.classify(contents)
 
  var words = 0
  for (i <- 0 to out.size() - 1) {
    val sentence = out.get(i)
 
    var foundWord = ""
    var oldWordClass = ""
 
    for (j <- 0 to sentence.size() - 1) {
      val word = sentence.get(j)
      val wordClass = word.get(classOf[CoreAnnotations.AnswerAnnotation]) + ""
 
      if (!oldWordClass.equals(wordClass)) {
        if (!oldWordClass.equals("O") && !oldWordClass.equals("")) {
          print("[/" + oldWordClass + "]")
        }
      }
 
      if (!wordClass.equals("O") && !wordClass.equals("")) {
        if (!oldWordClass.equals(wordClass)) {
          print("[" + wordClass + "]")
        }
      }
 
      oldWordClass = wordClass
 
      words = words + 1
      print(word);
      print(" ");
 
      if (words > 10) {
        words = 0
        println(" ")
      }
    }
  }
}

11-1285 [ORGANIZATION]US Airways , Inc. [/ORGANIZATION]v.
[PERSON]McCutchen [/PERSON]-LRB- 4\/16\/13 -RRB- 1 -LRB-
Slip Opinion -RRB- OCTOBER TERM ,
2012 Syllabus NOTE : Where it
is feasible , a syllabus -LRB-
headnote -RRB- will be released ,
as isbeing done in connection with
this case , at the time
the opinion is issued . The
syllabus constitutes no part of the
opinion of the Court but has
beenprepared by the Reporter of Decisions
for the convenience of the reader
. See [LOCATION]United States [/LOCATION]v. [ORGANIZATION]Detroit
Timber & Lumber Co. [/ORGANIZATION], 200
U. S. 321 , 337 .
SUPREME COURT OF THE [ORGANIZATION]UNITED STATES
Syllabus US AIRWAYS [/ORGANIZATION], INC. ,
IN ITS CAPACITY AS FIDUCIARY AND
PLAN ADMINISTRATOR OF THE [LOCATION]US [/LOCATION]AIRWAYS
, INC. . EMPLOYEE BENEFITS PLAN
v. [PERSON]MCCUTCHEN [/PERSON]ET AL. . CERTIORARI
TO THE [ORGANIZATION]UNITED STATES [/ORGANIZATION]COURT OF
APPEALS FOR THE THIRD CIRCUIT No.
11 -- 1285 . Argued November
27 , 2012 -- Decided April
16 , 2013 The health benefits
plan established by petitioner [ORGANIZATION]US Airways
[/ORGANIZATION]paid $ 66,866 in medical expenses
for injuries suffered by respondentMcCutchen ,
a [ORGANIZATION]US Airways [/ORGANIZATION]employee , in
a car accident caused by athird
party . The plan entitled [ORGANIZATION]US
Airways [/ORGANIZATION]to reimbursement if
[PERSON]McCutchen [/PERSON]


Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Published at DZone with permission of Gary Sieling, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}