Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using OpenNLP for Named-Entity-Recognition in Scala

DZone's Guide to

Using OpenNLP for Named-Entity-Recognition in Scala

OpenNLP is a great alternative to StanfordNLP, very open and in Scala that allows for advanced Named Entity Recognition with a detailed example for understanding parsing language.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

A common challenge in Natural Language Processing (NLP) is Named Entity Recognition (NER) - this is the process of extracting specific pieces of data from a body of text, commonly people, places and organizations (for example trying to extract the name of all people mentioned in a wikipedia article). NER is a problem that has been tackled many times over the evolution of NLP, from dictionary-based, to rule-based, to statistical models and more recently using Neural Nets to solve the problem.

Whilst there have been recent attempts to crack the problem without it, the crux of the issue is really that for approach to learn it needs a large corpus of marked up training data (there are some marked up corpora available, but the problem is still quite domain specific, so training on the WSJ data might not perform particularly well against your domain specific data) and finding a set of 100,000 marked up sentences is no easy feat.  There are some approaches that can be used to tackle this by generating training data - but it can be hard to generate truly representative data and so this approach always risks over-fitting to the generated data.

Having previously looked at Stanford's NLP library for some sentiment analysis, this time I am looking at using the OpenNLP library

. Stanford's library is often referred to as the benchmark for several NLP problems, however, these benchmarks are always against the data it is trained for — so out of the box, we likely won't get amazing results against a custom dataset. Further to this, the Stanford library is licensed under GPL which makes it harder to use in any kind of commercial/startup setting. The OpenNLP library has been around for several years, but one of its strengths is its API — it's pretty well documented to get up and running, and is all very extendable.

Training a Custom NER

Once again, for this exercise we are going back to the BBC recipe archive for the source data — we are going to try and train an OpenNLP model that can identify ingredients.

To train the model we need some example sentences — they recommend at least 15,000 marked up sentences to train a model — so for this, I annotated a bunch of the recipe steps and ended up with somewhere in the region of about 45,000 sentences.

Bring a large pan of salted water to the boil, then add the <START:ingredient> cauliflower <END> and cook for two minutes.


As you can see in the above example, the marked up sentences are quite straight forward. We just wrap the ingredient in the tags as above (although note that if the word itself isn't padded by a space on either side inside the tags, it will fail!).

Once we have our training data, we can just easily setup some code to feed it in and train our model:

def trainModel() = {

    val charset = Charset.forName("UTF-8")
    val lineStream: ObjectStream[String] = new PlainTextByLineStream(new FileInputStream(s"src/main/resources/trainingdata.txt"), charset)
    val sampleStream = new NameSampleDataStream(lineStream)

    try {
      val params = TrainingParameters.defaultParams()
      params.put(TrainingParameters.ALGORITHM_PARAM, QNTrainer.MAXENT_QN_VALUE)
      model = NameFinderME.train("en", "food", sampleStream, params, new TokenNameFinderFactory())
    }
    finally {
      sampleStream.close()
    }

    try {
      modelOut = new BufferedOutputStream(new FileOutputStream(s"src/main/resources/en-ingredients-finder.bin"))
      model.serialize(modelOut)
    } finally {
      if (modelOut != null)
        modelOut.close()
    }
  }

This is a very simple example of how you can do it, and not always paying attention to engineering best practices, but you get the idea for whats going on. We are getting an input stream of our training data set, then we instantiate the Maximum Entropy name finder class and ask it to train a model, which we can then write to disk for future use.

When we want to use the model, we can simply load it back into the OpenNLP Name Finder class and use that to parse the input text we want to check:

val model = new TokenNameFinderModel(modelIn)
val nameFinder = new NameFinderME(model)
val matches = nameFinder.find(sampleRecipe)
matches.foreach { m =>
  sampleRecipe.slice(m.getStart, m.getEnd).foreach(println(_))
}


So, once I had created some training data in the required format, and trained a model I wanted to see how well it had actually worked - obviously, I don't want to run it against one of the original recipes as they were used to train the model, so I selected this recipe for rosemary-caramel millionaire shortbread, to see how it performed, here are the ingredients it found:
  • butter
  • sugar
  • rosemary
  • caramel
  • shortbread

All in all, pretty good — it missed some ingredients, but given the training data was created in about 20 minutes just manipulating the original recipe set with some Groovy, that's to be expected really, but it did well in not returning false positives.

In conclusion, if you have a decent training set, or have the means to generate some data with a decent range, you can get some pretty good results using the library. As usual, the code for the project is on GitHub (although it is little more than the code shown in this post).

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
opennlp ,natural language processing ,big data ,scala

Published at DZone with permission of Rob Hinds, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}