DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Using OpenNLP for Named-Entity-Recognition in Scala

Using OpenNLP for Named-Entity-Recognition in Scala

OpenNLP is a great alternative to StanfordNLP, very open and in Scala that allows for advanced Named Entity Recognition with a detailed example for understanding parsing language.

Rob Hinds user avatar by
Rob Hinds
·
Nov. 11, 16 · Tutorial
Like (5)
Save
Tweet
Share
9.62K Views

Join the DZone community and get the full member experience.

Join For Free

A common challenge in Natural Language Processing (NLP) is Named Entity Recognition (NER) - this is the process of extracting specific pieces of data from a body of text, commonly people, places and organizations (for example trying to extract the name of all people mentioned in a wikipedia article). NER is a problem that has been tackled many times over the evolution of NLP, from dictionary-based, to rule-based, to statistical models and more recently using Neural Nets to solve the problem.

Whilst there have been recent attempts to crack the problem without it, the crux of the issue is really that for approach to learn it needs a large corpus of marked up training data (there are some marked up corpora available, but the problem is still quite domain specific, so training on the WSJ data might not perform particularly well against your domain specific data) and finding a set of 100,000 marked up sentences is no easy feat.  There are some approaches that can be used to tackle this by generating training data - but it can be hard to generate truly representative data and so this approach always risks over-fitting to the generated data.

Having previously looked at Stanford's NLP library for some sentiment analysis, this time I am looking at using the OpenNLP library

. Stanford's library is often referred to as the benchmark for several NLP problems, however, these benchmarks are always against the data it is trained for — so out of the box, we likely won't get amazing results against a custom dataset. Further to this, the Stanford library is licensed under GPL which makes it harder to use in any kind of commercial/startup setting. The OpenNLP library has been around for several years, but one of its strengths is its API — it's pretty well documented to get up and running, and is all very extendable.

Training a Custom NER

Once again, for this exercise we are going back to the BBC recipe archive for the source data — we are going to try and train an OpenNLP model that can identify ingredients.

To train the model we need some example sentences — they recommend at least 15,000 marked up sentences to train a model — so for this, I annotated a bunch of the recipe steps and ended up with somewhere in the region of about 45,000 sentences.

Bring a large pan of salted water to the boil, then add the <START:ingredient> cauliflower <END> and cook for two minutes.


As you can see in the above example, the marked up sentences are quite straight forward. We just wrap the ingredient in the tags as above (although note that if the word itself isn't padded by a space on either side inside the tags, it will fail!).

Once we have our training data, we can just easily setup some code to feed it in and train our model:

def trainModel() = {

    val charset = Charset.forName("UTF-8")
    val lineStream: ObjectStream[String] = new PlainTextByLineStream(new FileInputStream(s"src/main/resources/trainingdata.txt"), charset)
    val sampleStream = new NameSampleDataStream(lineStream)

    try {
      val params = TrainingParameters.defaultParams()
      params.put(TrainingParameters.ALGORITHM_PARAM, QNTrainer.MAXENT_QN_VALUE)
      model = NameFinderME.train("en", "food", sampleStream, params, new TokenNameFinderFactory())
    }
    finally {
      sampleStream.close()
    }

    try {
      modelOut = new BufferedOutputStream(new FileOutputStream(s"src/main/resources/en-ingredients-finder.bin"))
      model.serialize(modelOut)
    } finally {
      if (modelOut != null)
        modelOut.close()
    }
  }

This is a very simple example of how you can do it, and not always paying attention to engineering best practices, but you get the idea for whats going on. We are getting an input stream of our training data set, then we instantiate the Maximum Entropy name finder class and ask it to train a model, which we can then write to disk for future use.

When we want to use the model, we can simply load it back into the OpenNLP Name Finder class and use that to parse the input text we want to check:

val model = new TokenNameFinderModel(modelIn)
val nameFinder = new NameFinderME(model)
val matches = nameFinder.find(sampleRecipe)
matches.foreach { m =>
  sampleRecipe.slice(m.getStart, m.getEnd).foreach(println(_))
}


So, once I had created some training data in the required format, and trained a model I wanted to see how well it had actually worked - obviously, I don't want to run it against one of the original recipes as they were used to train the model, so I selected this recipe for rosemary-caramel millionaire shortbread, to see how it performed, here are the ingredients it found:
  • butter
  • sugar
  • rosemary
  • caramel
  • shortbread

All in all, pretty good — it missed some ingredients, but given the training data was created in about 20 minutes just manipulating the original recipe set with some Groovy, that's to be expected really, but it did well in not returning false positives.

In conclusion, if you have a decent training set, or have the means to generate some data with a decent range, you can get some pretty good results using the library. As usual, the code for the project is on GitHub (although it is little more than the code shown in this post).

Named-entity recognition OpenNLP NLP Data (computing) Scala (programming language)

Published at DZone with permission of Rob Hinds, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Introduction to Container Orchestration
  • What Are the Benefits of Java Module With Example
  • Monolithic First
  • The 5 Books You Absolutely Must Read as an Engineering Manager

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: