OpenNLP Named Entity Recognition
Learn about named entity recognition to extract a named entity from a text with OpenNLP in a Java project using pre-trained model files.
Join the DZone community and get the full member experience.
Join For FreeIn this article, we will discuss how to extract a named entity from a text using Apache OpenNLP. We will create a sample Maven-based Java project and will configure OpenNLP in it. We will be using pre-trained model files such as en-ner-location.bin
, en-ner-person.bin
, and en-ner-organization.bin
, which have been provided by OpenNLP for this.
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
The very first requirement to get started with NER is to download the required model file from here. I downloaded en-ner-location.bin
, en-ner-person.bin
, and en-ner-token.bin
and kept them in my local workspace under the /resources
folder.
The next step is to update the Maven dependencies required for this setup.
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.8.1</version>
</dependency>
There is a common way provided by OpenNLP to detect all these named entities. First, we need to load the pre-trained models and then instantiate the TokenNameFinderModel
object.
Let's get started withen-ner-person.bin
.
InputStream inputStream = getClass().getResourceAsStream("/en-ner-person.bin");
TokenNameFinderModel model = new TokenNameFinderModel(inputStream);
After this model is loaded, we need to instantiate the NameFinderME
class and use the find()
method to find the respective entities. This method requires tokens of a text to find named entities. Hence, we are first required to tokenize the text. You can visit my other post about OpenNLP tokenization to learn more about tokenization. Following is an example to extract person names from tokens.
NameFinderME nameFinder = new NameFinderME(model);
String[] tokens = tokenize(paragraph);
Span nameSpans[] = nameFinder.find(tokens);
The find()
method above returns an array of Span
. To find the actual text of the named entity, we need to read each span in a loop. Following is an example to read each span and extract the named entity.
for(Span s: nameSpans){
System.out.println(tokens[s.getStart()]);
}
This will print the name of a person from the text if there is any. Similarly, we can load en-ner-location.bin
or en-ner-organization.bin
and follow a similar approach to extract the location and organization name from any text.
This article has been all about named entity recognition using OpenNLP in a Java project. In the next article, we will look into named entity recognition using the Stanford NLP.
Published at DZone with permission of Dhiraj Ray. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments