How to Use the Apache Open NLP POS Tagger
The Apache Open NLP POS Tagger is used to mark up text to be processed by natural language processing and NLP. Read on to learn how to use it!
Join the DZone community and get the full member experience.
Join For FreeAs per Wikipedia, POS tagging is "the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context — i.e. its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc."
To begin, any part of speech is tokenized — it is divided into tokens and then these tokens are tagged as per grammar rules by NLP for further processing. Tagging is the basic pre-processing of any POS for text retrieval and text indexing. You can see an Apache Open NLP POS tokenization example here.
To get started with OpenNLP tagging, first we include following dependencies in the pom.xml
file.
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.8.1</version>
</dependency>
OpenNLP provides a pre-trained model called en-pos-maxent.bin for any POS tagging. For tagging any POS, we first load en-pos-maxent.bin. The following lines of code will load this model.
public void initialize() {
try {
InputStream modelStream = getClass().getResourceAsStream("/en-pos-maxent.bin");
model = new POSModel(modelStream);
tagger = new POSTaggerME(model);
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
After the tagger is initialized, we basically tokenize any POS and apply tags on the tokenized string. Here is an example:
public void tag(String sentence) {
initialize();
try {
if (model != null) {
POSTaggerME tagger = new POSTaggerME(model);
if (tagger != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE
.tokenize(sentence);
String[] tags = tagger.tag(whitespaceTokenizerLine);
for (int i = 0; i < whitespaceTokenizerLine.length; i++) {
String word = whitespaceTokenizerLine[i].trim();
String tag = tags[i].trim();
System.out.print(tag + ":" + word + " ");
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
The output will be similar to following for a sentence like Otri is from Mars and she loves coding.
NNP:Otri VBZ:is IN:from NNP:Mars CC:and PRP:she VBZ:loves .:coding.
And that's it! Next time, we'll look at the Standford NLP POStagger with Maven.
Published at DZone with permission of Dhiraj Ray. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments