Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Guide to Natural Language Processing (Part 5)

DZone's Guide to

A Guide to Natural Language Processing (Part 5)

The NLP libraries in this article can be used for multiple purposes, so let's get started with learning about all of them!

· AI Zone ·
Free Resource

Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

Be sure to check out Part 1, Part 2, Part 3, and Part 4 before reading the final post in this series!

Understanding Documents

Yup, we're still talking about understanding documents! Let's specifically talk about libraries this time.

The Best Libraries Available

The following libraries can be used for multiple purposes, so we are going to divide this section by the title of the libraries. Most of them are in Python or Java.

Apache OpenNLP

The Apache OpenNLP library is a machine learning-based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron based machine learning.

Apache OpenNLP is a Java library with an excellent documentation that can fulfill most of the tasks we have just discussed, except for sentiment analysis and translation. The developers provide language models for a few languages in addition to English; the most notable are German, Spanish and Portuguese.

The Classical Language Toolkit

The Classical Language Toolkit (CLTK) offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia. Greek and Latin functionality are currently most complete.

As the name implies the major feature of the Classical Language Toolkit is the support for classical (ancient) languages, such as Greek and Latin. It has basic NLP tools, such as a lemmatizer, but also indispensable tools to work with ancient languages, such as transliteration support, and peculiar things like Clausulae Analysis. It has a good documentation and it is your only choice for ancient languages.

FreeLing

FreeLing is a C++ library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense Disambiguation, Semantic Role Labelling, etc.) for a variety of languages (English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Galician, Croatian, Slovene, among others).

It is a library with a good documentation and even a demo. It supports many languages usually excluded by other tools, but it is released the Affero GPL, which is probably the least user-friendly license ever conceived.

Moses

Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices.

The only thing to add is that the system is written in C++ and there is ample documentation.

NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Natural Language Toolkit (NLTK) is probably the most-known NLP library for Python. The library can accomplish many tasks in different ways (i.e. using different algorithms). It even has a good documentation (if you include the freely available book).

Simply put, it is the standard library for NLP research. Though one issue that some people have is exactly that: it is designed for research and educational purposes. If there are ten ways to do something NLTK would allow you to choose among them all. The intended user is a person with a deep understanding of NLP.

TextBlob is a library that builds upon NLTK (and Pattern) to simplify processing of textual data. The library also provides translation, but it does not implement it directly: it is simply an interface for Google Translate.

Pattern

Pattern is the most peculiar software in our collection because it is a collection of Python libraries for web mining. It has support for data mining from services such as Google and Twitter (i.e., it provide functions to directly search from Google/Twitter), an HTML parser and many other things. Among these things, there is natural language processing for English and a few other languages, including German, Spanish, French, and Italian — though English support is more advanced than the rest.

The pattern.en module contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment analysis, tools for English verb conjugation and noun singularization and pluralization, and a WordNet interface.

The rest of the libraries can only support POS-tagging.

Polyglot

Polyglot is a set of NLP libraries for many natural languages in Python. It looks great, although it has little documentation.

It supports fewer languages for the more advanced tasks, such as POS tagging (16) or named entity recognition (40). However, for sentiment analysis and language identification can work with more than a hundred of them.

Sentiment and Sentiment

Sentiment is a JavaScript (Node.js) library for sentiment analysis. The library relies on AFINN (a collection of English words with an associated emotional value) and a similar database for Emoji. These database associate to each word/Emoji a positive or negative value, to indicate a positive or negative sentiment. For example, the word joy has a score of 3, while sad has -2.

The code for the library itself is quite trivial, but it works, and it is easy to use.

var sentiment = require('sentiment');

var r1 = sentiment('Cats are stupid.');
console.dir(r1);        // Score: -2, Comparative: -0.666

var r2 = sentiment('Cats are totally amazing!');
console.dir(r2);        // Score: 4, Comparative: 1

That is the extent of the documentation for the Python library sentiment. Although there is also a paper and a demo. The paper mentions that:

We have explored different methods of improving the accuracy of a Naive Bayes classifier for sentiment analysis. We observed that a combination of methods like negation handling, word n-grams and feature selection by mutual information results in a significant improvement in accuracy.

We have explored different methods of improving the accuracy of a Naive Bayes classifier for sentiment analysis. We observed that a combination of methods like negation handling, word n-grams and feature selection by mutual information results in a significant improvement in accuracy.

Which means that it can be a good starting point to understand how to build your own sentiment analysis library.

spaCy: Industrial-Strength Natural Language Processing in Python

The library spaCy claims to be a much more efficient, ready-for-the-real-world, and easy-to-use library than NLTK. In practical terms, it has two advantages over NLTK:

  1. Better performance.

  2. It does not give you the chance of choosing among the many algorithms the one you think is best. Instead, it chooses the best one for each task. While fewer choices might seem bad, it can actually be a good thing. That is if you have no idea what the algorithms do and you have to learn them before making a decision.

In practical terms, it is a library that supports most of the basic tasks we mentioned (i.e. things like named entity recognition and POS-tagging, but not translation or parsing) with a great code-first documentation.

Textacy is a library built on top of spaCY for higher-level NLP tasks. Basically, it simplifies some things including features for cleaning data or managing it better.

The Stanford Natural Language Processing Group Software

The Stanford NLP Group makes some of our natural language processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs. These packages are widely used in industry, academia, and government.

The Stanford NLP group creates and support many great tools that cover all the purposes we have just mentioned. The only thing missing is sentiment analysis. The most notable software are CoreNLP and Parser. The parser can be seen in action in a web demo. CoreNLP is a combination of several tools, including the parser.

The tools are all in Java. The parser supports a few languages: English, Chinese, Arabic, Spanish, etc. The only downside is that the tools are licensed under the GPL. Commercial licensing is available for proprietary software.

Excluded Software

We think that the libraries we choose are the best ones for parsing, or processing, natural languages. However, we excluded some other interesting software that is usually mentioned, like CogCompNLP or GATE for several reasons:

  • There might have little to no documentation.

  • It might have a purely educational or any non-standard license.

  • It might not be designed for developers, but for end-users.

Summary

In this series, we have seen many ways to deal with a document in a natural language to get the information you need from it. Most of them tried to find smart ways to bypass the complex task of parsing natural language. Despite being hard to parse natural languages, it is still possible to do so if you use the libraries available.

Essentially, when dealing with natural languages hacking a solution is the suggested way of doing things, since nobody can figure out how to do it properly.

Where it was possible we explained the algorithms that you can use. For the most advanced tasks this would have been impractical, so we just pointed at ready-to-use libraries. In any case, if you think we missed something, be it a subject or an important library, please contact us.

Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

Topics:
ai ,nlp ,data mining ,machine learning ,deep learning ,tutorial ,machine learning libraries

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}