We're still talking about understanding documents here. Let's continue!
You can apply the same techniques employed to create summaries to different tasks. That is particularly true for the more advanced and semantic-based ones. Notice that the creation of only one summary for many documents is also a different task. That is because you have to take into account different documents lengths and avoid the repetitions, among other things.
A natural application is the identification of similar documents. If you can devise a method to identify the most meaningful sentences of one document, you can also compare the meaning of two documents.
Another objective with common techniques is information retrieval. In short, if a user search for one word — say, car — you could use some of these techniques to find documents containing also automobile.
Finally, there is topic modeling, which consists in finding the topics of a collection of documents. In simple terms, it means grouping together words with similar themes. It uses more complex statistical methods that the one used for the creation of summaries. The current state-of-the-art is based on a method called Latent Dirichlet allocation.
Gensim is a very popular and production-ready library, that have many such applications. Naturally is written in Python.
Mallet is a Java library mainly designed for topic modelling.
Most computer languages are easy to parse. This is not true for natural languages. There are approaches that give good results, but ultimately, this is still an open area of research. Fundamentally, the issue is that the parsing a sentence (i.e. analyzing its syntax) and its meaning are interconnected in a natural language. A subject, a verb, a noun, or an adverb are all words and most words that can be subject can also be object.
In practical terms, this means that there are no ready to use libraries that are good for every use you can think of. We present some libraries that can be used for restricted tasks, such as recognizing parts of speech that can also be of use for improving other methods, like the ones for creation of summaries.
There is also the frustrating fact is that a lot of software is made by academic researchers, which means that it could be easily abandoned for another approach or lacking documentation. You cannot really use a work-in-progress, badly maintained software for anything productive. Especially if you care about a language other than English, you might find yourself seeing a good working demo, which was written ten years ago, by somebody with no contact information and without any open-source code available.
You Need Data
To achieve any kind of result with parsing or generally extracting information from a natural language document, you need a lot of data to train the algorithms. This group of data is called a corpus. For use in a system that uses statistical or machine learning techniques, you might just need a lot of real world data possibly divided in the proper groups (i.e. Wikipedia articles divided by category).
However, if you are using a smart system, you might need this corpus of data to be manually constructed or annotated (i.e. the word dog is a noun that has these X possible meanings). A smart system is one that tries to imitate human understanding, or at a least that uses a process that can be followed by humans. For instance, a parser that relies on a grammar which uses rules such as phrase > subject verb (a phrase is made of a subject and a verb), but also defines several classes of verbs that humans would not normally use (i.e. verbs related to motion).
In these cases, the corpus often uses a custom format and is built for specific needs. For example, this system that can answer geographical questions about United States uses information stored in a Prolog format. The natural consequence is that even what is generally available information, such as dictionary data, can be incompatible between different programs.
On the other hand, there are also good databases that are so valuable that many programs are built around them. WordNet is an example of such database. It is a lexical database that links groups of words with similar meaning (i.e. synonyms) with their associated definition. It works thus as both a dictionary and a thesaurus. The original version is for English, but it has inspired similar databases for other languages.
What You Can Do
We have presented some of the practical challenges to build your own library to understand text. And we have not even mentioned all the issues related to ambiguity of human languages. So, differently from what we did for past sections, we are just going to explain what you can do. We are not going to explain the algorithms used to realized them, both because there is no space and also without the necessary data they would be worthless. Instead, in the next paragraph, we are just going to introduce the most used libraries that you can use to achieve what you need.
Named-entity recognition basically means finding the entities mentioned in the document. For example, in the phrase John Smithis going toItaly, it should identify John Smith and Italy as entities. It should also be able to correctly keep track of them in different documents.
Sentiment analysis classifies the sentiment represented by a phrase. In the most basic terms, it means understanding if a phrase indicates a positive or negative statement. A naive Bayes classifier can suffice for this level of understanding. It works in a similar way a spam filter works: It divides the messages into two categories (i.e. spam and non-spam) relying on the probability of each word being present in any of the two categories.
An alternative is to manually associate an emotional ranking to a word. For example, a value between -10/-5 and 0 for catastrophic and one between 0 and 5/10 for nice.
If you need a subtler evaluation you need to resort to machine learning techniques.
Parts of Speech Tagging
Parts of Speech Tagging (usually abbreviated as POS-tagging) indicates the identification and labelling of the different parts of speech (i.e. what is a noun, verb, adjective, etc.). While is an integral part of parsing, it can also be used to simplify other tasks. For instance, it can be used in the creation of summaries to simplify the sentences chosen for the summary (i.e. removing subordinates’ clauses).
A lemmatizer return the lemma for a given word and a part of speech tag. Basically, it gives the corresponding dictionary form of a word. In some ways, it can be considered an advanced form of a stemmer. It can also be used for similar purposes; namely, it can ensure that all different forms of a word are correctly linked to the same concepts.
For instance, it can transform all instances of cats in cat, for search purposes. However, it can also distinguish between the cases of run as in the verb to run and run as in the noun synonym of a jog.
Parts of speech tagging can be considered equivalent to lexing in natural languages. Chunking, also known as shallow parsing, is a step above parts of speech tagging, but one below the final parsing. It connects parts of speech in higher units of meaning, for example complements. Imagine the phrase John always wins our matches of Russian roulette:
A POS-tagger identifies that Russian is an adjective and roulette a noun
A chunker groups together (of) Russian roulette as a complement or two related parts of speech
The chunker might work to produce units that are going to be used by a parser. It can also work independently, for example to help in named-entity recognition.
The end result is the same as for computer languages: a parse tree. Though the process is quite different, and it might start with a probabilistic grammar or even with no grammar at all. It also usually continues with a lot of probabilities and statistical methods.
The following is a parse tree created by the Stanford Parser (we are going to see it later) for the phrase My dog likes hunting cats and people. Groups of letters such as NP indicates parts of speech or complements.
(ROOT (S (NP (PRP$ My) (NN dog)) (VP (VBZ likes) (NP (NN hunting) (NNS cats) (CC and) (NNS people)))))
The current best methods for automatic machine translation rely on machine learning. The good news is that this means you just need a great number of documents in the languages you care about, without any annotation. Typical sources of such texts are Wikipedia and the official documentation of the European Union (which requires documents to be translated in all the official languages of the Union).
As anybody that have tried Google Translate or Bing Translator can attest, the results are generally good enough for understanding, but still often a bit off. They cannot substitute a human translator.