A Guide to Natural Language Processing (Part 3)
A Guide to Natural Language Processing (Part 3)
In this part of the series, we'll learn about generating a summary of a document, sentiment analysis, parsing a document written in a natural language, and more.
Join the DZone community and get the full member experience.Join For Free
This section contains more advanced libraries to understand documents. We use the term somewhat loosely; we talk about how a computer can extract or manage the content of a document beyond simple manipulation of words and characters.
We are going to see how you can:
Generate a summary of a document (i.e. an algorithmic answer to the question, What is this article about?)
Sentiment analysis (Does this document contain a positive or negative opinion?)
Parse a document written in a natural language
Translate a document in another language
For the methods listed in the previous sections, you could build a library yourself with reasonable effort. From now on, it will get harder. That is because they might require a vast amount of annotated data (i.e. a vocabulary having each word with the corresponding part of speech) or rely on complex machine learning methods. So, we will mostly suggest using libraries.
This is an area with many open problems and active research, so you could find most libraries in Python, a language adopted by the research community, though you could find the occasional research-ready library in another language.
A final introductory note is that statistics and machine learning are the current kings of natural language processing. So, there is probably somebody trying to use TensorFlow to accomplish each of these tasks (i.e. deep news summarization). You might try that, too, if you take into account a considerable amount of time for research.
Generation of Summaries
The creation of a summary, or a headline, to correctly represent the meaning of a document is achievable with several methods. Some of them rely on information retrieval techniques, while others are more advanced. The theory is also divided into two strategies: extracting sentences or parts thereof from the original text, generating abstract summaries.
The second strategy it is still an open area of research, so we will concentrate on the first one.
SumBasic is a method that relies on the probability of individual words being present in a sentence to determine the most representative sentence:
First, you have to account the number of times a word appears in the whole document. With that, you calculate the probability of each word appearing in the document. For instance, if the word appears 5 times and the document has 525 words, its probability is 5/525.
You calculate a weight for each sentence that is the average of the probabilities of all the words in the sentence. For example, if a sentence contains three words with probability 3/525, 5/525, and 10/525, the weight would be 6/525.
Finally, you score the sentences by multiplying the highest probability word of each sentence with its weight. For example, a sentence with a weight of 0.1 and whose best word has a probability of 0.5 would score 0.1 * 0. 5 = 0.05, while another with weight 0.2 and a word with probability 0.4 would score 0.2 * 0.4 = 0.08.
Having found the best sentence, you recalculate the probabilities for each word in the chosen sentence. You recalculate the probabilities as if the chosen sentence was removed from the document. The idea is that the included sentence already contains a part of the whole meaning of the document. So, that part becomes less important — and this helps to avoid excessive repetition. You repeat the process until you reach the needed summary length.
This technique is quite simple. It does not require to have a database of documents to build a general probability of a word appearing in any document. You just need to calculate the probabilities in each input document. However, for this to work you have to exclude what is called stopwords. These are common words present in most documents, such as the or is. Otherwise, you might include meaningless sentences that include lots of them. You could also include stemming to improve the results.
The approach based on frequencies is an old and popular one, because it is generally effective and simple to implement. SumBasic is good enough that is frequently used as a baseline in the literature. However, there are even simpler methods. For example, Open Text Summarizer is a 2003 library that uses an even simpler approach. Basically you count the frequency of each word, then you exclude the common English words (e.g., the, is) and finally you calculate the score of a sentence according to the frequencies of the word it contains.
Graph-Based Methods: TextRank
There are more complex methods of calculating the relevance of the individual sentences. A couple of them take inspiration from PageRank — they are called LexRank and TextRank. They both rely on the relationship between different sentences to obtain a more sophisticated measurement of the importance of sentences, but they differ in the way they calculate the similarity of sentences.
PageRank measures the importance of a document according to the importance of other documents that links to it. The importance of each document, and thus each link, is computed recursively until a balance is reached.
TextRank works on the same principle: The relationship between elements can be used to understand the importance of each individual element. TextRank actually uses a more complex formula than the original PageRank algorithm because a link can be only present or not, while textual connections might be partially present. For instance, you might calculate that two sentences containing different words with the same stem (i.e. cat and cats both have cat as their stem) are only partially related.
The original paper describes a generic approach, rather than a specific method. In fact, it also describes two applications: keyword extraction and summarization. The key differences are:
The units you choose as a foundation of the relationship.
The way you calculate the connection and its strength.
For instance, you might choose as units n-grams of words or whole phrases. N-grams of words are sequences of n words, computed the same way you do k-gram for characters. So, for the phrase dogs are better than cats, there are these 3-grams:
dogs are better
are better than
better than cats
Phrases might create weighted links according to how similar they are. Or they might simply create links according to the position they are (i.e. a phrase might link to the previous and following one). The method works the same.
TextRank for Sentence Extraction
TextRank for extracting phrases uses as a unit whole sentences, and as a similarity measure, the number of words in common between them. So, if two phrases contain the words tornado, data, and center, they are more similar than if they contain only two common words. The similarity is normalized based on the length of the phrases to avoid the issue of having longer phrases having higher similarity than shorter ones.
The words used for the similarity measure could be stemmed. Stopwords are usually excluded by the calculation. A further improvement could be to also exclude verbs, although that might be complicated if you do not already have a way to identify the parts of speech.
LexRank differs mainly because as a similarity measure it uses a standard TF-IDF (Term Frequency—Inverse Document Frequency). Basically, with TF-IDF, the value of individual words is first weighted according to how frequently they appear in all documents and in each specific document. For example, if you are summarizing articles for a car magazine, there will be a lot of occurrences of the word car in every document. So, the word car would be of little relevance for each document. However, the word explosion would appear in few documents (hopefully), so it will matter more in each document it appears.
Latent Semantic Analysis
The methods we have seen so far have a weakness: they do not take into account semantics. This weakness is evident when you consider that there are words that have similar meanings (i.e. synonyms) and that most words can have different meaning depending on the context (i.e. polysemy). Latent semantic analysis attempt to overcome these issues.
The expression latent semantic analysis describes a technique more than a specific method — a technique that could be used whenever you need to represent the meaning of words. It can be used for summarization but also for search purposes to find words like the query of the user. For instance, if the user searches for happiness, a search library using LSA could also return results for joy.
A Simple Description
The specific mathematical formulas are a bit complex and involve matrices and operations on them. However, the founding idea is quite simple: words with similar meaning will appear in similar parts of a text. So you start with a normal TF-IDF matrix. Such a matrix contains nothing else than the frequencies of individual words, both inside a specific document and in all the documents evaluated.
The problem is that we want to find a relation between words that do not necessarily appear together. For example, imagine that different documents contain phrases containing the words joy and happiness together other words cookie or chocolate. The words do not appear in the same sentence, but they appear in the same document. One document contains a certain number of such phrases: a dog create happiness and dogs bring joy to children. In this document, LSA should be able to find a connection between joy and happiness through their mutual connection with dog.
The connection is build based on the frequency the words appear together or with related words in the whole set of documents. This allows building connection even in a sentence or document where they do not appear together. So, if joy and happiness appear frequently with dog, LSA would associate the specific document with the words (joy, happiness) and dog.
Basically, this technique will transform the original matrix from one linking each term with its frequency, into one with a (weighted) combination of terms linked to each document.
The issue is that there are a lot of words, and combinations thereof, so you need to make a lot of calculation and simplifications. And that is where the complex math is needed.
Once you have this matrix, the world is your oyster. That is to say, you could use this measurement of meaning in any number of ways. For instance, you could find the most relevant phrase and then find the phrases with are most close to it, using a graph-based method.
Text summarization and singular value decomposition describe one way to find the best sentences. The Python library
sumy offers an implementation.
Other Methods and Libraries
The creation of summaries is a fertile area of research with many valid methods already devised. In fact, much more than the ones we have described here. They vary for the approaches and the objective they are designed for. For example, some are created specifically to provide an answer to a question of the user, others to summarize multiple documents, etc.
You can read a brief taxonomy of other methods in Automatic Text Summarization (PDF). The Python library sumy, that we have already mentioned, implements several methods, though not necessarily the ones mentioned in the paper.
Classifier4J (Java), NClassifier (C#), and Summarize (Python) implement a Bayes classifier in an algorithm described as such:
In order to summarize a document, this algorithm first determines the frequencies of the words in the document. It then splits the document into a series of sentences. Then, it creates a summary by including the first sentence that includes each of the most frequent words. Finally, summary’s sentences are reordered to reflect that of those in the original document. — Summarize.py
These projects that implement a Bayes classifier are all dead, but they are useful to understand how the method could be implemented.
That's it for Part 3! Next time, we'll talk about other uses of LSA, parsing documents, and more.
Published at DZone with permission of Gabriele Tomassetti , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.