A Guide to Natural Language Processing (Part 1)
Introduce yourself to the world of natural language processing by learning about some basic algorithms for stemming and splitting words automatically.
Join the DZone community and get the full member experience.Join For Free
Natural language processing (NLP) comprises a set of techniques that can be used to achieve many different objectives. Take a look at the following table to figure out which technique can solve your particular problem.
WHAT YOU NEED
WHERE TO LOOK
Grouping similar words for search
Finding words with the same meaning for search
Generating realistic names
Understanding how much time does it take to read a text
Understanding how difficult to read is a text
Identifying the language of a text
Generating a summary of a text
Finding similar documents
Identifying entities (i.e. cities, people) in a text
Understanding the attitude expressed in a text
Translating a text
We are going to talk about parsing in the general sense of analyzing a document and extracting its meaning. So, we are going to talk about the actual parsing of natural languages, but we will spend most of the time on other techniques. When it comes to understanding programming languages, parsing is the way to go. However, you can pick specific alternatives for natural languages. In other words, we are mostly going to talk about what you would use instead of parsing to accomplish your goals.
For instance, if you wanted to find all
for statements in a programming language file, you would parse it and then count the number of
fors. Instead, you are probably going to use something like stemming to find all mentions of cats in a natural language document.
This is necessary because the theory behind the parsing of natural languages might be the same one that is behind the parsing of programming languages; however, the practice is very dissimilar. In fact, you are not going to build a parser for a natural language — that is, unless you work in artificial intelligence or as a researcher, and even in those cases, you are rarely going to use one. Rather, you are going to find an algorithm that works as a simplified model of the document and that can only solve your specific problem.
In short, you are going to find tricks to avoid to actually having to parse a natural language. That is why this area of computer science is usually called natural language processing rather than natural language parsing.
Algorithms That Require Data
We are going to see specific solutions to each problem. Mind you, these specific solutions can be quite complex themselves. The more advanced they are, the less they rely on simple algorithms. Usually, they need a vast database of data about the language. A logical consequence of this is that it is rarely easy to adopt a tool for one language to be used for another one. Or rather, the tool might work with few adaptations, but to build the database would require a lot of investment. So, for example, you would probably find a ready-to-use tool to create a summary of an English text, but maybe not one for an Italian one.
For this reason, in this series, we concentrate mostly on English language tools. Although we mention whether these tools work for other languages, you do not need to know the theoretical differences between languages, such as the number of genders or cases they have. However, you should be aware that the more different a language is from English, the harder it will be to apply these techniques or tools to it.
For example, you should not expect to find tools that can work with Chinese (or rather the Chinese writing system). It is not necessarily that these languages are harder to understand programmatically, but there might be less research on them or the methods might be completely different from the ones adopted for English.
The Structure of This Guide
This article is organized according to the tasks we want to accomplish — which means that the tools and explanation are grouped according to the task they are used for. For instance, there is a section about measuring the properties of a text, such as its difficulty. They are also generally in ascending order of difficulty — it is easier to classify words than entire documents. We start with simple information retrieval techniques and we end in the proper field of natural language processing.
We think this is the most useful way to provide the information you need: If you need to do X, we directly show the methods and tools you can use.
With the expression classifying words, we include techniques and libraries that group words together.
Grouping Similar Words
We are going to talk about two methods that can group together similar words for the purpose of information retrieval. Basically, these are methods used to find documents with the words we care about from a pool of documents. That is useful because if a user searches for documents containing the word "friend," they are probably equally interested in documents containing "friends" and possibly "friended" and "friendship."
So, to be clear, in this section, we are not going to talk about methods to group semantically connected words, such as identifying all pets or all English towns.
The two methods are stemming and splitting words into groups of characters. The algorithms for the former are language-dependent, while the ones for the latter are not. We are going to examine each of them in separate paragraphs.
Stemming is the process of finding the stem, or root, of a word. In this context, the stem is not necessarily the morphological root according to linguists. So, it is not the form of a word that you would find, say, in a vocabulary. For example, an algorithm may produce the stem "consol" for the word "consoling," while in a vocabulary, as a root, you would find "console."
A typical application of stemming is grouping together all instances of words with the same stem for usage in a search library. So, if a user searches for documents containing "friend" they can also find ones with "friends" or "friended."
Porter Stemming Algorithm
Let’s talk about an algorithm that removes suffixes to find the stem: the effective and widely used Porter Stemming Algorithm. The algorithm was originally created by Martin Porter for English. There are also Porter-based/inspired algorithms for other languages, such as French or Russian. You can find all of them on the website Snowball. Snowball is a simple language to describe stemming algorithms, but the algorithms are also described in plain English.
A complete description of the algorithm is beyond the scope of this guide. However, its foundation is easy to grasp. Fundamentally, the algorithm divides a word into regions and then replaces or removes certain suffixes if they are completely contained in the said region. So, for example, the Porter2 (i.e. the updated version) algorithm, states that:
R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.
And then that you should replace "-tional" with "-tion" if it is found inside R1.
The word "confrontational" has as R1 region "-frontational"
"-tional" is completely contained in its R1
"Confrontational" becomes "confrontation"
Porter Stemmer is purely algorithmic; it does not rely on an external database or computed rules (i.e. rules created according to a training database). This is a great advantage because it makes it predictable and easy to implement. The disadvantage is that it cannot handle exceptional cases and known mistakes cannot be easily solved. For example, the algorithm creates the same stem for university and universal.
Porter Stemmer is not perfect — but it is simple, effective, and easy to implement. For a language like English, a stemmer can be realized by any competent developer. So, there are many out there for all notable programming languages and we are not going to list them here.
Typical Issues With Other Languages
Most languages that are somewhat close to English, like German or even Romance languages, are generally easy to stem. Actually, the creation of the algorithm itself is complex and requires great knowledge of the language. However, once somebody has done the hard work of creating an algorithm, implementing one is easy.
In stemming, there are many problems with two kinds of languages you will usually encounter. The first kind is agglutinative languages. Setting aside the linguistic meaning of the expression, the issue is that agglutinative languages pile up prefixes and suffixes on the root of a word.
In particular, Turkish is problematic because is both an agglutinative language and a concatenative one, which means that basically, in Turkish, a word can represent a whole English sentence. This makes hard to develop a stemming algorithm for Turkish, but it also makes it less useful. That is because if you stem a Turkish word, you might end up with one stem for each sentence, so you lose a lot of information.
The second kind of issue is related to language with no clearly defined words. Chinese is the prime example of a language that has no alphabet, but only symbols that represent concepts. So, stemming has no meaning for Chinese. Even determining the boundaries of concepts is hard. The problem of dividing a text in its component words is called word segmentation. With English, you can find the boundaries of words just by looking at whitespace or punctuation. There are no such things in Chinese.
An alternative method to group together similar words relies on splitting them. The foundation of this method is taking apart words into the sequence of characters. These characters are called k-grams, but they are also known as n-grams characters (n-grams might also indicate groups of words). The sequence of characters is built in a sliding manner, advancing by one character at each step, starting and ending with a special symbol that indicates the boundaries of the word. For example, the 3-grams for happy are:
With the symbol $ used to indicate the beginning and the end of the word.
The exact method used for search is beyond the scope of this article. In general terms, you apply the same process to the search term(s) and then compare the occurrences of the k-grams of the input with the one of the words in the documents. Usually, you apply a statistical coefficient, like the Jaccard coefficient, to determine how much similar the words have to be to be grouped together (i.e. how many grams have to have in common). For example, with a high coefficient, you might group together cat and cats or divide cat and catty.
It is important to note a couple of things: the order of the k-grams and spelling mistakes. The order of the k-grams does not matter; in theory, you could have completely different words that happen to have the same k-grams. In practice, this does not happen. This method is imprecise, which means that it can also protect from spelling mistakes of the user. For example, even if the user input "locamotive" instead of "locomotive," it will probably still show the correct results. That is because 7 of 10 3-grams match. Exact matches would rank higher, but the word "locamotive" does not exist and so it probably has no matches.
Limits and Effectiveness
The great advantage of this technique is that it is not just purely algorithmic and very simple but it also works with all languages. You do not need to build k-grams for English differently from the ones for French. You just take apart the words in the same way. Although, it is important to note that the effectiveness is in the details — you have to pick the right number of k to have the best results.
The ideal number depends on the average length of the word in the language. It should be lower or equal than that. Different languages might have different values, but in general, you can get away with four or five. You will not have the absolute best results with only one choice, but it will work.
The disadvantage is that it looks extremely stupid, let’s face it: It so simple that it should not work. But it actually does, it works well if not better than stemming (PDF). It is shamelessly effective, and it has many other uses. We are going to see one right now.
The general case of generating fake words that look like real words is hard and of limited use. You could create phrases for a fake language, but that is pretty much it. However, it is possible to programmatically create realistic fake names for use in games or for any world-building need.
There are several feasible methods. The easiest works roughly like this:
Create a database of names of the same kind you want to generate (i.e. Roman names, Space Lizards Names, etc.).
Divide the input names in k-grams (i.e. 3-grams of Mark -> $ma – mar – ark – rk$).
Associate a probability to the k-grams: The more frequently they appear in the original database, the higher the chance they appear in the generated name.
Generate the new names!
There are many variants. For instance, you could combine a different number of k-grams for special purposes (i.e. all names start with a 2-gram but end in a 4-gram).
You could also improve the soundness of the generated names simply by looking at the probabilities of the sequences appearing in a certain order. For example, if you randomly start with "ar" the following syllable might be more likely "th" than "or."
This is the end of Part 1! In Part 2, we'll talk about classifying documents. In further posts, we'll discuss understanding documents, parsing documents, sentiment analysis, NLP libraries, and more. Stay tuned!
Published at DZone with permission of Gabriele Tomassetti, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.