What is NLTK?
Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data (Natural Language Processing). It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit. NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.
The library contains
- Lexical analysis: Word and text tokenizer
- n-gram and collocations
- Part-of-speech tagger
- Tree model and Text chunker for capturing
- Named-entity recognition
Download and Install
1. You can download NLTK from here in windows
2. Once NLTK is installed, start up the Python interpreter to install the data required for rest of the work.
import nltk nltk.download()
It consists of about 30 compressed files requiring about 100Mb disk space. If any disk space issue or network issue you can pick only what you need.
Once the data is downloaded to your machine, you can load some of it using the Python interpreter.
from nltk.book import *
Basic Operation in Text
from __future__ import division from nltk.book import * #Enter their names to find out about these texts print text3 #Length of a text from start to finish, in terms of the words and punctuation symbols that appear. print 'Length of Text: '+str(len(text3)) #Text is just the set of tokens #print sorted(set(text3)) print 'Length of Token: '+str(len(set(text3))) #lexical richness of the text def lexical_richness(text): return len(set(text)) / len(text) #percentage of the text is taken up by a specific word def percentage(word, text): return (100 * text.count(word) / len(text)) print 'Lexical richness of the text: '+str(lexical_richness(text3)) print 'Percentage: '+ str(percentage('God',text3));
Now we will pick ‘text3’ called '”The Book of Genesis” to try NLTK features. The above code sample shows:
- Name of the Text
- The length of a text from beginning to end
- Token count of the text. (A token is the technical name for a sequence of characters. Text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together.)
- Calculate a measure of the lexical richness of the text (number of distinct words by total number of words)
- How often a word occurs in a text (compute what percentage of the text is taken up by a specific word)
In Python 2, to start with from __future__ import for division.
Output of above code snippet
- Count(word) - support count the word in the text
- Concordance(word) - give every occurrence of a given word, together with some context.
- Similar(word) - appending the term similar to the name of the text
- Common_contexts([word]) - contexts are shared by two or more words
from nltk.book import * #names of the Text print text3 #count the word in the Text print "===Count===" print text3.count("Adam") #'concordance()' view shows us every occurrence of a given word, together with some context. #Here 'Adam' search in 'The Book of Genesis' print "===Concordance===" print text3.concordance("Adam") #Appending the term similar to the name of the text print "===Similar===" print text3.similar("Adam") #Contexts are shared by two or more words print "===Common Contexts===" text3.common_contexts(["Adam", "Noah"])
Output of the code sample:
Now I need plot words that are distributed over the text, such as "God", "Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac".
text3.dispersion_plot(["God","Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac"])
 Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly Media Inc. ISBN 0-596-51649-5.