Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Natural Language Toolkit (NLTK) Sample and Tutorial: Part 1

DZone's Guide to

Natural Language Toolkit (NLTK) Sample and Tutorial: Part 1

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 


What is NLTK?

Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data (Natural Language Processing). It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit. NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.

The library contains

  • Lexical analysis: Word and text tokenizer
  • n-gram and collocations
  • Part-of-speech tagger
  • Tree model and Text chunker for capturing
  • Named-entity recognition

Download and Install

1. You can download NLTK from here in windows

2. Once NLTK is installed, start up the Python interpreter to install the data required for rest of the work.

import nltk
nltk.download()



image


It consists of about 30 compressed files requiring about 100Mb disk space. If any disk space issue or network issue you can pick only what you need.


Once the data is downloaded to your machine, you can load some of it using the Python interpreter.

from nltk.book import *



image


Basic Operation in Text

from __future__ import division
 from nltk.book import *


 #Enter their names to find out about these texts
 print text3
 #Length of a text from start to finish, in terms of the words and punctuation symbols that appear.
 print 'Length of Text: '+str(len(text3))

 #Text is just the set of tokens
 #print sorted(set(text3))
 print 'Length of Token: '+str(len(set(text3)))

 #lexical richness of the text
 def lexical_richness(text):
     return len(set(text)) / len(text)

 #percentage of the text is taken up by a specific word    
 def percentage(word, text):
     return (100 * text.count(word) / len(text))

 print 'Lexical richness of the text: '+str(lexical_richness(text3))
 print 'Percentage: '+ str(percentage('God',text3));

Now we will pick ‘text3’ called '”The Book of Genesis” to try NLTK features. The above code sample shows:


  • Name of the Text
  • The length of a text from beginning to end
  • Token count of the text. (A token is the technical name for a sequence of characters. Text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together.)
  • Calculate a measure of the lexical richness of the text (number of distinct words by total number of words)
  • How often a word occurs in a text (compute what percentage of the text is taken up by a specific word)

Note
In Python 2, to start with from __future__ import for division.

Output of above code snippet


image


Searching Text

  • Count(word) - support count the word in the text
  • Concordance(word) - give every occurrence of a given word, together with some context.
  • Similar(word) - appending the term similar to the name of the text
  • Common_contexts([word]) - contexts are shared by two or more words
     from nltk.book import *
    
     #names of the Text
     print text3
    
     #count the word in the Text
     print "===Count==="
     print text3.count("Adam")
    
     #'concordance()' view shows us every occurrence of a given word, together with some context.
     #Here 'Adam' search in 'The Book of Genesis'
     print "===Concordance==="
     print text3.concordance("Adam")
    
     #Appending the term similar to the name of the text
     print "===Similar==="
     print text3.similar("Adam")
    
     #Contexts are shared by two or more words
     print "===Common Contexts==="
     text3.common_contexts(["Adam", "Noah"])

Output of the code sample:


image


Now I need plot words that are distributed over the text, such as "God", "Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac".

text3.dispersion_plot(["God","Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac"])



image


References


[1] Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly Media Inc. ISBN 0-596-51649-5.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
python ,bigdata ,big data ,nlp ,natural language processing ,ir ,nltk

Published at DZone with permission of Madhuka Udantha, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}