DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How To Perform OCR on a Photograph of a Receipt Using Java
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps
  • Cryptography Module in Mule 4
  • Thumbnail Generator Microservice for PDF in Spring Boot

Trending

  • How to Build Scalable Mobile Apps With React Native: A Step-by-Step Guide
  • Contextual AI Integration for Agile Product Teams
  • Unlocking Data with Language: Real-World Applications of Text-to-SQL Interfaces
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Using "Natural": A NLP Module for node.js

Using "Natural": A NLP Module for node.js

Like most node modules "natural" is packaged as an NPM and can be installed from the command line with node.js.

By 
Christopher Umbel user avatar
Christopher Umbel
·
Mar. 27, 12 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
63.1K Views

Join the DZone community and get the full member experience.

Join For Free

Whether it is for Twitter sentiment analysis or for solving search problems natural language processing (NLP) has become the fulcrum of much of my hobby work in recent years. Initially I usually found myself relying on theNatural Language Toolkit (NLTK)which is a rich library of NLP algorithms forPython. The NLTK is simply fantastic. It's a true one-stop-NLP-shop that's widely adopted, well documented, and open source. Certainly I had to learn what the algorithms did and how they fit together but for the most part the hard work was done for me. It was a very productive situation, to be sure!

Last year, however, brought a new platform to my hobby work:node.js. My, node and its community were young but maturing rapidly.

When the need for natural language facilities arose and I found the pickings pretty slim. I have to be honest. That's *exactly* what I was hoping for; an opportunity to sink my teeth into the algorithms themselves and contribute them back to a young, but growing, community.

Thus I began work on"natural", a module of base natural languages processing algorithms for node.js. The idea was loosely based on the Python NLTK in that all algorithms are in the same package. Initially I didn't think "natural" could be as complete as the NLTK, but as my own understanding as well as community contributions picked up I've become much more hopeful. Also, merging with Rob Ellis's node-nltools back in August of 2011 strengthened "natural" further by rapidly bringing new algorithms and features into the fold.

As of version 0.1.5 Rob, other contributors, and I have managed to get the following feature list together:

  • Stemming
    • Porter
    • Lancaster
  • Phonetic
    • SoundEx
    • Metaphone
    • Double Metaphone
  • Classification
    • Naive Bayes
    • Logistic Regression
  • String Distance
    • Levenshtein (thanks Sid Nallu)
    • Jaro-Winkler (thanks Adam Phillabaum)
    • Dice's Coefficient (thanks John Crepezzi)
  • Tokenization
    • Treebank
    • Word
    • Word-Punctuation
  • Inflection
    • Numeric
    • Nouns Singular/Pluralization
    • Present-tense verb Singular/Pluralization
  • tf*idf
  • n-grams
  • WordNet

I'll not cover every single module and feature in this article, but will instead outline what's the most commonly used and most mature.

Installing

Like most node modules "natural" is packaged as an NPM and can be installed from the command line as such:

npm install natural

If you want to install from source (or contribute for that matter) it can be foundhere on GitHub.

Stemming

The first class of algorithms I'd like to outline is stemming. Stemming is the processes of reducing a word to a root (not necessarily the morphological root). In other words the idea is to boil all conjugations, tenses and forms down to a single root word. That root may not end up looking exactly like the English root, but should be close enough for comparison.

Stemming is a typical step in preparing text for use by other algorithms or storage such as classification or even full-text indexing. Both the Lancaster and Porter algorithms are supported as of 0.1.5. Here's a basic example of stemming a word with a Porter Stemmer.

var natural = require('natural'),
stemmer = natural.PorterStemmer;

var stem = stemmer.stem('stems');
console.log(stem);
stem = stemmer.stem('stemming');
console.log(stem);
stem = stemmer.stem('stemmed');
console.log(stem);
stem = stemmer.stem('stem');
console.log(stem);

Above I simply required-up the main "natural" module and grabbed the PorterStemmer sub-module from within. Calling the "stem" function takes an arbitrary string and returns the stem. The above code returns the following output:

stem
stem
stem
stem

For convenience stemmers can patch String with methods to simplify the process by calling theattachmethod. String objects will then have astemmethod.

stemmer.attach();
stem = 'stemming'.stem();
console.log(stem);

It's very possible you'd be interested in stemming a string composed of many words, perhaps an entire document. Theattachmethod provides atokenizeAndStemmethod to accomplish this. It breaks the owning string up into an array of strings, one for each word, and stems them all. For example:

var stems = 'stems returned'.tokenizeAndStem();
console.log(stems);

produces the output:

[ 'stem', 'return' ]

Note that thetokenizeAndStemmethod will omit certain words by default that are considered irrelevant (stop words) from the return array. To instruct the stemmer to not omit stop words pass atruein totokenizeAndStemfor thekeepStopsparameter. Consider:

console.log('i stemmed words.'.tokenizeAndStem());
console.log('i stemmed words.'.tokenizeAndStem(true));

outputting:

[ 'stem', 'word' ]
[ 'i', 'stem', 'word' ]

All of the code above would also work with a Lancaster stemmer by requiring the LancasterStemmer module instead, like:

var natural = require('natural'),
    stemmer = natural.LancasterStemmer;

Of course the actual stems produced could be different depending on the algorithm chosen. The Lancaster stemmer tends to be a bit more agressive resulting in roots that look less like their English equivalents, but will likely perform better.

Phonetics

Phonetic algorithms are also provided to determine what words sound like and compare them accordingly. The old (and I mean pre-electronic computers old... like 1918 old) SoundEx and the more modern Metaphone/Double Metaphone algorithms are supported as of 0.1.5.

The following example compares the string "phonetics" and the intentional misspelling "fonetix" and determines they sound alike according to the Metaphone module but the same pattern could be applied to the DoubleMetaphone or SoundEx modules.

var natural = require('natural'),
phonetic = natural.Metaphone;

var wordA = 'phonetics';
var wordB = 'fonetix';

if(phonetic.compare(wordA, wordB))
    console.log('they sound alike!');

The raw code the phonetic algorithm produces can be retrieved with theprocessmethod:

var phoneticCode = phonetic.process('phonetics');
console.log(phoneticCode);

resulting in:

FNTKS

Like the stemming implementations the phonetic modules have anattachmethod that patches String with shortcut methods, most notablysoundsLikefor comparison:

phonetic.attach();

if(wordA.soundsLike(wordB))
    console.log('they sound alike!');

attachalso patches in aphoneticsandtokenizeAndPhoneticizemethods to retrieve the phonetic code for a single word and an entire corpus respectively.

console.log('phonetics'.phonetics());
console.log('phonetics rock'.tokenizeAndPhoneticize());

which outputs:

FNTKS
[ 'FNTKS', 'RK' ]

The above could could also use SoundEx by substituting the following in for the require.

var natural = require('natural'),
    phonetic = natural.SoundEx;

Note that SoundEx and Metaphone may have trouble with non-English words, but Double Metaphone should have some degree of success with many other languages.

tf*idf

tf*idf weights can be used to judge how important a given word is to a given document in a broader corpus (collection of documents). There are two components to a tf*idf weight: the term frequency and the inverse document frequency. To guarantee that a frequently-used, albeit semantically less important, word doesn't gain too much favor you'll want to ensure you have many documents in your TfIdf clone.

Consider the following code which adds a few documents to a corpus and then determines how important the words "ruby" and "node" are to them.

var natural = require('natural'),
    TfIdf = natural.TfIdf,
    tfidf = new TfIdf();

tfidf.addDocument('i code in c.');
tfidf.addDocument('i code in ruby.');
tfidf.addDocument('i code in ruby and node, but node more often.');
tfidf.addDocument('this document is about natural, written in node');
tfidf.addDocument('i code in fortran.');

console.log('node --------------------------------');
tfidf.tfidfs('node', function(i, measure) {
    console.log('document #' + i + ' is ' + measure);
});

console.log('ruby --------------------------------');
tfidf.tfidfs('ruby', function(i, measure) {
    console.log('document #' + i + ' is ' + measure);
});

The previous code will output the tf*idf weights for "node" and "ruby". The higher the weight the more important the word is to the document.

node --------------------------------
document #0 is 0
document #1 is 0
document #2 is 3.347952867143343
document #3 is 1.6739764335716716
document #4 is 0
ruby --------------------------------
document #0 is 0
document #1 is 1.6739764335716716
document #2 is 1.6739764335716716
document #3 is 0
document #4 is 0

Additionally, you can measure a word against a single document.

console.log(tfidf.tfidf('node', 0 /* document index */));
console.log(tfidf.tfidf('node', 1));

You can also get a list of all terms in a document ordered by their importance.

tfidf.listTerms(4 /* document index */).forEach(function(item) {
    console.log(item.term + ': ' + item.tfidf);
});

yeilding:

fortran: 1.7047480922384253
code: 1.6486586255873816

Inflection

Basic inflectors are in place to convert nouns between plural and singular forms and to turn integers into string counters (i.e. '1st', '2nd', '3rd', '4th 'etc.).

The following example converts the word "radius" into its plural form "radii".

var natural = require('natural'),
    nounInflector = new natural.NounInflector();

var plural = nounInflector.pluralize('radius');
console.log(plural);

Singularization follows the same pattern as is illustrated in the following example wich converts the word "beers" to its singular form, "beer".

var singular = nounInflector.singularize('beers');
console.log(singular);

Just like the stemming and phonetic modules anattachmethod is provided to patch String with shortcut methods.

nounInflector.attach();
console.log('radius'.pluralizeNoun());
console.log('beers'.singularizeNoun()); 

A NounInflector instance can do custom conversion if you provide expressions via theaddPluralandaddSingularmethods. Because these conversion aren't always symmetric (sometimes more patterns may be required to singularize forms than pluralize) there needn't be a one-to-one relationship betweenaddPluralandaddSingularcalls.

nounInflector.addPlural(/(code|ware)/i, '$1z');
nounInflector.addSingular(/(code|ware)z/i, '$1');

console.log('code'.pluralizeNoun());
console.log('ware'.pluralizeNoun());

console.log('codez'.singularizeNoun());
console.log('warez'.singularizeNoun());

which would result in:

codez
warez
code
ware

Here's an example of using the CountInflector module to produce string counter for integers.

var natural = require('natural'),
    countInflector = natural.CountInflector;

console.log(countInflector.nth(1));
console.log(countInflector.nth(2));
console.log(countInflector.nth(3));
console.log(countInflector.nth(4));
console.log(countInflector.nth(10));
console.log(countInflector.nth(11));
console.log(countInflector.nth(12));
console.log(countInflector.nth(13));
console.log(countInflector.nth(100));
console.log(countInflector.nth(101));
console.log(countInflector.nth(102));
console.log(countInflector.nth(103));
console.log(countInflector.nth(110));
console.log(countInflector.nth(111));
console.log(countInflector.nth(112));
console.log(countInflector.nth(113));

producing:

1st
2nd
3rd
4th
10th
11th
12th
13th
100th
101st
102nd
103rd
110th
111th
112th
113th

Classification

Classification is currently supported by the Naive Bayes and Logistic regression algorithms, although natural's Naive Bayes implementation is the most mature of the two. You can use them for tasks like spam detection and sentiment analysis.

There are two fundamental steps involved in using a classifier: training and classification.

The following example takes care of the first step by requiring-up the classifier and training it with data. Naturally, this is only a sample. To do any production tasks you'd want many more training documents (hundreds per class depending on their size).

var natural = require('natural'),
classifier = new natural.BayesClassifier();
classifier.addDocument("my unit-tests failed.", 'software');
classifier.addDocument("tried the program, but it was buggy.", 'software');
classifier.addDocument("the drive has a 2TB capacity.", 'hardware');
classifier.addDocument("i need a new power supply.", 'hardware');
classifier.train();

By default the classifier will tokenize the corpus and stem it with a PorterStemmer. You can use a LancasterStemmer by passing it in to the BayesClassifier constructor as such:

var natural = require('natural'),
    stemmer = natural.LancasterStemmer,
    classifier = new natural.BayesClassifier(stemmer);

With the classifier trained it can now classify documents via theclassifymethod:

console.log(classifier.classify('did the tests pass?'));
console.log(classifier.classify('did you buy a new drive?'));

resulting in the output:

software
hardware

Similarly the classifier can be trained on arrays rather than strings, bypassing tokenization and stemming. This allows the consumer to perform custom tokenization and stemming if any at all. This is especially useful if the corpus is not English.

classifier.addDocument(['unit', 'test'], 'software');
classifier.addDocument(['bug', 'program'], 'software');
classifier.addDocument(['drive', 'capacity'], 'hardware');
classifier.addDocument(['power', 'supply'], 'hardware');

classifier.train();

It's possible to persist and recall the results of a training via thesavemethod:

var natural = require('natural'),
classifier = new natural.BayesClassifier();

classifier.addDocument(['unit', 'test'], 'software');
classifier.addDocument(['bug', 'program'], 'software');
classifier.addDocument(['drive', 'capacity'], 'hardware');
classifier.addDocument(['power', 'supply'], 'hardware');

classifier.train();

classifier.save('classifier.json', function(err, classifier) {
    // the classifier is saved to the classifier.json file!
 });

The training could then be recalled later with theloadmethod:

var natural = require('natural'),
    classifier = new natural.BayesClassifier();

natural.BayesClassifier.load('classifier.json', null, function(err, classifier) {
    console.log(classifier.classify('did the tests pass?'));
});

Note that substitutingLogisticRegressionClassifierforBayesClassifiershould generally work as a drop-in replacement.

n-grams

n-grams are essentially the destructuring of a sentence into overlapping, contiguous lists ofnsize and are useful for building probabilistic language models. In this case the n-grams are composed of words but outside of "natural" or even natural language processing they could be of other countable objects.

Consider the following examples which illustrate the production of trigrams (n-grams of length 3), bigrams (n-grams of length 2), and arbitrary n-grams using thetrigrams,bigramsandngramsfunctions respectively.

var NGrams = natural.NGrams;
console.log(NGrams.trigrams('some other words here'));
console.log(NGrams.trigrams(['some',  'other', 'words',  'here']));

both of which produce:

[ [ 'some', 'other', 'words' ], [ 'other', 'words', 'here' ] ]    
console.log(NGrams.bigrams('some words here'));
console.log(NGrams.bigrams(['some',  'words',  'here']));

both of which produce:

[ [ 'some', 'words' ], [ 'words', 'here' ] ]
console.log(NGrams.ngrams('some other words here for you', 4));

which output:

[ [ 'some', 'other', 'words', 'here' ], [ 'other', 'words', 'here', 'for' ], [ 'words', 'here', 'for', 'you' ] ]

String Distance

"natural" supplies the Dice's coefficient, Levenshtein distance, and Jaro-Winkler distance algorithms for determining string similarity. These algorithms are concerned with orthographic (spelling) similarity, not necessarily phonetics.

Each algorithm produces a number indicating its perception of similarity, but each is determined differently and can even move in opposite directions. For instance, the more dissimilar two strings are the greater the Levenshtein distance, but Jaro-Winkler considers two totally dissimilar strings to have a value of 0 with identical strings having a value of 1.

The following example shows each algorithm's perception of the difference between the words "execution" and "intention".

var natural = require('natural');

console.log(natural.JaroWinklerDistance('execution', 'intention'));
console.log(natural.LevenshteinDistance('execution', 'intention'));
console.log(natural.DiceCoefficient('execution', 'intention'));

resulting in the output:

0.48148148148148145
8
0.375

Now to consider totally identical strings.

var natural = require('natural');

console.log(natural.JaroWinklerDistance('same', 'same'));
console.log(natural.LevenshteinDistance('same', 'same'));
console.log(natural.DiceCoefficient('same', 'same'));

which yeilds:

1
0
1

Conclusion and Roadmap

Well, that was a summary of a sizable portion of "natural". Many of the algorithms have additional parameters that can be used to tweak their operation and a few modules weren't represented at all, butthe official READMEcan help fill that gap.

There's still plenty in store for "natural". While the current plan is certainly not limited to the following points, these are indeed slated for at least some kind of attention by fall 2012.

  • Non-English-specific stemming algorithms
  • Pure javascript version
  • Maximum entropy classifier
  • Clustering algorithms (k-means in development)
  • Part of speech tagging
  • Punkt sentence segmentation

With the exception of k-means, which is near completion, I'd love community help on nearly every one! To either help out or follow along check outthe GitHub repository.

NLP Algorithm Strings Node.js N-gram Data Types Document

Published at DZone with permission of Christopher Umbel, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How To Perform OCR on a Photograph of a Receipt Using Java
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps
  • Cryptography Module in Mule 4
  • Thumbnail Generator Microservice for PDF in Spring Boot

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: