Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Sentiment Shoot-Out: Part I

DZone's Guide to

Sentiment Shoot-Out: Part I

You can use different sentiment analysis libraries depending on your various needs. Read on to learn how a couple work in terms of performance and accuracy.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Let's test out various sentiment frameworks.

Performance and Accuracy

If anyone has numbers for Deep Learning-based sentiment analysis or other frameworks, let me know. Comment here. Thanks! Also, let me know if you have a large dataset that you want to offer, and we can that use for Part 2 with the numbers.

Round 'Em Up

For NLP, mostly I want to do two things:

Entity Recognition 

This involves people, facility, organizations, locations, products, events, art, language, groups, dates, time, percent, and money, and can be quantitative, ordinal, and cardinal.

Sentiment Analysis

What is it and why don't people like it?

These two features are very useful as part of a real-time streaming processing of social, email, logs and semistructured document data. I can use both of these in Twitter to ingest via Apache NiFi or Apache Spark. Don't confuse text entity recognition with image recognition that we looked at with TensorFlow previously. You can certainly add that to your flow, as well, but that is working with images, and not text.

My debate with sentiment analysis is: do you give numbers really general terms (like neutral, negative, or positive) or do you get more detailed (like Stanford CoreNLP, which has multiple of each)?

There are a lot of libraries available for NLP and sentiment analysis. The first two decisions are:

  1. Do you want to run JVM programs, which are good for Hadoop MR, Apache Spark, Apache Storm, enterprise applications, Spring applications, microservices, NiFi processors, Hive UDFs, and Pig UDFs, and have multiple programming language support (i.e. Java, Scala)

  2. Or do you want to run on Python, which is already well-known by many data scientists and engineers, is simple to prototype with no compiling, is very easy to call from NiFi and scripts, and has a ton of great Deep Learning libraries and interfaces?

Python Libraries

Like most things in Python, you can use Pip to install them. You will need a Python 2.7 or 3.0 environment setup with PIP to install and use the libraries I have looked at. spaCY requires numpy and so do many of the others.

spaCy:

pip install -U spacy
python -m spacy.en.download all

Downloading parsing model
Downloading...
Downloaded 532.28MB 100.00% 9.59MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
Downloading GloVe vectors
Downloading...
Downloaded 708.08MB 100.00% 19.38MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
After you install you need to download text and models to be used by the tool.

import spacy
nlp = spacy.load('en')
doc5 = nlp(u"Timothy Spann is studying at Princeton University in New Jersey.")
# Named Entity Recognizer (NER)
for ent in doc5.ents:
print ent, ent.label, ent.label_

spaCy is new and pretty fast and does some cool NER stuff.

NLTK:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
if ss['compound'] == 0.00:
print('Neutral')
elif ss['compound'] < 0.00:
print ('Negative')
else:
print('Positive')

NLTK is usually my go-to Python library. It's pretty quick, very stable, and standard. As you can see, the code to work with it is trivial and can be called from shell scripts, NiFi, Cron, and other streams.

Another NLTK option:

import sys

sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
print('Compound {0} Negative {1} Neutral {2} Positive {3} '
      .format(ss['compound'],ss['neg'],ss['neu'],ss['pos']))

NLTK does sentiment analysis very easily as shown above. It runs fairly quickly, so you can call this in a stream without too much overhead.

TextBlob:

from textblob import TextBlob

b = TextBlob("Spellin iz vaerry haerd to do. I do not like this spelling product at all it is terrible and I am very mad.")
print(b.correct())
print(b.sentiment)
print(b.sentiment.polarity)

python tb.py
Spelling in very heard to do. I do not like this spelling product at all it is terrible and I am very mad.
Sentiment(polarity=-0.90625, subjectivity=1.0)
-0.90625

TextBlob is a nice library that does sentiment analysis and other useful text processing like language translation and spell checking.

The install will look familiar.

sudo pip install -U textblob
sudo python -m textblob.download_corpora

JVM

Natural Language Processing for JVM languages (NLP4J) is one option. I have not tried this one yet.

Apache OpenNLP

This one is very widely used and is an Apache project, which makes the licensing ideal for most users. I have a long example of this in this article on Apache OpenNLP.

StanfordNLP

I love StanfordNLP. It works very well, integrates in a Twitter processing flow, and is very accurate. The only issue for many is that it is GPLd and for many use cases will require purchasing a license. It is very easy to use Stanford CoreNLP with Java, Scala, NiFi, and Spark. Stanford NLP has been around forever and is super solid. I have built a NiFi processor to work with this and it returns almost instantly for analyzing tweets.

import java.util.Properties

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._

import com.vader.SentimentAnalyzer
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql._
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer

case class Tweet(coordinates: String, geo:String, handle: String, hashtags: String, language: String,
location: String, msg: String, time: String, tweet_id: String, unixtime: String, user_name: String, tag: String, profile_image_url: String,
source: String, place: String, friends_count: String, followers_count: String, retweet_count: String, 
time_zone: String, sentiment: String, stanfordSentiment: String)

val message = convert(anyMessage)
val pipeline = new StanfordCoreNLP(nlpProps)
val annotation = pipeline.process(message)
var sentiments: ListBuffer[Double] = ListBuffer()
var sizes: ListBuffer[Int] = ListBuffer()

var longest = 0
var mainSentiment = 0

for (sentence <- annotation.get(classOf[CoreAnnotations.SentencesAnnotation])) {
val tree = sentence.get(classOf[SentimentCoreAnnotations.AnnotatedTree])
val sentiment = RNNCoreAnnotations.getPredictedClass(tree)
val partText = sentence.toString

if (partText.length() > longest) {
mainSentiment = sentiment
longest = partText.length()
}

sentiments += sentiment.toDouble
sizes += partText.length
}

val averageSentiment:Double = {
if(sentiments.nonEmpty) sentiments.sum / sentiments.size
else -1
}

val weightedSentiments = (sentiments, sizes).zipped.map((sentiment, size) => sentiment * size)
var weightedSentiment = weightedSentiments.sum / (sizes.fold(0)(_ + _))

if(sentiments.isEmpty) {
mainSentiment = -1
weightedSentiment = -1
}

weightedSentiment match {
case s if s <= 0.0 => NOT_UNDERSTOOD
case s if s < 1.0 => VERY_NEGATIVE
case s if s < 2.0 => NEGATIVE
case s if s < 3.0 => NEUTRAL
case s if s < 4.0 => POSITIVE
case s if s < 5.0 => VERY_POSITIVE
case s if s > 5.0 => NOT_UNDERSTOOD
}
}

trait SENTIMENT_TYPE
case object VERY_NEGATIVE extends SENTIMENT_TYPE
case object NEGATIVE extends SENTIMENT_TYPE
case object NEUTRAL extends SENTIMENT_TYPE
case object POSITIVE extends SENTIMENT_TYPE
case object VERY_POSITIVE extends SENTIMENT_TYPE
case object NOT_UNDERSTOOD extends SENTIMENT_TYPE

Summary

Do you have to use just one of these libraries? Of course not. I use different ones depending on my needs. Licensing, performance, accuracy on your dataset, programming language choice, enterprise environment, the volume of data, your corpus, the human language involved, and many other factors come into play. One size does not fit all. If you have sophisticated data scientists and strong Machine Learning pipelines, you may want to pick one and build up your own custom models and corpus.

References

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,sentiment analysis ,nifi ,python ,nltk ,stanfordcorenlp

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}