DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • How Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever
  • The Transformer Algorithm: A Love Story of Data and Attention
  • Optimizing Vector Search Performance With Elasticsearch
  • How Machine Learning and AI are Transforming Healthcare Diagnostics in Mobile Apps

Trending

  • How to Submit a Post to DZone
  • Enforcing Architecture With ArchUnit in Java
  • Tired of Spring Overhead? Try Dropwizard for Your Next Java Microservice
  • The Smart Way to Talk to Your Database: Why Hybrid API + NL2SQL Wins
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Simple Sentiment Analysis With NLP

Simple Sentiment Analysis With NLP

Look at a simple application of sentiment analysis using Natural Language Processing techniques.

By 
Emrah Mete user avatar
Emrah Mete
·
Dec. 04, 18 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
22.7K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, I will develop a simple application of sentiment analysis using natural language processing techniques.

natural Language Processing (NLP)Following the developments in Artificial Intelligence, the number of Artificial Intelligence applications developed for natural language processing is increasing day by day. Applications developed with NLP will enable us to use infrastructure that works faster and more accurately by eliminating human power in many jobs. Common examples of applications developed with NLP are the followings.

  • Text Classification (Spam Detector etc)
  • Sentiment Analysis
  • Author Recognition
  • Machine Translate
  • Chatbot  s 

Sentiment analysis is one of the most common applications in natural language processing. With Sentiment analysis, we can decide what emotion a text is written.

With the widespread use of social media, the need to analyze the content that people share over social media is increasing day by day. Considering the volume of data coming through social media, it is quite difficult to do this with human power. Therefore, the need for applications that can quickly detect and respond to the positive or negative comments that people write is increasing day by day. In this paper, we will develop a baseline model for simple analysis of sentiment.

First of all, we will give you information about the data set that we will make a sentiment analysis.

Data Set Name: Sentiment Labelled Sentences Data Set

Data Set Source:UCI Machine Learning Libarary

Data Set Info: This dataset was created with user reviews collected via 3 different websites (Amazon, Yelp, Imdb). These comments consist of restaurant, film and product reviews. Each record in the data set is labeled with two different emoticons. These are 1: Positive, 0: Negative.

We will create a sentiment analysis model using the data set we have given above.

We will build the Machine Learning model with the Python programming language using the sklearn and nltk library.

Now we can go to the writing part of our code.

First, let's import the libraries we will use.

import pandas as pd
import numpy as np
import pickle
import sys
import os
import io
import re
from sys import path
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
from string import punctuation, digits
from IPython.core.display import display, HTML
from nltk.corpus import stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

Now let's upload and view our data set.

#Amazon Data
input_file = "../data/amazon_cells_labelled.txt"
amazon = pd.read_csv(input_file,delimiter='\t',header=None)
amazon.columns = ['Sentence','Class']

#Yelp Data
input_file = "../data/yelp_labelled.txt"
yelp = pd.read_csv(input_file,delimiter='\t',header=None)
yelp.columns = ['Sentence','Class']

#Imdb Data
input_file = "../data/imdb_labelled.txt"
imdb = pd.read_csv(input_file,delimiter='\t',header=None)
imdb.columns = ['Sentence','Class']


#combine all data sets
data = pd.DataFrame()
data = pd.concat([amazon, yelp, imdb])
data['index'] = data.index

data

Yes, we imported the data and viewed it. Now, let's look at the statistics about the data.

#Total Count of Each Category
pd.set_option('display.width', 4000)
pd.set_option('display.max_rows', 1000)
distOfDetails = data.groupby(by='Class', as_index=False).agg({'index': pd.Series.nunique}).sort_values(by='index', ascending=False)
distOfDetails.columns =['Class', 'COUNT']
print(distOfDetails)

#Distribution of All Categories
plt.pie(distOfDetails['COUNT'],autopct='%1.0f%%',shadow=True, startangle=360)
plt.show()

As you can see, the data set is very balanced. There are almost equal numbers of positive and negative classes.

Now, before using the data set in the model, let's do a few things to clear the text (pre processing).

#Text Preprocessing
columns = ['index','Class', 'Sentence']
df_ = pd.DataFrame(columns=columns)

#lower string
data['Sentence'] = data['Sentence'].str.lower()

#remove email adress
data['Sentence'] = data['Sentence'].replace('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+', '', regex=True)

#remove IP address
data['Sentence'] = data['Sentence'].replace('((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', regex=True)

#remove punctaitions and special chracters
data['Sentence'] = data['Sentence'].str.replace('[^\w\s]','')

#remove numbers
data['Sentence'] = data['Sentence'].replace('\d', '', regex=True)

#remove stop words
for index, row in data.iterrows():
    word_tokens = word_tokenize(row['Sentence'])
    filtered_sentence = [w for w in word_tokens if not w in stopwords.words('english')]
    df_ = df_.append({"index": row['index'], "Class":  row['Class'],"Sentence": " ".join(filtered_sentence[0:])}, ignore_index=True)

data = df_

We made the pre-cleaning of the data ready for use within the model. Now, before we build our model, let's split our dataset to test (10%) and training(90%).

X_train, X_test, y_train, y_test = train_test_split(data['Sentence'].values.astype('U'),data['Class'].values.astype('int32'), test_size=0.10, random_state=0)
classes  = data['Class'].unique()

Now we can create our model using our training data. In creating the model, I will use the TF-IDF as the vectorizer and the Stochastic Gradient Descend algorithm as the classifier. We found these methods and the parameters in the method using grid search (I will not mention grid search in this article).

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier


#grid search result
vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1,2), max_features=50000,max_df=0.5,use_idf=True, norm='l2') 
counts = vectorizer.fit_transform(X_train)
vocab = vectorizer.vocabulary_
classifier = SGDClassifier(alpha=1e-05,max_iter=50,penalty='elasticnet')
targets = y_train
classifier = classifier.fit(counts, targets)
example_counts = vectorizer.transform(X_test)
predictions = classifier.predict(example_counts)

Our model has occurred. Now let's test our model with test data. Let's examine the accuracy, precision, recall and f1 results.

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

#Model Evaluation
acc = accuracy_score(y_test, predictions, normalize=True)
hit = precision_score(y_test, predictions, average=None,labels=classes)
capture = recall_score(y_test, predictions, average=None,labels=classes)

print('Model Accuracy:%.2f'%acc)
print(classification_report(y_test, predictions))


Model Accuracy:0.83
             precision    recall  f1-score   support

          0       0.83      0.84      0.84       139
          1       0.84      0.82      0.83       136

avg / total       0.83      0.83      0.83       275



As we have seen, the success of our model was 83%. Now let's look at the confusion matrix, where we can see more clearly how accurate our estimates are.

#source: https://www.kaggle.com/grfiv4/plot-a-confusion-matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        print()

    plt.imshow(cm, interpolation='nearest', cmap=cmap, aspect='auto')
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.figure(figsize=(150,100))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, predictions,classes)
np.set_printoptions(precision=2)

class_names = range(1,classes.size+1)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix, without normalization')

classInfo = pd.DataFrame(data=[])
for i in range(0,classes.size):
    classInfo = classInfo.append([[classes[i],i+1]],ignore_index=True)

classInfo.columns=['Category','Index']
classInfo

With this study, we have developed a Natural Language Processing project. As I said at the beginning of the article, our model is a baseline model. The aim of this article was to develop an application that could be considered an introduction to Natural Language Processing. I hope there has been a useful article in terms of awareness.

NLP Sentiment analysis Data set Data (computing) application Machine learning

Opinions expressed by DZone contributors are their own.

Related

  • How Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever
  • The Transformer Algorithm: A Love Story of Data and Attention
  • Optimizing Vector Search Performance With Elasticsearch
  • How Machine Learning and AI are Transforming Healthcare Diagnostics in Mobile Apps

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!