A Developer's Guide to Sentiment Analysis With Naive Bayes and Python

Learn sentiment analysis with Python and Scikit-learn using Naive Bayes. Build, train, and evaluate a text classifier for real-world applications.

Soumya Banerjee

Nov. 06, 25 · Tutorial

Likes (1)

Comment

Save

3.1K Views

Sentiment analysis is a powerful tool for understanding customer feedback, social media comments, and product reviews. It allows us to programmatically determine whether a piece of text is positive, negative, or neutral. While complex models like Transformers (e.g., BERT) often grab the headlines, the classic Multinomial Naive Bayes classifier remains a surprisingly effective, efficient, and interpretable baseline, especially for text-based tasks.

In this guide, we'll walk through a complete sentiment analysis project using Python and Scikit-learn. We'll cover:

Why Naive Bayes is a great starting point for text
Exploratory Data Analysis (EDA) to understand our dataset
Data preprocessing to clean and prepare the text
Vectorization using TF-IDF
Model training using a Scikit-learn Pipeline
Performance evaluation with code for Confusion Matrix, ROC-AUC, and Precision-Recall curves
Making predictions on new, unseen data

Why Naive Bayes for Text?

Before we dive in, why choose Naive Bayes?

Speed: It's incredibly fast to train, even on large datasets.
Efficiency: It requires a relatively small amount of training data to produce a decent result.
Works well with high dimensions: Text classification is a high-dimensional problem (one dimension for every unique word in your vocabulary). Naive Bayes handles this "wide" data gracefully.

It's called "naive" because it makes a "naive" assumption: that the presence of one word in a document is independent of the presence of all other words. While this is obviously false (the word "New" is highly dependent on the word "York"), the model works exceptionally well in practice.

Step 1: Setup and Exploratory Data Analysis (EDA)

Before writing any machine learning code, we must first understand our data. For this project, we'll simulate a movie review dataset with pre-labeled positive (1) and negative (0) reviews.

First, let's get our imports and data ready.

    Python
   
 

   import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, 
    confusion_matrix, 
    classification_report, 
    RocCurveDisplay, 
    PrecisionRecallDisplay
)

# --- Download NLTK assets ---
# (Only need to run this once)
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4') # For WordNet

# --- 1. Simulate our DataFrame ---
# In a real project, you'd use pd.read_csv() here
data = {
    'review': [
        "This movie was absolutely fantastic! The acting was superb.",
        "I loved this film. The story was compelling and beautiful.",
        "A truly great and moving picture. Highly recommend.",
        "What a wonderful movie. I'll watch it again.",
        "The best film I've seen all year.",
        "Completely boring. I fell asleep halfway through.",
        "A terrible plot and awful acting. Do not recommend.",
        "This was a bad movie. Just plain bad.",
        "I hated it. The end was a letdown.",
        "The characters were flat and the story was predictable."
    ],
    'sentiment': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] # 1 = Positive, 0 = Negative
}
df = pd.DataFrame(data)
  

Now, let's explore.

1. Check Class Balance

Is our dataset balanced? An imbalanced dataset (e.g., 90% positive, 10% negative) can mislead our model.

     Python
    
    sns.countplot(x='sentiment', data=df)
plt.title('Class Distribution (0=Negative, 1=Positive)')
plt.show()

Our tiny dataset is perfectly balanced, which is ideal.

2. Visualize With Word Clouds

Word clouds show the most frequent words for each sentiment.

     Python
    
 

    positive_text = ' '.join(df[df['sentiment'] == 1]['review'])
negative_text = ' '.join(df[df['sentiment'] == 0]['review'])

# Positive Word Cloud
wc_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_text)
plt.figure(figsize=(10, 5))
plt.imshow(wc_positive, interpolation='bilinear')
plt.title('Most Frequent Words in Positive Reviews')
plt.axis('off')
plt.show()
   

Positive reviews are dominated by words like "fantastic," "great," "loved," "superb," and "beautiful."

     Python
    
 

    # Negative Word Cloud
wc_negative = WordCloud(width=800, height=400, background_color='black', colormap='Reds').generate(negative_text)
plt.figure(figsize=(10, 5))
plt.imshow(wc_negative, interpolation='bilinear')
plt.title('Most Frequent Words in Negative Reviews')
plt.axis('off')
plt.show()
   

Negative reviews feature "terrible," "awful," "boring," "bad," and "hated."

3. Analyze Review Lengths

Is there a correlation between the length of a review and its sentiment?

     Python
    
    df['review_length'] = df['review'].apply(len)
sns.histplot(data=df, x='review_length', hue='sentiment', multiple='stack', bins=20)
plt.title('Distribution of Review Lengths by Sentiment')
plt.show()

In our simple data, there's no clear pattern, but in a larger dataset, you might find (for example) that negative reviews are often shorter and more "to the point."

Step 2: Data Preprocessing — Cleaning Our Text

Raw text is messy. To make it usable, we need to clean it. We'll write a single function to do this.

Convert to lowercase: Ensures "Movie" and "movie" are treated as the same word.
Remove punctuation and numbers: These characters generally don't add sentiment value.
Remove stop words: Eliminate common words like "the," "a," and "is" that don't carry sentiment.
Lemmatization: Reduce words to their root form (e.g., "running" becomes "run," "was" becomes "be"). This group relates words together.

     Python
    
 

    stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # 3. Tokenize (split into words)
    words = text.split()
    
    # 4. Remove stop words and lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    
    # 5. Join back into a string
    return ' '.join(words)

# Let's see the 'before and after'
print(f"Original: {df['review'][0]}")
print(f"Cleaned:  {preprocess_text(df['review'][0])}")

# Apply this function to our entire 'review' column
df['cleaned_review'] = df['review'].apply(preprocess_text)
   

Output:

     Plain Text
    
    Original: This movie was absolutely fantastic! The acting was superb.
Cleaned:  movie absolutely fantastic acting superb

Step 3: Building and Training the Naive Bayes Model

Now we convert the cleaned text into numbers that our model can understand.

Vectorization (TF-IDF)

We'll use TF-IDF (Term Frequency-Inverse Document Frequency).

Term frequency (TF): How often a word appears in a single document (a review).
Inverse document frequency (IDF): How rare a word is across all documents.

This technique gives a high score to words that are frequent in one review but rare in all other reviews. This helps the model find unique, sentiment-bearing words.

Using a `Pipeline` (The Right Way)

The best practice in Scikit-learn is to use a Pipeline. A pipeline chains our steps (vectorizer and model) into one object. This preven

Classification Report and Confusion Matrix

     Python
    
 

    # 1. Define our features (X) and target (y)
X = df['cleaned_review']
y = df['sentiment']

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Create the Scikit-learn Pipeline
# This pipeline will:
# 1. Apply TfidfVectorizer
# 2. Train a MultinomialNB model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('model', MultinomialNB())
])

# 4. Train the model
# We just call .fit() on the pipeline!
pipeline.fit(X_train, y_train)

# 5. Make predictions
y_pred = pipeline.predict(X_test)
   

Step 4: Evaluating Model Performance

How did our model do? We evaluate its predictions on the X_test data.

Classification Report and Confusion Matrix

Accuracy: Overall percentage of correct predictions. (Use with caution on imbalanced data!)
Precision: Of all reviews we predicted as positive, how many were positive?
Recall: Of all actual positive reviews, how many did we find?
F1-score: The harmonic mean of Precision and Recall. A great all-around metric.

     Python
    
 

    # 1. Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("-" * 40)

# 2. Classification Report
print(classification_report(y_test, y_pred))
print("-" * 40)

# 3. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted Negative', 'Predicted Positive'],
            yticklabels=['Actual Negative', 'Actual Positive'])
plt.title('Confusion Matrix')
plt.show()
   

ROC Curve (Receiver Operating Characteristic)

The ROC Curve plots the True Positive Rate against the False Positive Rate.

A curve in the top-left corner (AUC = 1.0) is a perfect model.
A diagonal line (AUC = 0.5) is a model that's no better than random guessing.

     Python
    
 

    # Plot ROC Curve
RocCurveDisplay.from_estimator(pipeline, X_test, y_test)
plt.title('ROC Curve for Naive Bayes Classifier')
plt.plot([0, 1], [0, 1], 'r--', label='Random Guess')
plt.legend()
plt.show()
   

Precision-Recall Curve

This curve is particularly useful when dealing with imbalanced datasets, as it focuses on the performance of the positive class.

     Python
    
    # Plot Precision-Recall Curve
PrecisionRecallDisplay.from_estimator(pipeline, X_test, y_test)
plt.title('Precision-Recall Curve')
plt.show()

Step 5: Predicting on New Data

The best part! Let's use our trained pipeline to predict sentiment on new, raw text. The pipeline will automatically apply all the preprocessing steps and the TF-IDF vectorization.

     Python
    
 

    def predict_sentiment(text):
    # The pipeline does all the work:
    # 1. Preprocesses the text
    # 2. TF-IDF vectorizes it
    # 3. Predicts
    prediction = pipeline.predict([text])[0]
    probability = pipeline.predict_proba([text])[0]

    if prediction == 1:
        return f"Positive (Confidence: {probability[1]:.2f})"
    else:
        return f"Negative (Confidence: {probability[0]:.2f})"

# Try it out
print(predict_sentiment("This was the best movie I have ever seen!"))
print(predict_sentiment("The acting was stiff and the plot was just awful."))
   

Example output:

     Plain Text
    
    Positive (Confidence: 0.95)
Negative (Confidence: 0.99)

Conclusion

We've successfully built an end-to-end sentiment analysis classifier. We started by exploring our text data, then moved on to cleaning the text, building a robust Pipeline, training a Naive Bayes model, and finally, evaluating its performance with industry-standard metrics.

This process provides a solid baseline that can be applied to a wide range of text classification problems. From here, you could experiment with different vectorizers (like CountVectorizer), tune model hyperparameters (like alpha for Naive Bayes), or use this model's performance as a benchmark to see if more complex models (like Logistic Regression or Transformers) provide a significant lift.

Naive Bayes classifier Sentiment analysis Python (language)

Opinions expressed by DZone contributors are their own.

Related

Trending

A Developer's Guide to Sentiment Analysis With Naive Bayes and Python

Learn sentiment analysis with Python and Scikit-learn using Naive Bayes. Build, train, and evaluate a text classifier for real-world applications.

Why Naive Bayes for Text?

Step 1: Setup and Exploratory Data Analysis (EDA)

1. Check Class Balance

2. Visualize With Word Clouds

3. Analyze Review Lengths

Step 2: Data Preprocessing — Cleaning Our Text

Step 3: Building and Training the Naive Bayes Model

Vectorization (TF-IDF)

Using a Pipeline (The Right Way)

Classification Report and Confusion Matrix

Step 4: Evaluating Model Performance

Classification Report and Confusion Matrix

ROC Curve (Receiver Operating Characteristic)

Precision-Recall Curve

Step 5: Predicting on New Data

Conclusion

Related

Partner Resources

Using a `Pipeline` (The Right Way)