DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Coding
  3. Languages
  4. Spam filtering with a Naive Bayes Classifier in R

Spam filtering with a Naive Bayes Classifier in R

Giorgio Sironi user avatar by
Giorgio Sironi
·
Feb. 16, 12 · Interview
Like (0)
Save
Tweet
Share
23.21K Views

Join the DZone community and get the full member experience.

Join For Free

In machine learning, a classifier is a function that assigns to a population's element one of a few available categories. For example, a spam filter is a classifier assigning to all possible mails a label "spam" or "not spam" (of course with a measurable margin of error: no classifier is perfect.)

One of the simplest classifier you can build is the Bayes classifier, in its naive version. I won't go into the mathematical derivation of the classifier, but I will show you a full example of how to detect spam in R.

First iteration: training set and priors

First, we will build an even simpler classifier outputting a fixed category, to gain familiarity with the data and the R data structures we will use.
Every classifier needs a training set, which comprehends some samples of the population belonging to the categories we want to distinguish between. Each category sample is a list of mails, and each mail is represented in turn as a vector of words: we do not need more information than the presence or the absence of a word in a mail in this approach. We also do not count repetitions for simplicity.

# training set
spam <- list(c("buy", "drugs", "online", "from", "our", "pharma"),
            c("buy", "insurance", "at", "low", "prices"))
legitimate <- list(c("newsletter", "from", "your", "favorite", "website"),
                   c("I", "was", "writing", "for", "php", "advice"),
                   c("good", "article", "on", "php"))

The actual training analyzes the training set and outputs a series of parameters which will be used by the classifier. In this first iteration, we are only estimating the frequency of the two categories, without looking at words:

# training
categories = 2
priors <- c()
total <- length(spam) + length(legitimate)
priors[1] <- length(spam) / total
priors[2] <- length(legitimate) / total

These are the estimated priors:

> priors
[1] 0.4 0.6

which means that, prior to any further analysis, a mail has a 0.4 probability of being spam and 0.6 probability of being legitimate (according to our training set).

The classifier itself chooses the maximum score of a mail, trying to assign it to all categories. The score function is very simple in this case, since it just returns the absolute frequency of a category.

# classifier
score <- function (test_mail, category) {
    priors[category]
}

classify <- function(test_mail) {
    scores = c()
    for (i in 1:categories) {
        scores[i] = score(test_mail, i)
    }
    print(scores)
    which(scores==max(scores))
}

Of course such a fixed classifier is not useful; let's try it out:

# validation set
print(classify(c("php", "object", "oriented")))
print(classify(c("free", "drugs")))
print(classify(c("r", "article")))
[1] 2
[1] 2
[1] 2

 

Second iteration: features

Priors are good at estimating blindly the probability of an event: a category described by a 0.001 prior shouldn't be taken into consideration as much as a 0.1 one. If you're detecting a user's nationality, you won't routinely assign "" or "New Zealand" or "Madagascar" as the category, but more likely "Canada" or "USA".

However, evidence can shift the balance: if you're reading Maori in the Accept-Language header or finding "Auckland" in the user's search queries, you're really more inclined to assign a sample to New Zealand, even if it is relatively rare to encounter such a visitor.

The same reasoning can be formally derived from the application of the Bayes theore to the maximization of the probability of a sample to belong to the chosen categories; that is, we look for the category with maximum P(C|X) where X is the sample. This is equivalent to maximize P(X|C)*P(f1|C)*...*P(fn|C) over C:

  • C can be either the spam or legitimate mail category in our example.
  • P(X|C) is the prior, the frequency of spam or legitimate mails.
  • P(fi|C) is the probability of word fi to appear in a spam or legitimate mail.
  • the whole expression is our score function for category C.

This score function is not a probability (who cares) because it lacks a normalization factor; it is also not comparable across mails, so we can only compare the scores a mail get in the two categories to assign it, but not use the value any further. This is the new function:

score <- function (test_mail, category) {
    score <- priors[category]
    categoryFeatures = features[[category]]
    for (word in test_mail) {
        if (contains(categoryFeatures, word)) {
            score <- score * categoryFeatures[[word]]
        } else {
            score <- score * zeroOccurrences[[category]]
        }
    }
    return(score)
}

With zeroOccurrences[category] we denote a small value chosen for P(fi|category) where fi is a word that did not appear in the training set. For all the other words, features[[category]][[word]] denotes P(word|category):

> features[[1]]
$buy
[1] 1.5
$drugs
[1] 1
$online
[1] 1
$from
[1] 1
> features[[2]]
$writing
[1] 0.6666667
$`for`
[1] 0.6666667
$php
[1] 1
$advice
[1] 0.6666667

Actually these are not probabilities anymore due to a smoothing applied in their estimation:

features <- list();
zeroOccurrences = list()
for (category in 1:categories) {
    categoryFeatures <- list();
    singleOccurrence = 1 / length(training[[category]])
    zeroOccurrences[[category]] = singleOccurrence
    for (sampleMail in training[[category]]) {
        for (word in sampleMail) {
            if (contains(categoryFeatures, word)) {
                categoryFeatures[[word]] = categoryFeatures[[word]] + singleOccurrence
            } else {
                categoryFeatures[[word]] = zeroOccurrences[[category]] + singleOccurrence
            }
        }
    }
    features[[category]] <- categoryFeatures
}

We estimate features[[category]][[word]] as the frequency with which word appears at least once in mails in the training set for category. However, some words do not appear at all in the training set ("buy" in the legitimate set, for example.) We cannot assign probability zero to this words, since it would get the score to 0 without considering all the other words: for example, the mail containing "do you want" would get a score of 0 on both categories because neither sample contain the word "you".

Thus we apply additive smoothing and add a conceptual mail containing all possible words in each part of the training set. A non-appearing word will get 1 appearance, while a word appearing N times would be counted N+1 times.

Note that given the different size of the categories samples, these frequencies are normalized by dividing by length(training[category]).

Validation

Validation would usually be a more scientific process, but just to get a feel of how our code works we can run the classifier on some new samples:

# validation set
print(classify(c("php", "object", "oriented")))
print(classify(c("free", "drugs")))
print(classify(c("r", "article")))
[1] 0.05000000 0.06666667
[1] 2
[1] 0.20000000 0.06666667
[1] 1
[1] 0.1000000 0.1333333
[1] 2

The odd rows also show the scores for the two categories. We see "php object oriented" is assigned to legitimate, with a score of 0.066 over 0.050; "free drugs" is assigned to spam with a score of 0.200 over 0.066; and "r article" is assigned to legitimate with a score of 0.133 over 0.100. We'll need a new annotated sample to perform validation, but you can see from these examples we got the basics of the algorithm working correctly.

Naive Bayes classifier R (programming language) Mail (Apple)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 5 Steps for Getting Started in Deep Learning
  • Create a CLI Chatbot With the ChatGPT API and Node.js
  • What “The Rings of Power” Taught Me About a Career in Tech
  • Reconciling Java and DevOps with JeKa

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: