Over a million developers have joined DZone.

Stanford NLP and Java 9: Creating an Email Spam Filter

DZone's Guide to

Stanford NLP and Java 9: Creating an Email Spam Filter

The goal of this article is to use Stanford NLP and Java 9 to create a spam filter that will scan all incoming emails and send them to a separate spam folder.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

In this article, I would like to illustrate how Stanford's Natural Language Processing and Java 9 can be used to create a spam filter for an email account.

The goal is for all incoming messages to be scanned, and if they contain any spam information, they will be moved to a spam folder.

First, we download the following:

  1. Stanford NLP 3.9.1
  2. Jsop-1.11.2
  3. JavaEE (we will use the JavaMail API for the connection and manipulation of the email account)
  4. Eclipse Oxygen 4.7.2

Next, we create a project called EmailSpamFilter in Eclipse Oxygen. We will create an application with the following architecture:

Image title

This is similar to the MVC pattern but instead of a model and view, we have EmailListener and MessageNLP. In the source code below, the class that represents MessageNLP is EmailTextClassifier. The controller enables pure separation of concerns and carries out all the orchestration.

The project structure is as follows:

Image title

EmailController will run in an infinite loop, reading the inbox for new emails at a given interval. Here, I have set it to five seconds. For personal use, one can set the interval to be much larger, like every five hours.

Image title

Note that since we are using Java 8 or 9, the “stream” can be changed to a parallel stream for optimized performance when using multicore systems. The beauty is that a threading or concurrency model can be superimposed on the controller, as it delegates functionality to the EmailTextClassifier and EmailListener classes.

Next, we train our application to be able to detect spam. In order to do this, we will implement Named Entity Recognition (NER). All we need is to use a Sentence from the package edu.stanford.nlp.simple.Sentence;.

Image title

The commented out code illustrates an extension to the mailLanguageClassifier where we can process the email subject, body, and text attachments. We can then pass this around as a list of triple strings by using flatMaps to create spamEmails. In the example, I just analyze the text in the email body.

The emailspamfilter_ner.txt file contains the spam items that we will look out for in emails. Here is an example:

Image title

If I get any mail with Buy peanuts or Sale on biscuits, I can classify it as spam. Note that this can be extended with the use of Stanford’s NLC, where you can train it to look out for certain phrases or words. In addition, you could just look for NERs like Sale on or Get discounted. Also, as you get more emails that you don't like, you can add more NERs to this list and as time goes on, your spam filter becomes more intelligent.

EmailListener contains the method getEmails, which retrieves all new emails. Then, the controller sends them to EmailTextClassifier. The second important method moves the emails to spam. If the email is spam, it will then be moved from the inbox folder to a spam folder that I call MySpam (I created a new folder called MySpam in my Gmail inbox for this article).

Image title

Note not all the code in EmailListener utilizes Java 8 or 9’s capabilities. This is because some of the methods I have are from Java 6. So, I implemented some good code reuse. However, moveEmailToSpamFolder implements the use of Optional, which is a Java capability.

The code can all be found on GitHub in the repository.

Happy coding!

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

java 9 ,stanford nlp ,nlp ,tutorial ,ai ,spam filter ,filtering data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}