Stanford NLP and Java 9: Creating an Email Spam Filter
The goal of this article is to use Stanford NLP and Java 9 to create a spam filter that will scan all incoming emails and send them to a separate spam folder.
Join the DZone community and get the full member experience.
Join For FreeIn this article, I would like to illustrate how Stanford's Natural Language Processing and Java 9 can be used to create a spam filter for an email account.
The goal is for all incoming messages to be scanned, and if they contain any spam information, they will be moved to a spam folder.
First, we download the following:
- Stanford NLP 3.9.1
- Jsop-1.11.2
- JavaEE (we will use the JavaMail API for the connection and manipulation of the email account)
- Eclipse Oxygen 4.7.2
Next, we create a project called EmailSpamFilter
in Eclipse Oxygen. We will create an application with the following architecture:
This is similar to the MVC pattern but instead of a model and view, we have EmailListener
and MessageNLP
. In the source code below, the class that represents MessageNLP
is EmailTextClassifier
. The controller enables pure separation of concerns and carries out all the orchestration.
The project structure is as follows:
EmailController
will run in an infinite loop, reading the inbox for new emails at a given interval. Here, I have set it to five seconds. For personal use, one can set the interval to be much larger, like every five hours.
Note that since we are using Java 8 or 9, the “stream” can be changed to a parallel stream for optimized performance when using multicore systems. The beauty is that a threading or concurrency model can be superimposed on the controller, as it delegates functionality to the EmailTextClassifier
and EmailListener
classes.
Next, we train our application to be able to detect spam. In order to do this, we will implement Named Entity Recognition (NER). All we need is to use a Sentence
from the package edu.stanford.nlp.simple.Sentence;
.
The commented out code illustrates an extension to the mailLanguageClassifier
where we can process the email subject, body, and text attachments. We can then pass this around as a list of triple strings by using flatMaps
to create spamEmails
. In the example, I just analyze the text in the email body.
The emailspamfilter_ner.txt file contains the spam items that we will look out for in emails. Here is an example:
If I get any mail with Buy peanuts or Sale on biscuits, I can classify it as spam. Note that this can be extended with the use of Stanford’s NLC, where you can train it to look out for certain phrases or words. In addition, you could just look for NERs like Sale on or Get discounted. Also, as you get more emails that you don't like, you can add more NERs to this list and as time goes on, your spam filter becomes more intelligent.
EmailListener
contains the method getEmails
, which retrieves all new emails. Then, the controller sends them to EmailTextClassifier
. The second important method moves the emails to spam. If the email is spam, it will then be moved from the inbox folder to a spam folder that I call MySpam (I created a new folder called MySpam in my Gmail inbox for this article).
Note not all the code in EmailListener
utilizes Java 8 or 9’s capabilities. This is because some of the methods I have are from Java 6. So, I implemented some good code reuse. However, moveEmailToSpamFolder
implements the use of Optional
, which is a Java capability.
The code can all be found on GitHub in the repository.
Happy coding!
Opinions expressed by DZone contributors are their own.
Trending
-
Guide To Selecting the Right GitOps Tool - Argo CD or Flux CD
-
A Complete Guide to AWS File Handling and How It Is Revolutionizing Cloud Storage
-
13 Impressive Ways To Improve the Developer’s Experience by Using AI
-
How To Backup and Restore a PostgreSQL Database
Comments