Stanford NLP and Java 9: Creating an Email Spam Filter
The goal of this article is to use Stanford NLP and Java 9 to create a spam filter that will scan all incoming emails and send them to a separate spam folder.
Join the DZone community and get the full member experience.Join For Free
In this article, I would like to illustrate how Stanford's Natural Language Processing and Java 9 can be used to create a spam filter for an email account.
The goal is for all incoming messages to be scanned, and if they contain any spam information, they will be moved to a spam folder.
First, we download the following:
- Stanford NLP 3.9.1
- JavaEE (we will use the JavaMail API for the connection and manipulation of the email account)
- Eclipse Oxygen 4.7.2
Next, we create a project called
EmailSpamFilter in Eclipse Oxygen. We will create an application with the following architecture:
This is similar to the MVC pattern but instead of a model and view, we have
MessageNLP. In the source code below, the class that represents
EmailTextClassifier. The controller enables pure separation of concerns and carries out all the orchestration.
The project structure is as follows:
EmailController will run in an infinite loop, reading the inbox for new emails at a given interval. Here, I have set it to five seconds. For personal use, one can set the interval to be much larger, like every five hours.
Note that since we are using Java 8 or 9, the “stream” can be changed to a parallel stream for optimized performance when using multicore systems. The beauty is that a threading or concurrency model can be superimposed on the controller, as it delegates functionality to the
Next, we train our application to be able to detect spam. In order to do this, we will implement Named Entity Recognition (NER). All we need is to use a
Sentence from the package
The commented out code illustrates an extension to the
mailLanguageClassifier where we can process the email subject, body, and text attachments. We can then pass this around as a list of triple strings by using
flatMaps to create
spamEmails. In the example, I just analyze the text in the email body.
The emailspamfilter_ner.txt file contains the spam items that we will look out for in emails. Here is an example:
If I get any mail with Buy peanuts or Sale on biscuits, I can classify it as spam. Note that this can be extended with the use of Stanford’s NLC, where you can train it to look out for certain phrases or words. In addition, you could just look for NERs like Sale on or Get discounted. Also, as you get more emails that you don't like, you can add more NERs to this list and as time goes on, your spam filter becomes more intelligent.
EmailListener contains the method
getEmails, which retrieves all new emails. Then, the controller sends them to
EmailTextClassifier. The second important method moves the emails to spam. If the email is spam, it will then be moved from the inbox folder to a spam folder that I call MySpam (I created a new folder called MySpam in my Gmail inbox for this article).
Note not all the code in
EmailListener utilizes Java 8 or 9’s capabilities. This is because some of the methods I have are from Java 6. So, I implemented some good code reuse. However,
moveEmailToSpamFolder implements the use of
Optional, which is a Java capability.
Opinions expressed by DZone contributors are their own.