You’ve Got Mail… and It’s a SPAM!

This article briefs about the impact of spam and how it can be addressed with emerging machine-learning technology based on our journey in this domain.

Ramesh Manickavel

CORE ·

Celina Daisy .J

May. 22, 23 · Analysis

Likes (1)

Comment

Save

4.0K Views

When Celina John finalized her college project “Spam Classification through Machine Learning Algorithms,” we didn’t expect the scope of the work to be so huge since the domain has been constantly evolving and we had to firm up the scope and key deliverables.

As technology grows, spam is growing exponentially in all electronic communication channels, be it email, short message service, or social media. No matter how unethical and illegal, the number of spammers is increasing day by day, and sending unsolicited, unwanted, malicious messages sent in bulk to a large number of recipients.

We realized that it’s not about spam or not spam anymore. This article briefs about the impact of spam and how it can be addressed with emerging machine-learning technology based on our journey in this domain.

Why Should We Bother About Spam?

The impact of spam is literally in every area such as finance, security, healthcare, advertisements, business, and so on. Spam emails are beyond junk mail that create an impact when we inadvertently deal with it.

The intent of spam has gone beyond a business opportunity. These days spam can be anything that tries to steal valuable information, money, and credibility too.

Financial Impact

Nowadays money theft has become a serious issue, different scams are devised by spammers to expropriate credentials and all valuable assets without the knowledge of the user. This further leads to another chain of causes described below.

Impact on Security

Security is much more than simply protecting your important credentials; it is also about protecting one’s privacy and valuable possessions. Spammers attempt to imitate and resemble authorized entities to infiltrate your computer. The attacker constantly changes his invasion techniques to deceive potential users.

Psychological Impact

Receiving too many spam messages requesting sensitive or important information can cause stress and depression. It is important to highlight that this harms a person's physical health. Spam with pornographic content is another significant side effect. Another significant problem that threatens people's societal reputations is this one.

Machine Learning Approaches and Algorithms

Traditionally, supervised learning was considered in identifying spam messages. However, as the technology grows, the complexity of classifying spam is increasing rapidly. Hence, it is important to understand different learning from the aspects of spam.

Supervised Machine Learning Algorithm

In supervised learning, the machine learns under supervision. It contains a model that can predict with the help of a labeled dataset. A labeled dataset is one where you already know the target answer. To make it more simple, supervised learning is more like a student learning new things with the help of a teacher, where a student gets more inference about the domain with the supervision of a teacher. This is an effective method, where you don’t get lost as you know the expert is with you. In summary, supervised learning is training the machine with previously known information.

For spam classification, we can use different machine learning models to classify the mail as spam and ham.

Logistic Regression

Logistic regression is a machine learning model which is used to categorize or classify the data, also called a discriminative model. This model is a type of statistical model that is used to estimate the occurrence of an event. You can use these to classify the emails as spam and non-spam based on the spamicity (a measure of how likely a page/message is spam) of the words. This is mostly used for binary classification. Even though the model classifies the data properly, there's a chance that the model may be overfitting.

Naïve Bayes

The most often utilized machine learning model is utilized for both multinomial and binary classification. This model predicts based on the likelihood that a word will be spam or classified based on probability. Why is this referred to as naïve if it is based on the Bayes theorem? This is so because a variable or attribute in the model is supposed to be independent of every other variable. This concept is also known as class conditional independence.

Consider that you enjoy both cupcakes and chili oil ramen to better appreciate why the model is so naïve. Mr. Naïve will tell you that you like chili oil ramen cupcakes when you serve this to your model. It just sounds strange, doesn't it? This is the issue, and when you are knowledgeable about your field, you can also resolve it. Naïve Bayes is utilized primarily for classification-based problems.

Support Vector Machine

This is one of the most often used machine learning models, both for classification and regression. The model, as the name suggests, attempts to draw a line or a boundary in the data using the points that support the class labels, such as the points that help determine the spamicity of the emails in the spam classification in n-dimensional data. The support vectors are these points.

A hyperplane is a term used to describe the decision boundary that the model generates. The data can be viewed in n dimensions using this model. To simplify things even further, the model looks at the data from all directions, taking in every nook and cranny.

Unsupervised Machine Learning Algorithms

Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Unsupervised learning helps in grouping unsorted information according to similarities, patterns, and differences without any prior training of data.

Clustering

These algorithms group similar data points based on some similarity metric, such as distance or density. We can apply clustering to group similar emails together based on their content and other features such as sender, subject, and attachments. Spam emails often have similar characteristics such as keywords, URLs, and email addresses. By clustering these emails together, we can identify groups of messages that are likely to be spam.

Topic Modeling

Topic modeling algorithms identify common themes and topics in the emails. Spam emails often have similar themes such as promotions, scams, or phishing attempts. By identifying these topics, we can flag emails that are likely to be spam.

Rule-Based Systems

Rule-based systems use a set of predefined rules to identify spam. These rules can be based on known patterns or characteristics of spam emails such as certain keywords or phrases, specific email addresses or domains, or other attributes. Rule-based systems can be effective, but they require frequent updates to keep up with evolving spam tactics.

Anomaly Detection

Anomaly detection algorithms identify unusual patterns in the data that do not fit with the normal pattern. Spam emails often have features that are unusual or abnormal, such as a high number of links, unusual formatting, or a mismatch between the sender and subject line. Anomaly detection algorithms can flag emails with these features as potential spam.

We also can combine the power of multiple algorithms such as rule-based anomaly detection to classify spam effectively.

Reinforcement Learning Algorithms

Reinforcement learning is a feedback-based machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For example, if the email is correctly classified as spam, the agent receives a reward of +1 and if the email is incorrectly classified as spam, the agent receives a penalty of -1.

Markov Decision Process

While the Markov Decision Process is primarily used for optimization problems, a spam filter agent can be modeled as a Markov Decision Process (MDP), where the states represent the content/body of an email, and the actions represent whether to classify the email as spam or not spam. The agent receives a reward for each correctly classified email (spam as spam and spam as not spam) and a penalty for each incorrectly classified email (spam as not spam and not spam as spam). The objective of the agent is to maximize the cumulative reward over time.

Q-Learning

Finally, how about a model-free algorithm? Q-learning comes as a rescue since it learns an optimal action-value function that maps states to actions. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations. Since spam may not be able to fit into a model, Q-learning may be one of the suitable algorithms for effective classification.

To train the spam filter agent using Q-learning, we can define the state, action, and reward function as below. The state space can include features such as the email subject, sender, recipient, and message content. The action space includes two possible actions: classify the email as spam or not spam. The reward function can be defined as mentioned earlier, as rewarding for correct classification and penalizing for incorrect classification).

Conclusion

Combining the power of machine learning algorithms with rule engines will provide a recommendation to end users on how to identify spam messages and can ignore or take actions appropriately. Since the challenges associated with spam are constantly evolving, we must take advantage of reinforcement learning.

Technology for that matter any good thing always comes with a caveat. We need to be careful about certain aspects while applying machine learning algorithms to spam. The following lists a few of them:

Privacy and confidentiality: At all levels of machine learning algorithm application, we need to ensure the confidentiality and privacy of the data are maintained. The consent of the user should be acquired as applicable.
Transparency: The algorithm should be transparent and explainable to the end users so that the users can understand how it works, how it makes decisions, and what data it uses while maintaining privacy and confidentiality. This is one main reason why we need to rely on explicable models more.

Anomaly detection Machine learning Naive Bayes classifier Supervised learning Unsupervised learning Algorithm

Opinions expressed by DZone contributors are their own.

Related

Trending