Fake news has become a huge issue in our digitally-connected world and it is no longer limited to little squabbles — fake news spreads like wildfire and is impacting millions of people every day.
How do you deal with such a sensitive issue? Countless articles are being churned out every day on the internet — how do you tell real from fake? It's not as easy as turning to a simple fact-checker which is typically built on a story-by-story basis. As developers, can we turn to machine learning?
In this series, we will see two approaches to predict if a given article is fake or not. In this first article, we will see a more traditional supervised approach of detecting fake news by training a model on labeled data and will use Twilio WhatsApp API to infer from our model. In the next article, we will see how we can use Advanced pre-trained NLP models like BERT, GPT-2, XLNet, Grover, etc, to achieve our goal.
Let's start with understanding a bit of background.
What Is Fake News?
According to 30seconds.org:
"Fake news" is a term used to refer to fabricated news. Fake news is an invention -- a lie created out of nothing -- that takes the appearance of real news with the aim of deceiving people. This is what is important to remember: the information is false, but it seems true.
According to Wikipedia:
"Fake news (also known as junk news, pseudo-news, or hoax news) is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media (print and broadcast) or online social media."
The usage of the web as a medium for perceiving information is increasing daily. The amount of information loaded in social media at any point is enormous, posing a challenge to the validation of the truthfulness of the information. The main reason that drives this framework is that on an average 62% of US adults rely on social media as their main source of news. The quality of news that is being generated in social media has substantially reduced over the years.
The generation of fake news is intentional by the unknown sources which are trivial, and there are existing methodologies to individually validate the users' trustworthiness, the truthfulness of the news and user engagement in social media. But analyzing these features individually doesn't consider the holistic factors of measuring the news credibility. Hence, combining the auxiliary information together with the news content to measure the news credibility is a possible route to focus. There have been techniques to validate the writing style of the users to classify the news content, but these methods also have their outliers and error rates.
We will be building a WhatsApp based service that will accept news headlines from the user and predict if given news is fake news or not.
- A Twilio account — sign up for a free one here
- A Twilio whatsapp sandbox — configure one here
- Set up your Python and Flask developer environment — Make sure you have Python 3 downloaded as well as ngrok
Now we know what is fake news and why it's a major issue. Let's jump into building a solution to fight this problem. We will be using the LIAR Dataset by William Yang Wang which he used in his research paper titled "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection.
The original dataset comes with following columns:
- Column 1: the ID of the statement ([ID].json)
- Column 2: the label
- Column 3: the statement
- Column 4: the subject(s)
- Column 5: the speaker
- Column 6: the speaker's job title
- Column 7: the state info
- Column 8: the party affiliation
- Column 9-13: the total credit history count, including the current statement
- 9: barely true counts
- 10: false counts
- 11: half true counts
- 12: mostly true counts
- 13: pants on fire counts
Column 14: the context (venue / location of the speech or statement)
For the simplicity, we have converted it to 2 column format:
- Column 1: Statement (News headline or text)
- Column 2: Label (Label class contains: True, False)
You can find the modified dataset here. Now we have a dataset, let's start building a Machine Learning model.
Step 1: Preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put it in a formatted way. So for this, we use data preprocessing tasks.
preprocessing.py contains all the preprocessing functions needed to process all input documents and texts. First, we read the train, test, and validation data files then performed some preprocessing like tokenizing, stemming etc. There are some exploratory data analysis is performed like response variable distribution and data quality checks like null or missing values etc.
Step 2: Feature Selection
For feature selection, we have used methods like simple bag-of-words and n-grams and then term frequency like tf-idf weighting. we have also used word2vec and POS tagging to extract the features, though POS tagging and word2vec has not been used at this point in the project.
We are looking at following features:
Step 3: Classification
Here we have built all the classifiers for predicting the fake news detection. The extracted features are fed into different classifiers. We have used Naive-bayes, Logistic Regression, Linear SVM, Stochastic gradient descent, and Random forest classifiers from sklearn. Each of the extracted features was used in all of the classifiers. Once fitting the model, we compared the f1 score and checked the confusion matrix.
After fitting all the classifiers, 2 best performing models were selected as candidate models for fake news classification. We have performed parameter tuning by implementing GridSearchCV method on these candidate models and chosen best performing parameters for these classifiers. Finally the selected model was used for fake news detection with the probability of truth. In Addition to this, We have also extracted the top 50 features from our term-frequency tf-idf vectorizer to see what words are most important in each of the classes. We have also used Precision-Recall and learning curves to see how training and test sets perform when we increase the amount of data in our classifiers.
Step 4: Prediction
Our finally selected and best performing classifier was Logistic Regression which was then saved on disk with name final_model.sav. Once you close this repository, this model will be copied to the user's machine and will be used by prediction.py file to classify the fake news. It takes a news article as input from the user then a model is used for final classification output that is shown to the user along with probability of truth.
Step 5: Integrating Twilio WhatsApp API
We have to write a code to accept a news article headline or text from Twilio WhatsApp API and save it to our model for prediction. For this, we will python flask API server. [You can follow the similar process for SMS API as well]
The following script will do that:
Now you have to generate an endpoint which can be accessed using Twilio WhatsApp Sandbox.
Your Flask app will need to be visible from the web so Twilio can send requests to it. Ngrok lets us do this. With it installed, run the following command in your terminal in the directory your code is in. Run
ngrok http 5000 in a new terminal tab.
Grab that ngrok URL to configure twilio whatsapp sandbox. We will try this on WhatsApp! So let’s go ahead and do it (either on our Sandbox if you want to do testing or your main WhatsApp Sender number if you have one provisioned). In a screenshot below we show the Sandbox page:
And we’re good to go! Let’s test our application on WhatsApp! We can send some news headlines or facts to this sandbox and get predictions in return if everything works as expected.
Hurray! You wanna try this? Complete code is available on GitHub.
This was a very basic implementation with limited data, but I really hope this will be sufficient to give you an idea about cool things you can do with Tensorflow and Twilio. You can try to tweak this project and use various datasets to build something cooler! So what are you planning to build? Tell me in the comments or hit me up on Twitter with your ideas, and I will be happy to collaborate!
In the next part, we will see how we can use Advanced pre-trained NLP models like BERT, GPT-2, XLNet, Grover, etc. to achieve our goal!