{{announcement.body}}
{{announcement.title}}

Fake News’ Foe: Machine Learning and Twilio

DZone 's Guide to

Fake News’ Foe: Machine Learning and Twilio

We will be building a WhatsApp based bot service, which will accept news headlines from the user and predict if given news is fake news or not.

· AI Zone ·
Free Resource

Fake news has become a huge issue in our digitally-connected world and it is no longer limited to little squabbles — fake news spreads like wildfire and is impacting millions of people every day.

How do you deal with such a sensitive issue? Countless articles are being churned out every day on the internet — how do you tell real from fake? It's not as easy as turning to a simple fact-checker which is typically built on a story-by-story basis. As developers, can we turn to machine learning?

In this series, we will see two approaches to predict if a given article is fake or not. In this first article, we will see a more traditional supervised approach of detecting fake news by training a model on labeled data and will use Twilio WhatsApp API to infer from our model. In the next article, we will see how we can use Advanced pre-trained NLP models like BERT, GPT-2, XLNet, Grover, etc, to achieve our goal.

Let's start with understanding a bit of background.

What Is Fake News?

According to 30seconds.org:

"Fake news" is a term used to refer to fabricated news. Fake news is an invention -- a lie created out of nothing -- that takes the appearance of real news with the aim of deceiving people. This is what is important to remember: the information is false, but it seems true.

According to Wikipedia:

"Fake news (also known as junk news, pseudo-news, or hoax news) is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media (print and broadcast) or online social media."

The usage of the web as a medium for perceiving information is increasing daily. The amount of information loaded in social media at any point is enormous, posing a challenge to the validation of the truthfulness of the information. The main reason that drives this framework is that on an average 62% of US adults rely on social media as their main source of news. The quality of news that is being generated in social media has substantially reduced over the years.

The generation of fake news is intentional by the unknown sources which are trivial, and there are existing methodologies to individually validate the users' trustworthiness, the truthfulness of the news and user engagement in social media. But analyzing these features individually doesn't consider the holistic factors of measuring the news credibility. Hence, combining the auxiliary information together with the news content to measure the news credibility is a possible route to focus. There have been techniques to validate the writing style of the users to classify the news content, but these methods also have their outliers and error rates.

Aim

We will be building a WhatsApp based service that will accept news headlines from the user and predict if given news is fake news or not.

Requirements

Let's Build

Now we know what is fake news and why it's a major issue. Let's jump into building a solution to fight this problem. We will be using the LIAR Dataset by William Yang Wang which he used in his research paper titled "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection.

The original dataset comes with following columns:

  • Column 1: the ID of the statement ([ID].json)
  • Column 2: the label
  • Column 3: the statement
  • Column 4: the subject(s)
  • Column 5: the speaker
  • Column 6: the speaker's job title
  • Column 7: the state info
  • Column 8: the party affiliation
  • Column 9-13: the total credit history count, including the current statement
    • 9: barely true counts
    • 10: false counts
    • 11: half true counts
    • 12: mostly true counts
    • 13: pants on fire counts
  • Column 14: the context (venue / location of the speech or statement)

For the simplicity, we have converted it to 2 column format:

  • Column 1: Statement (News headline or text)
  • Column 2: Label (Label class contains: True, False)

You can find the modified dataset here. Now we have a dataset, let's start building a Machine Learning model.

Step 1: Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put it in a formatted way. So for this, we use data preprocessing tasks.

The file preprocessing.py contains all the preprocessing functions needed to process all input documents and texts. First, we read the train, test, and validation data files then performed some preprocessing like tokenizing, stemming etc. There are some exploratory data analysis is performed like response variable distribution and data quality checks like null or missing values etc.

Java
 




xxxxxxxxxx
1
38


 
1
#Stemming
2
def stem_tokens(tokens, stemmer):
3
   stemmed = []
4
   for token in tokens:
5
       stemmed.append(stemmer.stem(token))
6
   return stemmed
7
 
           
8
#process the data
9
def process_data(data,exclude_stopword=True,stem=True):
10
   tokens = [w.lower() for w in data]
11
   tokens_stemmed = tokens
12
   tokens_stemmed = stem_tokens(tokens, eng_stemmer)
13
   tokens_stemmed = [w for w in tokens_stemmed if w not in stopwords ]
14
   return tokens_stemmed
15
 
           
16
 
           
17
#creating ngrams
18
#unigram
19
def create_unigram(words):
20
   assert type(words) == list
21
   return words
22
 
           
23
#bigram
24
def create_bigrams(words):
25
   assert type(words) == list
26
   skip = 0
27
   join_str = " "
28
   Len = len(words)
29
   if Len > 1:
30
       lst = []
31
       for i in range(Len-1):
32
           for k in range(1,skip+2):
33
               if i+k < Len:
34
                   lst.append(join_str.join([words[i],words[i+k]]))
35
   else:
36
       #set it as unigram
37
       lst = create_unigram(words)
38
   return lst



Step 2: Feature Selection

For feature selection, we have used methods like simple bag-of-words and n-grams and then term frequency like tf-idf weighting. we have also used word2vec and POS tagging to extract the features, though POS tagging and word2vec has not been used at this point in the project.
We are looking at following features:

Java
 




xxxxxxxxxx
1
21


 
1
def features(sentence, index):
2
   """ sentence: [w1, w2, ...], index: the index of the word """
3
   return {
4
       'word': sentence[index],
5
       'is_first': index == 0,
6
       'is_last': index == len(sentence) - 1,
7
       'is_capitalized': sentence[index][0].upper() == sentence[index][0],
8
       'is_all_caps': sentence[index].upper() == sentence[index],
9
       'is_all_lower': sentence[index].lower() == sentence[index],
10
       'prefix-1': sentence[index][0],
11
       'prefix-2': sentence[index][:2],
12
       'prefix-3': sentence[index][:3],
13
       'suffix-1': sentence[index][-1],
14
       'suffix-2': sentence[index][-2:],
15
       'suffix-3': sentence[index][-3:],
16
       'prev_word': '' if index == 0 else sentence[index - 1],
17
       'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
18
       'has_hyphen': '-' in sentence[index],
19
       'is_numeric': sentence[index].isdigit(),
20
       'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
21
}



Step 3: Classification

Here we have built all the classifiers for predicting the fake news detection. The extracted features are fed into different classifiers. We have used Naive-bayes, Logistic Regression, Linear SVM, Stochastic gradient descent, and Random forest classifiers from sklearn. Each of the extracted features was used in all of the classifiers. Once fitting the model, we compared the f1 score and checked the confusion matrix.

Java
 




xxxxxxxxxx
1
26


 
1
n-grams & tfidf confusion matrix and F1 scores
2
 
           
3
#Naive bayes
4
 [841 3647]
5
 [427 5325]
6
 f1-Score: 0.723262051071
7
 
           
8
#Logistic regression
9
 [1617 2871]
10
 [1097 4655]
11
 f1-Score: 0.70113000531
12
 
           
13
#svm
14
 [2016 2472]
15
 [1524 4228]
16
 f1-Score: 0.67909201429
17
 
           
18
#sgdclassifier
19
 [  10 4478]
20
 [  13 5739]
21
 f1-Score: 0.718731637053
22
 
           
23
#random forest
24
 [1979 2509]
25
 [1630 4122]
26
 f1-Score: 0.665720333284



After fitting all the classifiers, 2 best performing models were selected as candidate models for fake news classification. We have performed parameter tuning by implementing GridSearchCV method on these candidate models and chosen best performing parameters for these classifiers. Finally the selected model was used for fake news detection with the probability of truth. In Addition to this, We have also extracted the top 50 features from our term-frequency tf-idf vectorizer to see what words are most important in each of the classes. We have also used Precision-Recall and learning curves to see how training and test sets perform when we increase the amount of data in our classifiers.

Step 4: Prediction

Our finally selected and best performing classifier was Logistic Regression which was then saved on disk with name final_model.sav. Once you close this repository, this model will be copied to the user's machine and will be used by prediction.py file to classify the fake news. It takes a news article as input from the user then a model is used for final classification output that is shown to the user along with probability of truth.

Java
 




xxxxxxxxxx
1


 
1
def detecting_fake_news(var):   
2
#retrieving the best model for prediction call
3
   load_model = pickle.load(open('final_model.sav', 'rb'))
4
   prediction = load_model.predict([var])
5
   prob = load_model.predict_proba([var])
6
 
           
7
   return prediction, prob



Step 5: Integrating Twilio WhatsApp API

We have to write a code to accept a news article headline or text from Twilio WhatsApp API and save it to our model for prediction. For this, we will python flask API server. [You can follow the similar process for SMS API as well]

The following script will do that:

Java
 




xxxxxxxxxx
1
17


 
1
from flask import Flask, request
2
import prediction
3
from twilio.twiml.messaging_response import MessagingResponse
4
 
           
5
app = Flask(__name__)
6
@app.route('/sms', methods=['POST'])
7
def sms():
8
   resp = MessagingResponse()
9
   inbMsg = request.values.get('Body')
10
   pred, confidence = prediction.detecting_fake_news(inbMsg)
11
 
           
12
   resp.message(
13
       f'The news headline you entered is {pred[0]!r} and corresponds to {confidence[0][1]!r}.')
14
   return str(resp)
15
 
           
16
if __name__ == '__main__':
17
   app.run()



Now you have to generate an endpoint which can be accessed using Twilio WhatsApp Sandbox.

Your Flask app will need to be visible from the web so Twilio can send requests to it. Ngrok lets us do this. With it installed, run the following command in your terminal in the directory your code is in. Run ngrok http 5000 in a new terminal tab.

Grab that ngrok URL to configure twilio whatsapp sandbox. We will try this on WhatsApp! So let’s go ahead and do it (either on our Sandbox if you want to do testing or your main WhatsApp Sender number if you have one provisioned). In a screenshot below we show the Sandbox page:

And we’re good to go! Let’s test our application on WhatsApp! We can send some news headlines or facts to this sandbox and get predictions in return if everything works as expected.

Hurray! You wanna try this? Complete code is available on GitHub.

What's Next?

This was a very basic implementation with limited data, but I really hope this will be sufficient to give you an idea about cool things you can do with Tensorflow and Twilio. You can try to tweak this project and use various datasets to build something cooler! So what are you planning to build? Tell me in the comments or hit me up on Twitter with your ideas, and I will be happy to collaborate!

In the next part, we will see how we can use Advanced pre-trained NLP models like BERT, GPT-2, XLNet, Grover, etc. to achieve our goal!

References

  1. "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection
  2. Twilio WhatsApp API
  3. Fake News Detection with LIAR Dataset
  4. What is Fake News?
  5. FEVER: a large-scale dataset for Fact Extraction and VERification
Topics:
artificial intelligence ,fake news ,machine learning ,python ,tensorflow ,tutorial ,twilio

Published at DZone with permission of Jayesh Bapu Ahire , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}