Applying NLP to Tweets With Python
Applying NLP to Tweets With Python
Learn how to use natural language processing to analyze the tweets of four popular Indian journalists in order to get a quantified view of their political standing.
Join the DZone community and get the full member experience.Join For Free
Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.
This post is about using Python code for applying basic NLP (natural language processing) techniques on tweets.
The four Twitter users whose tweets I analyzed are Nidhi Razdan, Rupa Subramanya, Shubhrastha, and Rana Ayyub. They are the very few media persons whom I like to hear or read, for I find them insightful and interesting while presenting their information and arguments. The rest, or at least most of the other journalists I've looked at, are either not insightful or boring. In terms of political inclinations, Nidhi and Rana are at the left of center whereas Rupa and Shubhrastha are to the right.
The analysis has a limited purpose: it's intended to put a set of numbers around tweets, sort of getting a basic quantified view of these journalists' tweets.
By "basic" NLP, I mean applying the following techniques:
- Word density: This is the simplest one; it calculates words per tweet.
- Lexical diversity: This is an interesting statistic, which the author Russel explains in Mining the Social Web:
"Lexical diversity is defined as the number of unique words divided by the number of total words in a corpus; by definition, a lexical diversity of 1.0 would mean that all words in a corpus were unique, while a lexical diversity that approaches a 0.0 implies more duplicate words.In the Twitter sphere, lexical diversity might be interpreted in a similar fashion if comparing two Twitter users, but it might also suggest a lot about the relative diversity of overall content being discussed, as might be the case with someone who talks only about technology versus someone who talks about a much wider range of topics."
- Top words: The top words that were most frequently used. My program prints the top five.
- Popularity: The sum of the number of retweets and likes received by a tweet. My program prints the top five.
- Sentiment: A piece of text can be classified as positive, neutral, or negative. My program calculates the number and percentage of positive, neutral, and negative tweets.
- Clustering: Clustering is a way of dividing a bunch of entities into a fixed number of groups that are not previously determined and are formed organically as we process the data. I used the k-means clustering algorithm, which is an unsupervised machine learning algorithm that divides n data points into K clusters based on some measure of similarity. My program uses k=5, thus it divides the tweets into five clusters (topics) and prints ten words from each cluster as an indicative example.
I wrote the code in Python. The program has a few helper functions that are used by the more functional routines:
clean_text_and_tokenize: This function takes a line (string) as input, cleans it and returns the words as a list. The cleaning process consists of: removing hyperlinks, punctuation marks, removing stop words and lemmatizing the remaining words. Stop words are words like a, an that we don't want to be part of the analysis. Lemmatizing is the process of replacing a word by its base word.
clean_tweet: This function takes a line(string), gets the clean words by calling
clean_text_and_tokenizeand returns a string by joining the cleaned words.
getCleanedWords: This function takes a list of lines (strings), cleans each, line and returns all words from all the lines.
The key functional routines are:
lexical_diversity: This function takes a list of words and returns the number of unique words divided by the total number of words.
average_words: This function takes an array of strings, splits them into words, and returns words per string value.
top_words: This function takes in a list of words, stores the frequency of each word, and returns the most frequently used words. If the top number is not passed as an argument, it defaults to five.
popular_tweets: This function adds the retweet count and like count of every tweet to calculate its popularity. It uses a priority queue to identify the most popular tweets. If the top number is not passed as an argument, it defaults to five.
sentiment_analysis_basic: This function uses the sentiment method of
TextBloblibrary to calculate the polarity of a tweet. The tweet is classified as positive, neutral, or negative depending on the value of polarity being greater than, equal to, or less than zero.
clusterTweetsKmeans: This function uses the
gensimlibrary to create a model of vectors from the cleaned tweets. After training the model, it invokes KMeans routine of the sklearn library. Tweets are clustered into six topics.
The code is available on my GitHub repository python-misc. The input to the program is a file named
<twitter_user>.csv. This file has to be generated first by running the excellent program called
Exporter.py available in the GitHub repository
GetOldTweets-python. The ultra-cool feature of this module is that you don't have to register an app on twitter.com and use the authorization tokens and passwords in the code.
For this article, I have fetched tweets from 01-Jan-2015 to 25-Sep-2017. To get the CSV file of tweets from @Nidhi, the command is:
$ python Exporter.py --Nidhi --since 2015-01-01 --until 2017-09-25
Exporter.py creates a file with the name
output_got.csv, which I renamed to
Nidhi.csv. The command to run my program is:
$ python tweets_analysis Nidhi
The program opens the CSV file and reads all the records into a list of strings. It skips the first line as it is the header. It then calls the functional routines one by one. The output generated for running with
Total no. of tweets: 3120 Average Number of words per tweet = 10.4330128205 Lexical diversity = 0.252425418385
+---------------+ | Words | Count | +-------+-------+ | thank | 320 | | india | 162 | | say | 124 | | yes | 109 | | also | 101 | +-------+-------+
Printing top five tweets:
1. I don't know who killed Gauri Lankesh. But I do see who is celebrating her death and vilifying her. Popularity = 17679 Link = https://twitter.com/Nidhi/status/905431561985773570
2. A message to those in the media who are still independent and do their job by fearlessly asking questions. We won't be intimidated https:// twitter.com/pti_news/statu s/871593196953849856 ... Popularity = 10653 Link = https://twitter.com/Nidhi/status/871595543041941504
3. It's now fairly clear demonetisation was a purely political move. Brilliant actually. Economy got hit but hey, U.P. was won Popularity = 9018 Link = https://twitter.com/Nidhi/status/902877626334756864
5. Honoured to present my book 'Left,Right &Centre,The Idea of India' to the President @RashtrapatiBhvn @PenguinIndia pic.twitter.com/m6MrQHmhNr Popularity = 7233 Link = https://twitter.com/Nidhi/status/886984933507448833
So, this boils down to:
No. of positive tweets = 1043; percentage = 33.4294871795
No. of neutral tweets = 1616; percentage = 51.7948717949
No. of negative tweets = 461; percentage = 14.7756410256
Topic 1 has words: income tax department sends notice harsh mander institute via httweets Topic 2 has words: anyone bjp condemned language today actually first one anything else Topic 3 has words: wonder took long life short live fruit covered story well Topic 4 has words: lol sigh never according yes saying mention press cog corner Topic 5 has words: hiv but thank actually thank thank sephora actually french thank
I have captured the output of the runs against the four files in this Google sheet.
Rupa is the most prolific, averaging about 45 tweets per day, whereas the most popular tweet is from Nidhi Razdan. Shubhrastha uses the most words per tweet amongst the four. The highest lexical diversity is from Nidhi, indicative of a larger vocabulary knowledge. Rupa's LD value is very low probably because the denominator (number of tweets) is very high. The highest positive sentiment is from Rupa and the highest negative sentiment is from Shubhrastha, both right-leaning. Sentiment neutrality is the lowest in Rupa's tweets, indicative of her taking a stand most of the time.
Program Improvement and Enhancement
For lexical diversity calculation, we should perhaps consider an equal number of tweets.
Sentiment analysis can be done with a more advanced algorithm like Naive Bayes; that would require a corpus of pre-classified tweets, the training data, as it is technically called, preferably from Indian users.
Once we have a larger dataset of Twitter analyses, this program could be used to classify a Twitter user's political orientation as left, center, or right by analyzing their tweets. This could be done either by comparing his/her tweets with a political ideology corpus or measuring similarity with one of the already-analyzed Twitter users.
Just showing the words in a cluster is not meaningful. I need to experiment with the number of clusters and analyze the cluster again separately to derive some semantic meaning. Topic clustering can be also done with a probabilistic algorithm like LDA.
Published at DZone with permission of Mahboob Hussain , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.