Mahendra Singh Dhoni on Wednesday announced that he was stepping down as captain of limited-overs cricket. As expected, his decision sent his fans, other cricketers, and experts into a tizzy. Many congratulated Dhoni on his glorious captain's career with cricket legend Sachin Tendulkar leading from the front, saying “it’s a day to celebrate his successful career and respect the decision"
Then after, we (BDCoE Lab) decided to analyze the tweets and look at the sentiment surrounding the announcement using Hadoop and R.
Data Collection: In the 1st stage we fetched data from Twitter service and stored it in HDFS using Apache Flume.
Store Data in HDFS: The twitter JSON data was stored in the HDFS.
Apache Hive: We used Hive to transform the data into a formatted dataset for the data science process.
Data Science Using R
- Word Frequencies: A common task in text mining is to look at word frequencies.
word_tweets_dhoni %>% count(word, sort = TRUE) %>% filter(n > 3000) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_bar(stat = "identity") + xlab(NULL) + coord_flip()
- Generate a WordCloud: An image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.
library(wordcloud) word_tweets_dhoni %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word, n, max.words = 200))
word_tweets_dhoni %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 200)
- Combinations of words using n-grams: Using bigrams to provide context in sentiment analysis.
Words Preceded By "Captain"
captain_words <- tweets_dhoni_bigrams_separated %>% filter(word1 == "captain") %>% inner_join(AFINN, by = c(word2 = "word")) %>% count(word2, score, sort = TRUE) %>% ungroup() captain_words %>% mutate(contribution = n * score) %>% arrange(desc(abs(contribution))) %>% head(20) %>% mutate(word2 = reorder(word2, contribution)) %>% ggplot(aes(word2, n * score, fill = n * score > 0)) + geom_bar(stat = "identity", show.legend = FALSE) + ylab("Words preceded by \"captain\"") + xlab("Sentiment score * #dhoni of occurrences") + coord_flip()
Words Preceded By Negation
negation_words <- c("not", "no", "never", "without","like") negated_words <- tweets_dhoni_bigrams_separated %>% filter(word1 %in% negation_words) %>% inner_join(AFINN, by = c(word2 = "word")) %>% count(word1, word2, score, sort = TRUE) %>% ungroup() negated_words %>% mutate(contribution = n * score) %>% mutate(word2 = reorder(word2, contribution)) %>% group_by(word1) %>% top_n(10, abs(contribution)) %>% ggplot(aes(word2, contribution, fill = n * score > 0)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap(~ word1, scales = "free") + xlab("Words preceded by negation") + ylab("Sentiment score * #dhoni of occurrences") + coord_flip()
Visualizing a Network of Bigrams With igraph
tweets_dhoni_bigrams_counts <- tweets_dhoni_bigrams_filtered %>% count(word1, word2, sort = TRUE) library(igraph) tweets_dhoni_bigrams_graph <- tweets_dhoni_bigrams_counts %>% filter(n > 500 & n < 3000) %>% graph_from_data_frame() tweets_dhoni_bigrams_graph library(ggraph) set.seed(2016) a <- grid::arrow(type = "closed", length = unit(.15, "inches")) ggraph(tweets_dhoni_bigrams_graph, layout = "fr") + geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) + geom_node_point(color = "lightblue", size = 5) + geom_node_text(aes(label = name), vjust = 1, hjust = 1) + theme_void()