Over a million developers have joined DZone.

Text Mining: Twitter Reacts to M. S. Dhoni's Decision

DZone's Guide to

Text Mining: Twitter Reacts to M. S. Dhoni's Decision

A look at the feelings behind the resignation of an Indian cricket team captain, using Hadoop and R.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Mahendra Singh Dhoni on Wednesday announced that he was stepping down as captain of limited-overs cricket. As expected, his decision sent his fans, other cricketers, and experts into a tizzy. Many congratulated Dhoni on his glorious captain's career with cricket legend Sachin Tendulkar leading from the front, saying “it’s a day to celebrate his successful career and respect the decision"

Then after, we (BDCoE Lab) decided to analyze the tweets and look at the sentiment surrounding the announcement using Hadoop and R.

Data Collection: In the 1st stage we fetched data from Twitter service and stored it in HDFS using Apache Flume.

Store Data in HDFS: The twitter JSON data was stored in the HDFS.

Apache Hive: We used Hive to transform the data into a formatted dataset for the data science process.

Data Science Using R 

  • Word Frequencies: A common task in text mining is to look at word frequencies.
word_tweets_dhoni %>%
  count(word, sort = TRUE) %>%
  filter(n > 3000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity") +
  xlab(NULL) +

  • Generate a WordCloud: An image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.
word_tweets_dhoni %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 200))

Sentiment Wordcloud

word_tweets_dhoni %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 200)

  • Combinations of words using n-grams: Using bigrams to provide context in sentiment analysis. 

Words Preceded By "Captain"

captain_words <- tweets_dhoni_bigrams_separated %>%
  filter(word1 == "captain") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, score, sort = TRUE) %>%

captain_words %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * score, fill = n * score > 0)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  ylab("Words preceded by \"captain\"") +
  xlab("Sentiment score * #dhoni of occurrences") +

Words Preceded By Negation

negation_words <- c("not", "no", "never", "without","like")

negated_words <- tweets_dhoni_bigrams_separated %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, score, sort = TRUE) %>%

negated_words %>%
  mutate(contribution = n * score) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  group_by(word1) %>%
  top_n(10, abs(contribution)) %>%
  ggplot(aes(word2, contribution, fill = n * score > 0)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free") +
  xlab("Words preceded by negation") +
  ylab("Sentiment score * #dhoni of occurrences") +

Visualizing a Network of Bigrams With igraph

tweets_dhoni_bigrams_counts <- tweets_dhoni_bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)


tweets_dhoni_bigrams_graph <- tweets_dhoni_bigrams_counts %>%
  filter(n > 500 & n < 3000) %>%




a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(tweets_dhoni_bigrams_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +

Source Code


Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

text analysis ,big data ,hadoop ,r

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}