Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Text Mining: Twitter Reacts to M. S. Dhoni's Decision

DZone's Guide to

Text Mining: Twitter Reacts to M. S. Dhoni's Decision

A look at the feelings behind the resignation of an Indian cricket team captain, using Hadoop and R.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Mahendra Singh Dhoni on Wednesday announced that he was stepping down as captain of limited-overs cricket. As expected, his decision sent his fans, other cricketers, and experts into a tizzy. Many congratulated Dhoni on his glorious captain's career with cricket legend Sachin Tendulkar leading from the front, saying “it’s a day to celebrate his successful career and respect the decision"

Then after, we (BDCoE Lab) decided to analyze the tweets and look at the sentiment surrounding the announcement using Hadoop and R.


Data Collection: In the 1st stage we fetched data from Twitter service and stored it in HDFS using Apache Flume.

Store Data in HDFS: The twitter JSON data was stored in the HDFS.

Apache Hive: We used Hive to transform the data into a formatted dataset for the data science process.

Data Science Using R 

  • Word Frequencies: A common task in text mining is to look at word frequencies.
word_tweets_dhoni %>%
  count(word, sort = TRUE) %>%
  filter(n > 3000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity") +
  xlab(NULL) +
  coord_flip()


  • Generate a WordCloud: An image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.
library(wordcloud)
word_tweets_dhoni %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 200))


Sentiment Wordcloud

word_tweets_dhoni %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 200)


  • Combinations of words using n-grams: Using bigrams to provide context in sentiment analysis. 

Words Preceded By "Captain"

captain_words <- tweets_dhoni_bigrams_separated %>%
  filter(word1 == "captain") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, score, sort = TRUE) %>%
  ungroup()


captain_words %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * score, fill = n * score > 0)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  ylab("Words preceded by \"captain\"") +
  xlab("Sentiment score * #dhoni of occurrences") +
  coord_flip()


Words Preceded By Negation

negation_words <- c("not", "no", "never", "without","like")


negated_words <- tweets_dhoni_bigrams_separated %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, score, sort = TRUE) %>%
  ungroup()

negated_words %>%
  mutate(contribution = n * score) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  group_by(word1) %>%
  top_n(10, abs(contribution)) %>%
  ggplot(aes(word2, contribution, fill = n * score > 0)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free") +
  xlab("Words preceded by negation") +
  ylab("Sentiment score * #dhoni of occurrences") +
  coord_flip()


Visualizing a Network of Bigrams With igraph

tweets_dhoni_bigrams_counts <- tweets_dhoni_bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

library(igraph)

tweets_dhoni_bigrams_graph <- tweets_dhoni_bigrams_counts %>%
  filter(n > 500 & n < 3000) %>%
  graph_from_data_frame()

tweets_dhoni_bigrams_graph

library(ggraph)

set.seed(2016)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(tweets_dhoni_bigrams_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()


Source Code

Thanks....!!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
text analysis ,big data ,hadoop ,r

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}