Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Collect Big Data Sets From Twitter

DZone 's Guide to

How to Collect Big Data Sets From Twitter

In this post, we learn how to collect data from other sources and perform a more accurate sentiment analysis.

· Big Data Zone ·
Free Resource

In this post, you’ll learn how to collect data from Twitter, one of biggest sources for big data sets.

You’ll also need to set up a Hadoop cluster and HDFS to store the multi-format data you’ll gather from Twitter. Though we’ll be focusing on one platform only, you can obtain more accuracy if you collect data from other channels as well.

Twitter has many APIs that can let you manipulate and retrieve tweets on demand. To get started, you’ll have to enforce or implement a twitter framework, OAuth. After implementation and authentication, you are good to go ahead and grab tweets as you wish.

Using R or RStudio to Retrieve data

In order to successfully attempt tweet retrieval, you’ll have to load a few packages to your Ubuntu system.

Start with installing these packages

  • libcurl4-gnutls-dev
  • libcurl4-nss-dev
  • libcurl4-openssl-dev
  • r-base r-base-dev
  • r-cran-rjson

The next step is to initiate your R console and run the below mentioned commands. This will install the twitter accessing package in your system.

  • install.packages(“twitteR”)
  • install.packages(“ROAuth”)
  • install.packages(“RCurl”)

Open R workspace and load the following libraries.

  • rm(list=ls())
  • library(twitteR)
  • library(ROAuth)
  • library(RCurl)

Once all the steps are executed and completed successfully, proceed further for Twitter authentication using R script.

download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")

requestURL <- "https://api.twitter.com/oauth/request_token"

accessURL <- "https://api.twitter.com/oauth/access_token"

authURL <- "https://api.twitter.com/oauth/authorize"

consumerKey <- myConsumerKeyFromTwitter

consumerSecret <- myConsumerSeccretFromTwitter

myCred <- OAuthFactory$new(consumerKey=consumerKey,

consumerSecret=consumerSecret,

requestURL=requestURL,

accessURL=accessURL,

authURL=authURL)

accessToken <- myAccessTokenFromTwitter

accessSecret <- myAccessSecretFromTwitter

setup_twitter_oauth(consumerKey,consumerSecret,accessToken,accessSecret)

Repeat these steps until you get zero errors. Save the authentication file as ‘twitter authentication.Rdata.’ Now run your R, load the saved authentication file in the session, and execute Twitter OAuth register. This command will result in a ‘TRUE’ message, which means there is no error, and you’re all set to move further into the process.

To extract targeted tweets, you’ll have to determine a pair of variables, the first one being the search string for a hashtag or mention. The next variable is to determine the number of tweets you plan on extracting. If you wish to limit the tweet language to english use this code:

load("twitter authentication.Rdata")

registerTwitterOAuth(cred)

search.string <- "#nba"

no.of.tweets <- 100

tweets <- searchTwitter(search.string, n=no.of.tweets, cainfo="cacert.pem", lang="en")

The code to fetch the tweets for a particular keyword:

  • govt_sentiment_data <- searchTwitter("#keyWord",since={last_date_pulled}

There’s an input that lets you fetch tweets automatically at intervals, if you plan on using this method, switch the previous code with this.

govt_sentiment_data <- filterStream( file="tweets_rstats.json",

track="#keyWord", timeout=3600, oauth=myCred)

The process will result in loads of tweets, some useful and some not. You can use the below code for data cleansing.

govt_sentiment_data_txt = govt_sentiment_data$text

# remove retweet entities

govt_sentiment_data_txt = gsub(“(RT|via)((?:\\b\\W*@\\w+)+)”, “”, tweet_txt)

# remove at people

govt_sentiment_data_txt = gsub(“@\\w+”, “”, tweet_txt)

# remove punctuation

govt_sentiment_data_txt = gsub(“[[:punct:]]”, “”, tweet_txt)

# remove numbers

govt_sentiment_data_txt = gsub(“[[:digit:]]”, “”, tweet_txt)

# remove html links

govt_sentiment_data_txt = gsub(“http\\w+”, “”, tweet_txt)

# remove unnecessary spaces

govt_sentiment_data_txt = gsub(“[ \t]{2,}”, “”, tweet_txt)

govt_sentiment_data_txt = gsub(“^\\s+|\\s+$”, “”, tweet_txt)

govt_sentiment_data_txt=gsub(“[^0-9a-zA-Z ,./?><:;’~`!@#&*’]”,””, tweet_txt)

Save the clean data to your storage/HDFS using:

hdfsFile <- hdfs.file("/tmp/govt_sentiment_data.txt", "w")

hdfs.write(govt_sentiment_data_txt, hdfsFile)

hdfs.close(hdfsFile)

write(govt_sentiment_data, "govt_sentiment_data.txt")

That is it, there you have your big data from Twitter. You can collect data from other sources and perform a more accurate sentiment analysis.

Topics:
big data ,twitter api ,data cleansing ,data ingestion ,r tutorials

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}