Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Gathering and Extracting from Tweets with R

DZone's Guide to

Gathering and Extracting from Tweets with R

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Yesterday evening, I wanted to play with Twitter to see which websites I was using as references in my Tweets. I wanted to create a Top Four list.

The first problem I encountered was that installing twitteR on Ubuntu is not that simple! You have to properly install RCurl … but before you install the package in R, it is necessary to run the following line in a terminal:

$ sudo apt-get install 
  libcurl4-gnutls-dev
then, launch R:
$ R
and then you can run the standard:
> install.packages("RCurl")
and install finally the package of interest,
> install.packages("twitteR")
Then, the second problem I had was that  twitteR has been updated recently because of Twitter’s new API. Now, you should register on Twitter’s developers webpage, get an ID and a password, then use it in the following function (I did change both of them below, so if you try to run the following code, you will probably get an error message):
> library(twitteR)
> cred <- getTwitterOAuth("ikzCtYif9Rwoood45w","rsCCifp99kw5sJfKfOUhhwyVmPl9A")
> registerTwitterOAuth(cred)
[1] TRUE
> T <- userTimeline('freakonometrics',n=5000)
you should also go to this webpage and enter a PIN that you are given.
To enable the connection, please direct your web browser to:

http://api.twitter.com/oauth/authorize?oauth_token=cQaDmxGe...

When complete, record the PIN given to you and provide it here:
It is a pain in a**, trust me. Anyway, I was able to run it. I could then have the list with all my (recent) Tweets:

> T <- userTimeline('freakonometrics',n=5000)

Now, my (third) problem was extracting the URLs of references from my Tweets. The second Tweet of the list was:

But when you look at the text, you see:

> T[[2]]
[1] "freakonometrics: [textmining] \"How a Computer Program Helped Reveal J. K. 
Rowling as Author of A Cuckoos Calling\" http://t.co/wdmBGL8cmj by @garethideas"
So, what I get is not the URL used in my Tweet, but a shortcut to the URLs from http://t.co/. Thankfully, @ 3wen (as always) has been able to help me with the following functions:
> extraire <- function(entree,motif){
+	res <- regexec(motif,entree)
+	if(length(res[[1]])==2){
+		debut <- (res[[1]])[2]
+		fin <- debut+(attr(res[[1]],"match.length"))[2]-1
+		return(substr(entree,debut,fin))
+	}else return(NA)}
> unshorten <- function(url){
+	uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
+       cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
+	res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
+	return(res)}

Now, if we use those functions, we can get the true URL:

> url <- "http://t.co/wdmBGL8cmj"
> unshorten(url)
[1] http://www.scientificamerican.com/article.cfm?id=how-a-computer-program-helped-show..
Now I can play with my list to extract the URLs and the address of the website:
> exturl <- function(i){
+ text_tw <- T_text[i]
+ locunshort2 <- NULL
+ indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
+ if(length(indtext)>0){
+ loc <- unlist(strsplit(text_tw, " "))[indtext]
+ locunshort=unshorten(loc)
+ if(is.na(locunshort)==FALSE){
+ locunshort2 <- unlist(strsplit(locunshort, "/"))[3]}}
+ return(locunshort2)}
Using apply with this function, and my list, and counting using a simple table() function, I can see that my Top Four (out of over 900 Tweets) reference websites are the following:
Nice, isn’t it?

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}