Over a million developers have joined DZone.

Gathering and Extracting from Tweets with R

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

Yesterday evening, I wanted to play with Twitter to see which websites I was using as references in my Tweets. I wanted to create a Top Four list.

The first problem I encountered was that installing twitteR on Ubuntu is not that simple! You have to properly install RCurl … but before you install the package in R, it is necessary to run the following line in a terminal:

$ sudo apt-get install 
then, launch R:
$ R
and then you can run the standard:
> install.packages("RCurl")
and install finally the package of interest,
> install.packages("twitteR")
Then, the second problem I had was that twitteR has been updated recently because of Twitter’s new API. Now, you should register on Twitter’s developers webpage, get an ID and a password, then use it in the following function (I did change both of them below, so if you try to run the following code, you will probably get an error message):
> library(twitteR)
> cred <- getTwitterOAuth("ikzCtYif9Rwoood45w","rsCCifp99kw5sJfKfOUhhwyVmPl9A")
> registerTwitterOAuth(cred)
[1] TRUE
> T <- userTimeline('freakonometrics',n=5000)
you should also go to this webpage and enter a PIN that you are given.
To enable the connection, please direct your web browser to:


When complete, record the PIN given to you and provide it here:
It is a pain in a**, trust me. Anyway, I was able to run it. I could then have the list with all my (recent) Tweets:

> T <- userTimeline('freakonometrics',n=5000)

Now, my (third) problem was extracting the URLs of references from my Tweets. The second Tweet of the list was:

But when you look at the text, you see:

> T[[2]]
[1] "freakonometrics: [textmining] \"How a Computer Program Helped Reveal J. K. 
Rowling as Author of A Cuckoos Calling\" http://t.co/wdmBGL8cmj by @garethideas"
So, what I get is not the URL used in my Tweet, but a shortcut to the URLs from http://t.co/. Thankfully, @3wen (as always) has been able to help me with the following functions:
> extraire <- function(entree,motif){
+	res <- regexec(motif,entree)
+	if(length(res[[1]])==2){
+		debut <- (res[[1]])[2]
+		fin <- debut+(attr(res[[1]],"match.length"))[2]-1
+		return(substr(entree,debut,fin))
+	}else return(NA)}
> unshorten <- function(url){
+	uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
+       cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
+	res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
+	return(res)}

Now, if we use those functions, we can get the true URL:

> url <- "http://t.co/wdmBGL8cmj"
> unshorten(url)
[1] http://www.scientificamerican.com/article.cfm?id=how-a-computer-program-helped-show..
Now I can play with my list to extract the URLs and the address of the website:
> exturl <- function(i){
+ text_tw <- T_text[i]
+ locunshort2 <- NULL
+ indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
+ if(length(indtext)>0){
+ loc <- unlist(strsplit(text_tw, " "))[indtext]
+ locunshort=unshorten(loc)
+ if(is.na(locunshort)==FALSE){
+ locunshort2 <- unlist(strsplit(locunshort, "/"))[3]}}
+ return(locunshort2)}
Using apply with this function, and my list, and counting using a simple table() function, I can see that my Top Four (out of over 900 Tweets) reference websites are the following:
Nice, isn’t it?

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison


Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}