Over a million developers have joined DZone.

Gathering and Extracting from Tweets with R

DZone 's Guide to

Gathering and Extracting from Tweets with R

· Big Data Zone ·
Free Resource

Yesterday evening, I wanted to play with Twitter to see which websites I was using as references in my Tweets. I wanted to create a Top Four list.

The first problem I encountered was that installing twitteR on Ubuntu is not that simple! You have to properly install RCurl … but before you install the package in R, it is necessary to run the following line in a terminal:

$ sudo apt-get install 
then, launch R:
$ R
and then you can run the standard:
> install.packages("RCurl")
and install finally the package of interest,
> install.packages("twitteR")
Then, the second problem I had was that  twitteR has been updated recently because of Twitter’s new API. Now, you should register on Twitter’s developers webpage, get an ID and a password, then use it in the following function (I did change both of them below, so if you try to run the following code, you will probably get an error message):
> library(twitteR)
> cred <- getTwitterOAuth("ikzCtYif9Rwoood45w","rsCCifp99kw5sJfKfOUhhwyVmPl9A")
> registerTwitterOAuth(cred)
[1] TRUE
> T <- userTimeline('freakonometrics',n=5000)
you should also go to this webpage and enter a PIN that you are given.
To enable the connection, please direct your web browser to:


When complete, record the PIN given to you and provide it here:
It is a pain in a**, trust me. Anyway, I was able to run it. I could then have the list with all my (recent) Tweets:

> T <- userTimeline('freakonometrics',n=5000)

Now, my (third) problem was extracting the URLs of references from my Tweets. The second Tweet of the list was:

But when you look at the text, you see:

> T[[2]]
[1] "freakonometrics: [textmining] \"How a Computer Program Helped Reveal J. K. 
Rowling as Author of A Cuckoos Calling\" http://t.co/wdmBGL8cmj by @garethideas"
So, what I get is not the URL used in my Tweet, but a shortcut to the URLs from http://t.co/. Thankfully, @ 3wen (as always) has been able to help me with the following functions:
> extraire <- function(entree,motif){
+	res <- regexec(motif,entree)
+	if(length(res[[1]])==2){
+		debut <- (res[[1]])[2]
+		fin <- debut+(attr(res[[1]],"match.length"))[2]-1
+		return(substr(entree,debut,fin))
+	}else return(NA)}
> unshorten <- function(url){
+	uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
+       cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
+	res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
+	return(res)}

Now, if we use those functions, we can get the true URL:

> url <- "http://t.co/wdmBGL8cmj"
> unshorten(url)
[1] http://www.scientificamerican.com/article.cfm?id=how-a-computer-program-helped-show..
Now I can play with my list to extract the URLs and the address of the website:
> exturl <- function(i){
+ text_tw <- T_text[i]
+ locunshort2 <- NULL
+ indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
+ if(length(indtext)>0){
+ loc <- unlist(strsplit(text_tw, " "))[indtext]
+ locunshort=unshorten(loc)
+ if(is.na(locunshort)==FALSE){
+ locunshort2 <- unlist(strsplit(locunshort, "/"))[3]}}
+ return(locunshort2)}
Using apply with this function, and my list, and counting using a simple table() function, I can see that my Top Four (out of over 900 Tweets) reference websites are the following:
Nice, isn’t it?

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}