Over a million developers have joined DZone.

Gathering and Extracting from Tweets with R

DZone's Guide to

Gathering and Extracting from Tweets with R

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Yesterday evening, I wanted to play with Twitter to see which websites I was using as references in my Tweets. I wanted to create a Top Four list.

The first problem I encountered was that installing twitteR on Ubuntu is not that simple! You have to properly install RCurl … but before you install the package in R, it is necessary to run the following line in a terminal:

$ sudo apt-get install 
then, launch R:
$ R
and then you can run the standard:
> install.packages("RCurl")
and install finally the package of interest,
> install.packages("twitteR")
Then, the second problem I had was that  twitteR has been updated recently because of Twitter’s new API. Now, you should register on Twitter’s developers webpage, get an ID and a password, then use it in the following function (I did change both of them below, so if you try to run the following code, you will probably get an error message):
> library(twitteR)
> cred <- getTwitterOAuth("ikzCtYif9Rwoood45w","rsCCifp99kw5sJfKfOUhhwyVmPl9A")
> registerTwitterOAuth(cred)
[1] TRUE
> T <- userTimeline('freakonometrics',n=5000)
you should also go to this webpage and enter a PIN that you are given.
To enable the connection, please direct your web browser to:


When complete, record the PIN given to you and provide it here:
It is a pain in a**, trust me. Anyway, I was able to run it. I could then have the list with all my (recent) Tweets:

> T <- userTimeline('freakonometrics',n=5000)

Now, my (third) problem was extracting the URLs of references from my Tweets. The second Tweet of the list was:

But when you look at the text, you see:

> T[[2]]
[1] "freakonometrics: [textmining] \"How a Computer Program Helped Reveal J. K. 
Rowling as Author of A Cuckoos Calling\" http://t.co/wdmBGL8cmj by @garethideas"
So, what I get is not the URL used in my Tweet, but a shortcut to the URLs from http://t.co/. Thankfully, @ 3wen (as always) has been able to help me with the following functions:
> extraire <- function(entree,motif){
+	res <- regexec(motif,entree)
+	if(length(res[[1]])==2){
+		debut <- (res[[1]])[2]
+		fin <- debut+(attr(res[[1]],"match.length"))[2]-1
+		return(substr(entree,debut,fin))
+	}else return(NA)}
> unshorten <- function(url){
+	uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
+       cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
+	res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
+	return(res)}

Now, if we use those functions, we can get the true URL:

> url <- "http://t.co/wdmBGL8cmj"
> unshorten(url)
[1] http://www.scientificamerican.com/article.cfm?id=how-a-computer-program-helped-show..
Now I can play with my list to extract the URLs and the address of the website:
> exturl <- function(i){
+ text_tw <- T_text[i]
+ locunshort2 <- NULL
+ indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
+ if(length(indtext)>0){
+ loc <- unlist(strsplit(text_tw, " "))[indtext]
+ locunshort=unshorten(loc)
+ if(is.na(locunshort)==FALSE){
+ locunshort2 <- unlist(strsplit(locunshort, "/"))[3]}}
+ return(locunshort2)}
Using apply with this function, and my list, and counting using a simple table() function, I can see that my Top Four (out of over 900 Tweets) reference websites are the following:
Nice, isn’t it?

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.


Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}