Over a million developers have joined DZone.

Gathering and Extracting from Tweets with R

DZone's Guide to

Gathering and Extracting from Tweets with R

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Yesterday evening, I wanted to play with Twitter to see which websites I was using as references in my Tweets. I wanted to create a Top Four list.

The first problem I encountered was that installing twitteR on Ubuntu is not that simple! You have to properly install RCurl … but before you install the package in R, it is necessary to run the following line in a terminal:

$ sudo apt-get install 
then, launch R:
$ R
and then you can run the standard:
> install.packages("RCurl")
and install finally the package of interest,
> install.packages("twitteR")
Then, the second problem I had was that  twitteR has been updated recently because of Twitter’s new API. Now, you should register on Twitter’s developers webpage, get an ID and a password, then use it in the following function (I did change both of them below, so if you try to run the following code, you will probably get an error message):
> library(twitteR)
> cred <- getTwitterOAuth("ikzCtYif9Rwoood45w","rsCCifp99kw5sJfKfOUhhwyVmPl9A")
> registerTwitterOAuth(cred)
[1] TRUE
> T <- userTimeline('freakonometrics',n=5000)
you should also go to this webpage and enter a PIN that you are given.
To enable the connection, please direct your web browser to:


When complete, record the PIN given to you and provide it here:
It is a pain in a**, trust me. Anyway, I was able to run it. I could then have the list with all my (recent) Tweets:

> T <- userTimeline('freakonometrics',n=5000)

Now, my (third) problem was extracting the URLs of references from my Tweets. The second Tweet of the list was:

But when you look at the text, you see:

> T[[2]]
[1] "freakonometrics: [textmining] \"How a Computer Program Helped Reveal J. K. 
Rowling as Author of A Cuckoos Calling\" http://t.co/wdmBGL8cmj by @garethideas"
So, what I get is not the URL used in my Tweet, but a shortcut to the URLs from http://t.co/. Thankfully, @ 3wen (as always) has been able to help me with the following functions:
> extraire <- function(entree,motif){
+	res <- regexec(motif,entree)
+	if(length(res[[1]])==2){
+		debut <- (res[[1]])[2]
+		fin <- debut+(attr(res[[1]],"match.length"))[2]-1
+		return(substr(entree,debut,fin))
+	}else return(NA)}
> unshorten <- function(url){
+	uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
+       cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
+	res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
+	return(res)}

Now, if we use those functions, we can get the true URL:

> url <- "http://t.co/wdmBGL8cmj"
> unshorten(url)
[1] http://www.scientificamerican.com/article.cfm?id=how-a-computer-program-helped-show..
Now I can play with my list to extract the URLs and the address of the website:
> exturl <- function(i){
+ text_tw <- T_text[i]
+ locunshort2 <- NULL
+ indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
+ if(length(indtext)>0){
+ loc <- unlist(strsplit(text_tw, " "))[indtext]
+ locunshort=unshorten(loc)
+ if(is.na(locunshort)==FALSE){
+ locunshort2 <- unlist(strsplit(locunshort, "/"))[3]}}
+ return(locunshort2)}
Using apply with this function, and my list, and counting using a simple table() function, I can see that my Top Four (out of over 900 Tweets) reference websites are the following:
Nice, isn’t it?

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}