Over a million developers have joined DZone.

An Automatic Code to Extract Tweets (and Produce the "Somewhere Else" Review)

· DevOps Zone

The DevOps zone is brought to you in partnership with Sonatype Nexus. The Nexus suite helps scale your DevOps delivery with continuous component intelligence integrated into development tools, including Eclipse, IntelliJ, Jenkins, Bamboo, SonarQube and more. Schedule a demo today

A few weeks ago, I ask in a post the (simple) question "dear reader, who are you?" just to know more about the readers of my blog. I found that extremely interesting (even if - to be honest - I was expecting more answers to start a more serious sociological study of the readers of my blog). And an interesting point was that a lot of readers of my blog come to read the "somewhere else" posts, which is a review of interesting posts and articles found on the internet. Those links I share actually come from my tweets. I have on my blog a backup of my tweets, and usually, that's where I go if I want to find some article, or some graph, or some map I have in mind, that I've seen somewhere (but usually I can't remember where). But most of the time, I feel bored, because there is nothing new: it is simply a copy and paste from my tweets.

And this afternoon @tomroud asked how those posts were written: was there an automatic procedure, or was I doing it manually? Until tonight, I was doing it manually. But because it was some kind of stupid challenge, I did try to produce a code that will generate a simple list of my tweets that I can use to produce a post.

Nevertheless, there are still two problems I cannot fix with a code:

  • in my "somewhere else" posts, there was a language distinction, with posts and articles in English first, and then those in French. Unfortunately, I could not find a function that detects the language of a tweet. I remember that we've been trying with @3wen to write such a code, but I could not find it... I guess @3wen had a first draft so if we can find it, I will upload it on my blog (or he will upload it on his)
  • in my posts, I include the picture, if any. This part will still be done manually because it is much more difficult (but I guess it is possible...)

Now, before starting, we will need  functions from an old post, to convert twitter's shorten url to real ones,

extraire <- function(entree,motif){
res <- regexec(motif,entree)
if(length(res[[1]])==2){
 debut <- (res[[1]])[2]
 fin <- debut+(attr(res[[1]],"match.length"))[2]-1
return(substr(entree,debut,fin))
}else return(NA)}
unshorten <- function(url){
uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
return(res)}

Now, let us consider the following code. The first step, of course, is to run some lines that will allow me to use Twitter's API,

require(twitteR)
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
apiKey <- "yourAPIkey"
apiSecret <- "yourAPIsecret"

twitCred <- OAuthFactory$new(consumerKey=apiKey,consumerSecret=apiSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)

twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

registerTwitterOAuth(twitCred)

Then, I need to be cautious become some of my tweets are in French, and some weird symbols might appear,

Sys.setlocale("LC_CTYPE","fr_FR.UTF-8")

Now I can write my function

somewhere_else <- function(){

tweets_freak <- searchTwitter("from:@freakonometrics", n = 500)

save(tweets_freak, file="somewhere_else.RData")

tweets_freak_df <- do.call("rbind", lapply(tweets_freak, as.data.frame))

text_tweets_freak <- tweets_freak_df$text

tweets_freak_message <- text_tweets_freak[which(substr(text_tweets_freak,1,1)!="@")]

SE <- which(substr(tweets_freak_message,1,15)=="\"Somewhere else")
first_SE <- SE[1]

tweets_freak <- tweets_freak_message[1:(first_SE-1)]

substitute_id <- function(x){
split_x <- strsplit(x,"@")[[1]]
x_id <- paste(split_x,collapse="http://twitter.com/",sep="")
split_x_id <- strsplit(x_id,"http")
n <- length(split_x_id[[1]])
tweet_x <- strsplit(split_x_id[[1]]," ")

if(n==1) rt <- x_id
if(n>1){
for(i in 2:n){
url <- tweet_x[[i]][1]
split=FALSE
if(substr(url,nchar(url),nchar(url))%in%c(":",",",";",")","(")) split <- TRUE
if(split==FALSE) unshort_url <- unshorten(paste("http",url,sep=""))
if(split==TRUE) unshort_url <- unshorten(paste("http",substr(url,1,nchar(url)-1),sep=""))
tweet=FALSE
if(substr(url,4,10)=="twitter") tweet=TRUE
if((split==FALSE)&(tweet==FALSE)) tweet_x_2 <- c("<a href=\"",unshort_url,"\">",unshort_url,"</a>")
if((split==TRUE)&(tweet==FALSE)) tweet_x_2 <- c("<a href=\"",unshort_url,"\">",unshort_url,"</a>",substr(url,nchar(url),nchar(url)))
if((split==FALSE)&(tweet==TRUE)) tweet_x_2 <- c("<a href=\"",unshort_url,"\">@",substr(unshort_url,21,nchar(unshort_url)),"</a>")
if((split==TRUE)&(tweet==TRUE)) tweet_x_2 <- c("<a href=\"",unshort_url,"\">@",substr(unshort_url,21,nchar(unshort_url))
,"</a>",substr(url,nchar(url),nchar(url)))
tweet_x[[i]] <- c(tweet_x_2,tweet_x[[i]][-1])
}
rt <- paste("<li>",paste(unlist(tweet_x),collapse=" "),"</li>",sep="")
}
return(rt)
}

tweets_freak_sub <- lapply(tweets_freak, substitute_id)
write.table(unlist(tweets_freak_sub),file="tweets_somewhere_else.txt",quote=FALSE,row.names=FALSE)

cat("Number of tweets.....",length(tweets_freak_sub),"\n")
cat("File.................",paste(getwd(),"tweets_somewhere_else.txt",sep="/"),"\n")
cat("Done\n")
}

The first tricky part was to recognize names mentionned in my tweets (since some of them are retweets). The second one was to create an html link each time there is a link (I did not take into account hastags, here). If I run it, get

> somewhere_else()
Number of tweets..... 72 
File.... /home/arthur/tweets_somewhere_else.txt 
Done
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit,  :
  500 tweets were requested but the API can only return 191

If I make a copy and paste from the text file, I have

which makes sense, because those are indeed my most recent posts,

etc. I will have to spend some time to include pictures, graphs, maps, videos, etc, but that function should save me some time!

The DevOps zone is brought to you in partnership with Sonatype Nexus. Use the Nexus Suite to automate your software supply chain and ensure you're using the highest quality open source components at every step of the development lifecycle. Get Nexus today

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}