Streaming Data From Twitter for Analysis in Spark
'Tis the season of NFL football, and one way to capture excitement is Twitter data. I did this using StreamSets Data Collector.
Join the DZone community and get the full member experience.Join For Free
Happy New Year! Our first blog entry of 2018 is a guest post from Josh Janzen, a data scientist based in Minnesota. Josh wanted to ingest tweets referencing NFL games into Spark, then run some analysis to look for a correlation between Twitter activity and game winners. Josh originally posted this entry on his personal blog and kindly allowed us to repost it. Over to you, Josh.
'Tis the season of NFL football, and one way to capture excitement is Twitter data. I've tinkered around with Twitter's Developer API before, but this time I wanted to use a streaming product I've heard good things about: StreamSets Data Collector.
After I received the Tweets' semi-raw data, I wanted to analyze the tweet data using Spark. I chose Spark because the distributed nature of the RDD is great for using large amounts of data (and I'm not sure on how much I'll be getting).
My idea was to do a count of tweets for a particular team/game and see if the volume of tweets would predict whether that team actually wins or loses the game.
Data Collection Process
I have done a little work with the Twitter Developer API in the past, which I had used from Python to parse the tweets as they arrived. I found this process very simple, but I was a little apprehensive bringing StreamSets into the mix. However, having the knowledge of a scalable ETL and streaming program like StreamSets is a good idea.
To use StreamSets, I did some google searches on streaming Twitter StreamSets. I found a very well put-together tutorial. It looked promising, so I felt confident enough to download the StreamSets application on my Mac and install it. I was a 145mb ZIP download extracted as a Java project.
After starting via the terminal, I was able to connect to it via localhost through my web browser, which I appreciate.
To connect to Twitter API via the StreamSets HTTP Client origin, I had to define the Resource URL. Instead of getting all the tweets available, I decided to filter only tweets with "nfl" located in the tweet or hashtag. Also note, the Twitter API is a randomly sampled real-time subset of tweets. I was planning on doing all other filtering and counting in Spark later, but I'm sure some more of that ETL could have also been done in StreamSets.
As for the credentials to connect to Twitter, I had to enter four values: Consumer Key, Consumer Secret, Token, and Token Secret. At this point, as a test, using the StreamSets UI, I connected the HTTP Client to save in Local FileSystem and ran the pipeline.
I reviewed a few lines of raw tweet output in a text editor and online JSON viewer. I decided that I didn't need all the JSON fields, so I added a Field Remover processor to my pipeline between the HTTP Client and saving to the Local FileSystem. The fields I decided to keep were (I went with more rather than less, as I didn't know exactly what I'd need in Spark): create date/time, userId, tweet text, username, user location, user timezone, hashtags, retweet status, retweet count, and location. After running, it looked good!
As I was in NFL week 13 (Thursday 11/30 - Sunday 12/03), I decided to run the Pipeline on the Thursday game as a test. I noticed plenty of data (around 5k tweets) relating to the NFL for those three hours of 7 PM to 10 PM. I thought this was a good proxy for plenty of data to capture for Sunday - when was my intended data to go for analysis of the project.
Final pipeline diagram:
On Sunday, 12/3, I started the Pipeline at 11:59 AM and ran it until about 7:15 PM that day. By running during that time, it would allow me to have the option of analysis for both the noon and 3 PM games.
After I stopped the Pipeline, I had nine folders of data (one folder for each hour, which was the default setting in the StreamSets local file system destination; the first hour was only one minute, and the last folder representing 7 PM was also very small). The size of all the Sunday tweets was about 52mb.
Before diving into Spark, I wanted to get an idea on the amount of tweets in my data for data validation purposes. Using the terminal, I did a
wc -l filename for the 12 PM and 3 PM hours. The total lines were 3,145 and 4,110. Since I have about seven full hours, I would expect my data in Spark to have about 20k - 25k tweets.
Spark Processing and Validation
I had the data on my local drive on the Cluster, so now, I copied that data to HDFS for Spark to access. After starting the Spark shell, I went to read the data using the HDFS path and
/*. However, after doing a count of the tweets, it seemed very low. It turns out, I needed an additional
/* added to access all the subdirectories. I did a count on the RDD and came out to 20,202, which validated to the Linux command I ran on my local in which I estimated 20k - 25k Tweets.
Moving on to what I was looking for — which, at this point, was counting the number of tweets during a game for a particular team playing. I decided to break the dataset into two RDDs. The first would be mapping and getting just the "hour" of the tweet. The second would be mapping to get the "text" of the tweet.
The final data structure would need to combine the two RDDs so I could count across specific hours and tweets containing the team name. I decided on the tuple data structure. Then, I just filtered the tuple by hours of a game and team name. For example, for the Vikings/Rams game (which started at noon), would be an hour representing noon, 1 PM, 2 PM, and tweet text containing "vikings" or "rams."
I had to repeat this process for each team in the noon games, where there were seven games. At this point, I decided to create a JAR and submit the job via Spark-Submit. The input the Shell script to run the JAR on the cluster was to enter Input Data Location, Output Data Location, and team name. By doing this, it sped up the process of gathering the count of tweets for each team as I just had to update the team name in the Shell-script and running it right from the Cluster.
I was making the assumption that the noon game would run from noon to 3 PM.
I used text editors for writing my code. On my Windows, it was Notepad++.
Of the seven games played at noon, four of the seven who were winners had more Tweets. I don't think that is it significant to say the tweet activity predicted the outcome, but interesting nonetheless.
I have used a Hadoop cluster many times over the past three years. From a data science perspective, it's really not the greatest tool due to the effort needed to move data and the lack of built-in statistical/visualization tools. Going forward, if I were to consult similar tools, I would look into something like Cloudera's Data Science Workbench. However, I'm a firm believer in the knowledge to perform all functions through the command line, so this project further enhanced my skillset.
Published at DZone with permission of Josh Janzen, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.