Who’s the most popular Game of Thrones character? What Game of Thrones family rules the social scene? How blood thirsty are viewers of the show? Big data tells it all. I used the ELK Stack to analyze tweets about Game of Thrones to see what viewers are actually saying and feeling towards the series (sneak peak — Ramsay Bolton is not as hated as you’d expect!)
The ELK Stack (Elasticsearch, Logstash and Kibana) is quickly becoming the world’s most popular open source log analysis platform. A fact less known though, is that the stack can be used for a whole lot more than log analysis. While managing, analyzing and visualizing logs is the main use case, ELK is increasingly being used for business intelligence, SEO, and log-driven development as well.
This article will demonstrate one such use-case by showing how to use Twitter API to stream Twitter data into the ELK Stack hosted by Logz.io using Fluentd — an open source data collector. I chose Logz.io for convenience, but if you’re using the open source ELK, you can use the exact same workflow outlined here with some configuration tweaks.
Step 1: Creating a Twitter App
To be able to establish a connection with Twitter and extract data, we will need Twitter API keys. To get your hands on these keys, you will first need to create a Twitter app.Go to the Twitter apps page, and create a new app.
You will need to enter a name, description and website URL for the app. Don’t worry about the particulars, your entries here will not affect how the data is shipped into Elasticsearch.
Once created, open the app’s Keys and Access Tokens tab, and click the button at the bottom of the page to generate a new access token.
Keep this page open in your browser as we will need the data there when setting up the feed in Fluentd.
Step 2: Installing Fluentd
Fluentd is an open source data collector developed at Treasure Data, that acts as a unifying logging layer between input sources and output services.
Fluentd is easy to install, has a light footprint and has a fully pluggable architecture. In the world of ELK, Fluentd acts as a log collector — aggregating logs, parsing them, and forwarding them on to Elasticsearch. As such, Fluentd is often compared to Logstash, which has similar traits and functions (here’s a detailed comparison between the two).
The latest stable release of Fluentd is called 'td-agent’. To install it, use this cURL command (this command is for Ubuntu 14.04 -- if you’re using a different Linux distribution, click here):
$ curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-trusty-td-agent2.sh | sh
The command will automatically install Fluentd and start the daemon. To make sure all is running as expected, run:
$ sudo /etc/init.d/td-agent status
If all is OK, you should get this output:
* td-agent is running
Step 3: Installing the Logz.io and Twitter Plugins
Our next step is to install the Logz.io and Twitter plugins for Fluentd using the gem supplied with the td-agent:
To install the Logz.io plugin, run:
$ sudo /opt/td-agent/usr/sbin/td-agent-gem install fluent-plugin-logzio
Next, to install the Twitter plugin for Fluentd, run the following commands:
$ sudo apt-get install build-essential $ sudo /usr/sbin/td-agent-gem install fluent-plugin-twitter $ sudo /etc/init.d/td-agent restart Restarting td-agent: * td-agent
Step 4: Configuring Fluentd
The next step is to configure Fluentd to forward the Twitter data into the ELK Stack hosted by Logz.io. Open the Fluentd configuration file:
$ sudo vi /etc/td-agent/td-agent.conf
Remove all the current configurations, and add the following source (enter your Twitter API keys):
<source> @type twitter consumer_key <Twitter consumer key> consumer_secret <Twitter consumer secret> oauth_token <Twitter access token> oauth_token_secret <Twitter access token secret> tag input.twitter keyword 'GameOfThrones,Stark,Lannister,Arryn,Greyjoy,Baratheon,Bolton,Targaryen,Tully,Martel,Tyrell,Frey' timeline tracking output_format flat </source>
I decided to track mentions of the Great Families in Game of Thrones, but you can of course track any keyword you like. Just make sure you don’t break the syntax as the Fluentd configuration file is extremely case sensitive.
Spoiler tip! If you’re a Game of Thrones fan, DO NOT use these keywords if you haven’t seen the latest episode!
Next, we’re going to define Logz.io as a “match” (the Fluentd term for an output destination):
<match **.**> type logzio_buffered endpoint_url https://listener.logz.io:8071?token=<token>&type=twitter output_include_time true output_include_tags true buffer_type file buffer_path /tmp/buffer flush_interval 10s buffer_chunk_limit 1m </match>
Fine tune this configuration as follows:
Use your token in the token placeholder (can be found in the Logz.io Settings section)
Path to buffer - enter a path to a folder in your file system that you have full permissions for (e.g. /tmp/buffer). The buffer file helps to aggregate logs together and ship them in bulk.
If you’re using the open source ELK Stack, your ‘match’ configuration in the Fluentd configuration file would look something like this:
<match **> @type elasticsearch logstash_format true host <hostname> port 9200 </match>
To verify all is ship shape, restart Fluentd:
$ sudo /etc/init.d/td-agent restart Restarting td-agent: * td-agent
Step 5: Analyzing the Data
Opening Kibana (integrated in this case within the Logz.io user interface), you should now be receiving a stream of data from Twitter. You’ll see an initial message informing you that streaming API has begun for tracking your selected keywords.
Given enough time, the feed will grow to provide you with a great source of information to pull from.
It’s important to point out that the data ingested into ELK via the Twitter API is not 100% complete. Some fields have null values and the values of others depend on how the original tweets were composed. For example, the ‘coordinates.coordinates’ field reflects Twitter users who used Twitter’s location feature.
To analyze the datas, start by adding some fields to the message list. Useful fields to add are: the ‘text’ field (reflects the actual tweet text), the ‘user_location’ field (reflects the location defined in the user profile), and the ‘user_followers_count’ (reflects the number of followers, and can be useful to measure reach and impact.)
Query Elasticsearch for information you’re interested in. For example, you can use regex within a field-level search to search for a specific string within the ‘text’ field:
Step 6: Visualizing the Data
Next, let’s try and visualize the data to see trends and analyze correlations.
Great Family Mentions
As an example, we’re going to create a new pie chart visualization that shows how many times each Great Family is mentioned.
Open the Visualize tab in Kibana, and select the Pie Chart visualization type. As the search source for the visualization, select From a new search.
The configuration for this visualization is pretty straightforward, using filters containing the various family names to cross reference the entire pool of data:
Great Family Mentions Over Time
Another interesting way to identify trends is to try and visualize mentions over time. To do this, we’re going to select the Line Chart visualization type, and use the following configuration:
Note, the X axis in this case contains both a timestamp aggregation and a Split Lines bucket using the family names as filters.
Which character is mentioned the most? To determine this, we’re going to create a simple pie chart using the entire pool of tweets as a base, and filters with the names of the leading characters in the series.
Putting It All Together
These are just simple examples of what can be done with your Twitter data in Kibana. Once you have a series of visualizations, open the Dashboard tab create your own dashboard to get a comprehensive view of your data.
Big data analysis is one of the biggest technological trends of our time. More and more tools are being introduced that allow the consumption of huge sets of data and logical inference. The open-source ELK Stack is one such tool and this article shows a creative example of how the stack can be used to consume and analyze any type of data.