Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Analyzing Twitter Data With Apache Storm, Power BI, and Azure SQL Server

DZone's Guide to

Analyzing Twitter Data With Apache Storm, Power BI, and Azure SQL Server

Apache Storm is a reliable way to process unbounded streams of data. Learn how to use it to categorize Twitter data as negative, neutral, or positive.

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

When we need to process streams of real-time data, Storm is a great contender. Examples of streaming data are the number of consumer clicks and navigations on a website, IIS or user logs, IoT data, and social network information. In all these scenarios, we use real-time data processing. Apache Storm can process real-time unbounded streams of data. 

The term "unbounded" defines streams of data with no start or end. Here, the processing of data is continuous and in real-time. Twitter is a good example. Twitter data is continuous, has no start or end time, and is provided in real-time by millions of Twitter users around the world.

Apache Storm is an open-source, distributed, fault-tolerant system, but since we are using this as part of Azure HDInsight cluster, there are charges for usage. The distribution arrangement in Storm cluster works using the concept of master-slave. The master is Nimbus; the slave is the Supervisor node. ZooKeeper coordinates between distributed processes. The Nimbus node uploads computations for execution, distributes work across clusters, launches workers for computation, and monitors them.

ZooKeeper coordinates the Storm cluster, as depicted in Figure 1.1. Supervisor communicates to Nimbus through Zookeeper and starts and stops workers based on signals from Nimbus. Each worker process running in Supervisor nodes is isolated. It has its own Java Virtual Machine. In the case of .NET components, it runs its own process. The .NET SCPHost.exe is created per executor and runs the tasks in its own thread. To use C# topology in the Linux storm cluster, the .NET SDK should be 4.5 or later and the Microsoft.SCP.Net.SDK NuGet package used by the project has to be version 0.10.0.6 or later. 

Figure 1.1 Apache Storm cluster Nodes

Figure 1.1: Apache Storm cluster nodes.

Storm cluster topology consumes streams of data and processes data in complex ways by repartitioning the streams. The topology consists of spouts and bolts. Spouts emit the Twitter data as tuples to bolts.

Image title

Figure 1.2: Sample Storm topology.

The tuples can be emitted to single or multiple bolts as shown in Figure 1.2. A single bolt can receive data from multiple spouts, as well. The bolt aggregates, filters, sorts, or audit logs the data. The bolt can further pass the tuple to the downstream bolt for more processing. With this background information on Storm, we shall now look into the implementation of Twitter sentiment analysis.

Provisioning HDInsight Storm Cluster

Log into Microsoft Azure portal. Click New. In the Data + Analytics menu, click HDInsight.

In the New HDInsight Cluster section, enter the following settings, and then click Create:

  • Cluster name: Enter a unique name (and write it down)

  • Subscription: Select your Azure subscription

  • Cluster type: Storm

  • Cluster OS: Linux

  • Version: Choose the latest version of Storm available

  • Cluster tier: Standard

  • Credentials: Configure the following settings:

    • Cluster login username: Enter a username of your choice (note it)

    • Cluster login password: Enter and confirm a strong password (note it)

    • Remote desktop username: Enter another username of your choice (note it)

    • Remote desktop password: Use the same password as the cluster login password

  • Resource group: Create a new resource group (note it)

  • Storage: Select the existing storage account you created in the first procedure and set the default container to the cluster name you specified above.

  • Applications: None

• Cluster size: Configure the following settings:

  • Number of supervisor nodes: 2 

    • Supervisor node size: Leave the default size selected

    • Nimbus node size: Leave the default size selected

    • ZooKeeper node size: Leave the default size selected

  • Advanced settings: None

Click Notifications to confirm the start of deployment. It takes roughly 20-30 minutes to create a cluster, so move along to the next step.

Note:Your Azure subscription account will be charged as soon as the Azure HDInsight cluster is running. To clean up after usage, delete the resource group created specifically for this project to avoid further charges. 

Provision an Azure SQL Server

Log into Microsoft Azure Portal. Click New. Then in the Databases menu, click SQL Database. Enter following settings:

  • Database name: A unique name (note it)

  • Resource group: Use the existing resource group previously created

  • Select source: Blank database

  • Server: Create a new server with admin username and password (note it)

  • Want to use elastic pool: No

  • Pricing tier: Standard S2: 50 DTU and 200GB

Click on Create. Once the Azure SQL Server is created, under the SQL database blade, click on Set server firewall and make sureAllow access to Azure Services is on. This allows HDInsight Storm clusters to access the SQL database table.

Create a table to store Twitter data in the Azure SQL database:

create table Tweets([Id] nvarchar(256) PRIMARY KEY,
[Text] nvarchar(256),
[RetweetCount] BigInt,
[FavoriteCount] BigInt,
[Score] BigInt,
[Sentimenttype] int,
[Createddate] DATETIME,
[TimeStamp] DATETIME);

Next, create a Twitter app. Once the application is created, navigate to Keys and Access tokens to find details like the consumer key, consumer secret, access token, and access token secret, as shown in Figure 4.1. Note these details for the next step.

Figure 1.4 Twitter App - Keys and Access Token

Figure 4.1: Twitter app—keys and access token.

Next, create a Storm topology application in Visual Studio and make sure to select the latest .NET framework (4.6 and above).

Figure 4.1 Storm application in Visual StudioFigure 5.1: Storm application in Visual Studio.

Now update the Microsoft.SCP.Net.SDK NuGet to v1.0.0.3. Also get the NuGet TweetinviAPI to install 1.2.0.1 version or later. TweetinviAPI is a C# library that helps to easily access Twitter REST and stream APIs. The project contains one spout and two bolts. TwitterReaderSpout spawns a thread to get Twitter data using TweetinviAPI. This uses keys and tokens noted from the previous step, as shown below:

 private void TweetStream(AuthInfo auth)
        {
            Auth.SetUserCredentials(auth.ConsumerKey,auth.ConsumerSecret,
                                    auth.AccessToken,auth.AccessTokenSecret);
            var stream = Tweetinvi.Stream.CreateSampleStream();
            stream.AddTweetLanguageFilter(LanguageFilter.English);
            stream.TweetReceived += (s, e) =>
            {
                if (e.Tweet.IsRetweet)
                    queue.Enqueue(e.Tweet.RetweetedTweet);
            };
            stream.StartStream();
        }

The NextTuple() function in TwitterReaderSpout emits the Twitter data serialized to the TwitterFeed class object to bolts. 

var tweet = queue.Dequeue();
            cache.Add(seqId++, tweet);
this.context.Emit(Constants.DEFAULT_STREAM_ID,
                new Values(new TwitterFeed(tweet)), seqId);

There are two bolts: TweetRankBolt and AzureSqlBoltTweetRankBolt figures out the top 10 tweets based on score (retweets + favorite count) in a 10-second window and then emits those tweets to AzureSqlBolt to store in the Azure database table. 

TopologyBuilder topoBuilder = new TopologyBuilder("TwitterSentimentAnalysis" + DateTime.Now.ToString("yyyyMMddHHmmss"));
        topoBuilder.SetSpout("TwitterReaderSpout",TwitterReaderSpout.Get,
            new Dictionary<string, List<string>>(){
                {Constants.DEFAULT_STREAM_ID, TwitterReaderSpout.OutputSchemaName}},1, true);
        var boltConfig = new StormConfig();
        boltConfig.Set("topology.tick.tuple.freq.secs", "10");
            topoBuilder.SetBolt("TweetRankBolt",TweetRankBolt.Get,
            new Dictionary<string, List<string>>(){{"TWEETRANK_STREAM", TweetRankBolt.OutputSchemaName}}, 1, true)
            .shuffleGrouping("TwitterReaderSpout")
            .addConfigurations(boltConfig);
        topoBuilder.SetBolt(
            "AzureSqlBolt",
            AzureSqlBolt.Get,
            new Dictionary<string, List<string>>(),
            1).shuffleGrouping("TweetRankBolt", "TWEETRANK_STREAM");

To analyze the tweeted text to get sentiments like positive, neutral, or negative, use Sentiment140. It has machine learning algorithms that classify the Twitter text as positive, negative, or neutral. The API uses RESTful calls and JSON as the response format:

 private int GetSentimentType(string textToAnalyze)
        {
                string url = string.Format("http://www.sentiment140.com/api/classify?text={0}",
                                            System.Web.HttpUtility.UrlEncode(textToAnalyze, Encoding.UTF8));
                var response = HttpWebRequest.Create(url).GetResponse();
                using (var sr = new StreamReader(response.GetResponseStream()))
                {       var line = sr.ReadLine();
                        var jobj = JObject.Parse(line);
                        int polarity = jobj.SelectToken("results", true).SelectToken("polarity", true).Value<int>();
                        switch (polarity)
                        {
                            case 0: return 0;// Negative
                            case 4: return 1;// Positive
                            default: return 3;// neutral
                        }
                 }
              return 4; //error and not classified
        }

The source code is available on GitHub. Once the application is submitted to the Storm cluster, you will notice that the counts are updated when you refresh, as shown in Figure 5.2Figure 5.2: Storm topology view

Figure 5.2: Storm topology view.

Next, the logs of spouts and bolts should be checked for processing. For example, to check logs in a bolt, click on the TweetRankBolt box, then click on the port to view logs. Verify logs to fix any errors in the code, as shown in Figure 5.3. As an alternate, logs could also be checked in the HDInsight URL by using login details provided during cluster creation. 

Figure 5.3 Bolt Execution logsFigure 5.3: Bolt execution logs.

Power BI can be used to visualize, explore, and monitor data stored in the SQL Azure database in real-time. Azure SQL Server's firewall setting should add a client IP to allow connections to the Power BI client. Once the user provides the connection details, using schema discovery, Power BI lists tables and columns with types in the field list. The user can explore data by selecting charts, dragging individual columns onto canvas, change filters, change data types, and generate queries back to the source.

Figure 5.4 Sentiment Analysis graphs using Power BIFigure 5.4: Sentiment analysis graphs using Power BI.

Figure 5.4 shows a pie chart, donut chart, and TreeMap to depict the total logged Twitter sentiments. Each color denotes positive (1) or negative (0) or neutral (3). These charts denote that Sentiment140 classifies the most tweets as neutral and the least tweets as positive. 

Conclusion

We have seen Apache Storm as a reliable way to process unbounded streams of data from Twitter. It has a low latency pipeline and is fault tolerant and resilient, provided that we handle these in code. It is based on Hadoop and leverages scaling behaviors in Azure HDInsight cluster and to C# developers. The source code is available on GitHub.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
sentiment analysis ,big data ,tutorial ,data analytics ,twitter ,hdinsight ,sql server ,power bi

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}