Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Real-Time Streaming Pattern: Preprocessing for Sentiment Analysis

DZone's Guide to

Real-Time Streaming Pattern: Preprocessing for Sentiment Analysis

Sentiment analysis is often used by data scientists ot gauge how their organization is viewed on the web. See how to create your own sentiment analysis pattern.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Introduction

I am starting a series of posts looking at a variety of data processing patterns used to build real-time stream processing applications, the use cases that the patterns relate to, and how you would go about implementing them within Wallaroo.

These posts will help you understand the data processing use cases that Wallaroo is best designed to handle and how you can go about building them right away.

I will be looking at the Wallaroo application builder, the part of your application that hooks into the Wallaroo framework, and some the business logic of the pattern.

Pattern: Preprocessing

Preprocessing involves the transformation of the messages in your data pipeline.

A variety of operations can occur, including:

  • Removing attributes from a message.
  • Adding or enhancing attributes in a messaging.
  • Filtering out entire messages from a pipeline.
  • Splitting messages to be processed by multiple pipelines.
  • Combining multiple pipelines into a new pipeline.

Use Cases

It's not unusual to see the preprocessing pattern used in many use cases and combined with other patterns.

One excellent example use case is removing stop words for sentiment analysis.

Sentiment analysis is used by data scientists to look at a piece of text and determine whether the underlying sentiment is positive or negative.

Let's say you are monitoring tweets that mention Nike. You want to know how people are feeling about Nike. Are the tweets and comments generally positive or negative? "Nike has some great new looks this season" would be a positive sentiment. Each piece of text is given a score, and you can add up the score over some time, say the last hour, to get an overall sentiment score.

We will assume that upstream from our sentiment processor so that we only have Nike related tweets. The next step before doing the actual sentiment analysis is to remove stop words. Stop words are words that are meaningless to the underlying sentiment analysis and therefore can and should be removed before the sentiment algorithm runs.

Stop word lists are customized to the specifics of the use case but would include words like "are, aren't, as, at, be, because, been, has, I'm, it's, on, some, the, this."

In our example message above "@Nike has some great new looks this season" would become "@Nike great new looks season" after stop word preprocessing occurred. This isn't the easiest text for humans to read, but it is just how our sentiment algorithm wants to see it!

Wallaroo Application Builder

Overview

ab.new_pipeline(
            "Sentiment Analysis",
            wallaroo.TCPSourceConfig(order_host, order_port, order_decoder)
        )
ab.to_parallel(remove_stop_words)
ab.to_stateful(sum_sentiment, Sentiment, "Sentiment")
ab.to_sink(wallaroo.TCPSinkConfig(out_host, out_port, encoder))
return ab.build()

ab.new_pipeline(
            "Sentiment Analysis",
            wallaroo.TCPSourceConfig(order_host, order_port, order_decoder)
        )

Definition of the Wallaroo pipeline that includes the pipeline name, "Sentiment Analysis" and the source of the data, in this example we are receiving tweet messages over TCP.

ab.to_parallel(remove_stop_words) 

Our first processing step is to call a function called "remove_stop_words." This function would look at the text of the tweet and strip out any instances of the stop words from our list, then send the processed message to the next step in the pipeline.

In this example, I am using the Wallaroo to_parallel method to add a computation to the pipeline. The remove_stop_words function does not need to save anything to state, and the computation can run in parallel across all available workers.

ab.to_stateful(sum_sentiment, Sentiment, "Sentiment") 

to_stateful is a non-parallel method that contains a stateful computation. The sum_sentiment function tallies up the sentiment scores and stores the totals in the "Sentiment" state object. You could use a Python library like NLTK to accomplish the sentiment analysis.

    ab.to_sink(wallaroo.TCPSinkConfig(out_host, out_port, encoder))
    return ab.build()

to _sink specifies the endpoint of the pipeline and where the messages are sent. build() specifies to Wallaroo that there are no more steps for this application.

Conclusion

The preprocessing pattern is one of the most commonly used patterns that you will come across when building your streaming application. As you can see, Wallaroo's lightweight API gives you the ability to construct your data processing pipeline and run whatever application logic you need to power your application.

Give Wallaroo a try

We hope that this post has piqued your interest in Wallaroo!

If you are just getting started, we recommend you try our Docker image, which allows you to get Wallaroo up and running in only a few minutes.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,sentiment analysis ,real-time streaming

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}