Over a million developers have joined DZone.

You Might Have a Streaming Data Problem If...

DZone's Guide to

You Might Have a Streaming Data Problem If...

Any data processing job where you operate on a single item at a time is a candidate to be processed by a stream processing engine.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

When processing data, we often categorize a job as either a batch or a streaming job. However, this is a bit of a false dichotomy. In this post, I'll explore how we ordinarily define batch and stream processing, and show how some tasks that we think of as batch jobs can be seen as a subset of stream processing.

Batch Processing

The definition of batch processing has changed over time. The current conventional usage is some data transformation over a finite set of data. The finite nature of the data means that the job has a beginning and an end.

An example would be, "I want to process all the logs that my web servers generated yesterday."

Here, we have a finite set of data: "All the logs my web servers generated yesterday." I can start processing those logs and eventually, that task will finish.

Stream Processing

In contrast to batch processing, stream processing commonly refers to data transformations that are done over an infinite set of data. Or, more realistically, once a stream processing system is turned on, there's no fixed end. It will keep processing data as it arrives until the job is turned off because it's no longer needed.

The Missing "How"

While helpful for grounding conversations, these definitions overlook an essential aspect of batch and stream processing: how the data is processed.

One of the characteristics of stream processing that is implicit in our definition is that we engage in "item-at-a-time" processing. What is item-at-a-time processing? It means that we process each piece of data, each item, as it arrives. An auction site that takes action for each bid as it arrives is practicing item-a-time processing. It's handling each bid-placed event as it arrives, one bid a time.

What's interesting is that many "batch processing" jobs are also "item-at-a-time" and are good candidates for processing by an "item-at-a-time engine" — that is, a lot of batch jobs, with their finite set of data with a start and an end, make good stream processing jobs.

Let's take a look at a ubiquitous batch processing example: log file analysis.

Log File Analysis

Log file analysis is commonly thought of as a batch processing job because we have a finite, fixed set of data. We have some log files covering a specific period of time that we want to process.

Log file analysis often involves taking each line of the log, examining it for different features, and updating aggregations of those features, e.g. getting the geographic distribution of visitors over a given time frame or ranking the most popular pages on a website.

Note that I said our log analysis involves "taking each line of the log." Almost all log analysis is item-at-a-time. Instead of thinking of the problem as "take a bunch of files and process them," we can think of it as "process a stream of entries from logs."

Recommendation Engine

Recommendation engines are often item-at-a-time systems, as well. A company wants to recommend a product to its customers. The goal is to get them interested in products they might not be aware of.

Once again, there's a finite dataset: in this case, a list of customers that we want to generate recommendations for.

Every so often, perhaps daily or weekly, the company needs to generate new product recommendations that will be mailed to their customers. This fits under our earlier definition of batch processing: processing over a finite dataset with a beginning and an end.

However, recommendations are done "per customer." That's "item-at-a-time." Recommendations are generated for each customer independent of the suggestions for any other customer. That's a streaming problem.

OK, so there are batch problems that are also streaming problems. Why would we want to use a stream processing engine instead of a batch processing engine? There are many reasons to consider a stream processing engine, but I want to focus on one: extracting value from your data sooner than you could with a batch engine.

Unlock the Value of Your Data Sooner With Stream Processing

Let's take our product recommendation engine. If we treat it as a streaming problem, that means we could start generating recommendations in real-time with no need to wait for daily or weekly emails. We can begin to create and update recommendations on-the-fly. We can generate them while a customer is on the website, while we already have their attention. Stream processing can take the data we have and unlock its value sooner with no need to wait for the next batch to run.

How about log analysis? What's the value of real-time log analysis? Well, that depends on you and your company. Off the top of my head, threat analysis immediately pops to mind. If your logs can be analyzed to look for malicious behavior, would you rather do that every day, every hour, or as it is happening?

Wrapping Up

The simple definitions that we started with set up batch and stream processing as being a binary choice. A job is either batch or it's streaming. We've shown with a couple of examples that it isn't a binary choice. Many tasks that we think of as batch processing jobs are also stream processing jobs because many batch jobs are a subset of stream processing.

Any batch job that loads and processes its dataset an item at a time is a subset of stream processing. Our finite, bounded batch dataset is merely a window of time extracted from the theoretically infinite data set of a stream processing job.

Any data processing job where you operate on a single item at a time is a candidate to be processed by a stream processing engine. Using a stream processing engine has many advantages, including the ability to extract value from your data in real-time.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

data streaming ,big data ,data processing ,batch processing ,stream processing ,data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}