DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Choosing the Right Stream Processing System: A Comprehensive Guide
  • An Introduction to Stream Processing
  • Efficient Multimodal Data Processing: A Technical Deep Dive
  • Advanced Strategies for Building Modern Data Pipelines

Trending

  • Distributed Consensus: Paxos vs. Raft and Modern Implementations
  • After 9 Years, Microsoft Fulfills This Windows Feature Request
  • Introduction to Retrieval Augmented Generation (RAG)
  • Code Reviews: Building an AI-Powered GitHub Integration
  1. DZone
  2. Data Engineering
  3. Data
  4. Understanding Batch, Microbatch, and Stream Processing

Understanding Batch, Microbatch, and Stream Processing

Batch processing is for cases where having the most up-to-date data is not important. Stream processing is for cases that require live interaction and real-time responsiveness.

By 
Jon Bock user avatar
Jon Bock
·
Apr. 16, 18 · Opinion
Likes (9)
Comment
Save
Tweet
Share
28.3K Views

Join the DZone community and get the full member experience.

Join For Free

One of the fundamental distinctions in data processing centers on when newly received data is processed. This distinction has important implications for architecture and design of both systems that process data, as well as applications that use and depend on those systems. In this article, we'll take a closer look at the differences and some of their implications.

Most simply stated, the fundamental distinction is about whether each piece of new data is processed when it arrives, or instead processed at a later time as part of a set of new data. That distinction divides processing into two categories: batch processing and stream processing.

Batch Processing

In batch processing, newly arriving data elements are collected into a group. The whole group is then processed at a future time (as a batch, hence the term “batch processing”). Exactly when each group is processed can be determined in a number of ways — for example, it can be based on a scheduled time interval (e.g. every five minutes, process whatever new data has been collected) or on some triggered condition (e.g. process the group as soon as it contains five data elements or as soon as it has more than 1MB of data).

Image title

Batch processing with time-based batch interval

By way of analogy, batch processing is like your friend (you certainly know someone like this) who takes a load of laundry out of the dryer and simply tosses everything into a drawer, only sorting and organizing it once it becomes too hard to find something. This person avoids doing the work of sorting each time they do laundry, but with the tradeoff that they spend a lot of time searching through the drawer whenever they need something and ultimately need to spend a significant amount of time separating clothes, matching socks, etc. when it becomes too difficult to find things.

Historically, the vast majority of data processing technologies have been designed for batch processing. Traditional data warehouses and Hadoop are two common examples of systems focused on batch processing.

The term “microbatch” is frequently used to describe scenarios where batches are small and/or processed at small intervals. Even though processing may happen as often as once every few minutes, data is still processed a batch at a time. Spark Streaming is an example of a system designed to support micro-batch processing.

Stream Processing

In stream processing, each new piece of data is processed when it arrives. Unlike batch processing, there is no waiting until the next batch processing interval and data is processed as individual pieces rather than being processed a batch at a time.

Image title

Stream processing

Although each new piece of data is processed individually, many stream processing systems do also support “window” operations that allow processing to also reference data that arrives within a specified interval before and/or after the current data arrived.

Carrying forward our analogy, a stream processing approach to organizing laundry would sort, match, and organize laundry as it is taken out of the dryer. In this approach, a bit more work is done initially for each load of laundry, but there is no longer a need to go back and reorganize the entire drawer at a later time because it is already organized, and time is no longer wasted searching through the drawer for buried pieces of clothing.

There are a growing number of systems designed for stream processing including Apache Storm and Apache Heron. These systems are frequently deployed to support near-real-time processing of events.

Implications of the Differences Between Batch and Stream Processing

Although it may seem that the differences between stream processing and batch, especially micro-batch, are just a matter of a small difference in timing, they actually have fundamental implications for both the architecture of data processing systems and for the applications using them.

Systems for stream processing are designed to respond to data as it arrives. That requires them to implement an event-driven architecture, an architecture in which the internal workflow of the system is designed to continuously monitor for new data and dispatch processing as soon as that data is received. On the other hand, the internal workflow in a batch processing system only checks for new data periodically, and only processes that data when the next batch window occurs.

The difference between stream and batch processing is also significant for applications. Applications built for batch processing by definition process data with a delay. In a data pipeline with multiple steps, those delays accumulate. In addition, the lag between the arrival of new data and the processing of that data will vary depending on the time until the next batch processing window--from no time at all in some cases to the full time between batch windows for data that arrives just after the start of processing of a batch. As a result, batch processing applications (and their users) cannot rely on consistent response times and need to adjust accordingly to that inconsistency and to greater latency.

Batch processing is generally appropriate for use cases where having the most up-to-date data is not important and where tolerance for slower response time is higher. For example, offline analysis of historical data to compute results or identify correlations is a common batch processing use case.

Stream processing, on the other hand, is necessary for use cases that require live interaction and real-time responsiveness. Financial transaction processing, real-time fraud detection, and real-time pricing are examples that best-fit stream processing.

Stream processing Data processing Batch processing

Published at DZone with permission of Jon Bock. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Choosing the Right Stream Processing System: A Comprehensive Guide
  • An Introduction to Stream Processing
  • Efficient Multimodal Data Processing: A Technical Deep Dive
  • Advanced Strategies for Building Modern Data Pipelines

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!