DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • High-Performance Batch Processing Using Apache Spark and Spring Batch
  • An Introduction to Stream Processing
  • High-Volume Security Analytics: Splunk vs. Flink for Rule-Based Incident Detection
  • Real-Time Analytics: All Data, Any Data, Any Scale, at Any Time

Trending

  • System Coexistence: Bridging Legacy and Modern Architecture
  • How to Convert XLS to XLSX in Java
  • Introduction to Retrieval Augmented Generation (RAG)
  • Useful System Table Queries in Relational Databases
  1. DZone
  2. Data Engineering
  3. Databases
  4. Batch Processing vs. Stream Processing

Batch Processing vs. Stream Processing

Batch tasks are best used for performing aggregate functions on your data. Stream tasks are best used for cases where low latency is integral to the operation.

By 
Margo Schaedel user avatar
Margo Schaedel
·
Apr. 20, 18 · Analysis
Likes (4)
Comment
Save
Tweet
Share
11.3K Views

Join the DZone community and get the full member experience.

Join For Free

If you've read DevRel Katy Farmer's stellar post, Kapacitor and Continuous Queries: How to Decide Which Tool You Need, then you know that when our community talks, we listen. So, in alignment with that view and in honor of our very own Kapacitor Koala, let's tackle another common community issue that has come to our attention: when should we use batch processing versus stream processing in our Kapacitor tasks?

Image title

Our famous Kapacitor Koala.

Now, if you've no vague idea what Kapacitor is, I recommend doing a little light reading on it here and here just to get you up to speed. Kapacitor, the final component of our TICK Stack, offers several capabilities such as data transformation, downsampling, and alerting. Kapacitor uses its own DSL called TICKscript, which allows you to define certain tasks that can then be executed on your data — essentially, it's processing your data for you.

Here's where it gets tricky, though: How do you choose whether to process your data as a batch task or streaming task?

Batch Tasks

Let's discuss batch tasks first. A batch is a collection of data points that have been grouped together within a specific time interval. Another term often used for this is a window of data. When running a batch task, Kapacitor queries InfluxDB periodically, thereby avoiding having to buffer much of your data in RAM. There are several cases where batch processing is the way to go:

  • Performing aggregate functions such as finding the mean, maximum, or minimum of a set interval of data.
  • Cases where alerting doesn't need to run on every single data point (since state changes will probably not happen that often). You don't want to be inundated with alerts!
  • Downsampling of your data takes a large collection of data points and only retains the most significant data (so you can still view overall trends in the data).
  • Cases where a little extra latency won't severely impact your operation.
  • Cases with a super-high throughput InfluxDB instance since Kapacitor cannot process data as quickly as it can be written to InfluxDB (this occurs more frequently with InfluxEnterprise clusters).

Stream Tasks

On the other side, we have stream tasks. Stream tasks create subscriptions to InfluxDB so that every data point written to InfluxDB is also written to Kapacitor. One should note though that stream tasks use a high percentage of available memory, so memory availability is a key factor to take into consideration. Here's where stream processing is most ideal:

  • If you want to transform each individual data point in real-time (technically, this could also be run with a batch process but there's latency to consider).
  • Cases where lowest possible latency is paramount to the operation. If alerts need to be triggered immediately, e.g. running a stream task will ensure the least possible delay.
  • Cases in which InfluxDB is handling high-volume query load and you may want to alleviate some of the query pressure from InfluxDB.
  • Stream tasks understand time by the data's timestamps; there are no race conditions for when exactly a given point will make it into a window or not. With batch tasks, on the other hand, it is possible for a data point to arrive late and be left out of its relevant window.

Another advantage some might see with writing stream tasks is the ease of use in having to define the task using only Kapacitor's TICKscript, without having to delve into writing queries for InfluxDB. If you are comfortable with writing both, however, it's probably going to be in your best interest to go with batch processing most of the time since it uses a lot less memory. An additional factor to consider is that Kapacitor is not limited to use only with InfluxDB. For example, if you want to send data straight from Telegraf over to Kapacitor, that will have to be done as a streaming task.

Key Takeaways

  • Batch tasks query InfluxDB periodically and use limited memory but can place additional query load on InfluxDB.
  • Batch tasks are best used for performing aggregate functions on your data, downsampling, and processing large temporal windows of data.
  • Stream tasks subscribe to writes from InfluxDB placing additional write load on Kapacitor, but can reduce query load on InfluxDB.
  • Stream tasks are best used for cases where low latency is integral to the operation.

When our community talks, we listen. We'd love to hear how your batch and stream tasks are going! Send us your comments, questions, issues, and blog ideas on our community site and feel free to reach out to us on Twitter: @InfluxDB or @mschae16.

Stream processing Batch processing Task (computing) Data (computing) Database InfluxDB

Published at DZone with permission of Margo Schaedel, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • High-Performance Batch Processing Using Apache Spark and Spring Batch
  • An Introduction to Stream Processing
  • High-Volume Security Analytics: Splunk vs. Flink for Rule-Based Incident Detection
  • Real-Time Analytics: All Data, Any Data, Any Scale, at Any Time

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!