DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > 4 Valuable Resources on Stream Processing

4 Valuable Resources on Stream Processing

This article is an overview of the main types of streaming processing and their frameworks.

Itamar Weiss user avatar by
Itamar Weiss
·
Jul. 08, 16 · Big Data Zone · Opinion
Like (4)
Save
Tweet
3.83K Views

Join the DZone community and get the full member experience.

Join For Free

Stream processing is a programming paradigm defining applications which, when receiving a sequence of data, treat it as a collection of elements, or datapoints, and rather than group and process them together, process each datapoint by itself. In stream processing, each datapoint is processed as it arrives and independently from other datapoints, unlike batch processing, where datapoints are usually buffered and processed together, in bulk. Therefore, stream processors have become an important building block of real-time applications, as they facilitate acting on event data in real-time, allowing a user access to the real-time state of a system and its data, rather than allowing access to periodical snapshots of it.

Alooma's main product is a data-pipeline, that allows our users to send or pull data from multiple sources, perform computations on the data (e.g., to handle schema changes, clean corrupt data, etc.), and load it to a data warehouse. Stream processing allows our pipeline to have a much shorter latency compared to a batch processing approach. Stateful stream processing enriches the types of computations our users are able to perform on the stream. For example, by aggregating the number of events with certain attributes and counting them over time windows, it is possible to create real-time dashboards of the data in the stream, without ever needing to load the data to a data warehouse.

Stateless vs. Stateful Stream Processing

In a Stateless stream, the way each event is handled is completely independent from the preceding events. Given an event, the stream processor will treat it exactly the same way every time, no matter what data arrived beforehand.

Stateful stream processing means that a "state" is shared between events and therefore past events can influence the way current events are processed. This state can usually be queried from outside the stream processing system as well. For example, stateful systems can keep track of user sessions (aggregate events coming from the same session, and output only session-level metrics, when the session ends), perform aggregated counts (e.g., count the number of errors in every time window) and more.

Stateless stream processing is easy to scale up, because, by definition, events are processed independently. The stream can be processed by multiple identical processors, with a simple load balancing between them. When the system needs to process a higher throughput of events, you simply launch more processors.

The Challenge of Stateful Stream Processing

Stateful stream processing is much more difficult to scale up because you need the different workers to share the state. A simple solution would be to use an external store (such as a database), but then the performance of the external store limits the performance of your stream processing. Another option is to partition the stream: instead of randomly sending events to processors, you can send events to processors according to some attribute of the events.

For example, in the sessions use-case, one could send all events from the same user to the same processor. This way, each processor can handle its own state, significantly improving performance. However, using this approach means you now have multiple states instead of one, so querying it from outside of the stream processing engine becomes more complex.

Stream processing Event Data (computing)

Published at DZone with permission of Itamar Weiss, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Stupid Things Orgs Do That Kill Productivity w/ Netflix, FloSports & Refactoring.club
  • How to Set Up and Run PostgreSQL Change Data Capture
  • Top Six Kubernetes Best Practices for Fleet Management
  • Counting Faster With Postgres

Comments

Big Data Partner Resources

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo