DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

  1. DZone
  2. Data Engineering
  3. Big Data
  4. Stream Data Processing: The Next 'Big Thing' in Big Data

Stream Data Processing: The Next 'Big Thing' in Big Data

DZone 's Guide to

Stream Data Processing: The Next 'Big Thing' in Big Data

With a couple open source projects advertising streaming engines - Flink, Beam, and Apex - we decided to jump in and test one of them out for our data lake customers.

by
Siddharth Agarwal
CORE ·
Aug. 25, 16 ·
Free Resource
Like (3)

Comment (3)

Save
Tweet
Share
{{ articles[0].views | formatCount}} Views
  • Edit
  • Delete
  • Delete without notifying
  • {{ articles[0].isLocked ? 'Enable' : 'Disable' }} comments
  • {{ articles[0].isLimited ? 'Remove comment limits' : 'Enable moderated comments' }}

Join the DZone community and get the full member experience.

Join For Free

Stream data processing seems to be the next ‘big thing’ in Big Data. With a couple open source projects advertising streaming engines - Flink, Beam, and Apex - we decided to jump in and test one of them out for our data lake customers. Flink seems to be the most mature in the segment, having just announced its 1.0.0 release.

Use Case

We want to stream stock prices and develop real-time metrics to report to the user. In this case, our metric is the 5-minute moving price average. To connect this back to a real-world use case, this is sometimes used by traders to get a sense of whether a security is currently underpriced or overpriced, albeit at larger time intervals.

One of the biggest obstacles to much of financial analysis is finding free sources of data. Fortunately, Google provides us with a URL for each security that we can hit every minute to get an updated JSON that includes the current price.

Why Not Spark Streaming?

A line has been drawn in the sand between the stream processing frameworks named previously, and Spark Streaming. Behind the scenes, Spark Streaming is really batch-processing system that appears as a streaming system. It provides the illusion of stream processing by creating ‘micro-batches’ across time that are processed individually. One drawback to this is that sometimes splits up related data across batches, creating issues when running analysis on data that spans multiple batches. However, for time-agnostic use-cases, this might not be an issue at all.

When calculating moving price averages with the batch system, if one of our prices is delayed for some reason, our moving average for that timeframe would be slightly skewed. However, with a true streaming system like Flink, the order that the event is received does not matter, since we perform operations based on the event time.

Pipeline

The Google endpoint has a JSON object that is updated every minute with properties such as stock price, time, volume, etc. For our purposes, we only need the stock price and the time. At this point, we have a couple options as to how we want to send this over to Flink. Flink has a couple of built-in ‘stream connectors’ that include Kafka, Flume, Twitter, as well as being able to receive data from a specific port. We went with Kafka, but another option would be to push new data to a Netcat process at a specific port.

The Kafka program polls the specified Google endpoint every minute and reads the JSON. It then pulls out the fields that we need, concatenates them with a comma delimiter, and pushes that out to a topic. For multiple stock tickers, we could add a key that would allow Kafka to partition the topic based on the ticker name (‘GOOG, ‘AAPL’, etc.).

Flink is watching this Kafka topic, so once the values are pushed in, we can receive them and start working with them.

Flink

FlinkBlog.png

The above Flink code first defines a Kafka stream connector, which allows us to create a Datastream object that can be acted upon by Flink transformations. The first transformation splits up the received records into time and price. We then define 5-minute windows, over which we take the average of the prices that we count. We then return this value to the user.

Conclusion

The above exercise was useful in getting familiar with the mechanics of stream processing using Flink.  At Zaloni, we are seeing a growing number of business use-cases that call for low-latency processing of unbounded data sets, and so we expect to see more widespread adoption of stream-processing frameworks. Maturity, or lack of, might have been a criticism in the past - that seems to no longer be an issue.

Like This Article? Read More From DZone

related article thumbnail

DZone Article

related article thumbnail

DZone Article

related refcard thumbnail

Free DZone Refcard

DOWNLOAD
Big data Stream processing Data processing kafka

Published at DZone with permission of Siddharth Agarwal. See the original article here.

Opinions expressed by DZone contributors are their own.

Partner Resources

X

    {{ editionName }}

  • {{ node.blurb }}
    {{ node.type }}
    Trend Report

    {{ ::node.title }}

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.linkDescription }}

{{ parent.urlSource.name }}
by
CORE
· {{ parent.articleDate | date:'MMM. dd, yyyy' }} {{ parent.linkDate | date:'MMM. dd, yyyy' }}
Tweet
{{ parent.views }} ViewsClicks
  • Edit
  • Delete
  • {{ parent.isLocked ? 'Enable' : 'Disable' }} comments
  • {{ parent.isLimited ? 'Remove comment limits' : 'Enable moderated comments' }}