Over a million developers have joined DZone.

Stream Data Processing: The Next 'Big Thing' in Big Data

With a couple open source projects advertising streaming engines - Flink, Beam, and Apex - we decided to jump in and test one of them out for our data lake customers.

· Big Data Zone

Compliments of Zaloni: Download free eBook "Architecting Data Lakes" to learn the key to building and managing a big data lake, brought to you in partnership with Zaloni.

Stream data processing seems to be the next ‘big thing’ in Big Data. With a couple open source projects advertising streaming engines - Flink, Beam, and Apex - we decided to jump in and test one of them out for our data lake customers. Flink seems to be the most mature in the segment, having just announced its 1.0.0 release.

Use Case

We want to stream stock prices and develop real-time metrics to report to the user. In this case, our metric is the 5-minute moving price average. To connect this back to a real-world use case, this is sometimes used by traders to get a sense of whether a security is currently underpriced or overpriced, albeit at larger time intervals.

One of the biggest obstacles to much of financial analysis is finding free sources of data. Fortunately, Google provides us with a URL for each security that we can hit every minute to get an updated JSON that includes the current price.

Why Not Spark Streaming?

A line has been drawn in the sand between the stream processing frameworks named previously, and Spark Streaming. Behind the scenes, Spark Streaming is really batch-processing system that appears as a streaming system. It provides the illusion of stream processing by creating ‘micro-batches’ across time that are processed individually. One drawback to this is that sometimes splits up related data across batches, creating issues when running analysis on data that spans multiple batches. However, for time-agnostic use-cases, this might not be an issue at all.

When calculating moving price averages with the batch system, if one of our prices is delayed for some reason, our moving average for that timeframe would be slightly skewed. However, with a true streaming system like Flink, the order that the event is received does not matter, since we perform operations based on the event time.

Pipeline

The Google endpoint has a JSON object that is updated every minute with properties such as stock price, time, volume, etc. For our purposes, we only need the stock price and the time. At this point, we have a couple options as to how we want to send this over to Flink. Flink has a couple of built-in ‘stream connectors’ that include Kafka, Flume, Twitter, as well as being able to receive data from a specific port. We went with Kafka, but another option would be to push new data to a Netcat process at a specific port.

The Kafka program polls the specified Google endpoint every minute and reads the JSON. It then pulls out the fields that we need, concatenates them with a comma delimiter, and pushes that out to a topic. For multiple stock tickers, we could add a key that would allow Kafka to partition the topic based on the ticker name (‘GOOG, ‘AAPL’, etc.).

Flink is watching this Kafka topic, so once the values are pushed in, we can receive them and start working with them.

Flink

FlinkBlog.png

The above Flink code first defines a Kafka stream connector, which allows us to create a Datastream object that can be acted upon by Flink transformations. The first transformation splits up the received records into time and price. We then define 5-minute windows, over which we take the average of the prices that we count. We then return this value to the user.

Conclusion

The above exercise was useful in getting familiar with the mechanics of stream processing using Flink.  At Zaloni, we are seeing a growing number of business use-cases that call for low-latency processing of unbounded data sets, and so we expect to see more widespread adoption of stream-processing frameworks. Maturity, or lack of, might have been a criticism in the past - that seems to no longer be an issue.

Zaloni, the data lake company, provides data lake management and governance software and services. Learn more about Bedrock and Mica

Topics:
streaming ,kafka ,program ,spark streaming ,endpoint ,connector ,stream processing ,processing ,spark

Published at DZone with permission of Siddharth Agarwal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}