What Is Stream Processing? A Gentle Introduction
Stream processing targets difficult scenarios. The key strength of stream processing is that it can provide insights faster, often within milliseconds to seconds.
Join the DZone community and get the full member experience.Join For Free
What Is Stream Processing?
Stream processing is a technology that let users query continuous data streams and detect conditions quickly within a small time period from the time of receiving the data. The detection time period varies from few milliseconds to minutes.
For example, with stream processing, you can query data streams coming from the temperature sensor and receive an alert when the temperature has reached the freezing point.
It is also called by names such as real-time analytics, streaming analytics, complex event processing, real-time streaming analytics, and event processing. Although some terms historically had differences, now, tools have converged under term stream processing. (See this Quora Question for a list of tools).
It is one of the big data technologies. It is popularized by Apache Storm, and now there are many contenders.
Why Is Stream Processing Needed?
Big data established the value of insights derived from processing data. The value of such insights is not created equal. Some insights have much higher values shortly after something has happened and that value diminishes very fast with time. Stream processing targets such scenarios. The key strength of stream processing is that it can provide insights faster, often within milliseconds to seconds.
Stream processing is introduced and popularized as a technology like Hadoop but can give you results faster.
Following are some of the secondary reasons for using stream processing.
- Some data naturally comes as a never-ending stream of events. To do batch processing, you need to store it, stop data collection at some time, and process the data. Then, you have to do the next batch, and then worry about aggregating across multiple batches. In contrast, streaming handles neverending data streams gracefully and naturally. You can detect patterns, inspect results, look at multiple levels of focus, and also easily look at data from multiple streams simultaneously.
- Stream processing naturally fits with time series data and detecting patterns over time. For example, if you are trying to detect the length of a web session in a never-ending stream (this is an example of trying to detect a sequence), it is very hard to do it with batches, as some sessions will fall into two batches. Stream processing can handle this easily. If you take a step back and think about it, the most continuous data series are time series data. For example, almost all IoT data are time series data. Hence, it makes sense to use a programming model that fits naturally.
- Batch processing lets the data build up and try to process them at once, while stream processing processes data as they come in, hence spread the processing over time. Hence, stream processing can work with a lot less hardware than batch processing. Furthermore, stream processing also enables approximate query processing via systematic load shedding. Hence stream processing fits naturally into use cases where approximate answers are sufficient.
- Sometimes, data is huge and it is not even possible to store it. Stream processing lets you handle large fire horse-style data and retain only useful bits.
- Finally, there are a lot of streaming data available (e.g. customer transactions, activities, website visits) and they will grow faster with IoT use cases (all kind of sensors). Streaming is a much more natural model to think about and program those use cases.
However, streaming is also not a tool for all use cases. One good rule of thumb is that if processing needs multiple passes through full data or have random access (think a graph dataset), then it is tricky with streaming. One big missing use case in streaming is machine learning algorithms to train models. On the other hand, if processing can be done with a single pass over the data or has temporal locality (processing tend to access recent data), then it is a good fit for streaming.
How to Do Stream Processing?
If you want to build an app that handles streaming data and takes real-time decisions, you can either use a tool or build it yourself. The answer depends on how much complexity you plan to handle, how much you want to scale, how much reliability and fault tolerance you need, etc.
If you want to build the App yourself, place events in aa message broker topic (e.g. ActiveMQ, RabbitMQ, or Kafka), write code to receive events from topics in the broker ( they become your stream), and then publish results back to the broker. Such a code is called an actor.
However, Instead of coding the above scenario from scratch, you can use a stream processor to save time. An event stream processor lets you write logic for each actor, wire the actors up, and hook up the edges to the data source(s). You can either send events directly to the stream processor or send them via a broker.
An event stream processor will do the hard work of collecting data, delivering it to each actor, making sure they run in the right order, collecting results, scaling if the load is high, and handling failures. Among examples are Storm, Flink, and Samza. If you like to build the app this way, please check out respective user guides.
Since 2016, a new idea called Streaming SQL has emerged. We call a language that enables users to write SQL like queries to query streaming data a Streaming SQL language. There are many streaming SQL languages on the rise.
- Projects such as WSO2 Stream Processor and SQLStreams have supported SQL for more than five years.
- Apache Storm added support for Streaming SQL in 2016.
- Apache Flink added support for Streaming SQL in 2016.
- Apache Kafka added support for SQL (which they called KSQL) in 2017.
- Apache Samza added support for SQL in 2017.
With Streaming SQL languages, developers can rapidly incorporate streaming queries into their apps. By 2018, most of the Stream processors support processing data via a Streaming SQL language.
Let’s understand how SQL is mapped to streams. A stream is a table data in the move. Think of a never-ending table where new data appears as the time goes. A stream is such a table. One record or a row in a stream is called an event. But it has a schema and behaves just like a database row. To understand these ideas, Tyler Akidau’s talk at Strata is a great resource.
The first thing to understand about SQL streams is that it replaces tables with streams.
When you write SQL queries, you query data stored in a database. Yet, when you write a Streaming SQL query, you write them on data that is now, as well as the data that will come in the future. Hence, streaming SQL queries never ends. Isn't that a problem? No: it works because the output of those queries are streams. The event will be placed in output streams once the event matched and output events are available right away.
A stream represents all events that can come through a logical channel and it never ends. For example, if we have a temperature sensor in the boiler, we can represent the output from the sensors as a stream. However, classical SQL ingests data stored in a database table, processes them, and writes them to a database table. Instead, the above query will ingest a stream of data as they come in and produce a stream of data as output. For example, let’s assume there are events in the boiler stream once every ten minutes. The filter query will produce an event in the result stream immediately when an event matches the filter.
So, you can build your app as follows. You send events to the stream processor by either sending directly or by via a broker. Then, you can write the streaming part of the app using Streaming SQL. Finally, you configure the Stream Processor to act on the results. This is done by invoking a service when Stream Processor triggers or by publishing events to a broker topic and listening to the topic.
There are many stream processors available. (See Quora Question: What are the best stream processing solutions out there?) I would recommend the one I have helped build, WSO2 Stream Processor (WSO2 SP). It can ingest data from Kafka, HTTP requests, and message brokers, and you can query data stream using a Streaming SQL language. WSO2 SP is open-source under Apache license. With just two commodity servers, it can provide high availability and can handle 100K+ TPS throughput. It can scale up to millions of TPS on top of Kafka.
Who Is Using Stream Processing?
In general, stream processing is useful with usecases where we can detect a problem and we have a reasonable response to improve the outcome. Also, it plays a key role in a data-driven organization.
Following are some of the use cases.
- Algorithmic trading, stock market surveillance
- Smart patient care
- Monitoring a production line
- Supply chain optimizations
- Intrusion, surveillance, and fraud detection (e.g. Uber)
- Most smart device applications (e.g. smart cars, smart cars... this list goes on)
- Smart grids (e.g. load prediction and outlier plug detection see smart grids, 4 billion events, throughout in range of 100Ks)
- Geofencing, vehicle, and wildlife tracking (e.g. TFL London)
- Sport analytics: Augment Sports with real-time analytics (this is work we did with a real football game)
- Context-aware promotions and advertising
- Computer system and network monitoring
- Traffic monitoring
- Predictive maintenance
- Geospatial data processing
For more discussions about how to use stream processing, please refer to 13 Stream Processing Patterns for Building Streaming and Real-Time Applications.
Hope this was useful. If you enjoyed this post you might also like Stream Processing 101: From SQL to Streaming SQL and Patterns for Streaming Realtime Analytics.
I write at https://medium.com/@srinathperera.
Published at DZone with permission of Srinath Perera, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.