Towards a Unified Data Processing Framework: Batch as a Special Case of Streaming With Apache Flink
Two devs take a look at Apache Flink, and OSS project that gives you the ability to model batch processing, real-time data processing, and event-driven applications
Join the DZone community and get the full member experience.Join For Free
The Apache Flink project has followed the philosophy of taking a unified approach to batch and stream data processing, building on the core paradigm of “continuous processing of unbounded data streams” for a long time. If you think about it, carrying out offline processing of bounded data sets naturally fits the paradigm: these are just streams of recorded data that happen to end at some point in time.
Flink is not alone in this: there are other projects in the open source community that embrace “streaming first, with batch as a special case of streaming,” such as Apache Beam; and this philosophy has often been cited as a powerful way to greatly reduce the complexity of data infrastructures by building data applications that generalize across real-time and offline processing.
Figure 1. Batch as a Special case of Streaming, processing bounded and unbounded data streams in Apache Flink
Why Are There Still Batch Processors?
Being able to process batch as a special case of streaming does not mean that any stream processor is now the right tool for your batch processing use cases, or that batch processors were rendered obsolete. In fact:
- Pure stream processing systems are inefficient and slow at batch processing workloads. No one would consider it a good idea to use a stream processor that ingests data from a message queue to analyze large amounts of static data.
- Unified APIs often delegate to different runtimes depending on whether the data is continuous or bounded. For example, the implementations of the batch and streaming runtime of Google Cloud Dataflow are different, in order to tune the desired performance and resilience for each case.
- Apache Flink has a streaming API that can run bounded and unbounded use cases, but still offers a separate DataSet API and runtime stack that is faster for batch use cases.
Does this then mean that the fundamental paradigm of a unified API for batch and streaming data processing is...plain wrong? The answer is simple: nothing is wrong there. But, while batch processors were specifically built for that special purpose, competitively handling batch processing use cases in a unified runtime means that certain characteristics of that “special case” need to be exploited.
Batch on Top of a Streaming Runtime
As the original creators of Flink, we have always believed that it is possible to have a runtime that is state-of-the-art for stream processing and batch processing use cases simultaneously; a runtime that is streaming-first, but can exploit just the right amount of special properties of bounded streams to be as fast for batch use cases as dedicated batch processors.
Figure 2. Apache Flink supports both streaming and batch analytics applications.
This is the unique approach of Apache Flink: it has a network stack that supports both low-latency with high-throughput streaming data exchanges, as well as high-throughput batch shuffles. Although Flink has streaming runtime operators to continuously process unbounded data, it also has specialized operators for bounded inputs that are used when you choose the DataSet API or select the batch environment in the Table API. Because of that, Flink has demonstrated some pretty impressive batch processing performance from the get-go.
But...What Is Missing Still?
While we have made significant progress over the years, we still have a few steps to take in order to evolve Flink into a system for truly unified, state-of-the-art stream and batch processing. We will be introducing a few more enhancements including the following features that will be key to realizing our vision:
- A truly unified runtime operator stack. Currently, the bounded and unbounded operators have a different data consumption and threading model and do not mix and match. In a unified stack, streaming operators that continuously ingest data from all inputs to ensure low processing latencies are the foundation. However, when operating on bounded data, the API or the SQL query optimizer can also select operators that are optimized for high throughput and not low latency. The optimizer can pick, for example, a hybrid-hash join operator that first consumes one (bounded) input stream entirely before reading the second input stream.
- Exploiting bounded streams to reduce the scope of fault tolerance. When input data is bounded, it is possible to completely buffer data during shuffles (in memory or on disk). In case of a failure, this data can be replayed. Buffering shuffled data makes recovery more fine-grained and thus much more efficient.
- Exploiting bounded stream operator properties for scheduling. By definition, a continuous unbounded streaming application needs all operators running at the same time. An application on bounded data can schedule operations one after another, depending on how the operators consume data — for example: first, build a hash table from one input, then probe the hash table from the other input. Smart scheduling of operators can significantly improve the resource utilization and efficiency.
- Subsuming the DataSet API by the DataStream API. The DataStream API will be extended with the concept of bounded streams and operations that fully subsume the DataSet API. We plan to deprecate and eventually remove the DataSet API.
- Improving performance and coverage for batch SQL. SQL is the de-facto standard data language and, to be competitive with the best in class batch engines, Flink needs to cover more SQL features and better query execution performance. While the core data-plane in Flink is already very efficient, the speed of SQL execution ultimately also depends on the query optimizer, high-performance operator implementations, and efficient code generation.
Blink is a fork of Apache Flink, originally created inside Alibaba to improve the framework’s behavior for internal use cases. Recently Alibaba contributed its for,k Blink, back to the Apache Flink project. Blink adds a series of significant improvements and integrations — many of which overlap with the features mentioned above as the key for a unified streaming and batch system. Given the sheer amount of modifications introduced with Blink, the community has drafted a merge plan to ensure a smooth, non-disruptive integration of the contributed code. This plan focuses initially on improving the batch processing features of Flink, considering that the SQL query processor is the component that evolved the most compared to the latest Flink master branch.
- Unified Stream Operators. Blink extends the Flink streaming runtime operator model to support selective reading from different inputs while keeping the push model for very low latency. This control over the inputs helps to now support algorithms like hybrid-hash joins on the same operator and threading model as continuous symmetric joins through RocksDB and forms the basis for future features like “Side Inputs.”
- Table API and SQL Query Processor. While Flink translates queries either into DataSet or DataStream programs, depending on the characteristics of their inputs, Blink translates queries to a data flow of stream operators. These stream operators are more aggressively chained, and the common data structures — sorters, hash tables — and serializers are extended to go even further, operating on binary data and saving serialization overhead. It also adds a wider variety of runtime operators for common SQL operations, such as semi- and anti-joins, and many more optimization rules to the SQL query optimizer, including, but not limited to, join reordering.
- Improved Scheduling and Failure Recovery. Blink implements several improvements for task scheduling and fault tolerance. The scheduling strategies use resources better by exploiting how the operators process their input data. The failover strategies become more fine-grained along the boundaries of persistent shuffles. A failed JobManager can be replaced without restarting a running application.
Picking up speed from Alibaba’s contribution of Blink, the open source project is taking the next step in building a unified runtime and towards becoming a stream processor that is able to compete neck and neck with dedicated batch processing systems in their home turf: Online Analytical Processing (OLAP) and SQL. If you would like to dive deeper into this subject and follow the progress of the project, we encourage you to subscribe to the Apache Flink Development Mailing List.
With its ability to model batch processing, real-time data processing, and event-driven applications in the exact same way, while offering high performance and consistency, we believe that stream processing with Apache Flink will become the foundation of the data processing stack of the future.
Published at DZone with permission of Fabian Hueske. See the original article here.
Opinions expressed by DZone contributors are their own.