Apache Flink is an open source platform from Apache Software Foundation for large-scale distributed stream and batch data processing that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink can be integrated with other open-source and big data processing tools both for data input and output as well as deployment. Apache Flink provides several APIs for creating real-time streaming applications by utilizing Flink engine as follows:
- DataStream API for unbounded streams embedded in Java and Scala
- DataSet API for static data embedded in Java, Scala, and Python,
- Table API with an SQL-like expression language embedded in Java and Scala.
In this article, we will see how Apache Flink provides high performance and low-latency streaming that can be considered as a new wave in the real-time streaming application development.
High Performance & Low-Latency Streaming
If you are already familiar with Apache Spark then you probably experienced one issue with Spark streaming which is a micro-batch architecture is near real-time (NRT) in action. On the contrary, stream processing in Apache Flink is simply real time. Hence, the primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing.
From the technical point of view, high throughput rates and low latency can be achieved via Flink's data streaming runtime with minimal configuration and effort as shown in Figure 1. Flink supports stream processing and windowing with event time semantics (ETS), which makes it easy to compute over streams where events arrive out of order, and where events may arrive delayed.
Figure 1: High throughput and low-latency streaming with Apache Flink (image credit: https://flink.apache.org/features.html#streaming)
Apache Flink: A New Wave in Stream Processing
As already mentioned that Flink provides highly flexible streaming window for the continuous streaming model so that both the batch and the real-time streaming can be integrated into one system.
Highly Flexible Streaming Windows for Continuous Streaming Model
Flink supports windows over time, count, or sessions, as well as data-driven windows. Windows can be customized with flexible triggering conditions, to support sophisticated streaming patterns. Data streaming applications are executed with continuous and long-lived operators. Other than the Spark, Flink's streaming runtime has the natural flow control for supporting slow data sinks backpressure for the faster sources.
Batch and Streaming in One System
Apache Flink provides a single runtime for the streaming and as well batch processing. As a result, one common runtime is utilized for data streaming applications and batch processing applications as shown in Figure 2. Where the batch processing applications run efficiently as special cases of stream processing applications.
Most importantly, the DataStream API of Flink supports functional transformations on data streams, with the user-defined state, and flexible windows. Secondly, Flink's DataSet API lets you write beautiful type-safe and maintainable code in Java or Scala. Furthermore, it supports a wide range of data types beyond key/value pairs and a wealth of operators.
Figure 2: Unified Runtime for Batch and Stream Data Analysis (image credit: https://flink.apache.org/features.html#streaming)
Flink and the Streaming Fault-Tolerance
For streaming application development, Flink has a novel approach to drawing periodic snapshots of the streaming data flow state and use those for recovery. This mechanism is both efficient and flexible. For batch processing programs Flink remembers the program’s sequence of transformations and can restart failed jobs from the point of the failure.
Flink and the Broad Integration with Other Big Data Technologies
Flink is integrated with many other open-source data processing ecosystem to make the streaming analytics much easier. Flink runs on YARN, works with Hadoop’s distributed file system (HDFS), streams data can be fetched from Kafka, can execute Hadoop program code, and connects to various other data storage systems as shown in Figure 3.
Since Flink comes with its own runtime rather than building on top of MapReduce and it is a data processing system, it also can be considered an alternative to Hadoop’s MapReduce component. Consequently, it can work independently of the Hadoop ecosystem. More interestingly, Flink can also access HDFS for reading and writing the data. Furthermore, Hadoop’s next-generation resource manager (e.g.YARN) to provision cluster resources. Since most Flink users are using Hadoop HDFS to store their data, Flink already ships the required libraries to access HDFS.
Figure 3: Integration support in Flink with other big data technologies (image credit: https://flink.apache.org/features.html#streaming)
Since, Apache Flink is relatively a newer technology and less familiar compared to Apache Spark, the production deployments, and wide acceptance will take more time. However, it can be viewed as the 4G in the real-time stream processing framework.
In my next article, I will show how to packaging an Apache Flink streaming application with all the required dependencies to achieve wider portability so that Flink job can be executed platform independent manner.