DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Apache Flink: A New Wave to Real-time Stream Processing

Apache Flink: A New Wave to Real-time Stream Processing

Let's see how Apache Flink provides high performance and low-latency streaming that can be considered as a new wave in the real-time streaming application development.

Md. Rezaul Karim user avatar by
Md. Rezaul Karim
·
Jan. 02, 17 · Opinion
Like (7)
Save
Tweet
Share
11.92K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Flink is an open source platform from Apache Software Foundation for large-scale distributed stream and batch data processing that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink can be integrated with other open-source and big data processing tools both for data input and output as well as deployment. Apache Flink provides several APIs for creating real-time streaming applications by utilizing Flink engine as follows:

  1. DataStream API for unbounded streams embedded in Java and Scala
  2. DataSet API for static data embedded in Java, Scala, and Python,
  3. Table API with an SQL-like expression language embedded in Java and Scala.

In this article, we will see how Apache Flink provides high performance and low-latency streaming that can be considered as a new wave in the real-time streaming application development. 

High Performance & Low-Latency Streaming

If you are already familiar with Apache Spark then you probably experienced one issue with Spark streaming which is a micro-batch architecture is near real-time (NRT) in action. On the contrary, stream processing in Apache Flink is simply real time. Hence,  the primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing. 

From the technical point of view, high throughput rates and low latency can be achieved via Flink's data streaming runtime with minimal configuration and effort as shown in Figure 1. Flink supports stream processing and windowing with event time semantics (ETS), which makes it easy to compute over streams where events arrive out of order, and where events may arrive delayed. 

Figure 1: High throuphput and low-latency streaming with Apache FlinkFigure 1: High throughput and low-latency streaming with Apache Flink (image credit: https://flink.apache.org/features.html#streaming)

Apache Flink: A New Wave in Stream Processing

As already mentioned that Flink provides highly flexible streaming window for the continuous streaming model so that both the batch and the real-time streaming can be integrated into one system. 

Highly Flexible Streaming Windows for Continuous Streaming Model

Flink supports windows over time, count, or sessions, as well as data-driven windows. Windows can be customized with flexible triggering conditions, to support sophisticated streaming patterns. Data streaming applications are executed with continuous and long-lived operators. Other than the Spark, Flink's streaming runtime has the natural flow control for supporting slow data sinks backpressure for the faster sources.

Batch and Streaming in One System

Apache Flink provides a single runtime for the streaming and as well batch processing. As a result, one common runtime is utilized for data streaming applications and batch processing applications as shown in Figure 2. Where the batch processing applications run efficiently as special cases of stream processing applications.

Most importantly, the DataStream API of Flink supports functional transformations on data streams, with the user-defined state, and flexible windows. Secondly, Flink's DataSet  API lets you write beautiful type-safe and maintainable code in Java or Scala. Furthermore, it supports a wide range of data types beyond key/value pairs and a wealth of operators.

Image title

Figure 2: Unified Runtime for Batch and Stream Data Analysis (image credit: https://flink.apache.org/features.html#streaming)

Flink and the Streaming Fault-Tolerance

For streaming application development, Flink has a novel approach to drawing periodic snapshots of the streaming data flow state and use those for recovery. This mechanism is both efficient and flexible. For batch processing programs Flink remembers the program’s sequence of transformations and can restart failed jobs from the point of the failure.

Flink and the Broad Integration with Other Big Data Technologies

Flink is integrated with many other open-source data processing ecosystem to make the streaming analytics much easier. Flink runs on YARN, works with Hadoop’s distributed file system (HDFS), streams data can be fetched from Kafka, can execute Hadoop program code, and connects to various other data storage systems as shown in Figure 3.

Since Flink comes with its own runtime rather than building on top of MapReduce and it is a data processing system,  it also can be considered an alternative to Hadoop’s MapReduce component. Consequently, it can work independently of the Hadoop ecosystem. More interestingly, Flink can also access HDFS for reading and writing the data. Furthermore, Hadoop’s next-generation resource manager (e.g.YARN) to provision cluster resources. Since most Flink users are using Hadoop HDFS to store their data, Flink already ships the required libraries to access HDFS. 

Figure: Integration support in Flink with other big data technologies

Figure 3: Integration support in Flink with other big data technologies (image credit: https://flink.apache.org/features.html#streaming)

Conclusion

Since, Apache Flink is relatively a newer technology and less familiar compared to Apache Spark, the production deployments, and wide acceptance will take more time. However, it can be viewed as the 4G in the real-time stream processing framework.

In my next article, I will show how to packaging an Apache Flink streaming application with all the required dependencies to achieve wider portability so that Flink job can be executed platform independent manner. 

Stream processing Apache Flink Big data

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • A Complete Guide to AngularJS Testing
  • How Observability Is Redefining Developer Roles
  • Unlocking the Power of Polymorphism in JavaScript: A Deep Dive

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: