If you look at this image with a list of big data tools, it may seem that all possible niches in this field are already occupied. With so much competition, it should be very tough to come up with groundbreaking technology.
Apache Flink creators have a different thought about this. It started as a research project called Stratosphere. Stratosphere was forked, and this fork became what we know as Apache Flink. In 2014, it was accepted as an Apache Incubator project, and just a few months later, it became a top-level Apache project. At the time of this writing, the project has almost twelve thousand commits and more than 300 contributors.
Why is there so much attention? This is because Apache Flink was called a new generation big data processing framework and has enough innovations under its belt to replace Apache Spark and become the new de-facto tool for batch and stream processing.
Should you switch to Apache Flink? Should you stick with Apache Spark for a while? Or is Apache Flink just a new gimmick? This article will attempt to give you answers to these and other questions.
Unless you have been living under a rock for the last couple of years, you have heard about Apache Spark. It looks like every modern system that does any kind data processing is using Apache Spark in one way or another.
For a long time, Spark was the latest and greatest tool in this area. It delivered some impressive features comparing to its predecessors such as:
- Impressive speed: It is ten times faster than Hadoop if data is processed on a disk and up to 100 times faster if data is processed in memory.
- Simpler directed acyclic graph model: Instead of defining your data processing jobs using rigid MapReduce framework Spark allows defining a graph of tasks that can implement complex data processing algorithms
- Stream processing: With the advent of new technologies such as the Internet of Things, it is not enough to simply to process a huge amount of data. Now, we need processing a huge amount of data as it arrives in real time. This is why Apache Spark has introduced stream processing that allows processing a potentially infinite stream of data.
- Rich set of libraries: In addition to its core features, Apache Spark provides powerful libraries for machine learning, graph processing, and performing SQL queries.
To get a better idea of how you write applications with Apache Spark, let's take a look at how you can implement a simple word count application that would count how many times each word was used in a text document:
// Read file val file = sc.textFile("file/path") val wordCount = file // Extract words from every line .flatMap(line => line.split(" ")) // Convert words to pairs .map(word => (word, 1)) // Count how many times each word was used .reduceByKey(_ + _)
If you know Scala, this code should seem straightforward and is similar to working with regular collections. First, we read a list of lines from a file located in file/path". This file can be either a local file or a file in HDFS or S3.
Then, every line is split into a list of words using the
flatMap method that simply splits a string by the space symbol. Then, to implement the word counting, we use the
map method to convert every word into a pair where the first element of the pair is a word from the input text and the second element is simply a number one.
The last step simply counts how many times each word was used by summing up numbers for all pairs for the same word.
Apache Spark seems like a great and versatile tool. But what does Apache Flink brings to the table?
At first glance, there does not seem to be many differences. The architecture diagram looks very similar:
If you take a look at the code example for the word count application for Apache Flink, you would see that there is almost no difference:
val file = env.readTextFile("file/path") val counts = file .flatMap(line => line.split(" ")) .map(word => (word, 1)) .groupBy(0) .sum(1)
Few notable differences, is that in this case we need to use the
readTextFile method instead of the
textFile method and that we need to use a pair of methods:
sum instead of
So what is all the fuss about? Apache Flink may not have any visible differences on the outside, but it definitely has enough innovations, to become the next generation data processing tool. Here are just some of them:
- Implements actual streaming processing: When you process a stream in Apache Spark, it treats it as many small batch problems, hence making stream processing a special case. Apache Flink, in contrast, treats batch processing as a special and does not use micro batching.
- Better support for cyclical and iterative processing: Flink provides some additional operations that allow implementing cycles in your streaming application and algorithms that need to perform several iterations on batch data.
- Custom memory management: Apache Flink is a Java application, but it does not rely entirely on JVM garbage collector. It implements custom memory manager that stores data to process in byte arrays. This allows reducing the load on a garbage collector and increased performance. You can read about it in this blog post.
- Lower latency and higher throughput: Multiple tests done by third parties suggest that Apache Flink has lower latency and higher throughput than its competitors.
- Powerful windows operators: When you need to process a stream of data in most cases you need to apply a function to a finite group of elements in a stream. For example, you may need to count how many clicks your application has received in each five-minute interval, or you may want to know what was the most popular tweet on Twitter in each ten-minute interval. While Spark supports some of these use-cases, Apache Flink provides a vastly more powerful set of operators for stream processing.
- Implements lightweight distributed snapshots: This allows Apache Flink to provide low overhead and only-once processing guarantees in stream processing, without using micro batching as Spark does.
So, you are working on a new project, and you need to pick a software for it. What should ypi use? Spark? Flink?
Of course, there is no right or wrong answer here. If you need to do complex stream processing, then I would recommend using Apache Flink. It has better support for stream processing and some significant improvements.
If you don't need bleeding-edge stream processing features and want to stay on the safe side, it may be better to stick with Apache Spark. It is a more mature project it has a bigger user base, more training materials, and more third-party libraries. But keep in mind that Apache Flink is closing this gap by the minute. More and more projects are choosing Apache Flink as it becomes a more mature project.
If on the other hand, you like to experiment with the latest technology, you definitely need to give Apache Flink a shot.
Does all this mean that Apache Spark is obsolete and in a couple of years we all are going to use Apache Flink? The answer may surprise you. While Flink has some impressive features, Spark is not staying the same. For example, Apache Spark introduced custom memory management in 2015 with the release of project Tungsten, and since then, it has been adding features that were first introduced by Apache Flink. The winner is not decided yet.
In the upcoming blog posts I will write more about how you can use Apache Flink for batch and stream processing, so stay tuned!