DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Harnessing Real-Time Insights With Streaming SQL on Kafka
  • High-Speed Real-Time Streaming Data Processing
  • Real-Time Streaming Architectures: A Technical Deep Dive Into Kafka, Flink, and Pinot
  • Real-Time Analytics: All Data, Any Data, Any Scale, at Any Time

Trending

  • AI's Dilemma: When to Retrain and When to Unlearn?
  • Artificial Intelligence, Real Consequences: Balancing Good vs Evil AI [Infographic]
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • MySQL to PostgreSQL Database Migration: A Practical Case Study
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Kafka Stream (KStream) vs Apache Flink

Kafka Stream (KStream) vs Apache Flink

By 
Preetdeep Kumar user avatar
Preetdeep Kumar
DZone Core CORE ·
Apr. 07, 20 · Analysis
Likes (12)
Comment
Save
Tweet
Share
31.6K Views

Join the DZone community and get the full member experience.

Join For Free

Overview

Two of the most popular and fast-growing frameworks for stream processing are Flink (since 2015) and Kafka’s Stream API (since 2016 in Kafka v0.10). Both are open-sourced from Apache and quickly replacing Spark Streaming — the traditional leader in this space.

In this article, I will share key differences between these two methods of stream processing with code examples. There are few articles on this topic that cover high-level differences, such as [1], [2], and [3] but not much information through code examples.

In this post, I will take a simple problem and try to provide code in both frameworks and compare them. Before we start with code, the following are my observations when I started learning KStream.

Kafka Stream vs Flink

Example 1

The following are the steps in this example:

  1. Read stream of numbers from Kafka topic. These numbers are produced as string surrounded by "[" and "]". All records are produced with the same key.
  2. Define Tumbling Window of five seconds.
  3. Reduce (append the numbers as they arrive).
  4. Print to console.

Kafka Stream Code

Java
xxxxxxxxxx
1
17
 
1
static String TOPIC_IN = "Topic-IN";
2
3
final StreamsBuilder builder = new StreamsBuilder();
4
5
builder
6
.stream(TOPIC_IN, Consumed.with(Serdes.String(), Serdes.String()))
7
.groupByKey()
8
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)))
9
.reduce((value1, value2) -> value1 + value2)
10
.toStream()
11
.print(Printed.toSysOut());
12
            
13
Topology topology = builder.build();
14
System.out.println(topology.describe());
15
        
16
final KafkaStreams streams = new KafkaStreams(topology, props); 
17
streams.start();


Apache Flink Code

Java
xxxxxxxxxx
1
29
 
1
static String TOPIC_IN = "Topic-IN";
2
3
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
4
5
FlinkKafkaConsumer<KafkaRecord> kafkaConsumer = new FlinkKafkaConsumer<>(TOPIC_IN, new MySchema(), props);
6
        
7
kafkaConsumer.setStartFromLatest();
8
        
9
DataStream<KafkaRecord> stream = env.addSource(kafkaConsumer);
10
        
11
stream
12
.timeWindowAll(Time.seconds(5))
13
.reduce(new ReduceFunction<KafkaRecord>() 
14
 {
15
   KafkaRecord result = new KafkaRecord();
16
   
17
   @Override
18
   public KafkaRecord reduce(KafkaRecord record1, KafkaRecord record2) throws Exception
19
   {
20
     result.key = record1.key;  
21
     result.value = record1.value + record2.value;      
22
     return result;
23
   }
24
})
25
.print();        
26
27
System.out.println( env.getExecutionPlan() );
28
29
env.execute();


Differences Observed After Running Both

  1. Can't use window() without groupByKey() in Kafka Stream; whereas Flink provides the timeWindowAll() method to process all records in a stream without a key.
  2. Kafka Stream by default reads a record and its key, but Flink needs a custom implementation of KafkaDeserializationSchema<T> to read both key and value. If you are not interested in the key, then you can use new SimpleStringSchema() as the second parameter to the FlinkKafkaConsumer<> constructor. The implementation of MySchema is available on Github.
  3. You can print the pipeline topology from both. This helps in optimizing your code. However, Flink provides, in addition to JSON dump, a web app to visually see the topology https://flink.apache.org/visualizer/.
  4. In Kafka Stream, I can print results to console only after calling toStream() whereas Flink can directly print it.
  5. Finally, Kafka Stream took 15+ seconds to print the results to console, while Flink is immediate. This looks a bit odd to me since it adds an extra delay for developers.

Example 2

The following are the steps in this example

  1. Read stream of numbers from Kafka topic. These numbers are produced as a string surrounded by    "[" and "]". All records are produced with the same key.
  2. Define a Tumbling Window of five seconds.
  3. Define a grace period of 500ms to allow late arrivals.
  4. Reduce (append the numbers as they arrive).
  5. Send the result to another Kafka topic.

Kafka Stream Code

Java
xxxxxxxxxx
1
16
 
1
static String TOPIC_IN = "Topic-IN";
2
static String TOPIC_OUT = "Topic-OUT";
3
4
final StreamsBuilder builder = new StreamsBuilder();
5
6
builder
7
.stream(TOPIC_IN, Consumed.with(Serdes.String(), Serdes.String()))
8
.groupByKey()
9
.windowedBy(TimeWindows.of(Duration.ofSeconds(5))).grace(Duration.ofMillis(500)))
10
.reduce((value1, value2) -> value1 + value2)
11
.toStream()
12
.to(TOPIC_OUT);
13
            
14
Topology topology = builder.build();    
15
final KafkaStreams streams = new KafkaStreams(topology, props); 
16
streams.start();


Flink Code

Java
xxxxxxxxxx
1
41
 
1
static String TOPIC_IN = "Topic-IN";
2
static String TOPIC_OUT = "Topic-OUT";
3
4
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
5
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
6
7
FlinkKafkaConsumer<KafkaRecord> kafkaConsumer = new FlinkKafkaConsumer<>(TOPIC_IN, new MySchema(), props);
8
kafkaConsumer.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<KafkaRecord>() 
9
{
10
  @Override
11
  public long extractAscendingTimestamp(KafkaRecord record) 
12
  {
13
    return record.timestamp;
14
  }
15
});
16
17
// define kafka producer using Flink API.
18
KafkaSerializationSchema<String> serializationSchema = (value, timestamp) -> new ProducerRecord<byte[], byte[]>(TOPIC_OUT, value.getBytes());
19
20
FlinkKafkaProducer<String> kafkaProducer = 
21
                new FlinkKafkaProducer<String>(TOPIC_OUT, 
22
                                               serializationSchema, 
23
                                               prodProps, 
24
                                               Semantic.EXACTLY_ONCE);
25
26
DataStream<KafkaRecord> stream = env.addSource(kafkaConsumer);
27
28
stream
29
.keyBy(record -> record.key)
30
.timeWindow(Time.seconds(5))
31
.allowedLateness(Time.milliseconds(500))  
32
.reduce(new ReduceFunction<String>()
33
{
34
  @Override
35
  public String reduce(String value1, String value2) throws Exception
36
  {
37
    return value1+value2;
38
  }
39
})
40
.addSink(kafkaProducer);
41
env.execute();


Differences Observed After Running Both 

1. Due to native integration with Kafka, it was very easy to define this pipeline in KStream as opposed to Flink

2. In Flink, I had to define both Consumer and Producer, which adds extra code.

3. KStream automatically uses the timestamp present in the record (when they were inserted in Kafka) whereas Flink needs this information from the developer. I think Flink's Kafka connector can be improved in the future so that developers can write less code. 

4. Handling late arrivals is easier in KStream as compared to Flink, but please note that Flink also provides a side-output stream for late arrival which is not available in Kafka stream.

5. Finally, after running both, I observed that Kafka Stream was taking some extra seconds to write to output topic, while Flink was pretty quick in sending data to output topic the moment results of a time window were computed.

Conclusion

  • If your project is tightly coupled with Kafka for both source and sink, then KStream API is a better choice. However, you need to manage and operate the elasticity of KStream apps.
  • Flink is a complete streaming computation system that supports HA, Fault-tolerance, self-monitoring, and a variety of deployment modes.
  • Due to in-built support for multiple third-party sources and sink Flink is more useful for such projects. It can be easily customized to support custom data sources.
  • Flink has a richer API when compared to Kafka Stream and supports batch processing, complex event processing (CEP), FlinkML, and Gelly (for graph processing).
kafka Stream processing Apache Flink

Opinions expressed by DZone contributors are their own.

Related

  • Harnessing Real-Time Insights With Streaming SQL on Kafka
  • High-Speed Real-Time Streaming Data Processing
  • Real-Time Streaming Architectures: A Technical Deep Dive Into Kafka, Flink, and Pinot
  • Real-Time Analytics: All Data, Any Data, Any Scale, at Any Time

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!