DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

  1. DZone
  2. Refcards
  3. Apache Kafka Essentials
refcard cover
Refcard #254

Apache Kafka Essentials

Building Scalable and Reliable Stream Processing Applications With Apache Kafka and Flink

Due to growing demands for high-performance analytics, modern businesses consider data capabilities critical to daily operations. Enter Apache Kafka — a streaming engine for collecting, caching, and processing high volumes of data in real time. The distributed event store typically serves as a central data hub that accepts all enterprise data for aggregation, transformation, enrichment, ingestion, and analysis. The data can then be used for continuous processing or fed into other systems and applications in real time. In this Refcard update, we'll explore the fundamentals of Apache Kafka today and delve into how Apache Flink complements Kafka as a powerful stream processing framework.

Download Refcard
Free PDF for Easy Reference
refcard cover

Written By

author avatar Sudip Sengupta
Technical Writer, Javelynn
Table of Contents
► Introduction ► About Apache Kafka ► Apache Kafka Fundamentals ► Extending Apache Kafka With Flink ► Scaling Data Streams ► Conclusion
Section 1

Introduction

With the changing tech landscape, several trends have emerged in recent years. First, with the increasing adoption of the cloud-native framework, microservices architectures are now a preferred choice for application development. These loosely coupled microservices enable enhanced scalability, fault isolation, and rapid deployment across hybrid cloud environments. Second, the diversity, volume, and velocity of the data that an enterprise wants to collect for decision making continues to grow. Additionally, there is a growing need for an enterprise to make real-time decisions on collected data. Undoubtedly, the increasing application of high-performance data analytics to support modern businesses is considered central to their rapid decision making, augmented predictive analyses, and factual strategic initiatives.

Apache Kafka is a streaming engine for collecting, caching, and processing high volumes of data in real time. The distributed event store typically serves as a central data hub that accepts all enterprise data for aggregation, transformation, enrichment, ingestion, and analysis. The data can then be used for continuous processing or fed into other systems and applications in real time.

In this Refcard, we'll explore the fundamentals of Apache Kafka and delve into how Apache Flink complements Kafka as a powerful stream processing framework.

Section 2

About Apache Kafka

Kafka was originally developed at LinkedIn in 2010 and later open sourced via the Apache Software Foundation in 2012. The platform comprises three main components: Pub/Sub, Kafka Connect, and Kafka Streams. The role of each component is summarized in the table below:

Component Role

Pub/Sub

Storing and delivering data efficiently and reliably at scale

Kafka Connect

Integrating Kafka with external data sources and data sinks

Kafka Streams

Processing data in Kafka in real time

The main benefits of Kafka are:

  • High throughput – Kafka delivers high throughput, with clusters capable of handling terabytes of data per second.
  • High availability – Redundancy is ensured through data replication across multiple brokers, allowing the system to tolerate failures without data loss.
  • High scalability – New servers can be added over time to scale out the system.
  • Easy integration – Seamless integration with external data sources or data sinks.
  • Real-time processing – Kafka Streams provides a native stream processing library for building real-time applications.
  • Fault isolation – Kafka's distributed architecture ensures that failures in one part of the system don't impact others.
  • Client library availability – Kafka allows for event stream processing in multiple coding languages.

While Kafka Streams has traditionally been a popular choice for real-time data processing within Kafka, the industry is shifting toward alternative stream processing frameworks like Apache Flink. We'll delve into this trend in more detail in the upcoming sections.

Section 3

Apache Kafka Fundamentals

Apache Kafka runs in distributed clusters, with each cluster node being referred to as a broker. Kafka Connect integrates Kafka instances on brokers with producers and consumers — clients that produce and consume event data, respectively. All these components rely on the publish-subscribe durable messaging ecosystem to enable instant exchange of event data between servers, processes, and applications.

Apache Kafka ecosystem real-time diagram

Figure 1: Apache Kafka as a central real-time hub

Pub/Sub in Apache Kafka 

The first component in Kafka deals with the production and consumption of the data. The following table describes a few key concepts in Kafka:

Concept Description

Topic

Defines a logical name for producing and consuming records

Partition

Defines a non-overlapping subset of records within a topic

Offset

A unique sequential number assigned to each record within a topic partition

Record

Contains a key, value, timestamp, and list of headers

Broker

Server where records are stored; multiple brokers can be used to form a cluster

Figure 2 depicts a topic with two partitions. Partition 0 has 5 records, with offsets from 0 to 4, and partition 1 has 4 records, with offsets from 0 to 3.

Apache Kafka partitions in a topic

Figure 2: Partitions in a topic

The following code snippet shows how to produce records to the topic, "test", using the Java API:

Java
 
11
1
Properties props = new Properties();
2
   props.put("bootstrap.servers",
3
      "localhost:9092");
4
props.put("key.serializer",
5
   "org.apache.kafka.common.serialization.StringSerializer");
6
props.put("value.serializer",
7
   "org.apache.kafka.common.serialization.StringSerializer");
8
Producer<String, String> producer = new
9
   KafkaProducer<>(props);
10
producer.send(
11
    new ProducerRecord<String, String>("test", "key", "value"));
In the example above, both the key and value are strings, so we are using a StringSerializer. It's possible to customize the serializer when types become more complex. The following code snippet shows how to consume records with key and value strings in Java:
 
1
Properties props = new Properties();  props.put("bootstrap.servers", "localhost:9092");
2
props.put("key.deserializer", 
3
   "org.apache.kafka.common.serialization.StringDeserializer");
4
props.put("value.deserializer",
5
   "org.apache.kafka.common.serialization.StringDeserializer");
6
KafkaConsumer<String, String> consumer =
7
   new KafkaConsumer<>(props);
8
consumer.subscribe(Arrays.asList("test"));
9
while (true) {
10
   ConsumerRecords<String, String> records =
11
      consumer.poll(100);
12
   for (ConsumerRecord<String, String> record : records)
13
      System.out.printf("offset=%d, key=%s, value=%s",
14
         record.offset(), record.key(), record.value());
15
   consumer.commitSync();
16
}

Records within a partition are always delivered to the consumer in offset order. By saving the offset of the last consumed record from each partition, the consumer can resume from where it left off after a restart. In the example above, we use the commitSync() API to save the offsets explicitly after consuming a batch of records. Users can also save the offsets automatically by setting the property, enable.auto.commit, to true.

A record in Kafka is not removed from the broker immediately after it is consumed. Instead, it is retained according to a configured retention policy. The following are two common retention policies:

  • log.retention.hours – number of hours to keep a record on the broker
  • log.retention.bytes – maximum size of records retained in a partition

While Kafka Streams could be used to process these consumed records, Apache Flink provides a compelling alternative with a richer set of features. Flink excels in state management and windowing operations, and it offers more flexible deployment options compared to Kafka Streams.

Here's an example of how to consume data from the "test" topic using Apache Flink:

 
​x
1
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
2
import org.apache.flink.api.common.serialization.SimpleStringSchema;
3
import org.apache.flink.connector.kafka.source.FlinkKafkaConsumer;
4
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
5
import org.apache.flink.streaming.api.datastream.DataStream;
6
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
7
​
8
public class FlinkKafkaConsumerExample {
9
​
10
    public static void main(String[] args) throws Exception {
11
        // Set up the streaming execution environment
12
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
13
​
14
        // Configure the Kafka consumer
15
        FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<>(
16
                "test", new SimpleStringSchema(), properties)
17
                .setStartFromEarliest(); // Read from the beginning of the topic
18
​
19
        // Create a DataStream from the Kafka consumer
20
        DataStream<String> stream = env.addSource(consumer);
21
​
22
        // Process the stream (e.g., print the messages)
23
        stream.print();
24
​
25
        // Execute the Flink program
26
        env.execute("Flink Kafka Consumer Example");
27
    }
28
}

Note that in contrast to the Kafka consumer code, which focuses on basic consumption, the above example using Flink demonstrates a more streamlined approach to creating a data stream from Kafka. It is Flink's DataStream API that provides a powerful foundation for building complex stream processing pipelines.

Kafka Connect 

The second component is Kafka Connect, a framework that makes it easy to stream data between Kafka and other systems. The framework uses connectors as pre-built components to integrate Kafka with various external systems for handling data transfer. Users can deploy a Connect cluster and run various connectors to import data from different sources into Kafka (through Source Connectors) and export data from Kafka further (through Sink Connectors) to storage platforms such as HDFS, S3, or Elasticsearch. 

The benefits of using Kafka Connect are:

  • Parallelism and fault tolerance
  • Avoidance of ad hoc code by reusing existing connectors
  • Built-in offset and configuration management

Quick Start for Kafka Connect 

The following steps show how to run the existing file connector in standalone mode to copy the content from a source file to a destination file via Kafka:

1. Prepare some data in a source file:

 
1
1
> echo -e \"hello\nworld\" > test.txt

2. Start a file source and a file sink connector:

 
3
1
> bin/connect-standalone.sh
2
config/connect-file-source.properties
3
config/connect-file-sink.properties

3. Verify the data in the destination file:

 
2
1
> more test.sink.txt
2
hello

4. Verify the data in Kafka:

 
10
1
> bin/kafka-console-consumer.sh
2
      --bootstrap-server localhost:9092
3
      --topic connect-test
4
      --from-beginning
5
   {"schema":{"type":"string",
6
         "optional":false},
7
      "payload":"hello"}
8
   {"schema":{"type":"string",
9
         "optional":false},
10
      "payload":"world"}

In the example above, the data in the source file, test.txt, is first streamed into a Kafka topic, connect-test, through a file source connector. The records in connect-test are then streamed into the destination file, test.sink.txt. If a new line is added to test.txt, it will show up immediately in test.sink.txt. Note that we achieve this by running two connectors without writing any custom code.

Connectors are powerful tools that allow for integration of Apache Kafka into many other systems. There are many open-source and commercially supported options for integrating Apache Kafka — both at the connector layer as well as through an integration services layer — that can provide much more flexibility in message transformation.

Transformations in Connect 

Connect is primarily designed to stream data between systems as-is, whereas Kafka Streams is designed to perform complex transformations once the data is in Kafka. That said, Kafka Connect provides a mechanism used to perform simple transformations per record. The following example shows how to enable a couple of transformations in the file source connector:

1. Add the following lines to connect-file-source.properties:

 
11
1
transforms=MakeMap, InsertSource
2
transforms.MakeMap.type=org.apache.kafka
3
   .connect.transforms.HoistField$Value
4
transforms.MakeMap.field=line
5
transforms.InsertSource.type=org.apache
6
   .kafka.connect.transforms
7
   .InsertField$Value
8
transforms.InsertSource.static.field=
9
   data_source
10
transforms.InsertSource.static.value=
11
   test-file-source

2. Start a file source connector:

 
2
1
> bin/connect-standalone.sh
2
config/connect-file-source.properties

3. Verify the data in Kafka:

 
7
1
> bin/kafka-console-consumer.sh
2
      --bootstrap-server localhost:9092
3
   --topic connect-test
4
{"line":"hello","data_source":"test
5
   -file-source"}
6
{"line":"world","data_source":"test
7
   -file-source"}

In step one above, we add two transformations, MakeMap and InsertSource, which are implemented by the classes, HoistField$Value and InsertField$Value, respectively. The first one adds a field name, line, to each input string. The second one adds an additional field, data_source, that indicates the name of the source file. After applying the transformation logic, the data in the input file is now transformed to the output in step three. Because the last transformation step is more complex, we implement it with the Streams API (covered in more detail below):

 
14
1
final Serde<String> stringSerde = Serdes.String();
2
final Serde<Long> longSerde = Serdes.Long();
3
StreamsBuilder builder = new StreamsBuilder();
4
// build a stream from an input topic
5
KStream<String, String> source = builder.stream(
6
   "streams-plaintext-input",
7
   Consumed.with(stringSerde, stringSerde));
8
KTable<String, Long> counts = source
9
   .flatMapValues(value -> Arrays.asList(value.toLowerCase().split(" ")))
10
   .groupBy((key, value) -> value) 
11
   .count();
12
// convert the output to another topic
13
counts.toStream().to("streams-wordcount-output",
14
   Produced.with(stringSerde, longSerde));

While Kafka Streams could be used for further processing, Flink provides a richer API and a broader ecosystem for implementing complex transformations on this data. It's also important to understand why Flink is gaining traction. Flink is a true stream processing engine that handles events individually as they arrive, enabling low-latency operations and real-time results. It also excels at state management, allowing you to maintain and query application state for tasks like sessionization and windowing.

Connect REST API 

In production, Kafka Connect usually runs in distributed mode and can be managed through REST APIs. To manage Connect in these environments, a REST API allows you to perform actions such as:

  • Monitoring connector status
  • Creating new connectors
  • Updating connector configurations

The following table lists the common APIs. See the Apache Kafka documentation for more information.

Connect Rest API Action

GET /connectors

Return a list of active connectors

POST /connectors

Create a new connector

GET /connectors/{name}

Get information for the connector

GET /connectors/{name} /config

Get configuration parameters for the connector

PUT /connectors/{name} /config

Update configuration parameters for the connector

GET /connectors/{name} /status

Get the current status of the connector

Flink 

Kafka offers the capability to perform stream processing on data stored within its topics. While Kafka Streams is a traditional, native library for this purpose, Apache Flink has emerged as a popular choice due to its advanced features and performance benefits.

Flink offers several advantages over Kafka Streams for stream processing, including:

  • High performance and scalability
  • Rich API for complex transformations
  • Support for stateful operations
  • Exactly-once semantics for data accuracy
  • Flexible deployment options
 
41
1
import org.apache.flink.api.common.functions.FlatMapFunction;
2
import org.apache.flink.api.java.tuple.Tuple2;
3
import org.apache.flink.streaming.api.datastream.DataStream;
4
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
5
import org.apache.flink.util.Collector;
6
​
7
public class WordCount {
8
​
9
    public static void main(String[] args) throws Exception {
10
        // Set up the streaming execution environment
11
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
12
​
13
        // Create a DataStream from the input source (e.g., Kafka topic)
14
        DataStream<String> text = env.fromElements("hello world", "flink is great"); // Replace with Kafka source
15
​
16
        DataStream<Tuple2<String, Integer>> counts =
17
                // Split the lines into words
18
                text.flatMap(new Tokenizer())
19
                // Group by word and count
20
                .keyBy(value -> value.f0)
21
                .sum(1);
22
​
23
        // Print the results
24
        counts.print();
25
​
26
        // Execute the program
27
        env.execute("Word Count Example");
28
    }
29
​
30
    public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
31
        @Override
32
        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
33
            String[] tokens = value.toLowerCase().split("\\W+");
34
            for (String token : tokens) {
35
                if (token.length() > 0) {
36
                    out.collect(new Tuple2<>(token, 1));
37
                }
38
            }
39
        }
40
    }
41
}


The above Flink code snippet reads a stream of text, splits it into individual words, and then counts the occurrences of each word. To run this Flink example:

  1. Download and install Apache Flink.
  2. Package the Flink application: Package the WordCount.java code into a JAR file.
  3. Submit the JAR file to the Flink cluster using the flink run command. (You'll need to configure the Flink application to read-from and write-to the appropriate Kafka topics.)

Stream Processing with Flink: DataStreams and Tables 

Apache Flink provides two core abstractions for working with streaming data: DataStreams and Tables. These abstractions offer different ways to represent and process streaming data, catering to various use cases and programming styles. 

  • A DataStream represents a continuous flow of data records. Each record is treated as an individual event, similar to how records are handled in a KStream. DataStreams are well suited for scenarios where you need to process events in the order they arrive and perform operations such as filtering, mapping, and windowing. 
  • A Table represents a dynamic collection of records with a schema. Unlike DataStreams, Tables focus on the relational aspect of data, allowing you to express transformations using SQL-like queries. Tables are ideal for scenarios where you need to perform aggregations, joins, and other relational operations on streaming data. 

Key differences are summarized in the table below:

Feature DataStream Table

Data model

Unbounded sequence of individual records

Dynamic table with rows and columns

Processing

Record-at-a-time processing

Relational operations (e.g., aggregations, joins)

Use cases

Event processing, real-time analytics

Data analysis, continuous queries

Flink allows you to seamlessly switch between DataStreams and Tables, enabling you to leverage the strengths of both abstractions within your stream processing applications. Using DataStreams, you can process each event individually to track user sessions or detect anomalies. On the other hand, with Tables, you can aggregate events to calculate metrics such as the number of active users or the average session duration. 

Let's consider a stream of events with a key and a numeric value to understand their differences better: 

Key Value

"k1"

2

"k1"

5

When processed as a DataStream, both records are treated as independent events. If we were to sum the values, the result would be 7 (2 + 5).

When processed as a Table, the second record with key "k1" is treated as an update to the existing "k1" record. Therefore, if we were to sum the values in the table, the result would be 5 (the latest value associated with "k1").   

Flink DataStream API and Operators 

Flink's DataStream API provides a rich set of operators for transforming and processing streaming data. These operators allow you to perform various operations on DataStreams, such as filtering, mapping, joining, and aggregating. Here are some commonly used Flink DataStream operators:

1. filter(Predicate): Create a new DataStream consisting of all records of this stream that satisfy the given predicate. Example:

 
2
1
DataStream<Integer> input = env.fromElements(1, 2, 3, 4, 5, 6, 7);
2
DataStream<Integer> output = input.filter(value -> value > 5); // Output: 6, 7

2. map(MapFunction): Transform each record of the input stream into a new record in the output stream. Example:

 
1
1
DataStream<Integer> input = env.fromElements(1, 2, 3); DataStream<Integer> output = input.map(value -> value * 2); // Output: 2, 4, 6

3. keyBy(KeySelector): Group the records by a key. This is a crucial step for performing aggregations and other stateful operations. Example:

 
5
1
DataStream<Tuple2<String, Integer>> input = env.fromElements(
2
                                                Tuple2.of("a", 1), 
3
                                                Tuple2.of("b", 2), 
4
                                                Tuple2.of("a", 3));
5
KeyedStream<Tuple2<String, Integer>, String> keyedStream = input.keyBy(value -> value.f0); // Key by the first element (String)

4. reduce(ReduceFunction): Combine the values of records in a keyed stream. Example:

 
4
1
DataStream<Tuple2<String, Integer>> input = // ... (keyed stream as in the previous example)
2
DataStream<Tuple2<String, Integer>> output = keyedStream.reduce(
3
                                                    (value1, value2) -> Tuple2.of(value1.f0, value1.f1 + value2.f1)
4
                                                 ); // Output: ("a", 4), ("b", 2)

5. window(WindowAssigner): Group records into windows for performing aggregations or other operations over a specific time period or number of events. Example:

 
4
1
DataStream<Tuple2<String, Integer>> input = // ... (keyed stream)
2
DataStream<Tuple2<String, Integer>> output = keyedStream
3
                                                .window(TumblingEventTimeWindows.of(Time.seconds(5))) 
4
                                                .sum(1); // Sum the second element (Integer) within 5-second windows

6. join(DataStream, KeySelector, KeySelector, ValueJoiner): Join records from two DataStreams based on their keys. Example:

 
11
1
DataStream<Tuple2<String, Integer>> stream1 = // ...
2
DataStream<Tuple2<String, String>> stream2 = // ...
3
DataStream<Tuple3<String, Integer, String>> output = stream1
4
                   .join(stream2)
5
                   .where(value -> value.f0) // Key selector for stream1
6
                   .equalTo(value -> value.f0) // Key selector for stream2
7
                   .window(TumblingEventTimeWindows.of(Time.seconds(10)))
8
                   .apply(
9
                                                          
10
                   (value1, value2) -> Tuple3.of(value1.f0, value1.f1, value2.f1)
11
         ); // Join within 10-second windows

The above is just a small selection of the many operators available in Flink's DataStream API. Flink also provides operators for connecting to various data sources and sinks, performing stateful operations, and handling event time. Additional details on the respective set of operations and examples can be found in the Apache Flink DataStream API documentation.

State Management and Fault Tolerance in Flink 

While processing data in real time, a Flink application often needs to maintain state. State is information that is relevant to the processing of events, such as aggregations, windowed data, or machine learning models. For example, in a word count application, the state would be the current count of each word encountered.

As a key feature over KStreams, Flink excels at state management, providing efficient ways to store, access, and update this state. It offers various state back ends to suit different needs, such as in-memory state for high performance, file-system-based state for durability, and RocksDB state for handling large state sizes.

RocksDB in Apache Flink cluster node

Figure 3: RocksDB in a Flink cluster node

Image adapted from Apache Flink docs

To mitigate potential failures in the system, Flink guarantees exactly-once processing semantics. As a result, each record in the stream will be processed exactly once, and the results will be consistent, regardless of any disruptions. A typical way of achieving this is through a combination of techniques, including checkpointing (periodically saving the state of the application), robust state management, and transactional sinks (ensuring that output data is written only once). 

Governing data in motion is often a shared responsibility between developers, DataOps, and business development teams. Flink provides built-in observability metrics and monitoring capabilities that allow you to track key metrics such as throughput (number of records processed per second), latency (time taken to process a record), and resource utilization (CPU, memory). You can also integrate Flink with external monitoring tools like Prometheus and Grafana for more comprehensive monitoring and visualization.

Though it is important to note that Flink, on its own, doesn't include a built-in event portal to catalog event streams and visualize the complete topology of data pipelines. However, it offers features and integration points that enable you to achieve this functionality. To understand the flow of data, there are various tools and APIs for tracking data lineage to trace the origin and transformations of data as it moves through your pipelines. If you detect an anomaly in the output data, data lineage can help you pinpoint the source of the issue and the transformations applied. For example, you can use OpenLineage as a standard for capturing and sharing data lineage information or Marquez as a lineage tracking system that can be integrated with Flink.

Section 4

Extending Apache Kafka With Flink

While Kafka and Flink provide a powerful foundation for event streaming and stream processing, you might need to extend their capabilities to support specific requirements of modern data-driven applications. Here are some approaches to consider:

  • Low-code stream processing – To reduce manual effort redundancy, consider using low-code approaches that require minimal to no coding for redundant tasks. Flink SQL, for example, allows you to express transformations and analysis using SQL queries, minimizing the need for complex Java or Scala code. The approach can be particularly useful for common tasks like filtering, aggregations, and joins.
  • Linking on-prem and cloud-based clusters – As your streaming infrastructure grows, you might need to connect Flink and Kafka deployments across different environments, such as on-premises data centers and multiple cloud providers. Consider solutions like an event mesh to create a unified fabric for seamless data flow and interoperability across these environments.
  • Filtering data streams – While Flink provides a rich set of operators for filtering and transforming data, you might encounter scenarios that require more specialized or complex logic. In such cases, you can leverage Flink's extensibility to implement custom functions or integrate with external libraries to achieve the desired functionality.
  • Data sharing and replication – To ensure high availability, disaster recovery, or data synchronization across different Kafka clusters, consider using tools and techniques for data replication. MirrorMaker 2 is a Kafka-native tool that can replicate topics between clusters. You can also explore Kafka's active/active replication feature for more advanced replication scenarios.

Apache Flink: Hybrid/multi-cloud deployment with event mesh

Figure 4: Hybrid/multi-cloud deployment with event mesh

For more information on low-code stream processing options, including Flink SQL access to Apache Kafka, please see the additional resources below.

Section 5

Scaling Data Streams

With the increasing popularity of real-time stream processing and the rise of event-driven architectures, a number of alternatives have started to gain traction for real-time data distribution. Apache Kafka is the flavor of choice for distributed high-volume data streaming; however, many implementations commonly struggle with building solutions at scale when the application's requirements go beyond a single data center or single location.

While Apache Kafka is purpose-built for real-time data distribution and stream processing, it may not fit the requirements of every enterprise application due to its constraints with architecture, throughput, and amount of data ingested. Some alternatives to Apache Kafka include:

  • Apache Pulsar – a lower-latency, higher-throughput event management platform that relies on geo-replication for enhanced performance.
  • Amazon Kinesis – an AWS streaming service that supports Java, Android, .NET, and Go SDKs.
  • Apache Spark – a unified, open-source analytics engine that is mainly built for streaming data in ML and AI applications.
  • Google Cloud Pub/Sub – a scalable message queue and stream analytics platform with enterprise support for event-driven applications.
  • Redpanda – a Kafka-compatible streaming platform designed for higher performance and improved resource efficiency. As a drop-in replacement for Kafka, Redpanda offers higher throughput and lower latency compared to Kafka.

For more information on comparisons between Apache Kafka and other data distribution solutions, please see the additional resources below.

Section 6

Conclusion

Apache Kafka has become the de facto standard for high-performance, distributed data streaming. It has a large and growing community of developers, corporations, and applications that are supporting, maintaining, and leveraging it. Flink complements Kafka by providing a robust and scalable platform for performing complex stream processing tasks. If you are building an event-driven architecture or looking for a way to stream data in real time, Apache Kafka is a clear leader in providing a proven, robust platform for enabling stream processing and enterprise communication.

Additional resources:

  • Apache Kafka Documentation – http://kafka.apache.org/documentation/
  • Apache NiFi website – http://nifi.apache.org/
  • "Real-Time Stock Processing With Apache NiFi and Apache Kafka, Part 1" – https://dzone.com/articles/real-time-stock-processing-with-apache-nifi-and-ap
  • Apache Kafka Summit website – http://kafka-summit.org/
  • Apache Kafka Mirroring and Replication – http://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
  • Apache Flink Documentation - https://flink.apache.org/documentation/
  • Real-time sessionization pipeline using Apache Flink – https://github.com/vraja2/flink-sessionization
  • Open-Source Data Management Practices and Patterns Refcard – https://dzone.com/refcardz/open-source-data-management-practices-and-patterns

Like This Refcard? Read More From DZone

related article thumbnail

DZone Article

Kafka for XML Message Integration and Processing
related article thumbnail

DZone Article

Kafka Connectors Without Kafka
related article thumbnail

DZone Article

Intro To Apache Kafka: How Kafka Works
related article thumbnail

DZone Article

IoT and Event Streaming at Scale With Kafka and MQTT
related refcard thumbnail

Free DZone Refcard

Open-Source Data Management Practices and Patterns
related refcard thumbnail

Free DZone Refcard

Real-Time Data Architecture Patterns
related refcard thumbnail

Free DZone Refcard

Getting Started With Real-Time Analytics
related refcard thumbnail

Free DZone Refcard

Getting Started With Apache Iceberg

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: