Introduction to Apache Kafka
A developer gives an introduction to how Apache Kafka works followed by a quick tutorial on creating a cluster in Apache Kafka.
Join the DZone community and get the full member experience.Join For Free
What Is Apache Kafka?
Apache Kafka is a distributed streaming system that allows you to publish and subscribe to the stream of records. In another aspect, it is an enterprise messaging system. It is a highly fast, horizontally scalable, and fault-tolerant system. Kafka has four core APIs:
This API allows the clients to connect to Kafka servers running in clusters and publish the stream of records to one or more Kafka topics.
This API allows the clients to connect to Kafka servers running in clusters and consume the streams of records from one or more Kafka topics. Kafka consumes the messages from Kafka topics.
This API allows the clients to act as stream processors by consuming streams from one or more topics and producing the streams to other output topics. This allows for the transformation of the input and output streams.
This API allows you to write reusable producer and consumer code. For example, if we want to read data from any RDBMS to publish the data to the topic and consume data from the topic and write that to RDBMS. With Connector API, we can create reusable source and sink connector components for various data sources.
What Is Kafka Used For?
Kafka is used for the below use cases:
Kafka uses an enterprise messaging system to decouple the source and target systems to exchange data. Kafka provides high throughput with partitions and fault tolerance with replication compared to JMS.
Web Activity Tracking
This lets you track the user journey events on a website for analytics and offline data processing.
This processes logs from various systems, especially in the distributed environments with microservice-based architectures where the systems are deployed on various hosts. We need to aggregate the logs from various systems and make the logs available in a central place for analysis. For a deeper dive, I recommend going through the following article on distributed logging architecture where Kafka is used: https://smarttechie.org/2017/07/31/distributed-logging-architecture-for-micro-services/
Some references on this can be found on GitHub.
What Is a Broker?
An instance in a Kafka cluster is called a broker. In a Kafka cluster, if you connect to any one broker you will be able to access the entire cluster. The broker instance which we connect to the access cluster is also known as a bootstrap server. Each broker is identified by a numeric id in the cluster. To start with the Kafka cluster three brokers is a good number. But there are clusters that have hundreds of brokers in them.
What Is a Topic?
A topic is a logical name to which the records are published. Internally, the topic is divided into partitions to which the data is published. These partitions are distributed across the brokers in the cluster. For example, if a topic has three partitions with three brokers in a cluster each broker has one partition. The data published to the partition is append-only with the offset increment.
Below are a couple of points we need to remember while working with partitions.
- Topics are identified by their name. We can have many topics in a cluster.
- The order of the messages is maintained at the partition level, not across the topic.
- Once the data is written to partition it cannot be overridden. This is called immutability.
- The message in partitions is stored with a key, value, and timestamp. Kafka ensures the message is published to the same partition for a given key.
- From the Kafka cluster, each partition will have a leader which will take read/write operations to that partition.
In the above example, I have created a topic with three partitions with a replication factor of 3. In this case, as the cluster has three brokers, the three partitions are evenly distributed and the replicas of each partition are replicated over to another two brokers. As the replication factor is 3, there is no data loss, even if 2 brokers go down. Always keep the replication factor greater than 1 and less than or equal to the number of brokers in the cluster. You cannot create a topic with a replication factor of more than the number of brokers in a cluster.
In the above diagram, for each partition, there is a leader (glowing partition) and other in-sync replicas (gray out partitions) are followers. For partition 0, broker-1 is the leader, while broker-2 and broker-3 are the followers. All the reads/writes to partition 0 will go to broker-1 and the same will be copied to broker-2 and broker-3.
Now let us create a Kafka cluster with three brokers by following the below steps.
Create a Kafka Cluster
Download the latest version of Apache Kafka. In this example, I am using 1.0. Extract the folder and move it into the bin folder. Start Zookeeper.
Zookeeper is the coordination service to manage the brokers and leader election for partitions, and it alerts Kafka when changes are made to the topic or brokers (e.g., deleted, created, etc.). In this example, I have started only one Zookeeper instance. In production environments, we should have more Zookeeper instances to manage fail-over. Without Zookeeper, Kafka clusters cannot work.
Now start the Kafka brokers. In this example, we are going to start three brokers. Go to the config folder under the Kafka root file and copy the server.properties file three times and name it as
server_3.properties. Change the below properties in those files.
Now run the three brokers with the below commands.
Create a topic with the below command.
Produce some messages to the topic created in the above step by using Kafka's console producer. To do this, mention any one of the broker addresses in the console producer. That will be the bootstrap server to gain access to the entire cluster.
Consume the messages using Kafka's console consumer. For Kafka consumers, mention any one of the broker addresses as a bootstrap server. Remember, while reading the messages you may not see the order, as the order is maintained at the partition level, not at the topic level.
If you want, you can describe the topic to see how partitions are distributed, as well the leaders of each partition, using the below command.
In the above description, broker-1 is the leader for partition:0 and broker-1, broker-2, and broker-3 have replicas of each partition.
In the next article, we will see the producer and consumer Java API. Until then, happy messaging!
Published at DZone with permission of Siva Prasad Rao Janapati, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.