Kafka Consumer Overview
A developer gives a technical overview of consumers in the Apache Kafka platform, and how they allow devs to work with concurrent data consumption applications.
Join the DZone community and get the full member experience.Join For Free
This article is a continuation of Part 1 - Kafka Technical Overview, Part 2 - Kafka Producer Overview and Part 3 - Kafka Producer Delivery Semantics articles. Let's look into Kafka consumer group, consumer, and protocol used in detail.
Like a Kafka Producer that optimizes writes to Kafka, a Consumer is used for optimal consumption of Kafka data. The primary role of a Kafka consumer is to take Kafka connection and consumer properties to read records from the appropriate Kafka broker. Complexities of concurrent application consumption, offset management, delivery semantics, and a lot more are taken care of by Consumer APIs.
Some of the consumer properties in the bootstrap servers are:
enable.auto.commit, and many more. We will discuss some of these properties later in the next part of the article series.
Role of Kafka Consumers
Multiple applications can consume records from the same Kafka topic, as shown in the diagram below. Each application that consumes data from Kafka gets it’s own copy and can read at its own speed. In other words, offsets consumed by one application could be different from another application. Kafka keeps tracks of the offsets consumed by each application in an internal
Consumer Group and Consumer
Each application consuming data from Kafka is treated as a consumer group. For example, if two applications are consuming the same topic from Kafka, then, internally, Kafka creates two consumer groups. Each consumer group can have one or more consumers. If a topic has three partitions and an application consumes it, then a consumer group would be created and a consumer in the consumer group will consume all partitions of the topic. The diagram below depicts a consumer group with a single consumer.
When an application wants to increase the speed of processing and process partitions in parallel then it can add more consumers to the consumer group. Kafka takes care of keeping track of offsets consumed per consumer in a consumer group, rebalancing consumers in the consumer group when a consumer is added or removed and lot more.
When there are multiple consumers in a consumer group, each consumer in the group is assigned one or more partitions. Each consumer in the group will process records in parallel from each leader partition of the brokers. A consumer can read from more than one partition.
When consumers in a consumer group are more than partitions in a topic then over-allocated consumers in the consumer group will be unused.
When you have multiple topics and multiple applications consuming the data, consumer groups and consumers of Kafka will look similar to the diagram shown below.
Coordinator and Leader Discovery
In order to manage the handshake between Kafka and the application that forms the consumer group and consumer, a coordinator on the Kafka side and a leader (one of the consumers in the consumer group) is elected. The first consumer that initiates the process is automatically elected as leader in the consumer group. As explained in the diagram below, for a consumer to join a consumer group, the following handshake processes take place:
- Find coordinator
- Join group
- Sync group
- Leave group
In order to create or join a group, a consumer has to first find the coordinator on the Kafka side that manages the consumer group. The consumer makes a “find coordinator” request to one of the bootstrap servers. If a coordinator already doesn’t exist it’s identified based on a hashing formula and returned as a response to “find coordinator” request.
Once the coordinator is identified, the consumer makes a “join group” request to the coordinator. The coordinator returns the consumer group leader and metadata details. If a leader already doesn’t exist then the first consumer of the group is elected as leader. Consuming application can also control the leader elected by the coordinator node.
After leader details are received for the join group request, the consumer makes a “Sync group” request to the coordinator. This request triggers the rebalancing process across consumers in the consumer group, as the partitions assigned to the consumers will change after the “sync group” request.
All consumers in the consumer group will receive updated partition assignments that they need to consume when a consumer is added/removed or “sync group” request is sent. Data consumption by all consumers in the consumer group will be halted until the rebalance process is complete.
Each consumer in the consumer group periodically sends a heartbeat signal to its group coordinator. In the case of heartbeat timeout, the consumer is considered lost and rebalancing is initiated by the coordinator.
A consumer can choose to leave the group anytime by sending a “leave group” request. The coordinator will acknowledge the request and initiate a rebalance. In case the leader node leaves the group, a new leader is elected from the group and a rebalance is initiated.
As explained in Part 1of this series, “partitions” are units of parallelism. As consumers in a consumer group are limited by the partition in a topic, it’s very important to decide you partitions based on the SLA and scale your consumers accordingly. Consumer offsets are managed and stored by Kafka in an internal
__consumer_offset topic. Each consumer in a consumer group follows the find coordinator, join group, sync group, heartbeat, and leave group protocols. In the next article in this series, we'll look into Kafka consumer properties and delivery semantics.
Opinions expressed by DZone contributors are their own.