Kafka Producer Overview
Kafka Producer Overview
In this post, we take a look in to producers in Apache Kafka and what they allow devs and data scientists to do.
Join the DZone community and get the full member experience.Join For Free
This article is a continuation of Part 1, 'Kafka Technical Overview.' In Part 2 of the series, let's look into the details of how a Kafka producer works and important configurations.
The primary role of a Kafka producer is to take producer properties, record them as inputs, and write them to an appropriate Kafka broker. Producers serialize, partition, compress, and load balance data across brokers based on partitions.
Some of the producer properties are bootstrap servers, ACKs, batch.size, linger.ms key.serializer, value.serializer, and many more. We will discuss some of these properties later in this article.
A message that should be written to Kafka is referred to as a producer record. A producer record contains the name of the topic it should be written to and the value of the record. Other fields like partition, timestamp, and key are optional.
Broker and Metadata Discovery
Any broker in a Kafka cluster can act as a bootstrap server. Generally, a list of bootstrap servers is passed instead of just one server. At least two bootstrap servers are recommended.
In order to send a producer record to the appropriate broker, the producer first establishes a connection to one of the bootstrap servers. The bootstrap server returns a list of all the brokers available in the clusters and all the metadata details like topics, partitions, replication factors, and so on. Based on the list of brokers and metadata details, the producer identifies the leader broker that hosts the leader partition of the producer record and writes it to the broker.
The diagram below shows the work flow of a producer.
The work flow of a producer involves five important steps:
- Accumulate records
- Group by broker and send
In this step, the producer record gets serialized based on the serializers passed to the producer. Both the key and value are serialized based on the serializer that gets passed. Some of the serializers include string serializers, byteArray serializers, and ByteBuffer serializers.
In this step, the producer decides to which of the topic's partitions the record should get written. By default, the Murmur 2 algorithm is used for partitioning. Murmur 2 algorithms generate a unique hash code based on the key passed and the appropriate partition is decided upon. In case the key is not passed, the partitions are chosen in a round-robin fashion.
It’s important to understand that by passing the same key to a set of records, Kafka will ensure that messages are written to the same partition in the order received for a given number of partitions. If you want to retain the order of messages received it’s important to use an appropriate key for the messages. Custom partitioners can also be passed to the producer to control to which partition's the message should be written.
In this step, the producer record is compressed before it’s written to the record accumulator. By default, compression is not enabled in a Kafka producer. Below are the supported compression types:
Compression enables faster transfer not only from producer to broker but also during replication. Compression helps with better throughput, low latency, and better disk utilization. Refer to this article for benchmark details.
In this step, the records are accumulated in a buffer per partition of a topic. Records are grouped into batches based on the producer batch size property. Each partition in a topic gets a separate accumulator/buffer.
In this step, the batches of the partition in a record accumulator are grouped by the broker to which they are to be sent. The records in the batch are sent to a broker based on
linger.ms properties. The records are sent by the producer based on two conditions: when the defined batch size is reached or the defined linger time is reached.
Duplicate Message Detection
Producers may send a duplicate message when a message was committed by Kafka but the acknowledgment was never received by the producer due to network failure and other issues. From Kafka 0.11 on, in order to avoid duplicate messages in the case of the above scenario, Kafka tracks each message based on its producer ID and sequence number. When a duplicate message is received for a committed message with the same producer ID and sequence number then Kafka will treat the message as a duplicate message and will not commit the message again; but it will send acknowledgment back to the producer so the producer can treat the message as sent.
A Few Other Producer Properties
- Buffer.memory – manage buffer memory allocated to the producer.
- Retries - Number of times to retry sending a message. Default is 0. The retry may cause out of order messages.
- Max.in.flight.requests.per.connection - The number of messages to be sent without any acknowledgment. The default is 5. Set this to 1 to avoid out of order messages due to retry.
- Max.request.size - Maximum size of the message. Default 1 MB.
Based on the producer workflow and producer properties, tune the configuration to achieve the desired results. Importantly, focus on the below properties:
- Batch.size – batch size (messages) per request.
- Linger.ms – Time to wait before sending the current batch.
- Compression.type – compress messages.
In Part 3 of this series, we'll look at Kafka producer delivery semantics and how to tune some of the producer properties to achieve our desired results.
Opinions expressed by DZone contributors are their own.