Best Practices for Scaling Kafka-Based Workloads
Kafka is a famous technology with a lot of great features and capabilities. This article explains Kafka producer and consumer configurations best practices.
Join the DZone community and get the full member experience.
Join For FreeApache Kafka is known for its ability to process a huge quantity of events in real time. However, to handle millions of events, we need to follow certain best practices while implementing both Kafka producer services and consumer services.
Before start using Kafka in your projects, let's understand when to use Kafka:
- High-volume event streams. When your application/service generates a continuous stream of events like user activity events, website click events, sensor data events, logging events, or stock market updates, Kafka's ability to handle large volumes with low latency is very useful.
- Real-time analytics. Kafka is especially really helpful in building real-time data processing pipelines, where data needs to be processed as soon as it arrives. It allows you to stream data to analytics engines like Kafka streams, Apache Spark, or Flink for immediate analytics/insights and stream or batch processing.
- Decoupling applications. While acting as a central message hub, it can decouple different parts of an application, enabling independent development and scaling of services and encouraging the responsible segregation principle.
- Data integration across systems. When integrating distributed systems, Kafka can efficiently transfer data between different applications across teams/projects, acting as a reliable data broker.
Key Differences from Other Queuing Systems
Below are the differences of Apache Kafka from systems like ActiveMQ, ZeroMQ, and VerneMQ:
Persistent Storage
Kafka stores events in a distributed log, allowing the ability to replay data anytime and data persistence even in case of system/node failures, unlike some traditional message queues, which might rely on in-memory storage like Redis.
Partitioning
Data is partitioned across brokers/topics, enabling parallel processing of large data streams and high throughput. This helps consumer threads to connect to individual partitioning, promoting horizontal scalability.
Consumer Groups
Multiple consumers can subscribe to the same topic and read from different offsets within a partition, allowing for duplicate consumption patterns for different teams to consume and process the same data for different purposes. Some examples are:
- User activity consumed by ML teams to detect suspicious activity
- Recommendation team to build recommendations
- Ads team to generate relevant advertisements
Kafka Producer Best Practices
Batch Size and Linger Time
By configuring batch.size
and linger.ms
, you can increase the throughput of your Kafka producer. batch.size
is the maximum size of the batch in bytes. Kafka will attempt to batch it before sending it to producers.
Linger.ms
determines the maximum time in milliseconds that the producer will wait for additional messages to be added to the batch for processing.
Configuring batch size
and linger.ms
settings significantly helps the performance of the system by controlling how much data is accumulated before sending it to processing systems, allowing for better throughput and reduced latencies when dealing with large volumes of data. It can also introduce slight delays depending on the chosen values. Especially, a large batch size with a correct linger.ms
can optimize data transfer efficiencies.
Compression
Another way to increase throughput is to enable compression through the compression.type
configuration. The producer can compress data with gzip
, snappy
, or lz4
before sending it to the brokers. For large data volumes, this configuration helps compression overhead with network efficiency. It also saves bandwidth and increases the throughput of the system. Additionally, by setting the appropriate serializer and key serializer, we can ensure data is serialized in a format compatible with your consumers.
Retries and Idempotency
To ensure the reliability of the Kafka producer, you should enable retries and idempotency. By configuring retries
, the producer can automatically resend any batch of data that does not get ack
by the broker within a specified number of tries.
Acknowledgments
This configuration controls the level of acknowledgment required from the broker before considering a message sent successfully. By choosing the right acks
level, you can control your application's reliability. Below are the accepted values for this configuration.
- 0 – fastest, but no guarantee of message delivery.
- 1 – message is acknowledged once it's written to the leader broker, providing basic reliability.
- all – message is considered delivered only when all replicas have acknowledged it, ensuring high durability.
Configuration-Tuning Based on Workload
you should start tracking metrics like message send rate, batch size, and error rates to identify performance bottlenecks. Regularly check and adjust producer settings based on the feature/data modifications or updates.
Kafka Consumer Best Practices
Consumer Groups
Every Kafka consumer should belong to a consumer group; a consumer group can contain one or more consumers. By creating more consumers in the group, you can scale up to read from all partitions, allowing you to process a huge volume of data. The group.id
configuration helps identify the consumer group to which the consumer belongs, allowing for load balancing across multiple consumers consuming from the same topic. The best practice is to use meaningful group IDs to easily identify consumer groups within your application.
Offset Committing
You can control when your application commits offsets, which can help to avoid data loss. There are two ways to commit offsets: automatic and manual. For high-throughput applications, you should consider manual commit for better control.
- auto.offset.reset – defines what to do when a consumer starts consuming a topic with no committed offsets (e.g., a new topic or a consumer joining a group for the first time). Options include
earliest
(read from the beginning),latest
(read from the end), ornone
(throw an error). Choose "earliest" for most use cases to avoid missing data when a new consumer joins a group. Controls how a consumer starts consuming data, ensuring proper behavior when a consumer is restarted or added to a group. - enable.auto.commit – helps configure to automatically commit offsets periodically. Generally, we set value to
false
for most production scenarios where we don't need high reliability and manually commit offsets within your application logic to ensure exact-once processing. Provides control to manage offset commits, allowing for more control over data processing. - auto.commit.interval.ms – interval in milliseconds at which offsets are automatically committed if
enable.auto.commit
is set totrue
. Modify based on your application's processing time to avoid data loss due to unexpected failure.
Fetch Size and Max Poll Records
This configuration helps control the number of records retrieved in each request, configure the fetch.min.bytes
and max.poll.records
. Increasing this value can help improve the throughput of your applications while reducing CPU usage and reducing the number of calls made to brokers.
- fetch.min.bytes – the minimum number of bytes to fetch from a broker in a single poll request. Set a small value to avoid unnecessary network calls, but not too small to avoid excessive polling. It helps optimize the network efficiency by preventing small, frequent requests.
- fetch.max.bytes – the maximum number of bytes to pull from a broker in a single polling request. Adjust based on available memory to stop overloading the consumer workers. This reduces the amount of data retrieved in a single poll, avoiding memory issues.
- max.poll.interval.ms – the maximum time to wait for a poll request to return data before timing out. Set a good timeout to avoid consumer hangs/lags if data is not available. It helps prevent consumers from getting stuck waiting for messages for too long. (Sometimes, k8s pods may restart if the liveness probes are impacted).
Partition Assignment
This is the strategy used to assign partitions (partition.assignment.strategy
) to consumers within a group (e.g., range
, roundrobin
). Use range
for most scenarios to evenly distribute partitions across consumers. This enables balanced load distribution among consumers in a group.
Here are some important considerations before using Kafka:
- Complexity. Implementing Kafka requires a deeper understanding of distributed systems concepts like partitioning and offset management due to its advanced features and configurations.
- Monitoring and management. Implementing monitoring and Kafka cluster management is important to ensure high availability and performance.
- Security. Implementing robust security practices to protect sensitive data flowing through the Kafka topics is also important.
Implementing these best practices can help you scale your Kafka-based applications to handle millions/billions of events. However, it's important to remember that the optimal configuration can vary based on the specific requirements of your application.
Opinions expressed by DZone contributors are their own.
Comments