DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Lakehouse: Starting With Apache Doris + S3 Tables
  • Building a Reactive Event-Driven App With Dead Letter Queue

Trending

  • Top Book Picks for Site Reliability Engineers
  • DGS GraphQL and Spring Boot
  • Artificial Intelligence, Real Consequences: Balancing Good vs Evil AI [Infographic]
  • Agentic AI for Automated Application Security and Vulnerability Management
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Best Practices for Scaling Kafka-Based Workloads

Best Practices for Scaling Kafka-Based Workloads

Kafka is a famous technology with a lot of great features and capabilities. This article explains Kafka producer and consumer configurations best practices.

By 
Narendra Lakshmana gowda user avatar
Narendra Lakshmana gowda
·
Feb. 06, 25 · Analysis
Likes (1)
Comment
Save
Tweet
Share
4.3K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Kafka is known for its ability to process a huge quantity of events in real time. However, to handle millions of events, we need to follow certain best practices while implementing both Kafka producer services and consumer services.

Before start using Kafka in your projects, let's understand when to use Kafka:

  • High-volume event streams. When your application/service generates a continuous stream of events like user activity events, website click events, sensor data events, logging events, or stock market updates, Kafka's ability to handle large volumes with low latency is very useful.
  • Real-time analytics. Kafka is especially really helpful in building real-time data processing pipelines, where data needs to be processed as soon as it arrives. It allows you to stream data to analytics engines like Kafka streams, Apache Spark, or Flink for immediate analytics/insights and stream or batch processing.
  • Decoupling applications. While acting as a central message hub, it can decouple different parts of an application, enabling independent development and scaling of services and encouraging the responsible segregation principle.
  • Data integration across systems. When integrating distributed systems, Kafka can efficiently transfer data between different applications across teams/projects, acting as a reliable data broker.

Key Differences from Other Queuing Systems 

Below are the differences of Apache Kafka from systems like ActiveMQ, ZeroMQ, and VerneMQ:

Persistent Storage

Kafka stores events in a distributed log, allowing the ability to replay data anytime and data persistence even in case of system/node failures, unlike some traditional message queues, which might rely on in-memory storage like Redis.

Partitioning

Data is partitioned across brokers/topics, enabling parallel processing of large data streams and high throughput. This helps consumer threads to connect to individual partitioning, promoting horizontal scalability.

Consumer Groups

Multiple consumers can subscribe to the same topic and read from different offsets within a partition, allowing for duplicate consumption patterns for different teams to consume and process the same data for different purposes. Some examples are:

  • User activity consumed by ML teams to detect suspicious activity
  • Recommendation team to build recommendations
  • Ads team to generate relevant advertisements

Kafka Producer Best Practices

Batch Size and Linger Time

By configuring batch.size and linger.ms, you can increase the throughput of your Kafka producer. batch.size is the maximum size of the batch in bytes. Kafka will attempt to batch it before sending it to producers. 

Linger.ms determines the maximum time in milliseconds that the producer will wait for additional messages to be added to the batch for processing. 

Configuring batch size and linger.ms settings significantly helps the performance of the system by controlling how much data is accumulated before sending it to processing systems, allowing for better throughput and reduced latencies when dealing with large volumes of data. It can also introduce slight delays depending on the chosen values. Especially, a large batch size with a correct linger.ms can optimize data transfer efficiencies.

Compression

Another way to increase throughput is to enable compression through the compression.type configuration. The producer can compress data with gzip, snappy, or lz4 before sending it to the brokers. For large data volumes, this configuration helps compression overhead with network efficiency. It also saves bandwidth and increases the throughput of the system. Additionally, by setting the appropriate serializer and key serializer, we can ensure data is serialized in a format compatible with your consumers.

Retries and Idempotency

To ensure the reliability of the Kafka producer, you should enable retries and idempotency. By configuring retries, the producer can automatically resend any batch of data that does not get ack by the broker within a specified number of tries.

Acknowledgments

This configuration controls the level of acknowledgment required from the broker before considering a message sent successfully. By choosing the right acks level, you can control your application's reliability. Below are the accepted values for this configuration.

  • 0 – fastest, but no guarantee of message delivery.
  • 1 – message is acknowledged once it's written to the leader broker, providing basic reliability.
  • all – message is considered delivered only when all replicas have acknowledged it, ensuring high durability.

Configuration-Tuning Based on Workload

you should start tracking metrics like message send rate, batch size, and error rates to identify performance bottlenecks. Regularly check and adjust producer settings based on the feature/data modifications or updates.

Kafka Consumer Best Practices

Consumer Groups

Every Kafka consumer should belong to a consumer group; a consumer group can contain one or more consumers. By creating more consumers in the group, you can scale up to read from all partitions, allowing you to process a huge volume of data. The group.id configuration helps identify the consumer group to which the consumer belongs, allowing for load balancing across multiple consumers consuming from the same topic. The best practice is to use meaningful group IDs to easily identify consumer groups within your application. 

Offset Committing

You can control when your application commits offsets, which can help to avoid data loss. There are two ways to commit offsets: automatic and manual. For high-throughput applications, you should consider manual commit for better control.

  • auto.offset.reset – defines what to do when a consumer starts consuming a topic with no committed offsets (e.g., a new topic or a consumer joining a group for the first time). Options include earliest (read from the beginning), latest (read from the end), or none (throw an error).  Choose "earliest" for most use cases to avoid missing data when a new consumer joins a group. Controls how a consumer starts consuming data, ensuring proper behavior when a consumer is restarted or added to a group.
  • enable.auto.commit – helps configure to automatically commit offsets periodically. Generally, we set value to false for most production scenarios where we don't need high reliability and manually commit offsets within your application logic to ensure exact-once processing. Provides control to manage offset commits, allowing for more control over data processing.
  • auto.commit.interval.ms – interval in milliseconds at which offsets are automatically committed if enable.auto.commit is set to true. Modify based on your application's processing time to avoid data loss due to unexpected failure.

Fetch Size and Max Poll Records

This configuration helps control the number of records retrieved in each request, configure the fetch.min.bytes and max.poll.records. Increasing this value can help improve the throughput of your applications while reducing CPU usage and reducing the number of calls made to brokers.

  • fetch.min.bytes – the minimum number of bytes to fetch from a broker in a single poll request. Set a small value to avoid unnecessary network calls, but not too small to avoid excessive polling. It helps optimize the network efficiency by preventing small, frequent requests.
  • fetch.max.bytes – the maximum number of bytes to pull from a broker in a single polling request. Adjust based on available memory to stop overloading the consumer workers. This reduces the amount of data retrieved in a single poll, avoiding memory issues.
  • max.poll.interval.ms – the maximum time to wait for a poll request to return data before timing out. Set a good timeout to avoid consumer hangs/lags if data is not available. It helps prevent consumers from getting stuck waiting for messages for too long. (Sometimes, k8s pods may restart if the liveness probes are impacted).

Partition Assignment

This is the strategy used to assign partitions (partition.assignment.strategy) to consumers within a group (e.g., range, roundrobin). Use range for most scenarios to evenly distribute partitions across consumers. This enables balanced load distribution among consumers in a group.

Here are some important considerations before using Kafka:

  • Complexity. Implementing Kafka requires a deeper understanding of distributed systems concepts like partitioning and offset management due to its advanced features and configurations.
  • Monitoring and management. Implementing monitoring and Kafka cluster management is important to ensure high availability and performance.
  • Security. Implementing robust security practices to protect sensitive data flowing through the Kafka topics is also important.

Implementing these best practices can help you scale your Kafka-based applications to handle millions/billions of events. However, it's important to remember that the optimal configuration can vary based on the specific requirements of your application.

Data processing kafka Event stream processing Apache

Opinions expressed by DZone contributors are their own.

Related

  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Lakehouse: Starting With Apache Doris + S3 Tables
  • Building a Reactive Event-Driven App With Dead Letter Queue

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!