DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How To Install CMAK, Apache Kafka, Java 18, and Java 19 [Video Tutorials]
  • Data Migration from AWS DocumentDB to Atlas on AWS
  • Event Mesh: Point-to-Point EDA
  • Kafka Fail-Over Using Quarkus Reactive Messaging

Trending

  • How to Submit a Post to DZone
  • DZone's Article Submission Guidelines
  • How Large Tech Companies Architect Resilient Systems for Millions of Users
  • Unlocking AI Coding Assistants Part 4: Generate Spring Boot Application
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Kafka Internals: Topics and Partitions

Kafka Internals: Topics and Partitions

We take a look under the hood of Apache Kafka to better understand how this popular framework uses topics and partitions.

By 
Arun Lingala user avatar
Arun Lingala
·
Apr. 24, 19 · Analysis
Likes (13)
Comment
Save
Tweet
Share
33.1K Views

Join the DZone community and get the full member experience.

Join For Free

Let's start discussing how messages are stored in Kafka. In regard to storage in Kafka, we always hear two words: Topic and Partition. 

Topic

A topic is a logical grouping of Partitions.

Partition

A partition is an actual storage unit of Kafka messages, which can be assumed as a Kafka message queue. The number of partitions per topic is configurable while creating it. Messages in a partition are segregated into multiple segments to ease finding a message by its offset. The default size of a segment is very high, i.e., 1GB, which can be configured. Each segment is composed of the following files:

  1. Log: Messages are stored in this file.
  2. Index: stores message offset and its starting position in the log file.
  3. Timeindex: not relevant to the discussion.

Let’s imagine there are six messages in a partition and that a segment size is configured such that it can contain only three messages (for the sake of explanation). Thus the Partition contains these segments as follows:

  • Segment – 00 contains 00.log, 00.index and 00.timeindex files
  • Segment – 03 contains 03.log, 03.index and 03.timeindex files
  • Segment – 06 contains 06.log, 06.index and 06.timeindex files

The segment name indicates the offset of the first message in the segment.

Sample log file:

Starting offset: 0

offset: 0 position: 0 CreateTime: 1533443377944 isvalid: true keysize: -1 valuesize: 11 producerId: -1 headerKeys: [] payload: Hello World
offset: 1 position: 79 CreateTime: 1533462689974 isvalid: true keysize: -1 valuesize: 6 producerId: -1 headerKeys: [] payload: intuit


Sample index file:

offset: 0 position: 0
offset: 2 position: 79


Let’s discuss the time complexity of finding a message in a topic, given its partition and offset.

Step

Complexity

How

Find partition

O(1)

The broker knows the partition is located in a given partition name.

Find segment in partition

O(log (SN, 2)) where SN is the number of segments in the partition.

The segment's log file name indicates the first message offset so it can find the right segment using a binary search for a given offset.

Find message in segment

O(log  (MN, 2)) where MN is the number of messages in the log file.

The index file contains the exact position of a message in the log file for all the messages in ascending order of the offsets. So, the offset can be searched using a binary search.

So, the total complexity is O(1) + O(log (SN, 2)) + O(log  (MN, 2)).

Replication

A topic replication factor is configurable while creating it. Assume there are two brokers in a broker cluster, and a topic, `freblogg,` is created with a replication factor of 2.

Among the multiple partitions, there is one `leader,` and the remaining are `replicas/followers` to serve as backup. Kafka always allows consumers to read only from the leader partition. A leader and follower of a partition can never reside on the same broker for obvious reasons. Followers are always in sync with a leader. The broker chooses a new leader among the followers when a leader goes down. A topic is distributed across broker clusters as each partition in the topic resides on different brokers in the cluster.

Parallelism With Partitions

Kafka allows only one consumer from a consumer group to consume messages from a partition to guarantee the order of reading messages from a partition. So, it's important point to note that the order of message consumption is not guaranteed at the topic level. To increase consumption, parallelism is required to increase partitions and spawn consumers accordingly.

kafka cluster Replication (computing)

Opinions expressed by DZone contributors are their own.

Related

  • How To Install CMAK, Apache Kafka, Java 18, and Java 19 [Video Tutorials]
  • Data Migration from AWS DocumentDB to Atlas on AWS
  • Event Mesh: Point-to-Point EDA
  • Kafka Fail-Over Using Quarkus Reactive Messaging

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!