DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Next-Gen Data Pipes With Spark, Kafka and k8s
  • Real-Time Analytics: All Data, Any Data, Any Scale, at Any Time
  • Kafka: Powerhouse Messaging
  • Medallion Architecture: Efficient Batch and Stream Processing Data Pipelines With Azure Databricks and Delta Lake

Trending

  • How to Practice TDD With Kotlin
  • Scalability 101: How to Build, Measure, and Improve It
  • Setting Up Data Pipelines With Snowflake Dynamic Tables
  • Scaling in Practice: Caching and Rate-Limiting With Redis and Next.js
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Life Beyond Kafka With Apache Pulsar

Life Beyond Kafka With Apache Pulsar

Moving on — finding love after Kafka.

By 
Avaro Santos Andres user avatar
Avaro Santos Andres
·
Oct. 04, 19 · Presentation
Likes (52)
Comment
Save
Tweet
Share
81.4K Views

Join the DZone community and get the full member experience.

Join For Free


heart-rate-monito


During all my years as a Solution Architect, I have built many streaming architectures, such as real-time data ETL, reactive microservices, log collection, and even AI-driven services, all using Kafka as a core part of their architecture. Kafka is a proven stream-processing platform used for many years at companies like LinkedIn, Microsoft, and Netflix. In many cases Kafka works very well, supports large amounts of data, and has a good community. Because of that, Kafka is used for many streaming scenarios.

However, due to the design of Kafka, all of my projects using Kafka have been suffering similar problems:

  • High latency.
  • Poor scalability.
  • Difficulty supporting a global architecture.
  • High OpEx.

Latency and Throughput

Latency, or the delay before a transfer of data begins, could be a nightmare for anyone working with data-intensive applications. As IoT-enabled applications, such as autonomous vehicles and even industrial inspection, become commonplace, the data generated from sensors will become too demanding for existing architectures.

To maintain low latency while keeping up with the ever growing throughput requirements becomes a big challenge. As a result, data takes longer to move from devices to data centers, causing the user experience to degrade exponentially.

Apache Pulsar shows notable improvements in both latency and throughput in comparison to Kafka. Pulsar is approximately 2.5 times faster and 40% less latency than Kafka (*). Those differences are huge, and in critical systems they can mean success or failure.

There are many techniques that Pulsar uses to improve performance. The most important technique is used to handle tailing reads. In a scenario where readers are only interested in the most recent data, the readers are served from an in-memory cache in the serving layer (the Pulsar brokers), and only catch-up readers end up having to be served from the storage layer (Apache BookKeeper). This approach is key to improving the latency and throughput compared to systems such as Kafka.

If you are more interested in the matter, Chris Bartholomew wrote recently a very good article benchmarking latency that compares Apache Pulsar and Kafka.

Scalability Issues

Imagine you have thousands or millions of devices sending data to your data lake. This data must be managed with speed, security, and reliability. In addition, for legal reasons you must partition data by country, device, and city. These requirements seem reasonable, and in 2019, stream-processing platforms must be able to deal with them.

But how well do they? Kafka is not known to work well when there are thousands of topics and partitions even if the data is not massive. You can see how complicated it can be to try to solve performance challenges in these scenarios.

Fortunately, Pulsar is designed to serve over a million topics in a cluster. The key to scaling the number of topics is how data is stored. In Kafka, data for a topic is stored in dedicated files and directories, but as a result, Kafka has trouble scaling because I/O will be scattered across the disk as these files are flushed from the page cache to disk periodically. In contrast, Pulsar stores data in bookies (BookKeeper servers), where messages from different topics are aggregated, sorted, and stored in large files and then indexes. With these, Pulsar is able to scale to millions of topics.

Global Architectures

Another common error in many projects I have participated in is the limited scope of their initial design. When you begin to design the architecture, you are often focused on the ROI for the first year and on local impact. However, when future expansion to new countries becomes mandatory, you are often forced to expand that same infrastructure to new regions without a global architecture design.

Kafka brokers are designed to work together in a network in a single region or even availability zone. So, there is no easy way to work with a multi-datacenter architecture. In contrast, geo-replication is an out-of-the-box feature in Pulsar. Global clusters can be configured at the namespace level to replicate data among any number of clusters. Additionally, Pulsar’s multi-tenancy feature makes it possible to stand up one cluster for an enterprise while still providing isolation of data storage.

OpEx

Working in Agile projects, it is desirable to begin with fewer features and incrementally add new ones so that the project is not overwhelmed by so many services that must be coded, tested and maintained. In infrastructure there is a similar scenario. First, we have a small Kafka cluster that is enough for our current volume of data. In the following months, more and more customers arrive and the cluster can manage them by adding new partitions.

However, there will be a point in time that a new server must be added to the cluster, and then not only do I have to mess with the configuration but I also have to re-balance the current topics. These are some examples of how the operational expenditure exponentially increases with a Kafka-based architecture.

Happily for us, Pulsar’s layered architecture and stateless brokers help make zero downtime in these cases possible. When a new broker is added to the cluster, it is immediately available for writes and reads and does not spend any time re-balancing data across the cluster.

From the perspective of data storage (bookies), when a new bookie is added to the cluster, re-balancing of data based on the replication configuration will take place behind the scenes, without any impact on the cluster. Finally, Pulsar can be easily deployed in Kubernetes clusters, either in managed clusters on Google Kubernetes Engine or Amazon Web Services or in custom clusters. Easy to install and easy to maintain, as delivered with Pulsar, are exactly what we are looking for.

Final Thoughts

Apache Pulsar is a powerful stream-processing platform that has been able to learn from the weaknesses of previous systems. Its layered architecture is complemented by a number of great out-of-the-box features including geo-Replication, multi-tenancy, zero rebalancing downtime, unified queuing and streaming, TLS-based authentication/authorization, proxy and durability. Compared to other platforms, Pulsar can give you the ultimate tools to deliver successful projects.

Ready to Pulsar!


Further Reading

  • 5 Courses to Learn Apache Kafka in 2019.
  • Kafka Architecture.


(*) Benchmark performed by OpenMessaging Benchmark, a Linux Foundation project.

kafka Data (computing) cluster Architecture Amazon Web Services Stream processing

Opinions expressed by DZone contributors are their own.

Related

  • Next-Gen Data Pipes With Spark, Kafka and k8s
  • Real-Time Analytics: All Data, Any Data, Any Scale, at Any Time
  • Kafka: Powerhouse Messaging
  • Medallion Architecture: Efficient Batch and Stream Processing Data Pipelines With Azure Databricks and Delta Lake

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!