Comparing Pulsar and Kafka From a CTO’s Point of View
How a CTO would make the decision between Kafka and Pulsar.
Join the DZone community and get the full member experience.Join For Free
When upper management assesses a new technology, they view it from a different perspective than middle management, architects, or data engineers. Upper management is not just looking at benchmarks and feature lists; they’re looking for long-term viability and how it gives their company a clear competitive edge. In addition, they’re optimizing for time to market and costs.
As the Managing Director for Big Data Institute, technology assessment is a key part of my role. We help companies identify and adopt the best technologies for their business needs. Our clients especially appreciate our vendor-neutral approach.
In this post, I compare Apache Pulsar and Apache Kafka from a CTO’s perspective. We won’t make this comparison in a vacuum because we shouldn’t be making technology choices without looking at use cases. Instead, we’ll look at each technology through some real-world and common use cases, including a simple messaging use case, a complex messaging use case, and an advanced messaging use case. This will allow us to better understand the Pulsar and Kafka trade-offs.
Simple Messaging Company
Let’s imagine our company needs a simple messaging system. It gets messages from point A to B and we aren’t going to do any replication. The new messaging technology is our company’s first foray into messaging systems.
The data architects have told us there isn’t any clear advantage between Pulsar and Kafka for this use case. They’ve done their homework and deeply understand the business case. The team doesn’t believe there will be growth in the near future for the use cases.
For a simple messaging use case such as this, I’d agree that the pros and cons of each system balance out. The decision between the two technologies from a purely technical perspective is a tie so the comparison becomes solely one of cost. How much will it cost to operate? How much will it cost to train my staff? I’d be looking at using one of the Kafka or Pulsar ‘as a service’ providers to start comparing costs. For this stage, leveraging an ‘as a service’ provider for the selected message platform will cut down on operations costs and the cost to train my staff on the cluster operations. For Kafka, I’m looking at Confluent Cloud, MSK (AWS), or Event Hubs using the Kafka API (Azure). For Pulsar, I’m looking at StreamNative Cloud.
We hedge our bets and have the team use the Kafka API. There are several different technologies that support using Kafka’s API or line protocol with a broker that isn’t Kafka. Using the Kafka API will give us the most options for technologies by their brokers supporting Kafka’s line protocol or API compatibility by dropping in new libraries. For example, Pulsar can be used as the backend for Kafka by recompiling with Pulsar’s Kafka-compliant API or with Pulsar reading Kafka’s line protocol (KoP).
We create a cost per unit comparison and go with the most cost-effective one. Hedging our bets with the Kafka API allows us to move between backends with QA requalifications. It insulates us from community/technology fallout where a technology becomes less popular or isn’t supported anymore.
Complex Messaging Company
Let’s imagine our company needs complex messaging. We need geo-replication because we have data all over the world. We have been using messaging systems and are generally familiar with the complexity of real-time systems. We see the limitations of our current messaging systems, and we want something that handles advanced delivery of messages and other complex messaging features.
The data architects have told us each technology has its own advantages. They’ve talked to all of the stakeholders and the business side to understand current and future needs and believe that future use cases and data volume will grow over time.
In this case, we don’t have a clear winner between Pulsar and Kafka. We have to dig deeper into the various facets of the use case to make the right decision.
As we talk to Kafka vendors about our geo-replication needs, we find they’re either proprietary (expensive) or open source (bolted on). The proprietary replication solution is costly and is built-in. The open-source solution (MirrorMaker) is actually a data copying solution and creates operational overhead because it isn’t built-in.
As we talk to Pulsar vendors about geo-replication, we find out it’s built-in, open source, and supports complex replication strategies. We know that as our replication strategy becomes more complex, Pulsar already supports the advanced replication strategies.
We decide Pulsar is the clear winner on geo-replication.
As part of our move to a new messaging platform, we want to handle new use cases. The data architects have been diligent in finding and understanding them. In the current system, any processing errors have to be retried manually by re-producing the message. We’d also want a way to delay the send of a message. Finally, the current system lacks strong schema enforcement, and we experience the pain of having different teams with different schema implementations.
We start by looking at Kafka. We see that Kafka lacks a built-in dead letter queue and any message processing failure has to be manually or programmatically retried. It lacks any built-in mechanism to delay sending messages, which would have to be done with heavy workarounds. Also, Kafka lacks a built-in schema enforcement mechanism. As a result, there are many different implementations of schema registries from other vendors.
We look at Pulsar. It has a built-in dead letter queue. If a message fails to be processed with a negative acknowledgment, Pulsar can automatically retry the message a certain number of times. Pulsar also has a built-in mechanism to delay the send of messages for a given amount of time. Pulsar considers schema to be a first-class citizen and has a built-in schema registry. The API support for schema is built-in too.
We decide Pulsar is the clear winner on complex messaging.
Advanced Delivery of Messages
As we get deeper into the architecture, we find that our data on the same topic needs to be delivered in a round-robin for even resource usage and ordered for use with ordering guarantees.
We see that Kafka doesn’t support changing how data is delivered to consumers. Once a message is put on the topic by a producer, it can’t be changed. The extra topics force us to either duplicate the data into two topics and deal with the operational issues. Or we’ll have to have all consumers be ordered and over-provision the consumers on the round-robin to keep up.
Looking at Pulsar, we see that it supports delivering data on the same topic in both ordered and round-robin because the broker has smarter routing. Pulsar’s brokers allow us to do precisely what we want and without workarounds.
We decide Pulsar is the clear winner on the advanced delivery of messages.
Deployments and Community
To round out our comparison, we have to compare Pulsar and Kafka on the number of deployments and the overall community.
Looking at the competitive landscape, we see there is more vendor support for Kafka and more companies are selling and supporting Kafka products. When we look at the open source community, we see both Kafka and Pulsar have a vibrant community; however, Kafka has a larger community.
As we look at the companies using Kafka and Pulsar at a large scale, we see both technologies are being used at scale with large companies in production. Kafka boasts more companies using it in production.
From a training perspective, there are more people with Kafka experience; however the data engineering team believes that a person with Kafka skills could learn Pulsar without difficulty.
We decide Kafka is the clear winner in terms of support and community, with Pulsar picking up steam.
The decision for this use case is not easy as there are distinct advantages and disadvantages for each technology.
Pulsar has a ways to catch up on the community and deployments, and Kafka has a ways to catch up on features.
To start the discussion, we need to understand what the company values most in a technology. Are we extremely conservative on our technology adoption, or are we less conservative? We may have had good luck adopting new open source technologies in the past, which would motivate us to use Pulsar.
If we choose Kafka, we also have to think about the ramifications of going back to the business sponsor and saying we can’t handle their use cases. Even paying a large amount of money for geo-replication licenses, the use case may not be achievable. The team could end up spending a lot of time, even months, writing, perfecting, and testing their workarounds.
If we choose Pulsar, we can go back to the business sponsor and say we can handle everything. The team will take less time to implement because all of the features are already there and tested.
In this case, we cannot hedge our bets with the Kafka API with a Pulsar backend because the Kafka API doesn’t have the extra features we need. We have to use the Pulsar API for everything or make many workarounds for Kafka.
Our decision: Choose Pulsar and prioritize handling the business requests and focusing the team on writing code, not workarounds. We choose Pulsar and keep an eye on the community and vendor landscape.
Our conservative decision: we go with Kafka and accept that we may not be able to achieve some use cases. For the use cases we can approximate, we go with workarounds. We look back at our project timelines and add more time for anticipated workarounds. We check with our operations team to ensure we can handle the additional operational overhead these workarounds will bring.
Advanced Messaging Company
Let’s imagine our company is already using a variety of messaging and queuing systems. From an operational, architectural, and overhead perspective, we see the need to move to a single system. Also, we want to reduce operating costs.
The data architects have told us that both Kafka and Pulsar have their own advantages for this use case. The architects have talked to all of the stakeholders and the business side to understand current and future desires.
Queueing and Messaging
A core part of our pain exists in our RabbitMQ systems. We’re pushing too many messages through the system and RabbitMQ is not able to keep up with the demand. We’ve done several workarounds with our RabbitMQ code to buffer messages in memory, and we keep standing up new clusters to handle the load. We need a system that can handle the scale of messages we need to send through it instead of doing workarounds.
As the data architects go through the use cases, they see the need for both messaging and queuing. The need for RabbitMQ-style work doesn’t go away, and we’ll need better technology for messaging.
Kafka works well for messaging and can handle the scale, but it doesn’t address the queueing side. There are some workarounds that the team could do, but they leave us unable to accomplish our desired goal of a single system. To handle queueing use cases, we’d need a Kafka cluster and RabbitMQ clusters. The Kafka clusters would act more as a buffer to prevent overloading the RabbitMQ clusters but Kafka lacks built-in support for RabbitMQ and we’d have to go with a vendor or write our own code to move data between Kafka and RabbitMQ.
Pulsar can handle both queueing and messaging within the same clusters and can support the scale need. Pulsar would allow consolidating all messaging and queueing use cases into a single cluster. We also have the option to continue using the RabbitMQ code. There is a RabbitMQ connector in Pulsar and StreamNative also has an AMQP handler that plugs directly into the broker and is Apache-licensed.
If we don’t want to reuse our RabbitMQ code, we can use the Pulsar API and have all of the same queueing-style functionality. Depending on how well-factored the code is, this could be a simple change programmatically, and we’d have to run extensive QA.
We decide Pulsar is the clear winner on queueing and messaging.
The data architects analyzed the data usage and found that 99.99% of the messages are not read after being consumed the first time. However, they decided to remain conservative and retain the messages for seven days. Even though we’re storing data for seven days, we don’t want our operational costs to explode. Tiered storage allows us to better manage costs by saving some data locally and offloading other data to S3 for cheaper long-term retention.
Kafka is working on a tiered storage, but the Apache Kafka project hasn’t released this new feature yet. There are vendors who offer this in closed-source, but we are not sure if it is production-ready.
In Pulsar tiered storage is supported and has been for a while. The feature is production-ready and is being used in production by companies.
We decide Pulsar is the clear winner on tiered storage with Kafka catching up.
Because we use many topics to break up data, we need the next system to handle large numbers of topics to be created. Our data architects believe that we’ll need 100,000 topics initially, and that we will grow to 500,000 topics over time.
A Kafka cluster is limited on the number of partitions that can be created and each topic would need at least one partition. Work is going into supporting more topics on Kafka but these efforts haven’t been released. Additionally, Kafka lacks namespaces and multi-tenancy, so all 100,000 topics would be together in the same namespace because there is no way to segment out the resources based on the topics.
While some companies leverage multiple Kafka clusters in order to accommodate more topics and to segment resources, this approach increases costs and removes the option of using only a single cluster.
With Pulsar there is support for millions of topics. The functionality is already released and being used in production. To keep the operational overhead down, Pulsar supports namespaces and multi-tenancy. This allows us to have resource quotas set for each topic.
We decide Pulsar is the clear winner on topics.
Because of our company background using RabbitMQ, much of our design revolved around having the brokers routing into topics for us. We have several inbound topics that are routed or fanned out into many different topics depending on the data in the message. For example, we have a single topic for the entire world that the RabbitMQ broker breaks up into a per country topic.
The data architects have looked at how they would use a single topic for the entire world in each system. They found that the downstream consumers weren’t designed to handle the increased load of receiving the data, deserializing it, looking at it, and throwing it out and it would take too much effort to change every downstream system.
We start looking at Kafka and see that the general recommendation is to have everything in a single topic. We’ve already realized that won’t work due to the increased load on our consumers to filter and on our cluster, and we start looking at workarounds. The workaround is to increase the number of consumers by scaling horizontally to read the global topic and process it. The options for consumers would be: writing our own custom consumer/producer, writing a Kafka Streams program, or using the proprietary KSQL.
We look at Pulsar and see that it supports using Pulsar Functions or custom consumer/producer for routing. We would read the global topic and save out the data to a separate country-specific topic. By having separate topics, the consumers can subscribe to just the topics they need and only receive relevant information.
We decide Pulsar is the clear winner on routing.
This decision comes down to time. Do we have the time for Kafka to catch up with Pulsar? Do we have the time for our data engineers to implement the workarounds for Kafka? Waiting will cost the company on missed opportunities and delaying new use cases. These waits could have direct business implications.
Our decision: we go with Pulsar.
Our decision with extra time: we delay our new architecture. We check to see if Kafka has caught up with Pulsar in six months. If it has, we check to see if anyone is running these new features in production and what they’re saying about the stability. If we do not see the right outcomes with Kafka, we go with Pulsar.
Well, you’ve made it this far or just skimmed down here. Each one of the examples above is a real-world use case I’ve worked on. Hopefully, following the framework above will help you to make better-informed technology assessments based on your use case.
Opinions expressed by DZone contributors are their own.