Apache Kafka is a message queue implemented as a distributed commit log. From the producer’s point of view, it logs events into channels, and Kafka holds on to those messages while consumers process them.
Unlike a traditional “dumb” message queue, Kafka lets consumers keep track of which messages have been read. That removes the burden of maintaining reader state from the Kafka servers, instead placing that responsibility on the consumer. The upside of shifting that responsibility to the consumer is that they can jump back and forth in time, allowing an out-of-sync consumer to catch up without losing any data.
Kafka’s model of distributed commit logging has several distinct advantages over traditional queue or pub/sub-based messaging systems:
- Retain the ordering of messages in the presence of parallel consumers, a common issue in traditional queues. This is the atomic broadcast problem.
- The ability to read old messages, even the state of messages old enough to have been discarded due to compaction.
- No need for the message broker to deal with hacking, which requires a lot of state management, including locks and timeouts.
The performance and reliability of Kafka presents a compelling alternative to both queue and pub/sub-based messaging. The internet is an extremely unreliable place where laptops go to sleep and mobile devices drop service. Systems are vastly more robust with reliable messaging that allows consumers to catch up when they get out of sync. Even better, log services provide much more than ephemeral messaging systems.
Combining Messaging Models
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each message goes to one of them; in publish-subscribe, the message is broadcast to all consumers. Kafka offers a single consumer abstraction that generalizes both of these: the consumer group.
Here are a few of the popular use cases for centralized logging services and Kafka in particular:
- Messaging. As a replacement for message queues that has stronger guarantees about throughput, partitioning, replication, and fault tolerance.
- Metrics. Aggregating statistics from distributed applications into a central persistent location.
- Replication. Streaming time-ordered data is the foundation of most RDBS systems; it applies equally to replicating between databases.
- Warehousing. As the initial data aggregation stage in an extract, transform, process pipeline.
Any situation where an immutable sequence of event data can be both produced and consumed is a contender for using Kafka.
Making Use of Distributed Logging
Consider a simple concrete example where a primary cluster of application servers has a database, a collection of caching servers, and a websocket broker that manages connections with web clients. A set of Kafka topics provides a way to centralize updates to all of the peripheral systems, and even to know exactly how up-to-date the data is within each system.
As the central database changes, log entries are written to a corresponding topic. The cache servers and websocket broker can then consume log entries at their own pace, transforming the data as needed and passing it on to their clients when ready.
The log also acts as a buffer that makes data production asynchronous from data consumption. This is important for a lot of reasons, but particularly when there are multiple subscribers that may consume at different rates. This means a subscribing system can crash or go down for maintenance and catch up when it comes back.
Consumer systems, or services, only know about the log and particular partitions. Consumers don’t know any details about the system where the messages originated — whether the data was dumped from a SQL database, a key-value store, or was generated by an ephemeral event source. The origin doesn’t matter, keeping the consumer and producer topologies entirely separate. New consumer systems scan be added or removed with no change to the log, topics, nor Kafka.
Notice that this discussion is around logging, rather than messaging, queues, or pub/sub. That is because the semantics of a distributed log are different and much more practical when implementing critical services like data replication.
Combined with topics, logs can act as a messaging system with durability guarantees and strong ordering semantics. Kafka is a single service playing many roles simultaneously, without bending or compromising.
Where the Kafka Dragons Are
Kafka definitely isn’t perfect. You have to run a cluster of nodes for fault tolerant replication, increasing the minimum load on ops. It is also difficult to integrate with if you’re not in Java land. There are no mature client implementations for Ruby, Go, or Node. Its reliance on the JVM and on ZooKeeper makes it heavyweight when compared to a nimble pub/sub provider like Redis.
Kafka works wonderfully when the bottleneck is the broker — when there are a lot of lightweight messages being passed between systems, and the broker is having trouble keeping up. Messaging, analytics, and logs are great examples where Kafka’s data model works really well. In situations where a smaller number of large messages may take seconds or minutes to process, such as part of a job management system, Kafka isn’t optimal.
As with everything in software, it all depends on your use case. If reliability and persistence aren’t a deciding factor, and you instead favor simplicity, Kafka may not be the answer. However, if you have an internet scale system pushing large amounts of data around where performance is a concern, consider Kafka and distributed logging.