This is the first post in a blog series dedicated to Apache Kafka and its usage for solving problems in the big data domain. Using a hands-on approach and exploring the performance characteristics and limits of Kafka-based big data solutions, the series will make parallels with road racing. The reason for this is twofold. First, racing is a highly technical discipline where every tiny detail has to be mastered in order to win a race, and in a highly distributed and scalable system that deals with a high data volume and/or velocity, the situation is pretty much the same. Second, well, it's catchy and funny.
Kafka is a performant implementation of a general concept of distributed commit log and, as such, can be used in many places the log is required or can be used in order to implement an efficient solution. As it turns out, the commit log abstraction is present in distributed system's internals or can be used in a larger number of cases than we tend to see at first. With Kafka, this simple structure that is known and used for so many years goes through its renascence.
To understand the nature of distributed commit log, one can start with reading the famous Jay Kreps' post.
Kafka Use Case Patterns
Kafka is built from the ground up as a distributed system, natively handling replication, fault-tolerance, and partitioning. Kafka does a good job of persistence. The data in Kafka is always persisted and can be re-read. Kafka should be observed as a cluster, not just a collection of individual brokers. Such an approach has a substantial impact on everything from how you manage it to how Kafka-based applications behave.
The following few examples represent a list, rather use case patterns than use cases themselves. It is an attempt to group some of the most prominent use cases according to the place and function the parts implemented with Kafka have in system architecture.
Kafka for Input/Output plumbing
In many instances, there is a need to build an application or a system that needs to accept a large number of input messages. For example, there is a fire hose of external events, usually coming from multiple sources, that need to be accepted and processed in a reliable and performant way. The part of the system architecture responsible for this is usually said to implement the data ingestion or ingress path.
Figure 1: Using Kafka as ingress layer
The most recognized use case for this are applications from the IoT domain. One of the primary requirements of these applications is to accept messages coming from a large number of devices in real time. Each message is usually small in size (i.e. up to few kilobytes) but they can vary significantly in format, frequency, and number.
In fact, I would argue that practically any modern, large-scale application has a need for a well-implemented ingestion part. There are more reasons for this claim.
Another reason for the ingestion part would be the need to simplify the application implementation. This practically means to avoid processing of input messages directly by application business logic handlers. Instead of having handlers responsible for accepting input data, delegating sub-requests to other internal services and sending the response back, the whole process is decomposed into steps, by applying the command and query responsibility segregation (CQRS) pattern, where the first step, the commands acceptance, becomes the ingestion part of the system.
Figure 2: Using Kafka as egress layer
Kafka as Data Backbone
Modern and complex systems usually have to cope with large amounts of data but at the same time provide scalability, short downtime, and failure resilience, while their architecture has to remain flexible in order to support an easy evolution (for example, new business requirements and the application of new technologies).
One intrinsic characteristic of such systems is that the same data is being used through many access patterns. For example, one process accesses data in a serial manner (as time series) while another one needs to index data and access it randomly. This implies that different technologies need to be used at the same time to support different access patterns. An example of this would be the usage of different databases — a relational database (i.e. Postgres) for ad hoc BI, a NoSQL storage (i.e. Apache Cassandra) for time series, and Elastic for indexing and random access. Having multiple storage subsystems that operate on the same data and keeping it all in synchronism is a tough challenge.
The next important characteristic is that all future data handling strategies usually cannot be foreseen and, therefore, planned in the beginning. New data handling techniques, algorithms, and/or business use cases appear after the system architecture is already implemented and data accepted and saved. So the original data/information has to be preserved and there has to be a mechanism that allows data reprocessing.
The third characteristic is that, over time, new system components are developed, using new technologies. They have to be tested and used to replace old versions without breaking the production system. In case of an error, a rollback path has to be ensured. Moreover, all of this is usually expected without service downtime.
The system layer that is capable of supporting the above demands is usually called Central Data Pipeline. SSOT pattern (Single Source of Truth) can be also implemented using such an approach. Arguably it can also be seen as a variant of Data Lake (Data Pond?) implementation.
Fig 3 - Using Kafka as Central Data Pipeline
Apache Kafka has actually been created initially as an implementation of a central data pipeline in LinkedIn. Apparently, that's why Kafka is a natural fit for this kind of use cases.
If a new data strategy is to be introduced, again, the new process implementation has to be connected to respective Kafka topics, without having any impact on the existing infrastructure while replaying data and applying new algorithms.
The same applies to introducing new versions of the existing systems. They can run in parallel with the existing infrastructure in a test phase and later on, if everything is well, they can take over the function from the old implementation.
Another issue is related to scaling. Kafka splits logical topics to partitions so partitions are the basic unit of scaling. One logical topic can be spread across many Kafka brokers where each broker handles one or more topic partitions but a single partition cannot be handled by two or more brokers at the same time. This puts a scaling limit as there is no point in having more brokers than topic partitions, from the standpoint of scaling out a single topic. So a solid capacity planning is needed. Repartition of data is certainly possible, and using Kafka topics and data re-processing can be helpful for that, but it is still a costly operation.
Kafka to bind them all
Fig 4 - Using Kafka to decouple microservices
The concept of durable Kafka topics, with guaranteed message delivery semantics is a perfect fit for this use case. A Kafka topic can be used as a message pipe that connects two or more microservices in different topologies combining publish/subscribe and fan-out message patterns.
Replicated Kafka topics, combined with the fan-out message pattern can be used to easily scale the system capacity (throughput) simply by increasing the number of microservice instances.
Moreover, Kafka topics with auto-compactions and/or event-sourcing pattern could be efficiently used for saving microservices' internal states. These states for one or even a whole subtree of connected microservices can be recreated by replaying messages.
Kafka to stream it all
Stream processing of real-time events became an essential part of all modern applications. The need to react on real-life events in real time (e.g. fast reaction upon user actions or response to system metrics changes) makes stream processing such an important part.
Fig 5 - Using Kafka for stream processing applications
From its foundation, Kafka has been designed to support distributed stream processing as a layer on top of its core primitives. Basically Kafka provides a necessary infrastructure (logistic, if you want) for supporting data streams implementation that can be processed in a distributed and scalable manner, providing a highly-available, performant and fault-tolerant solution . This is the reason why it is used with so many popular stream-processing frameworks and libraries such as Apache Samza , Apache Flink , Apache Spark Streaming and Apache Storm . There is also a Kafka native library (or API), Apache Kafka Streams , that is part of the Apache Kafka project.
Although the end-to-end latency in Kafka-based solutions is relatively low, for example in comparison to Amazon AWS Kinesis, it is not a solution for really low latency scenarios where an event has to be processed within a few milliseconds. For example, typical end-to-end latency ranging from a few dozen up to a few hundred milliseconds.