An Introduction to Apache Kafka
If you're new to the world of data science, check out this great primer to the Apache Kafka framework and learn how to utilize its features.
Join the DZone community and get the full member experience.Join For Free
Once upon a time, a computer program was a monolith. At its simplest, at least one argument and a bit of data went in, and from the same system, deterministically, a result came out.
Things became more complicated when programs needed to talk to each other and pass data back and forth. In such cases, data was addressed to the receiving app, who then theoretically received it, did something to it, and either displayed it locally or maybe passed the result back to the sender.
Things continued to evolve with systems becoming distributed amongst billions of machines and users. Events like status updates, logins and presence, along with data about uploads and other user activity needed to be handled in real-time across systems that spanned the planet via a constellation of users. Suddenly, there was a need for those billions of point applications (and other systems) to communicate with each other.
Imagine the complexity of such a system: masses of unique, distributed implementations, scattered across continents, each with unique needs, and all trying to exchange data — but only the relevant stuff, willy nilly. Would each one have a distinct point-to-point exchange mechanism? Would each transport mechanism be different for the type of data being exchanged?
If so, this would be problematic. How would you build, manage, and maintain such a system? After all, the order of complexity of a system with 1 billion apps communicating point to point is a product: n*n. In other words, 1 quintillion, or 1018. So technically, that’s the possible upper limit of how many separate communication protocols and implementations could be needed for 1 billion point apps to communicate 1:1.
Surely there’s a simpler way. What if communication were built around a common interface? And, better still, what if applications could subscribe to and receive only the events they wanted?
Enter the pub-sub mechanism.
Earlier Attempts at Messaging
But first, let’s look at some early efforts to address this scalability problem.
An earlier endeavor for loosely-coupled messaging involved the Java Messaging Service that emerged in 1998. Unlike more tightly-coupled protocols (like TCP network sockets, or RMI with JMS), an intermediary component enabled software elements to speak to each other indirectly, either via a publish-and-subscribe, or a point-to-point model. As it turns out, JMS offered many features which would be core to Apache Kafka later on, such as producer, consumer, message, queue, and topic.
Both Pivotal’s RabbitMQ and Apache ActiveMQ eventually became provider implementations of JMS, with the former implementing Advanced Message Queuing Protocol (AMQP) and compatibility with a range of languages, and the latter offering “Enterprise features” — including multi-tenancy.
But even these solutions came up short in some cases. For example, RabbitMQ stores messages in DRAM until the DRAM is completely consumed, at which point messages are written to disk, severely impacting performance.
Also, the routing logic of AMQP can be fairly complicated as opposed to Apache Kafka. For instance, each consumer simply decides which messages to read in Kafka.
In addition to message routing simplicity, there are places where developers and DevOps staff prefer Apache Kafka for its high throughput, scalability, performance, and durability; although, developers still swear by all three systems for various reasons.
So, in order to understand what makes Apache Kafka special, let’s look at what it is.
What Is Apache Kafka?
Introduced by Jay Kreps, Jun Rao, and Neha Narkhede at LinkedIn around 2010, Apache Kafka is a distributed, partitioned, replicated commit log service. It provides all of the functionality of a messaging system, but with a distinctive design.
Kafka’s scalable, fault-tolerant, publish-subscribe messaging platform is nearly ubiquitous across vast internet properties like Spotify, LinkedIn, Square, and Twitter. Running on clusters of broker servers, partitioned topics are distributed as append-only logs across the cluster and can be consumed by software consumers.
This design provides for fault tolerance, completely eliminates complicated producer-side routing rules, and enables the sort of massive scale leading Apache Kafka to supplement or completely replace JMS and other messaging systems.
What Apache Kafka Does
Apache Kafka passes messages via a publish-subscribe model, where software components called producers append events to distributed logs called topics (conceptually a category-named data feed to which records are appended).
Consumers are configured to separately consume from these topics by offset (the record number in the topic). This latter idea — the notion that consumers decide what they will consume — removes the complexity of having to configure complicated routing rules into the producer or other components of the system.
Source: Martin Kleppmann’s Kafka Summit 2018 Presentation, Is Kafka a Database?
Some Common Kafka Features
Central to Apache Kafka’s architecture is the commit log, an append-only, time-ordered sequence of records. Logs map to Kafka topics, and are distributed via partitions.
A producer publishes, or appends, to the end of the log, and consumers subscribe or read the log starting from a specified offset, from left to right. An offset describes the position in the log containing a given record.
Also known as “write-ahead logs,” “transaction logs,” or “commit logs,” these items serve one purpose — to specify what happened and when. When log entries contain their own unique timestamp, independent of any physical clock, distributed systems are better able to consume their specified topic data asynchronously.
Some Prominent Use Cases
Although the use of logs as a mechanism for topic organization and data subscription seem to have evolved by chance and may seem counterintuitive, the publish-subscribe mechanism, at speed and scale, works beautifully for supporting all kinds of asynchronous messaging, data flow, and real-time handling.
Kafka was originally used to ingest event data (like logins and unique page views) from a website at low latency. Today, Apache Kafka is used for a variety of functions to communicate much of the messaging and real-time information we see on the platform today.
Aside from messages, Apache Kafka handles things like status updates, user messages, presence, various internal workings, and so forth. However, Kafka typically works as part of a bigger system, often a data pipeline. For instance, OVO Energy uses Aiven Kafka to power their operational process, data science applications, and collect all necessary data for reporting and analytics. OVO Energy complements Kafka with Redis, for a hyper-fast message queuing solution, and then monitors the lot with InfluxDB and Grafana.
Another example is Kyyti, a mobility-as-a-service app. They implemented Aiven Kafka and PostgreSQL to reduce their data pipeline APIs from over 10 to only a few while handling massive data loads.
Lastly, OptioPay uses Aiven Kafka, PostgreSQL, and Elasticsearch to streamline stateful service operation and migration, and deploy on Microsoft Azure Germany: the only cloud they’re allowed to store and process personal data in. With Aiven’s cloud flexibility and management, OptioPay spends less time on ops, while remaining PSD2 and GDPR compliant.
Some Common Shortcomings of Apache Kafka
Apache Kafka does many things to make it more efficient and scalable over other publish-subscribe message implementations. For example, Kafka keeps no indices of the messages its topics contain, even when those topics are distributed across partitions.
And, as mentioned earlier, since consumers merely specify offsets from which their data reads begin, there is no need for confusing, producer-side, or common routing rules. However, there are many potential pain points, shortcomings, and gotchas to Apache Kafka.
For starters, the simplicity in consumption rules for routing traffic can theoretically create an extra load — and performance impact — from redundant or superfluous data being pushed throughout the system. On the flip side, consumption rules can be misconfigured so that NO data is consumed, and therefore, distributed components miss important inputs.
Apache Kafka makes no deletes to the log prior to the timeout configured in the
TTL (time-to-live) setting; this is done via the appropriate
log.retention.*setting in your Kafka broker
.properties file. If logs become too large, there is a potential performance/latency impact as writes invoke disk swapping. On the other hand, if logs are deleted too soon, consumers miss inputs, as above.
Kafka can stream messages efficiently within kernel IO, which is more efficient than buffering messages in user space. However, when accessing the kernel, there is always a potential to introduce potentially catastrophic faults that can bring down all or part of your messaging infrastructure.
To make things even more confusing, topics can have no consumers. Why would you want to do this? Perhaps to view certain logs for debugging purposes while building a system up or when considering records for which a feed may be needed later on. If that were the case, any debugging options like these need to be deactivated in order for your system to run optimally.
Long story short: all of the above parameters, and many more, need to be tweaked and configured. And how does one know an optimal configuration for a system without first subjecting it to production level loads first, without affecting your business? It’s a catch-22.
The Schema Registry: To Use or Not to Use?
Is your system going to grow and evolve? If so, you may want to consider using the Confluent Schema Registry, an excellent feature that allows you to decouple the systems you integrate via Kafka.
Schema Registry can be used for recording the data schema utilized by producers, and again by consumers to decode and validate data consistent with these rules. Note that the schema registry is available as a feature of the “free forever” Confluent offering and not the distribution available directly from apache.org.
But here’s the catch: if you want to move forward, you’re likely tied to Confluent’s distribution (and should have probably started with it in the first place). And, moving forward, Confluent will require users to use this proprietary tool to manage schemas. This alone makes forward compatibility a challenge.
The Setup Quagmire
There are also many options to setup Apache Kafka, and these vary widely depending on whether you are using Confluent’s free or paid version (running natively or on Docker) or the distro available from apache.org.
Just be sure that your version, installation type, and payment model match your future intentions (assuming you know what those are, already), or you may find yourself rebuilding your system from scratch!
The Aiven Kafka Solution
One way to get around all of this complexity is to run your Kafka as a hosted service. Building, testing, and maintaining your own Kafka clusters will eat up developer resources that could be used elsewhere while introducing risk.
To deal with schema management challenges in the future, Aiven has introduced their own tool, Karapace: it’s a 1:1 replacement of schema registry that is lighter, faster, and more stable. Just like Confluent’s proprietary tool, Aiven’s Open Source Karapace supports the storing of schemas in a central repository, which clients can access to serialize and deserialize messages. The schemas also maintain their own version histories and can be checked for compatibility between their different respective versions.
Aiven Kafka is easy to set up, either directly from Aiven Console:
Via our Aiven command line:
…or our REST API. You decide!
Opinions expressed by DZone contributors are their own.