How We Use Kafka
In this article, we will take a look at how one company is making use of Kafka, demonstrating the flexible that this tool provides to developers.
Join the DZone community and get the full member experience.Join For Free
Humio is a log analytics system built to run both on-prem and as a hosted offering. It is designed for "on-prem first" because, in many logging use cases, you need the privacy and security of managing your own logging solution. And because volume limitations can often be a problem in Hosted scenarios.
From a software provider's point of view, fixing issues in an on-prem solution is inherently problematic, and so we have strived to make the solution simple. To realize this goal, a Humio installation consists only of a single process per node running Humio itself, being dependent on Kafka running nearby (we recommend deploying one Humio node per physical CPU so a dual-socket machine typically runs two Humio nodes).
We use Kafka for two things: buffering ingest and as a sequencer of events among the nodes of a Humio cluster.
Ingest Queue/Commit Log
The ingest queue is used for buffering ingest, to take peak loads, or deal with similar situations. But the ingest queue also plays the role of commit log. That is, whenever we finish a block and append that to one of Humio's segment files, we also write the offset of the last message batch received into the on-disk block. That way we can restart with no loss by reading the last block of the Humio segment under construction, and restart ingest from the Kafka offset listed in that block.
The primary coordination mechanism of our cluster state also goes via Kafka. The 'shared cluster state' is maintained using a straightforward event sourcing model, where we push updates through a single-partition/multiple-replica topic called
global-events. All nodes dump a snapshot of the internal state periodically to local disk, and when a node starts up, it will read it's own snapshot and re-load updates starting at the offset written in the snapshot. This has proven a very simple mechanism to deal with distributed state.
To be on the safe side there are some more details in this, such as also recording the Kafka-cluster-id as the epoch of Kafka offsets. So if you wipe your Kafka or redirect to a different Kafka cluster some manual intervention is required.
As for ways to run Kafka, Humio supports three different configurations:
- Single node: We provide one Docker container with Zookeeper, Kafka, and Humio in it. This makes it easy to try it out; just say
docker run humio/humioand go.
- Run a dedicated Kafka/Zookeeper to service your Humio cluster. We provide specialized containers for this scenario that deals with some management issues.
- Bring-your-own Kafka. We try to stay out of the way of managing it.
The first option is great for a medium-small instance running on a single server. Also having both option 2 and 3 do complicate things somewhat.
On the other hand, many users want to run a both Humio and Kafka as a cluster for scale and resilience. Unless the customer already has a dedicated team to manage Kafka, they'll typically be using Humio with our
humio/kafka containers. In this scenario, we need to help users balance the Kafka cluster and configure replication, etc. Humio uses a handful of topics that need varying configurations, and if the customer isn't knowledgeable with Kafka, it can easily be configured wrong (too few replicas, all replicas end up on a single node in the Kafka cluster, etc.). So we ship the entire Kafka admin client with Humio and perform these reconfigurations for the user. This mode is configured in Humio by setting
For customers that are already managing a Kafka solution, on the other hand, we need to stay out of their way. A customer who 'knows what they are doing' will just be annoyed if we try to reconfigure their Kafka. To make sure our topic names do not conflict with the existing ones, you can configure
HUMIO_KAFKA_TOPIC_PREFIX to namespace the topics we need.
Some Details of Our Kafka Usage
Either way, we've found some issues that might be interesting for a Kafka'esque audience.
- We don't use Kafka's built-in compression but provide our own. The issue is that in a high-performance scenario, the stock LZ4 compression uses JNI that can cause the garbage collector to lock. With many consumers running in the same JVM, we can easily get into a situation where there's always a process that prevents GC from happening, leading to long doing-nothing pauses. We did run into the same issue with our web framework's use of gzip and replaced that with
- We started using the
deleteRecordsfunctionality in Kafka (when ingested data is safely replicated to storage nodes, we can delete old entries in the ingest topic), but have since regretted that somehow. It took some versions of Kafka to make it work properly, and at the end of the day, the user needs the assigned retention space in case Humio has issues keeping up for some reason. It does save on typical disk space usage, but not worst case disk space needs. Should we decide today, we probably wouldn't have used it, but now we may have customers depending on the disk space that is released this way.
- We use the
beginningOffsetsmethods in the consumer to determine the size of the inbound queue and to determine if 'we're running behind' in ingest processing. These calls can sometimes take a very long time, and so it's hard to say if we're behind because the queue is long, or because it takes a long time to determine if the queue is long.
Anyhow, we love Kafka. It has made implementing a distributed system significantly less effort than if we had to do the distributed queue from scratch. Using Kafka as a building block has allowed us to focus our attention on building a fast and pleasant search experience.
Published at DZone with permission of Kresten Krab, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.