Using Unravel for End-to-End Monitoring of Kafka
Unravel comes standard with a number of tools that can help you keep a close eye on the processes in your Kafka instances.
Join the DZone community and get the full member experience.Join For Free
Customers with Kafka clusters struggle to understand what is happening in their Kafka implementations. There are out-of-the-box solutions like Ambari and Cloudera Manager which provide some high-level monitoring; however, most customers find these tools to be insufficient for troubleshooting purposes. These tools also fail to provide insight/visibility down to the applications acting as consumers that are processing Kafka data streams.
For Unravel customers, Kafka monitoring and insights come out-of-the-box with your installation. This tutorial is geared to showing you best practices in using Unravel to monitor your Kafka environments. It is also assumed you have a baseline knowledge of Kafka concepts and architecture.
Unravel provides color-coded KPIs per cluster to give you a quick overall view on the health of your Kafka cluster. The colors represent the following:
- Green = Healthy
- Red = Unhealthy and is something you will want to investigate
- Blue = Metrics on cluster activity
Let's walk through a few scenarios and how we can use Unravel's KPIs to help us through troubleshooting issues.
Under-replicated partitions tell us that replication is not going as fast as configured, which adds latency as consumers don't get the data they need until messages are replicated. It also suggests that we are more vulnerable to losing data if we have a master failure. Any under-replicated partitions at all constitute a bad thing and is something we'll want to root cause to avoid any data loss. Generally speaking, under-replicated partitions usually point to an issue on a specific broker.
To investigate scenarios if under-replicated partitions are showing up in your cluster, we'll need to understand how often this is occurring. For that we can use the following graphs to get a better understanding of this.
Number of Under-Replicated Partitions
If we are seeing under-replicated partitions then we'll want to understand which broker contains under-replicated partitions. Let's drill down to the broker level by clicking on the "Broker" tab:
From here, we identify which broker(s) have under-replicated partitions. Now that we have the offending broker(s) identified we'll want to investigate the offending broker logs to determine root cause.
Another useful metric to monitor is the "Log flush latency, 99th percentile" graph which provides us the time it takes for the brokers to flush logs to disk.
Log Flush Latency
Log flush latency is important, because the longer it takes to flush a log to disk, the more the pipeline backs up, the worse our latency and throughput. When this number goes up, even 10ms going to 20ms, end-to-end latency balloons, which can also lead to under-replicated partitions.
If we notice that latency fluctuates greatly, then we'll want to identify which broker(s) are contributing to this. In order to do this we'll go back to the "Broker" tab and click on each broker to refresh the graph for the broker selected:
Once we identify the broker(s) with wildly fluctuating latency we will want to get further insight by investigating the logs for that broker. If the OS is having a hard time keeping up with log flush, then we may want to consider adding more brokers to our Kafka architecture.
Generally speaking, you want to avoid offline partitions in your cluster. If you see this metric > 0, then there are broker level issues which need to be addressed.
Unravel provides the "# Offline Partitions" graph to understand when offline partitions occur:
This metric provides the total number of topic partitions in the cluster that are offline. This can happen if the brokers with replicas are down, or if the unclean leader election is disabled and the replicas are not in sync and thus none can be elected leader (may be desirable to ensure no messages are lost).
In order to investigate this we'll start on the broker level by navigating to the Broker tab:
From here we can see that our broker "kafka2" contains offline partitions. Now that we have the offending broker(s) identified, we'll want to investigate the offending broker logs to determine the root cause.
This KPI displays the number of brokers in the cluster reporting as the active controller in the last interval. Controller assignment can change over time as shown in the "# Active Controller Trend" graph:
"# Controller" = 1: There is one active controller. This is the state that you want to see.
"# Controller" > 1: Can be good or bad. During steady state, there should be only one active controller per cluster. If this is greater than 1 for only one minute, then it probably means the active controller switched from one broker to another. If this persists for more than one minute, troubleshoot the cluster for "split brain."
For Active Controller <> 1, we'll want to investigate logs on the broker level for further insight.
The last three KPI's show cluster activity within the last 24 hours. They will always be colored blue because these metrics can neither be good or bad. These KPIs are useful in gauging activity in your Kafka cluster for the last 24 hours. You can also view these metrics via their respective graphs below:
- Add additional brokers to keep up with data velocity
- Evaluate the performance of topic architecture on your brokers
- Evaluate the performance of partition architecture for a topic
The next section provides best practices in using Unravel to evaluate the performance of your topics/brokers.
A common challenge for Kafka admins is providing an architecture for the topics/partitions in the cluster which can support the data velocity coming from producers. Most of the time, this is a shot in the dark because OOB solutions do not provide any actionable insight on activity for your topics/partitions. Let's investigate how Unravel can help shed some light into this.
On the producer side, it can be a challenge in deciding on how to partition topics on the Kafka level. Producers can choose to send messages via key or use a round-robin strategy when no key has been defined for a message. Choosing the correct # of partitions for your data velocity is important in ensuring we have a real-time architecture that is performant.
Let's use Unravel to get a better understanding of how the current architecture is performing. We can then use this insight in guiding us in choosing the correct # of partitions.
To investigate, click on the "Topic" tab and scroll to the bottom to see the list of topics in your Kafka cluster:
In this table we can quickly identify topics which have heavy traffic where we may want to understand how our partitions are performing. This is convenient in identifying topics which have heavy traffic.
Consumer Group Architecture
We can also make changes on the consumer group side to scale according to your topic / partition architecture. Unravel provides a convenient view on the consumer group for each topic to quickly get a health status of the consumer groups in your Kafka cluster. If a consumer group is a Spark Streaming application then Unravel can also provide insights to that application thereby providing an end-to-end monitoring solution for your streaming applications.
Monitoring consumer groups has already been covered in Unravel's documentation which can be found here.
Thanks for reading.
Published at DZone with permission of George Demarest, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.