3 Key Metrics for Kafka Monitoring
3 metrics that I found to be very useful from a development point of view and saved us some time while triaging a few corner cases.
Join the DZone community and get the full member experience.Join For Free
“Without data you’re just a person with an opinion.”
— W. Edwards Deming
There are 100s of metrics documented as part of Kafka monitoring out of which CPU, Memory, Disk, and Network related metrics are always useful in monitoring any systems. In this article, I share 3 metrics that I found to be very useful from a development point of view, saved us some time while triaging a few corner cases reported by customers.
Lag per Topic — To get alerted when your consumers are functioning slower than usual. A high value often indicates the existence of one or more following situations
(a) Spike in messages being produced, probably of long duration because short spikes usually get sorted out by consumer bringing lag per topic down after some time.
(b) Consumer processes not having enough system resources or waiting for blocking I/O or network call. In one case, one of our EC2 instances (running Kafka consumer) stopped polling for messages because JVM crashed due to OOM; however, the health check didn't replace it because it was monitoring the instance and not the application inside it.
If you observe some data points missing in the dashboard now but after some time it shows up then please add this metric in your monitoring dashboard. As a developer, you should get an estimate and set a threshold or else enable monitoring and have some historical data to come up with a baseline.
Consumer Offset Delta — (Derived metric) Often, a consumer might simply stop receiving messages even though it is running and connected to the topic. This derived metric (difference of current_ consumer_offset and previous_consumer_offset every 10/30 seconds or 1 minute) is the key to be alerted on following probable causes
(a) Messages not being produced or not at the rate expected
(b) Messages not being consumed or not at the rate expected
Consumer Timer — (Derived metric) I recommend developers to put a timer which goes off when (a) 0 messages (b) < x messages (c) > y messages, are received in a poll cycle. If this timer count breaches a threshold then raise an alert. This might sound difficult to implement initially but if you keep track of this metric then the dev team can easily identify a pattern that will help them improve code.
Any insights you can get out of your system helps in deriving strategy to optimize and cut costs. Monitoring is not just an Operational exercise but also part of Development processes. You can learn so much by observing first hand how the Kafka cluster (or other systems) functions in production.
Opinions expressed by DZone contributors are their own.