Working on high nines where the latency of every query matters is a whole different beast in comparison with normal Cassandra use cases where measurements on 99.9% or even 99% are enough. We recently worked on a project which had really tight latencies (threshold in milliseconds on 99.999% of requests) and we needed latency measurement of each and every request so we wanted to share with the Cassandra community how we hit those latency needs and share our tool that will help the community do the same.
Tools at Your Disposal
Let’s first see what Cassandra offers out of the box. Cassandra has latency measurements for both read and write and it is exposed by a metrics library. You can use it but the main problem here is that it measures latency with three nines precision (up to 99.9%). It is good for rough tuning and at the start of a deployment, but when you come to fine tuning and your SLA is on five nines you are missing lots of data. We were using wrk2 as load generator which measures latency on high nines (it includes coordinated omission, more details can be seen on wrk2 GitHub page) and soon enough we realized having monitoring on three nines is not enough since the results on five nines were far worse and we could not explain what was happening at the moments when latency was high.
Also, the driver has an option to start logging slow queries (at least the Java driver does) via QueryLogger. Basically, you tell it which queries to log (normal or slow) and give it threshold for slow queries, and the driver starts producing a report which covers all queries above the given threshold. This is nice, it covers all queries above the threshold and it is good for a high nines use case. You need some amount of custom coding if you want those log messages plotted somewhere (i.e. Grafana) but you can always write your own implementation of LatencyTracker and send metrics wherever you want.
We are using InfluxDB so we wrote InfluxLatencyTracker which is reporting slow queries directly in influx and from there we are plotting graphs on Grafana. This is all nice and exactly what we want from an application perspective, but we have an Amazon deployment and we are losing precious milliseconds between the cluster and driver, so we are basically blind for those slow queries shown on driver but not seen on the cluster level (those are the ones showing problems with configuration, network, and hardware). We needed a better way of storing and representing slow queries both on the cluster and on the driver so we can correlate information from both sources to figure out exact moments when we have problems and which need further investigation.
OpsCenter to the Rescue - Sort of
Our client is using DSE with OpsCenter for monitoring so we started browsing for a solution there. OpsCenter has a services part which has a performance service and one of the options is to measure slow queries. Basically, you provide a threshold and TTL and all queries above that threshold are stored inside Cassandra meta tables, which are then used by OpsCenter for a limited amount of time (defined by TTL).
This looked promising but the OpsCenter UI part shows only the worst query. Good thing we figured out that all queries are in the DB and you can always query them inside TTL. We figured out how to write a small shell script which will query the DB and send all results to InfluxDB and this can run scheduled on each node. This way we will have central metrics storage and we can persist all slow queries in a central place without TTL.
This sounded exactly like what we needed so we created a shell script quickly and started looking at our fresh new metrics on a Grafana dashboard. We could combine driver slow query reporting and slow query reporting from the cluster and we could see how many of the cluster queries created slow queries on the application level, and how many slow queries are on the application and not on the cluster level (those are most interesting because they show either hardware or network latency, or something waiting and choking on the driver before it goes to cluster).
We were all good with our new shiny metrics until this slow query Cassandra table started hitting our performance. Having one more table with lot of TTLs and lot of tombstones is one more set of data being compacted and evicted from tombstones. Scheduled read queries against that table, which grew in size, started being slow. When we did that every 5 minutes on each node, we were connecting to the cluster, taking over some of the network resources and influencing performance while the cluster is handling 20K writes and 10K reads in a second.
Cassandra Diagnostics Project
Whole different topic, but important to understand our chain of thought. For a long time, we wanted to build an effective audit solution which can hook into the node query path and serve as an audit, to store changes on a row in a separate table. We started and got half way through with that idea and then we got involved in this project measuring latency on high nines. We made a pivot and decided to a create diagnostic solution for Cassandra which will hook into the query path and measure latency, sending metrics to an external reporter.
We found byte buddy, a nice little project which enabled us to hook into the Cassandra code without changing code itself, and create a wrapper around the query processor which will measure latency. We wrapped that into a project and placed that jar in a library folder. Full details of our project along with other modules can be seen on our GitHub page. In a nutshell, the project consists of many modules. The connector is a changeable component which you choose based on the Cassandra version you use (currently we support 2.1 and 3.x).
The newest addition to the code is a driver connector which hooks into the query path on a driver level and reusing logic for measuring slow queries on the application side as well. Full logic is wrapped in the core module. The core is holding different submodules (slow query is just one of them) and it applies different logic for different modules outputting the result to a configured reporter. By default, slow queries are reported to standard log output but you have the option to configure different reporters as well. Currently, we support two more reporter implementations, a Riemann reporter outputting slow queries as riemann events and an InfluxDB reporter which is creating points which are stored then in InfluxDB using this reporter.
Conclusion and Things to Come
This project gave us exactly the insights we needed: we’ve got slow queries measured on the application and on the cluster level without putting additional pressure on Cassandra. It integrated nicely in a monitoring stack with InfluxDB, Grafana, and Riemann. What is even more important is that we apply the same logic on the driver and cluster level and we can plot graphs on Grafana one above the other and compare slow queries on both levels. Insights you get when you correlate those two metrics shows exact moments which need further investigation.
Inside the project, there are two more modules: heartbeat, which is there to test if reporters are working (simple scheduled I am alive ping); and the request rate module, which is measuring write and read query rates on node in configured time unit. We are preparing a lot more stuff, deployment automation using ansible scripts and some more useful modules for Cassandra metrics. If you like this idea and our project feel free to participate, it is open source and we keep the issues list up to date. You can see all our open source projects at our main GitHub page.