In the first two blog posts (part one and part two), we gave a couple of pointers about the settings that can be found in the driver and that you can leverage to improve the performance of application-to-Cassandra communication. Most of the time, this is enough to improve the performance significantly. However, we had a really tight SLA that made us go the extra mile. We needed to provide guarantees that 99.999% of read requests would perform below a certain threshold while we had both a Cassandra cluster and the application deployed on AWS' infrastructure. When working with percentile requirements defined like this, even the smallest glitch can hurt your SLA. This was the main reason why we needed to work hard both on the Cassandra cluster side and the application side.
In order to provide full context, we need to explain the use case better. We have a heavy write and low latency read use case. The performance of our writes is not that important. Consistency is not an issue, we can write with consistency level ONE to get responses fast. Reads need to be fast — our SLA says that we are aiming for five nines (99.999%) to be under 100ms and four nines (99.99%) to be under 50ms. The whole infrastructure is deployed to AWS as multi-DC deployment. We are using EBS volumes to have persistence across reboots and snapshotting possibilities. This unlocks the ease of maintenance and operation, but it comes with a price, that is, EBS volumes have disk latencies that can hurt your performance.
Separation of Read and Write Sessions
As explained previously, our write latency is not that important, while it is really important to satisfy the SLA constraint regarding reads. We needed to solve the problem of EBS volume disk latencies, so the first thing we looked at was load balancing. Since we have multi-DC as the base load balancing strategy, we use the DCAwareRoundRobinPolicy policy. If we wrap it in TokenAwarePolicy, disk latency on certain nodes will still hurt us. We can land on a node that is holding a replica, but that node might have disk latency and all requests served from that node will have this latency added to the request. We needed a way to avoid the nodes that had latency and we needed the best configuration for the application to figure that out quickly. We have an application that uses the same cluster session to read and write at the same time. Our SLA has clear requirements for reads only, and it would be much harder to tune that cluster session, as reads and writes have different patterns and different requirements.
We decided to separate the sessions and run with one session for reads and one for writes. This resulted in some more connections to the cluster, but the benefits and flexibility that it brought compensated for the number of connections. We also put everything in place up front so we could monitor the connections, and this measurement proved that the cluster could handle it. Now that we have a clear separation between reads and writes, we can choose the best load balancing for both sessions. Also, we can tune reads without thinking of their impact to writes. If you decide to go this way, it is of the utmost importance to monitor the resources that both sessions are competing for, such as connections, thread pools, etc.
Tuning Reads: How to hHave the Lowest Impact of EBS Disk Latency
We have already mentioned the problem of disk latency. We have a few options here as to how to solve it: speculative executions with a low threshold (requests will timeout after the threshold and go to the next node if the queried node has disk latency) and a latency-aware load balancing policy (it keeps the HDRHistogram of node performance and can be tuned to avoid slow nodes). We wanted to try both and compare. First, we tried speculative executions, which sounded clear. We set a threshold which fell into our SLA (30ms), we provided the maximum number of attempts (three, as we had three replicas), and we let it run. We calculated that if we had two timeouts, we would still receive reads in 90ms, which fell under our SLA (99.999% reads to be under 100ms).
However, the price of having that many requests on cluster was something we could not handle with our cluster size. Thread pool queues had a lot of pending requests, and monitoring on cluster side showed that the limits for native transport could not handle this many requests. So we decided to try latency-aware policy.
When you browse the Internet for latency-aware policies, there are a lot of resources mentioning that it is a good idea but hard to tune. It is even said that the benefits you get in comparison with a token-aware policy are not worth the effort. However, it seemed like the perfect solution to our problem, so we decided to give it a try.
The initial run was with default parameters. Basically, it would measure latency over time and make a decision based on the best performing node. We had three replicas and we wrapped TokenAwarePolicy in LatencyAwarePolicy, which effectively meant that, out of our three replicas, load balancing would choose the fastest one. Most of the time it was the replica from the same AZ as the application, which was doing the query but not necessarily.
We noticed an increase in performance really fast. Most of the time, we were meeting our SLA, except when we hit a node that did not have performance issues before, but was just starting to have issues. The good news is that latency-aware lets you tune sensitivity so you can detect the slow nodes faster. We tuned three parameters; first we lowered down the exclusion threshold from 2 to 1.2 (this was effectively telling load balancing to exclude nodes that were twice as bad as the fast performing nodes, but we lowered this to 1.2 because we wanted to have smaller margins and exclude faster, so after the chang,e we were excluding nodes that were 20% worse than the fastest performing one).
The second thing we changed was scale, where the default was 100ms, which meant that fresh latencies inside 100ms would be more significant than measurements outside of 100ms from now on. We wanted to exclude nodes faster and we needed a lower number here so that fresh measurements could be more significant than older ones. We decided to lower this to 25ms.
As the last parameter, we wanted to change the retry period. There was a risk of excluding a lot of nodes with this aggressive tuning, so we wanted to return the nodes faster. The default was 10 seconds, and we lowered that by half. Setting this to 5 seconds meant that each penalized node would be retried after 5 seconds.
With those settings, we got exactly what we wanted. Read session had a load balancing policy, which shifted nodes fast after latency on the EBS volume started to hit, and it made the best effort to return nodes in query list as soon as the issues were over.
latencyAwarePolicyOptions:exclusionThreshold: 1.2 scaleInMilliseconds: 25 retryPeriodInSeconds: 5 updateRateInMilliseconds: 100 minimumMeasurements: 50 LatencyAwarePolicy.builder(new TokenAwarePolicy(DCAwareRoundRobinPolicy.builder()).build()))
Fail Early With Socket Timeout
When we think of reads and what is viable to our use case, we can conclude that those are reads in accordance with SLA. We do not have the value of reads being stuck for a long time, creating load on the cluster, and which are thus outside of our SLA. The client application communicating with our application has firmly set the threshold, and all requests above that threshold get discarded. That’s why we wanted to do the same: discard requests above a certain threshold and ease up the load on the cluster. The socket timeout option on the driver provided us that exact option. The details are explained in part one, but the main thing is that you set a timeout value in milliseconds for the driver and all queries above that threshold will timeout. It is up to you which retry policy you will choose and which makes the most sense for your use case. We have set our timeout to 100ms so we can timeout all the queries above that value.
With this tuning walkthrough, we wanted to emphasize the importance of understanding your use case. There is no silver bullet, no guide how to tune the DataStax Java driver for Apache Cassandra. There are some better practices for sure, but you must truly understand your use case to make some decisions. For us, it made sense to discard queries above 100ms, since we had already lost money. But maybe in some other use case, we would add a retry policy, which would give reads a couple of more attempts before giving up. An important takeaway from this blog post would be exactly that: Make sure you fully understand your use case and the reason for it before you start making changes to the driver and Cassandra settings.