Broken Glass: Diagnosing Production Cassandra Issues
I just passed my second year anniversary at Health Market Science (HMS), and we've been working with Cassandra for almost the entirety of my career here. In that time, we have had remarkably few problems with it. Like few other technologies I've worked with, Cassandra "just works."
But, as with *every* technology I've ever worked with, you eventually have some sort of issue, even if it is not with the technology itself, but rather your use of the technology. And that was the situation here. (gun? check. foot? check. aim... fire. =)
Here is our tale of when bullet met foot...
Our dependency on Cassandra has increased exponentially since its been in production. We've been adding product lines and clients to those product lines at an ever-increasing rate. And with that success, we've had to evolve the architecture over time, but some parts of the system have remained untouched because they've been cruising along. Over the last couple weeks, one of those parts reared its ugly head.
We've been scaling the nodes in our cluster vertically to accommodate demand. Our cluster is entirely virtual, so this was always the path of least resistance. Need more memory? No problem. Need more CPU? No problem. Need space/disk? We've got tons in our SAN. You do that a few times and with increasing frequency, and you can start to see a trend that doesn't end well. =)
First, as we increased our memory footprint, we weren't paying close enough attention to the tuning parameters in: http://www.datastax.com/docs/1.0/operations/tuning
We had our heap size set too large given our system memory, and that started causing hiccups in Cassandra. Once we brought that back in-line, we limped along for a few more weeks.
Then things came to a head last week. We saw the cliff at the end of the road. We found a "bug" in one of our client applications that was inadvertently introducing an artificial throttle. Fantastic! We make the code change (2 lines of code), do some testing, and release it to production. Bam, we increased our concurrency by orders of magnitude. Uh oh, what's that? Cassandra is choking?
Cassandra started to garbage collect rapidly. We quickly consulted the google Gods and went to the Oracle (Matrix reference, NOT the DB manufacturer =) for advice:
If you have not read through that presentation, do so before its too late. For performance tuning and C* diagnosis, there is no kung fu stronger than that of Aaron Morton (@aaronmorton).
We started looking at tpstats and cfstats. All seemed relatively okay. What could be expanding our footprint?
Well, we have a boat-load of column families. We've evolved the architecture and our data model, and in the newer applications we've taken a virtual-keyspaces approach, consolidating data into a single large column family using composite row keys. But alas, the legacy data model remains in production. Many of those column families see very little traffic, but Cassandra still reserves some memory for them. That might have been the culprit, but those column families had been there since the beginning of time. We had to look deeper.
We had heard about Bloom Filter bloat, and we thought that might be the issue. But looking at the on-disk size of the filters (ls -al **/*Filter.db), everything seemed hunky-dory and could fit well within our monstrous heap. (in 1.2 these have been moved off heap)
Way back when we had a brilliant idea to introduce some server-side AOP code to act as triggers. Initially, we used them to keep indexes in sync: wide-rows, and even at one point we kept Elastic Search up-to-date with server-side triggers. This kept the client-side code simple-stupid. The apps connecting to C* didn't need to know about any of our indexing mechanisms.
Eventually, we figured out that it was better to control that data flow in the app-layer (via Storm), but we still had AOP code server-side to manage the wide-rows. And despite the fact that I've recently been speaking out against our previous approach, that code was still in there. Could that be the be root cause? Our wide-rows were certainly getting wider... (into the millions of columns at this point)
One of our crew (kudos to sandrews) found JMeter Cassandra and started hammering away in a non-production environment. We attached a profiler, which exposed our problem -- the AOP inside. Fortunately, we had already been working on a patch that removed the AOP from C*. The patch moved the AOP code to the client-side (point-cutting Hector instead of Thrift/Cassandra). We applied the patch and tested away.
Voila, C* was humming again, and we all lived happily ever after.
A big thanks to +Aaron Morton again for the help. You are a rock star.
And to the crew at HMS, it's an honor to work with such a talented, passionate team.
Good on ya.