One of the more interesting aspects of being a community manager for NATS is the wide variety of use cases where NATS fits in — IoT, microservices, cloud-native architecture communication, replacing legacy enterprise messaging systems and their associated baggage, etc.
This also means that an increasingly wide spectrum of technologies are used along with NATS. This, of course, is not an issue for NATS: it is as simple a messaging system as you can imagine. NATS is a single binary (and a plain text protocol) with essentially no external dependencies — it makes no assumptions about the infrastructure pieces it interacts with, and "just works."
But the increasing variety of use cases and reference architectures mean I also get to hear about products, technologies, and integration patterns from the development community I would not have otherwise.
Henrik Johansson has been a long-time member of the NATS community, and I recently met him at GopherCon to catch up. A recent blog post of his is below, and it provides some great details on some of the experimentation they went through to optimize their data store — first with Redis, and ultimately with ScyllaDB and NATS
ScyllaDB — A Journey Into the Unknown…
Pretentious as that might sound, it is not far from the truth. We — my coworkers and I — have traditionally been using databases like PostgreSQL and MongoDB for most of our needs, and we have had great success doing so. We have had no dramatic problems either with scalability or reliability. Neither is either storage system particularly difficult to work with.
So why endeavor to try another database? A brand new one with an unproven track record to boot? Well, honestly, there was a fair bit of curiosity involved, at least initially. It was also motivated by the apparent performance benefit we potentially could gain. I am personally a bit of a performance freak, and I spend what is probably an inordinate amount of time on squeezing that last bit of latency or throughput out of a system. Premature optimization? Perhaps, but I prefer to call it good engineering and high quality. We also have had issues scaling PostgreSQL laterally across a cluster of machines. There has always been the master to track, and a little hands-on maintenance is almost always required. Don't get me wrong, PostgreSQL is a marvel of engineering and my absolute favorite when it comes to relational databases. It is stable, safe, and performs very well in even very demanding scenarios.
A Cassandra Clone
A clone you say? Why yes, it is indeed a clone in some sense. But it is not just a rewrite in Java. It has taken that step closer to the metal that is often needed to get the most performance out of a system that you can. There are no guarantees, but at least the door to performance is opened and you can get to work on the algorithms more than catering to the whims of the JVM.
Sure, it's a bit of flame-bait but there is some truth to it, so let's stick to it. Cassandra, however, has long seen heavy use in many a high-profile site and is definitely one of those systems that inspire awe and respect in most developers' minds. To set out with the express purpose of beating it on its home turf, performance, and scalability is nothing short of awesome! Time will tell whether or not the developer community will switch, or if Cassandra will accept the challenge and push on. Either way, competition is healthy and I for one welcome it.
Our Use Case
Initially, our use case was that we needed to relate a large set of transient identifiers ("A") to another quite large set of other identifiers ("B"). It may seem like a trivial thing to do, and we thought so as well as we started using Redis. Redis is known for its blazing performance, and the use case seemed cut and dried. It soon became apparent, though, that we needed to extract all the "A" identifiers corresponding to given "B" identifiers, which also seemed straightforward with Redis' SCAN method. We were satisfied, and it ran fast and stable for some time.
However, after a while, we began seeing high latency in the system and started to investigate. It turned out that we had severely underestimated the number of transient identifiers, and we were struck rather bluntly by Redis' single-threaded nature. Agreed, this is by design, and we were aware of how it worked, we just forgot some due diligence regarding our target flow of data.
That being somewhat embarrassing, we set out to correct the oversight. We tried to be clever with Redis' SCAN, trying to do it in steps, but it quickly proved untenable because we seemed to hit too large latencies too quickly. Perhaps we could have gotten a bigger server, but that felt like admitting defeat — although it really isn't. Sometimes, it can be the right solution.
We then started to think a little. What about our old and faithful PostgreSQL and MongoDB? Could they be the answer? We concluded that they could possibly both be the answer, but we were not entirely convinced which one to choose — or if it was worth the trouble to try. There would be some coding and setup, tuning indexes and queries. Nothing too hard, but nothing that could done very quickly either. I had followed ScyllaDB since I first read about it in a pre 1.0 blog post some time ago and thought ,"What if we were to try ScyllaDB?" It has the characteristics we want, low-latency reads and writes, as well as sharding and simple querying. After surprisingly little discussion, we decided to try it.
So, How Are Things Going?
The bottom line is that we are very pleased with ScyllaDB. The problem of listing the identifier mappings is completely gone. It is, in fact, a use case eminently suitable for ScyllaDB. Normal simple lookups are also blazing fast, and while Redis may be faster in the uncontended case, a scaled scenario can be very different. Benchmarking is fun, but hard. Perhaps there will be time in the future for a test. Who knows?
Basically, the only hitch we have hit so far is that during the installation process, we realized that our RHEL 7 setup had some as of yet not identified policies in place that got in the way. We found workarounds, but so far, there's no clear path to entirely friction-free installs.
The ScyllaDB repositories, however, worked nicely and the actual install of the packages was a breeze. We also bumped into a couple of already-reported bugs in a couple of ScyllaDB's setup scripts that, as far as I know, have been fixed at the time of this writing. Getting our minds around the familiar SQL query language into the realm of CQL hasn't been hard, but it gave us pause to think from time to time. There are just some limitations that you have to keep in mind when transitioning from a traditional relational query pattern to that of CQL.
ScyllaDB in Production
The database has been kicking nicely, and we have expanded its usage to more than the initial scope with great ease. We are using the eminent Gocql Go library and have had little to no issues so far — and awesome performance. We have also upgraded ScyllaDB twice to version 1.2.1, which has been progressively easy. The second time, it took very little time and was without any problems.
This is always a point of fear for me when it comes to databases. You always have to take a database dump and after an upgrade, reload it, and hope that it works. Most of the time it works fine, but I never feel really comfortable. In all fairness, ScyllaDB's upgrade instructions advise taking a snapshot, which we did, but there was no explicit need to restore it after the upgrade, which is very nice.
Show Me the Money
Well, it's fast. We have started to get used to seeing latencies in the microseconds as opposed to the milliseconds that has usually been the standard. And this is not just the latencies reported by nodetool, but also from our API handlers after JSON serialization, but before writing to the network.
|Percentile||SSTables||Writes (μs)||Reads (μs)||Partition (bytes)|
As you can see, there seems to be no immediate cause for concern, although we probably should check that max value of ~50 ms.
Our APIs are really fast as well, and no caching has so far been needed. Both our standard REST endpoints and our NATS-enabled microservices work really well!
This brings an almost perverse satisfaction in the entire stack: Go, Echo, NATS, and, of course, ScyllaDB. That makes me doubt a large chunk of my previous experience using other stacks. It should be noted also that no extra tuning and tweaking has been done, and, so far, we see no reason to do it (other than possibly for fun)!
So What About the Rock or Monster Thing?
It was my, perhaps farfetched, attempt at being witty. Recalling, somewhat vaguely, the Greek mythology of Scylla and Charybdis, we can now get at a meaning. On the one hand, we have the rock and on the other hand, the monster, and we have to navigate in the middle. This is where ScyllaDB comes in. It's both the monster of a database that you envision as well as the big rock to lean on when it's windy!