Defending the Cassandra Benchmark: What it Means to Compare NoSQL Performance
You may have heard about Jonathan Ellis criticizing Thumbtack Technology's NoSQL benchmarks on the DataStax blog - in short, he suggested that the benchmarks were improperly configured and failed to give Cassandra's performance the recognition it deserved. Well, Ben Engber at Thumbtack Technology heard about the criticism, too, and according to his recent response, Ellis is way off the mark.
Engber acknowledges that some of Ellis' basic points are valid - not every test was configured identically for each database - but different configurations, Engber argues, are necessary:
. . . Cassandra, Couchbase, Aerospike, and MongoDB are architected very differently. This is a pretty complex discussion, and we discussed the durability question explicitly in the second part of our study. Fundamentally, these databases work in different ways and are optimized for different things. The trick was to create a baseline that compares them in a useful way.
From there, Engber explains the set-up of the benchmarks and what, exactly, they aimed to measure. More interesting, though, is his higher-level discussion of benchmarks and their purpose:
Let’s take a step back to why we would want to run a benchmark in the first place. A benchmark is a synthetic thing, in a controlled environment, using a specialized and artificially designed workload. The only reason to do such a thing is if we hope to learn something about the real world.
And according to Engber, they did find useful data relevant to the real world; their benchmarks were not invalid. Engber acknowledges that they were requested by Aerospike, and even acknowledges a couple of flaws pointed out by Ellis - one, Engber says, is mostly a failure of documentation, while the other is a valid (but negligible) omission - but his overall point remains that the differing configurations were not a fudging of the rules or a bias, but a necessary concession to create a meaningful baseline for comparison.
It comes down, as it so often does, to the complexity of these technologies: another benchmark configured a different way (aiming for a different baseline) may have totally different results, depending on the strengths and weaknesses of each database. That's been a common sentiment for a while now: your database is a tool, so you need to understand its purpose, and use it only when it is appropriate.
In other words, it may be wise to take every benchmark with a grain of salt, or, more accurately, be precise in your understanding of what they are measuring.