Here is why in "Cassandra vs.", it's Cassandra FTW!
Our organization processes thousands of data sources continuously to
produce a single consolidated view of the healthcare space. There are
two aspects of this problem that are challenging. The first is schema
management, and the second is processing time.
Creating a flexible RDBMS model to accomodate thousands disparate data
sources is difficult, especially as those schemas change over time.
Even given a flexible relational model, to properly access and
manipulate data in that model is complicated. That complexity bleeds
into application code and hampers analytics.
Given the volume of data and the frequency of updates, standardizing,
indexing, analyzing and processing that data takes days of time across
dozens and dozens of machines. And even with round the clock
processing, the business and customer appetites for additional and more
current analytics are insatiable.
Trying to scale the RDBMS system vertically through hardware eventually
has its limits. Scaling horizontally through sharding becomes a
challenge. Operations and Maintenance (O&M) is difficult and
requires a lot of custom coding to accommodate the partitioning.
We needed a distributed data system that provided:
- Flexible Schema Management
- Distributed Processing
- Easy Administration (to lower O&M costs)
Driven by the need for flexible schemas, we turned to NoSQL. We
considered: MongoDB, CouchDB, HBase, and Riak. Immediately we set out
to see what support each of these had support for "real" map/reduce.
Given the processing we do, we knew we would eventually need support
for all of Hadoop's goodness. This includes extensions like Pig
, and Cascading
CouchDB dropped out here. It supports map/reduce, but little or no
notable support for Hadoop proper. MongoDB scored "acceptable", but the
Hadoop support was not nearly as evolved as the support in Cassandra. Datastax
actually distributes an enterprise version of Cassandra that fully
integrates the Hadoop runtime. Thus, we left MongoDB for another day
and scored HBase's Hadoop support off the charts.
Riak is interesting in that they provide very slick native support for
map/reduce (http://wiki.basho.com/MapReduce.html) via REST, while they
also provide a nice bridge from Hadoop
. I must admit. We were *very* attracted to the REST interface. (which is why we eventually went on to create Virgil
Left with Riak, HBase and Cassandra, we layered in some non-functional
requirements. First, we needed to be able to get third-party support.
Unfortunately, this is where Riak fell out. With Datastax and
Cloudera backing the other contenders, it was hard to go with what felt
like the "new kid on the block"
NOW -- Down to HBase and Cassandra. For this comparison, I won't bother
re-iterating all the great points from Dominic William's great post
. Given that post and a few others, we decided on Cassandra.
Now, since choosing Cassandra, I can say there are a few other *really*
important less tangible considerations. The first, is the code base.
Cassandra has an extremely clean and well maintained code base.
Jonathan and team do a fantastic job managing the community and the
code. As we adopted NoSQL, the ability to extend the code-base and
incorporate our own features has proven invaluable. (e.g. triggers
, a REST interface
, and server-side wide-row indexing
Secondly, the community is phenomenal. That results in timely support,
and solid releases on a regular schedule. They do a great job
prioritizing features, accepting contributions, and cranking out
features. (They are now releasing ~quarterly) We've all probably been
part of other open source projects where the leadership is lacking, and
features and releases are unpredictable, which makes your own release
planning difficult. Kudos to the Cassandra team.