The Test Pilot team started by outlining all of their major requirements:
- Expected minimum users: 1 million. Design to accommodate 10 million by the end of the year and have a plan for scaling out to tens of millions. (This is the 1x 10x 100x rule of estimation of which I am a fan)
- Expected amount of data stored per experiment: 1.2 TB
- Expected peak traffic: approximately 75 GB per hour for two 8 hour periods following the conclusion of an experiment window. This two day period will result in collection of approximately 90% of the total data.
- Remain highly available under load
- Provide necessary validation and security constraints to prevent bad data from polluting the experiment or damaging the application
- Provide a flexible and easy-to-use way for data analysts to explore the data. While all of these guys are great with statistics and thinking about data, not all of them have a programming background, so higher-level APIs are a plus.
- Do it fast.
Einspanjer and the teams decided that a key-value or column store would be best for their use case. They started researching the usual requirements for large-scale data persistence: Scalability, Elasticity, Reliability, Storage, etc. The developers also needed to know what kind of analyst-friendly system they could fashion, how much hardware they'd have to pay for, the level of security (privacy of data is extremely important to the Test Pilot project), the amount of effort required to implement and maintain a solution, and the extensibility of the solution. Einspanjer and his team wanted an extensible data store that could evolve to fit their needs, but they didn't want a system that would bog them down if they decided to switch technologies.
Here were the the conclusions that the Test Pilot team came to for each NOSQL data store:
CassandraNew nodes receive half of the largest range of data by default, but this can be changed and load balancing can be configured. The rebalance command performs the work throughout all of the data ranges, and rebalancing can be monitored.
Cassandra is lighter on memory requirements compared to HBase, and so is Riak. To analyze data with Cassandra, the Test Pilot team will need to leverage their Hadoop cluster. Any extensible schema changes would require a rolling restart of the nodes. All three solutions make it pretty easy to replicate, export, or MapReduce data out of the system during migration.
HBase and cassandra could incorporate fail-over code during disaster recovery to locally spool submissions until the cluster came back online. For these two stores, Einspanjer thinks a second cluster would be the best option for disaster recovery.
Cassandra and Riak have no single point of failure. For analysis, both Cassandra and HBase can use Hadoop.
HBaseHBase splits data into regions with data files stored in Hadoop Distributed File Systems. Each set of regions has a RegionServer which normally owns regions on the local HDFS DataNode. Einspanjer explains: "If you add a new node, HDFS will begin considering that node for the purposes of replication. When a region file is split, HBase will determine which machines should be the owners of the newly split files. Eventually, the new node will store a reasonable portion of the new and newly split data. Re-balancing the data involves both re-balancing HDFS and then ensuring that HBase reassess the ownership of regions."
Einspanjer believes that with HBase his team would need a dedicated cluster to separate from their Socorro project already running on HBase (because of heavy peak traffic periods). Riak and Cassandra are lighter on memory requirements.
With HBase or Cassandra, the Test Pilot developers would need to add additional security functionality to the front-end layer so that the payload would look for invalid/incomplete data and reject it.
Until later versions of HBase provide better High Availability options, the Hadoop NameNode and HBase Master constitute a single point of failure. Some admin functions and upgrades will require a restart of the entire cluster and a maintenance window to modify the NameNode or HBase Master.
RiakWould you be surprised if I told you that Einspanjer and his team chose Riak over Cassandra? Well they did, and thanks to the availability of some Basho experts, they will be able to deliver a near turn-key solution. Einspanjer says, "In Riak, the data is divided into partitions that are distributed among the nodes. When a node is added, the distribution of partition ownership is changed and both old and new data will immediately begin migrating over to the new data." Because of Riak's elasticity, the team could even temporarily re-purpose some of their nodes to the write cluster to accommodate peak traffic periods.
Riak has a built-in REST server that is battle-tested and production ready. Minimal schema design is required and specific hooking in of the schema is not necessary. For security, the Webmachine pre-commit hooks allow business logic to be included so that it can perform payload inspection. Riak is also highly extensible because new buckets and schema changes are completely dynamic.
For disaster recovery, Riak could potentially reassign the entire reporting cluster temporarily to handle incoming submissions. If they could not make the Riak cluster available for client connections, they would have no buffer in place on the back end to spool submissions.
You can read Einspanjer's full blog post here.