Apache Cassandra is a highly-scalable, NoSQL, distributed database system. It achieves high scale by distributing data symmetrically among all the compute nodes in a cluster. In performance testing a 288 node Cassandra cluster, Netflix achieved more than one million writes per second.
Each compute node in a Cassandra cluster contains a portion of the data, and that portion may be replicated on multiple nodes in one or more datacenters. Unlike other NoSQL systems, Cassandra supports consistency that is tunable from eventual consistency up to strong consistency.
Cassandra is open source and is currently at release v1.2. It comes with a SQL-like query language, CQL, that is currently at v3. The latest releases of Cassandra and CQL introduce a number of changes that simplify the task of learning their use making this a good time to look into the technology. Cassandra is worth considering where large amounts of data must be stored with good write performance. An important use case is the storage of time-series data.
The Windows Azure Developer Portal has an article written by Hanu Kommalapati showing how to deploy a Cassandra cluster to Windows Azure Virtual Machines running Linux and access it using a Node.js front end (note the fix suggested by Patriek van Dorp at the end of the article). Charles Lamanna (@clamanna) has a post showing how to deploy a performant Cassandra cluster to Windows Azure Virtual Machines running Linux (Q&A). Robin Schumacher has a post showing how to setup and monitor a multi-node Cassandra cluster on Windows. Kelly Sommers (@kellabyte) has an introductory post on Cassandra running in Windows.
There are many NoSQL choices available today. Kristof Kovacs (@kkovacs) has a post comparing Cassandra, MongoDB, CouchDB and Redis. Jonathan Ellis (@spyced), the CTO of Datastax, has a post comparing Cassandra with MongoDB, Riak and HBase. He has also uploaded an excellent deck showing the features introduced in Cassandra v1.2.
Data Model in Cassandra
Cassandra has an unusual – and, to be honest, somewhat confusing – physical data model. CQL3 provides a simpler logical data model. This post is only going to consider the CQL3 data model. The two data models leads to some nomenclature confusion to which I will add by describing it in terms that users of the Windows Azure Table Service would understand.
The data model comprises keyspaces, tables and columns. Tables have traditionally been called column families, but CQL3 uses the keyword TABLE instead.
A keyspace is similar to a schema in a relational model, in that it provides a container for application tables. Cassandra uses data replication to support high availability and data durability, and data replication is configured at the keyspace level.
A Cassandra table comprises a sequence of rows, each of which contains a set of columns, each of which is a name/value pair with a timestamp. A table has a composite primary key, which serves two functions. The first column in the primary key serves as a partition key and is used to allocate the row to the appropriate node(s). The remaining columns, if any, in the primary key uniquely identify a row for a given partition key. The columns not in the composite key store additional data for a row – and do not all have to be present in a specific row. This is similar to a table in the Windows Azure Table Service with a partition key and row key, with the difference that the Table Service supports only a single column for the row key.
The above describes the logical view of the data as presented through CQL3 and is significantly less complex than the underlying physical data model. The physical data model pivots all the rows with the same partition key into columns of a single row. It is this pivot that leads to the statement that Cassandra supports up to 2 billion columns per (physical) row.
Rows are allocated to data nodes based on the hash value of the partition key. A Murmer3 hash is used by default in Cassandra v1.2. This hash algorithm is not ordered so that although partition keys are indexed, allowing rows to be retrieved quickly, they are not ordered so that a range query on the partition key is non-performant. The row key is ordered, allowing performant range queries. Cassandra also supports secondary indexes on the other columns. Any query not matching an index results in a table scan which is not advised in a scalable system.
The syntax of CQL3 is very similar to that of SQL and it provides the obvious DDL and DML statements. Cassandra does not support joins so queries can be against only a single table.
Cassandra is designed to be a highly scalable database. This scalability is achieved by sharding the data over many compute nodes. Cassandra provides for high availability by using symmetric nodes, with no single node being privileged over the other nodes. Consistency is tunable by requiring that writes and reads be performed on specified numbers of nodes in each request.
The sharding model in Cassandra uses distributed hashing tables based on the technique used in Amazon Dynamo. A Cassandra cluster has an associated hashing algorithm – by default, Murmer3. Each Cassandra node is associated with a range of hash values, referred to as tokens.
In the traditional distribution model – using physical nodes – the nodes are configured as a ring. Each node is associated with a single token, which specifies the upper value for the partition key hash of the rows allocated to the node. Essentially, going round the ring the nodes host successive ranges of rows (with successive referring to the hashed value of the partition key). The token for each node must be calculated prior to node configuration which complicates the addition of new nodes. High availability is achieved by replicating data to one or more nodes. The simplest model is to replicate the data to successive nodes in the ring, although other choices are configurable including some that take account of the physical layout of the nodes in the datacenter.
Cassandra v1.2 introduced virtual nodes, in which each node hosts many token ranges. By default, each node has 256 token ranges. When replication is configured, the data for each token range is distributed to a random set of other nodes. The benefit of this is that if a node needs to be regenerated the data can be retrieved from many other nodes rather than a few, as is the case with physical nodes. Virtual nodes makes it much easier to add new nodes since Cassandra handles the allocation of token ranges.
Cassandra is a Java-based system that can be deployed into Windows Azure Virtual Machines. It runs on Windows Server VMs, but since Virtual Machines supports various Linux distributions that is the deployment environment of choice. Depending on instance size, each Windows Azure instance can support between 1 and 16 disks each of which can be up to 1TB in size.
In doing a Cassandra deployment to Windows Azure it is worth doing performance testing to identify the optimal instance and disk size for the anticipated workload. This may involve, for example, using different storage accounts for disks attached to different VMs. The most authoritative comment I’ve seen indicates that this should not be necessary for different disks attached to a single instance. It is worth bearing in mind that Virtual Machines are still in preview and that the performance characteristics may be different when it goes GA.