Cassandra was born out of Facebook's need to store reverse indices of Facebook messages that users send and receive while communicating with their friends. The solution needed to scale incrementally while remaining cost effective. Traditional data storage was not an option, so Facebook created a non-relational solution called "Cassandra". The project was designed by Avinash Lakshman and Prashant Malik. Lakshman was one of the authors of Amazon's Dynamo, another large-scale NoSQL database. In many ways, Cassandra is like the second version of Dynamo, or a marriage of Dynamo and Google's BigTable. Lakshman further describes Cassandra's data model and the distributed properties provided by the system:
- Every row is identified by a unique key. The key is a string and there is no limit on its size.
- An instance of Cassandra has one table which is made up of one or more column families as defined by the user.
- The number of column families and the name of each of the above must be fixed at the time the cluster is started. There is no limitation the number of column families but it is expected that there would be a few of these.
- Each column family can contain one of two structures: supercolumns or columns. Both of these are dynamically created and there is no limit on the number of these that can be stored in a column family.
- Columns are constructs that have a name, a value and a user-defined timestamp associated with them. The number of columns that can be contained in a column family is very large. Columns could be of variable number per key. For instance key K1 could have 1024 columns/super columns while key K2 could have 64 columns/super columns.
- “Supercolumns” are a construct that have a name, and an infinite number of columns assosciated with them. The number of “Supercolumns” associated with any column family could be infinite and of a variable number per key. They exhibit the same characteristics as columns.
Distribution, Replication and Fault Tolerance
- Data is distributed across the nodes in the cluster using Consistent Hashing based and on an Order Preserving Hash function. We use an Order Preserving Hash so that we could perform range scans over the data for analysis at some later point.
- Cluster membership is maintained via Gossip style membership algorithm. Failures of nodes within the cluster are monitored using an Accrual Style Failure Detector.
- High availability is achieved using replication and we actively replicate data across data centers. Since eventual consistency is the mantra of the system reads execute on the closest replica and data is repaired in the background for increased read throughput.
- System exhibits incremental scalability properties which can be achieved as easily as dropping nodes and having them automatically bootstrapped with data.
Think of Cassandra as a large 4 or 5 level associative array. Each dimention of the array has a free index that is based on the keys in that level. The optional 5th level is the Supercolumn, which is where the real power comes from. It can allow a simple key-value architecture to deal with sorted lists based on a specified index. Cassandra has no single points of failure and it is able to scale from one node to several thousand in different data centers. There is no central master, so data can be written to any node in the cluster and read from any other node. Cassandra can be tuned to support more consistency or availability depending on your application. There's also a high availability guarantee where if one node goes down, another one will step in and replace it smoothly.
Insert Speed stress.py benchmark on a quad-core system with 2GB RAM (Cassandra 0.4 vs. 0.5) - Source
In the most recent version of Cassandra has improved concurrency across the board (see above). Released last month, Cassandra 0.5 also adds load balancing and significantly improves bootstrap. 0.5 also features new tools, including JSON-based data import and export, new JMX metrics, and an improved command line interface. Cassandra is one of three Apache projects that have graduated from the incubator this year. The others are Apache Pivot and Apache Subversion.