Apache Cassandra is a high-performance, extremely scalable, fault-tolerant (i.e., no single point of failure), distributed non-relational database solution. Cassandra combines all the benefits of Google Bigtable and Amazon Dynamo to handle the types of database management needs that traditional RDBMS vendors cannot support.
Who is Using Cassandra?
Cassandra is in use at Apple (75,000+ nodes), Spotify (3,000+ nodes), eBay, Capital One, Macy's, Bank of America, Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Rackspace, Ooyala, and more companies that have large active data sets. The largest known Cassandra cluster has more than 300 TB of data across more than 400 machines (cassandra.apache.org).
RDBMS vs. Cassandra
||Success or failure for inserts/deletes in a single partition (one or more rows in a single partition).
||Enforced at every scope, at the cost of performance and scalability.
||Native share-nothing architecture, inherently partitioned by a configurable strategy.
||Often forced when scaling, partitioned by key or function.
||No tunable consistency in the ACID sense. Can be tuned to provide more consistency or to provide more availability. The consistency is configured per request. Since Cassandra is a distributed database, traditional locking and transactions are not possible (there is, however, a concept of lightweight transaction that should be used very carefully).
||Favors consistency over availability tunable via isolation levels.
||Writes are durable to a replica node, being recorded in memory and the commit log before acknowledged. In the event of a crash, the commit log replays on restart to recover any lost writes before data is flushed to disk.
||Typically, data is written to a single master node, sometimes configured with synchronous replication at the cost of performance and cumbersome data restoration.
||Native and out-of-the-box capabilities for data replication over lower bandwidth, higher latency, less reliable connections.
||Typically, only limited long-distance replication to read-only slaves receiving asynchronous updates.
||Coarse-grained and primitive, but authorization, authentication, roles, and data encryption are provided out-of-the-box.
||Fine-grained access control to objects.
Data Model Overview
Cassandra has a tabular schema comprising keyspaces, tables, partitions, rows, and columns.
Note that, since Cassandra 3.x terminology is altered due to changes in the storage engine, a “column family” is now a table and a “row” is now a partition.
||A collection of tables.
||A set of partitions.
||A set of rows that share the same partition key.
||An ordered (inside of a partition) set of columns.
||A key/value pair and timestamp.
||Column (Name, Value)
||(Key, Value, Timestamp)
The keyspace is akin to a database or schema in RDBMS, contains a set of tables, and is used for replication. A keyspace is also the unit for Cassandra's access control mechanism. When enabled, users must authenticate to access and manipulate data in a schema or table.
A table, previously known as a column family, is a map of rows. Similar to RDBMS, a table is defined by a primary key. The primary key consists of a partition key and clustering columns. The partition key defines data locality in the cluster, and the data with the same partition key will be stored together on a single node. The clustering columns define how the data will be ordered on the disk within a partition. The client application provides rows that conform to the schema. Each row has the same fixed subset of columns.
This is a preview of the Apache Cassandra Refcard. To read the entire Refcard, please download the PDF from the link above.