{{announcement.body}}
{{announcement.title}}
Refcard #153

Apache Cassandra

A Fault-Tolerant, Massively Scalable NoSQL Database

Covers data model, architecture, partitioning, strategies, indexes, libraries in various languages, and more.

1,598

Brought to you by

Datastax
Free .PDF for easy Reference

Written by

Milan Milosevic Lead Data Engineer, SmartCat
Brian O' Neill Architect, Iron Mountain
Refcard #153

Apache Cassandra

A Fault-Tolerant, Massively Scalable NoSQL Database

Covers data model, architecture, partitioning, strategies, indexes, libraries in various languages, and more.

1,598
Free .PDF for easy Reference

Written by

Milan Milosevic Lead Data Engineer, SmartCat
Brian O' Neill Architect, Iron Mountain

Brought to you by

Datastax
Table of Contents

Apache Cassandra

Section 1

Apache Cassandra

This is a preview of the Apache Cassandra Refcard. To read the entire Refcard, please download the PDF from the link above.

Introduction

Apache Cassandra is a high-performance, extremely scalable, fault-tolerant (i.e., no single point of failure), distributed non-relational database solution. Cassandra combines all the benefits of Google Bigtable and Amazon Dynamo to handle the types of database management needs that traditional RDBMS vendors cannot support. DataStax is the leading worldwide commercial provider of Cassandra products, services, support, and training.

Who is Using Cassandra?

Cassandra is in use at Apple (75,000+ nodes), Spotify (3,000+ nodes), eBay, Capital One, Macy's, Bank of America, Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Rackspace, Ooyala, and more companies that have large active data sets. The largest known Cassandra cluster has more than 300 TB of data across more than 400 machines (cassandra.apache.org).

RDBMS vs. Cassandra

Cassandra RDBMS
Atomicity Success or failure for inserts/deletes in a single partition (one or more rows in a single partition). Enforced at every scope, at the cost of performance and scalability.
Sharding Native share-nothing architecture, inherently partitioned by a configurable strategy. Often forced when scaling, partitioned by key or function.
Consistency No tunable consistency in the ACID sense. Can be tuned to provide more consistency or to provide more availability. The consistency is configured per request. Since Cassandra is a distributed database, traditional locking and transactions are not possible (there is, however, a concept of lightweight transaction that should be used very carefully). Favors consistency over availability tunable via isolation levels.
Durability Writes are durable to a replica node, being recorded in memory and the commit log before acknowledged. In the event of a crash, the commit log replays on restart to recover any lost writes before data is flushed to disk. Typically, data is written to a single master node, sometimes configured with synchronous replication at the cost of performance and cumbersome data restoration.
Multi-Datacenter Replication Native and out-of-the-box capabilities for data replication over lower bandwidth, higher latency, less reliable connections. Typically, only limited long-distance replication to read-only slaves receiving asynchronous updates.
Security Coarse-grained and primitive, but authorization, authentication, roles, and data encryption are provided out-of-the-box. Fine-grained access control to objects.

Data Model Overview

Cassandra has a tabular schema comprising keyspaces, tables, partitions, rows, and columns.

Note that, since Cassandra 3.x terminology is altered due to changes in the storage engine, a “column family” is now a table and a “row” is now a partition.

Definition RDBMS Analogy Object Equivalent
Schema/Keyspace A collection of tables. Schema/Database Set
Table/Column Family A set of partitions. Table Map
Partition A set of rows that share the same partition key. N/A
Row An ordered (inside of a partition) set of columns. Row OrderedMap
Column A key/value pair and timestamp. Column (Name, Value) (Key, Value, Timestamp)

Schema

The keyspace is akin to a database or schema in RDBMS, contains a set of tables, and is used for replication. A keyspace is also the unit for Cassandra's access control mechanism. When enabled, users must authenticate to access and manipulate data in a schema or table.

This is a preview of the Apache Cassandra Refcard. To read the entire Refcard, please download the PDF from the link above.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}