{{announcement.body}}
{{announcement.title}}
Refcard #153

Apache Cassandra

A Fault-Tolerant, Massively Scalable NoSQL Database

Distributed non-relational database Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors and is used at some of the most well-known, global organizations. This Refcard covers data modeling, Cassandra architecture, replication strategies, querying and indexing, libraries across eight languages, and more.

Published: Aug. 14, 2020    |    Modified: Sep. 02, 2020
4,780
Free PDF for easy Reference

Brought to you by

Datastax
refcard cover

Written by

author avatar Milan Milosevic Lead Data Engineer, SmartCat
author avatar Brian O' Neill Architect, Iron Mountain
asset cover
Refcard #153

Apache Cassandra

A Fault-Tolerant, Massively Scalable NoSQL Database

Distributed non-relational database Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors and is used at some of the most well-known, global organizations. This Refcard covers data modeling, Cassandra architecture, replication strategies, querying and indexing, libraries across eight languages, and more.

Published: Aug. 14, 2020    |    Modified: Sep. 02, 2020
4,780
Free PDF for easy Reference

Written by

author avatar Milan Milosevic Lead Data Engineer, SmartCat
author avatar Brian O' Neill Architect, Iron Mountain

Brought to you by

Datastax
Table of Contents

Introduction

Who is Using Cassandra?

Data Model Overview

Section 1

Introduction

Apache Cassandra is a high-performance, extremely scalable, fault-tolerant (i.e., no single point of failure), distributed non-relational database solution. Cassandra combines all the benefits of Google Bigtable and Amazon Dynamo to handle the types of database management needs that traditional RDBMS vendors cannot support.

Section 2

Who is Using Cassandra?

Cassandra is in use at Apple (75,000+ nodes), Spotify (3,000+ nodes), eBay, Capital One, Macy's, Bank of America, Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Rackspace, Ooyala, and more companies that have large active data sets. The largest known Cassandra cluster has more than 300 TB of data across more than 400 machines (cassandra.apache.org).

RDBMS vs. Cassandra


Cassandra RDBMS
Atomicity Success or failure for inserts/deletes in a single partition (one or more rows in a single partition). Enforced at every scope, at the cost of performance and scalability.
Sharding Native share-nothing architecture, inherently partitioned by a configurable strategy. Often forced when scaling, partitioned by key or function.
Consistency No tunable consistency in the ACID sense. Can be tuned to provide more consistency or to provide more availability. The consistency is configured per request. Since Cassandra is a distributed database, traditional locking and transactions are not possible (there is, however, a concept of lightweight transaction that should be used very carefully). Favors consistency over availability tunable via isolation levels.
Durability Writes are durable to a replica node, being recorded in memory and the commit log before acknowledged. In the event of a crash, the commit log replays on restart to recover any lost writes before data is flushed to disk. Typically, data is written to a single master node, sometimes configured with synchronous replication at the cost of performance and cumbersome data restoration.
Multi-Datacenter Replication Native and out-of-the-box capabilities for data replication over lower bandwidth, higher latency, less reliable connections. Typically, only limited long-distance replication to read-only slaves receiving asynchronous updates.
Security Coarse-grained and primitive, but authorization, authentication, roles, and data encryption are provided out-of-the-box. Fine-grained access control to objects.
Section 3

Data Model Overview

Cassandra has a tabular schema comprising keyspaces, tables, partitions, rows, and columns.

Note that, since Cassandra 3.x terminology is altered due to changes in the storage engine, a “column family” is now a table and a “row” is now a partition.


Definition RDBMS Analogy Object Equivalent
Schema/Keyspace A collection of tables. Schema/Database Set
Table/Column Family A set of partitions. Table Map
Partition A set of rows that share the same partition key. N/A
Row An ordered (inside of a partition) set of columns. Row OrderedMap
Column A key/value pair and timestamp. Column (Name, Value) (Key, Value, Timestamp)

Schema

The keyspace is akin to a database or schema in RDBMS, contains a set of tables, and is used for replication. A keyspace is also the unit for Cassandra's access control mechanism. When enabled, users must authenticate to access and manipulate data in a schema or table.

Table

A table, previously known as a column family, is a map of rows. Similar to RDBMS, a table is defined by a primary key. The primary key consists of a partition key and clustering columns. The partition key defines data locality in the cluster, and the data with the same partition key will be stored together on a single node. The clustering columns define how the data will be ordered on the disk within a partition. The client application provides rows that conform to the schema. Each row has the same fixed subset of columns. 


This is a preview of the Apache Cassandra Refcard. To read the entire Refcard, please download the PDF from the link above.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}