Cassandra provides simple primitives. Its simplicity allows it to scale linearly with continuous, high availability and very little performance degradation. That simplicity allows for extremely fast read and write operations for specific keys, but servicing more sophisticated queries that span keys requires pre-planning. Using the primitives that Cassandra provides, you can construct indexes that support exactly the query patterns of your application. Note, however, that queries may not perform well without properly designing your schema.
Secondary Indexes
To satisfy simple query patterns, Cassandra provides a native indexing capability called secondary indexes. A column family may have multiple secondary indexes. A secondary index is hash-based and uses specific columns to provide a reverse lookup mechanism from a specific column value to the relevant row keys. Under the hood, Cassandra maintains hidden column families that store the index. The strength of secondary indexes is allowing queries by value.
Secondary indexes are built in the background automatically without blocking reads or writes. To create a secondary index using CQL is straightforward. For example, you can define a table of data about movie fans, then create a secondary index of states where they live:
CREATE TABLE fans ( watcherID uuid, favorite_actor
text, address text, zip int, state text PRIMARY KEY
(watcherID) );
CREATE INDEX watcher_state ON fans (state);
Hot tip: Try to avoid indexes whenever possible. It is (almost) always a better idea to denormalize data and create a separate table that satisfies a particular query than it is to create an index.
Range Queries
It is important to consider partitioning when designing your schema to support range queries.
Range Queries with Order Preservation
Since order is preserved, order preserving partitioners better support range queries across a range of rows. Cassandra only needs to retrieve data from the subset of nodes responsible for that range. For example, if we are querying against a column family keyed by phone number and we want to find all phone numbers between that begin with 215-555, we could create a range query with start key 215-555-0000 and end key 215-555-9999.
To service this request with OrderPreservingPartitioning, it’s possible for Cassandra to compute the two relevant tokens: token(215-555-0000) and token(215-555-9999). Then satisfying that querying simply means consulting nodes responsible for that token range and retrieving the rows/tokens in that range.
Hot tip: Try to avoid queries with multiple partitions whenever possible. The data should be partitioned based on the access patterns, so it is a good idea to group the data in a single partition (or several) if such queries exist. If you have too many range queries that cannot be satisfied by looking into several partitions, you may want to rethink whether Cassandra is the best solution for your use case.
Range Queries with Random Partitioning
The RandomPartitioner provides no guarantees of any kind between keys and tokens. In fact, ideally row keys are distributed around the token ring evenly. Thus, the corresponding tokens for a start key and end key are not useful when trying to retrieve the relevant rows from tokens in the ring with the RandomPartitioner. Consequently, Cassandra must consult all nodes to retrieve the result. Fortunately, there are well-known design patterns to accommodate range queries. These are described next.
Index Patterns
There are a few design patterns to implement indexes. Each services different query patterns. The patterns leverage the fact that Cassandra columns are always stored in sorted order and all columns for a single row reside on a single host.
Inverted Indexes
First, let’s consider the inverted index pattern. In an inverted index, columns in one row become row keys in another. Consider the following dataset, in which users IDs are row keys:
Partition Key |
Rows/Columns |
BONE42 |
{ name : “Brian”} |
{ zip: 15283} |
{dob : 09/19/1982} |
LKEL76 |
{ name : “Lisa”} |
{ zip: 98612} |
{dob : 07/23/1993} |
COW89 |
{ name : “Dennis”} |
{ zip: 98612} |
{dob : 12/25/2004} |
Without indexes, searching for users in a specific zip code would mean scanning our Users column family row by row to find the users in the relevant zip code. Obviously, this does not perform well. To remedy the situation, we can create a table that represents the query we want to perform, inverting rows and columns. This would result in the following table:
Partition Key |
Rows/Columns |
98612 |
{ user_id : LKEL76 } |
|
|
|
{ user_id : COW89 } |
|
|
15283 |
{ user_id : BONE42 } |
|
|
Since each partition is stored on a single machine, Cassandra can quickly return all user IDs within a single zip code by returning all rows within a single partition. Cassandra simply goes to a single host based on partition key (zip code) and returns the contents of that single partition.
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}