Cassandra Indexing: The Good, the Bad and the Ugly
The Bad : Partitioning
One of the tough things to get used to at first is that without any indexes queries that span rows can (very) be bad. Thinking back to our storage model however, that isn't surprising. The strategy that Cassandra uses to distribute the rows across hosts is called Partitioning.
Partitioning is the act of carving up the range of rowkeys assigning them into the "token ring", which also assigns responsibility for a segment (i.e. partition) of the rowkey range to each host. You've probably seen this when you initialized your cluster with a "token". The token gives the host a location along the token ring, which assigns responsibility for a section of the token range. Partitioning is the act of mapping the rowkey into the token range.
There are two primary partitioners: Random and Order Preserving. They are appropriately named. The RandomPartitioner hashes the rowkeys into tokens. With the RandomPartitioner, the token is a hash of the rowkey. This does a good job of evenly distributing your data across a set of nodes, but makes querying a range of the rowkey space incredibly difficult. From only a "start rowkey" value and an "end rowkey" value, Cassandra can't determine what range of the token space you need. It essentially needs to perform a "table scan" to answer the query, and a "table scan" in Cassandra is bad because it needs to go to each machine (most likely ALL machines if you have a good hash function) to answer the query.
The Good : Secondary Indexes
Cassandra does provide a native indexing mechanism in Secondary Indexes. Secondary Indexes work off of the columns values. You declare a secondary index on a Column Family. Datastax has good documentation on the usage. Under the hood, Cassandra maintains a "hidden column family" as the index. (See Ed Anuff's presentation for specifics) Since Cassandra doesn't maintain column value information in any one node, and secondary indexes are on columns value (rather than rowkeys), a query still needs to be sent to all nodes. Additionally, secondary indexes are not recommended for high-cardinality sets. I haven't looked yet, but I'm assuming this is because of the data model used within the "hidden column family". If the hidden column family stores a row per unique value (with rowkeys as columns), then it would mean scanning the rows to determine if they are within the range in the query.
From Ed's presentation:
- Not recommended for high cardinality values(i.e.timestamps,birthdates,keywords,etc.)
- Requires at least one equality comparison in a query--not great for less-than/greater-than/range queries
- Unsorted - results are in token order, not query value order
- Limited to search on datatypes, Cassandra natively understands
With all that said, secondary indexes work out of the box and we've had good success using them on simple values.
The Ugly : Do-It-Yourself (DIY) / Wide-Rows
Now, beauty is in the eye of the beholder. One of the beautiful things about NoSQL is the simplicity. The constructs are simple: Keyspaces, Column Families, Rows and Columns. Keeping it simple however means sometimes you need to take things into your own hands.
This is the case with wide-row indexes. Utilizing Cassandra's storage model, its easy to build your own indexes where each row-key becomes a column in the index. This is sometimes hard to get your head around, but lets imagine we have a case whereby we want to select all users in a zip code. The main users column family is keyed on userid, zip code is a column on each user row. We could use secondary indexes, but there are quite a few zip codes. Instead we could maintain a column family with a single row called "idx_zipcode". We could then write columns into this row of the form "zipcode_userid". Since the columns are stored in sorted order, it is fast to query for all columns that start with "18964" (e.g. we could use 18964_ and 18964_ZZZZZZ as start and end values).
One obvious downside of this approach is that rows are self-contained on a host. (again except for replicas) This means that all queries are going to hit a single node. I haven't yet found a good answer for this.