NoSQL and Cassandra in Plain English
NoSQL and Cassandra in Plain English
Cassandra has been deployed at scale by companies like Netflix, eBay, Spotify, and Apple. Come learn why it's been so successful and also look at its drawbacks.
Join the DZone community and get the full member experience.Join For Free
Running out of memory? Learn how Redis Enterprise enables large dataset analysis with the highest throughput and lowest latency while reducing costs over 75%!
Cassandra is a highly scalable distributed database originally built at Facebook. It's now a top-level Apache open-source project with widespread adoption and is deployed at scale by Netflix, eBay, Spotify, and Apple.
Its main strengths are:
- Ease of use: Accessed via a subset of SQL (CQL), it supports select, insert, update, and delete operations against a table and column structure familiar to many database developers.
- Built for volume and velocity: A Netflix benchmark test demonstrated around 2.8 billion operations per hour using a mixed read/write workload. It managed an average write latency of just 1.75 milliseconds and an average throughput of 200,000 writes per second against a total of 70 terabytes of data.
- Built for scalability on-premises or the cloud: While you can build your own in-house cluster, Cassandra is well supported by the Amazon Elastic Compute platform. The entire Netflix benchmark ran on 285 nodes on the EC2 platform, costing around $400 per hour — then had no cost once the operation completed.
- Transparent node balancing: Data is automatically sharded (distributed) across nodes in the cluster for resilience. As demand grows and new nodes are added, the system transparently distributes the data to the new servers. On a cloud-based deployment, this could potentially be automated, providing elastic scalability.
- Highly fault tolerant: Many NoSQL databases (i.e. HBase) use a "master" node for write operations with an optional fail-over capability for recovery. Cassandra works in a "ring" topology with no single point of failure. Writes are automatically replicated to three (or more) nodes in a cluster, and data can be replicated to another data center in real-time, providing built-in disaster recovery.
- Hadoop integration: Cassandra works well with Hadoop and HDFS and is accessible via Apache Spark, with support for multiple languages including Java, Python, Ruby, and C++.
The diagram above illustrates a globally distributed system whereby users gain millisecond performance on a cluster of machines at each location, while results are automatically replicated globally. Using Amazon EC2, if the Singapore-based system crashed, the workload could automatically resume using London and New York-based servers at a reduced speed.
This benefit was illustrated in April 2011 when a major US-East outage on Amazon left many websites down, but Netflix operations running Cassandra were unaffected because of cross-regional replication in place.
Why Avoid Cassandra?
Like any specialized, highly tuned tool, Cassandra is built to solve a specific problem. It has high volume and low latency read/write lookups from a massive user base. It does, however, have significant limitations:
- No transaction support: An RDBMS like Oracle provides full read consistency and transaction isolation, which is absent from nearly all NoSQL databases. On Cassandra, every write operation is immediately visible to all users, and this has massive implications for many applications. Caveat emptor: buyer beware.
- Joins not supported: Cassandra is not a relational database, and join operations are not supported. It does support simple parent-child relationships using compound keys, but database design must be focused on query access paths, using extensive data denormalization to maximize performance.
- Not suitable for OLAP: Designed for fast primary key lookups, it's not suitable for data warehouse-type queries, which aggregate data over millions of rows. While it supports secondary indexes, they're not recommended for high cardinality (unique) values, as performance degrades rapidly. This compares poorly to a traditional RDBMS, which often handles a mixed workload with ease.
- Limited sort operations: Most read operations focus on a single row, and there's no option to sort results based upon an arbitrary column. It's possible to store and retrieve data in a predefined key sequence, but again, this is inflexible compared to a typical RDBMS.
Cassandra and Eventual Consistency
An Oracle database supports immediate consistency, as there's only one copy of the data. However, on a distributed database, the data is replicated to multiple independently running nodes, and to maximize throughput, we may relax the consistency rule and settle for eventual consistency.
The diagram below illustrates a situation where an update is applied to one node and eventually written to the two other replicas. In the interim (maybe a few microseconds), a reader can potentially read the old (stale) copy of the data. Eventually, all three replicas will have consistent results — hence the name.
While not the best solution in every case, the Netflix benchmark demonstrates the benefits of relaxing the consistency rule, as read throughput increased from 600,000 to nearly one million reads per second. This clearly illustrates the trade-off of performance against consistency and resilience. If you don't need the guaranteed absolute latest entry, it's a significant performance gain.
Cassandra is also highly flexible and supports 11 levels of consistency. The most commonly used are:
- ONE — eventual consistency: For maximum performance on reads, this returns the first available version from the three (or more) replicas. On writes, the operation is complete once written to one node, which provides write throughput at the expense of resilience.
- QUORUM — mid-ground: A balance of performance and resilience, this marks a write operation complete as soon as a majority (two of the three) replicas are written and combines data from multiple replicas on read, taking the latest data available.
- ALL — maximum resilience: The slowest option, this favors resilience over performance, as a write must be written to all replicas before it's marked complete.
Note: If an operation is written using ALL consistency, it can be safely read using ONE, which maximizes read throughput overwrites. This provides huge flexibility (and an opportunity for hard-to-identify bugs), but it's a powerful option to have available.
Warning: Transactions Not Supported
Although Cassandra has flexible support for consistency, this must not be confused with "read consistency" or support for transactions — a feature often taken for granted on traditional database platforms.
Effectively. on Cassandra, every write operation is implicitly committed and immediately visible to all readers without delay.
Take a simple example: an application that transfers money from a savings account to a current account would typically involve two operations committed in a single transaction (one to deduct from the savings account and another to credit the current account). In Cassandra, another user could read the balance of both accounts midway through with potentially catastrophic results.
Cassandra does support a lightweight test and set operation to set a value depending upon a value, and the ability to batch load multiple inserts, but this provides, at most, a rather basic level of lightweight transaction support.
In conclusion, if you have a web or sensor streaming-based application with a massive or highly unpredictable workload needing very low latency read/write operations using mainly unique keys, Cassandra is an excellent option. If you can deploy a cloud-based solution and need potentially massive scalability, all the better.
If however, you have a data warehouse workload summarizing billions of rows, a database like Vertica might be a better option. Likewise, if you need full transactional support, you should look elsewhere — perhaps at VoltDB.
There are over 150 NoSQL databases available, many of which support massive throughput and scalability. However, where Cassandra is unique is in providing tuneable consistency along with a high level of resilience and native replication across a potentially globally distributed system.
Provided you can work with the lack of transaction support and eventual consistency, it's a potentially powerful tool to deploy.
Thanks for reading this article. If you found this helpful, you can view more articles on Big Data, Cloud Computing, Database Architecture and the future of data warehousing on my web site www.Analytics.Today.
Published at DZone with permission of John Ryan , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.