Over the past year, I’ve had the opportunity to talk to a number of enterprises deploying Cassandra databases in production. One of the most remarkable things I’ve found is that many organizations do not have a comprehensive backup and recovery strategy. Note that this is not because organizations do not think they need a backup for their data; in fact, it’s far from it! The reason is simply that no enterprise-ready backup and recovery solution exists today. Because of this, application owners are forced to mitigate the risk of data loss through various stop-gap techniques and solutions. It is critical for enterprises to understand the flaws and limitations of such solutions so they can accurately assess their risk exposure. In this blog post, we will look at some of these techniques and identify their pros and cons.
With a single "datacenter", you cannot rely on data replication to protect your database. All the nodes are in a single failure zone, so you are not protected against any disaster scenarios. Replication is only useful in providing availability. If there is an operational or logical error in your cluster, data replication will duly propagate that error across your cluster, making things worse very fast.
There is some merit to using a Multi-DC deployments to protect against disaster scenarios (DR), but not against logical or operational errors. The following schematic shows a typical Multi-Datacenter scenario used as an active backup for Cassandra.
DC01 and DC02 are set up in two different failure zones. The cluster uses LOCAL_QUORUM for consistency and "Network Topology" as the replication strategy. Applications connect to DC01 during regular operation. DC02 is set up as a passive datacenter in a different failure zone. Since applications are using LOCAL_QUORUM, there will not be an impact of replicating data across to the second datacenter. When the application experiences increased request failures in DC01, it will move to DC02. DC02 is used as an active backup, and applications can be moved there in case of node failures in DC01.
In such deployments, applications are protected against disaster scenarios if a majority of nodes were to go down, but there is no protection against logical errors, errors with bad data ingestion or human errors. Any such errors will get propagated into DC02 and eventually take the entire application down. For example, if a table is deleted due to an operational error or if an application ingests corrupted data into one of the keyspaces (on DC01), those errors will propagate to the second datacenter (DC02) and will become irrecoverable. For scenarios like this, a comprehensive backup and recovery strategy is required: one that provides point in time recovery of data in the database.
Error Resilient Data Model
There a non-trivial segment of customers who run mission critical services on their Cassandra databases, relying on large batch data ingest into their cluster at regular intervals. For such use cases, customers could modify their data model to be resilient to ingest errors. It takes a few hours to validate the last dataset that was ingested. If a high rate of errors was found in the most recent ingest, that data can be reverted if the data model is set up appropriately.
Cassandra counters can be maintained for every ingest and update. Applications can identify all the data associated with a particular counter value (greater than a set value) and then delete the rows if required. This can protect you against bad data ingest, but at a high-cost, including fidelity of the cluster. Counters in Cassandra are notorious for their performance and can slow down the application significantly.
Another alternative used is to leverage the timestamp value to identify all the rows that are part of a particular ingest. This has a serious limitation in that the timestamp value has to be part of the clustering key. Even if this is agreeable, this solution causes a huge overhead on regular operation of the database.
Regardless of the two techniques above, the data is only protected against ingest errors. There is no protection from operational or human errors (80% of all failures are due to these!) where data is deleted or corrupted across multiple data ingests.
Node Level Snapshots
A large group of customers rely on “hand rolled” scripts to backup the data in their production Cassandra databases. These types of solutions are what I would call high maintenance bandaids for sunny days! The problems with such solutions show up when there are any types of issues with the source clusters which is when the backups become relevant.
The Achilles heel of any node level solution is the lack of failure resiliency. Imagine one node becomes unresponsive during a backup window, a node level snapshot based solution will lose be unable to complete a backup or worse, create the backup with missing data. It is difficult and time-consuming to “repair” such backups and recover from them if the need arises.
Another key problem with a node level backup solution for Cassandra is the lack of cluster consistency of the data. When the backup is inconsistent, recovery operation has to go through a long and arduous repair process which takes hours to days. Going through such time-consuming process during a recovery causes prohibitive application downtime and lost revenue.
In this post, we looked at the limitations of various stop-gap solutions used today for backing up data in a Cassandra cluster. Data Replication with Multi-Datacenter deployments only protect against disaster recovery scenarios. Resilient data models protect against certain data ingest errors but not against operational errors. Node level snapshots provide an incomplete backup solution at best -that is not failure resilient; would not scale with the source cluster and uses storage inefficiently. Given these limitations, relying on node level snapshots will severely limit your recovery time and increase the risk exposure to data loss.
A comprehensive backup and recovery strategy for Cassandra (and similarly distributed databases) is critical when running mission-critical applications. Such a solution needs to have the following key characteristics:
- Optimized for faster recovery time to reduce application downtime in case of an error scenario
- Be resilient to failure scenarios and continue to take backups even when there are a manageable number of failures.
- Achieve minimal disruption to the regular operations and performance of the Cassandra cluster, in terms of performance, data model fidelity and application programming ease.
At Datos IO, we are building an enterprise-ready scalable and reliable backup and recovery solution that can encompass all the above characteristics and can become the basis of your comprehensive backup and recovery strategy across next-generation distributed databases.