Cassandra is a popular next-generation database system (NoSQL) that powers the backend of high-performance web-scale applications in enterprises. It is database software for cloud applications that accelerates the ability of organizations to power the growing number of cloud applications that require data distribution across datacenters and clouds.
While Cassandra offers high-availability, there are significant opportunities as it relates to meeting data protection requirements. Using Datos IO RecoverX software, application owners and database admins can create space-efficient and cluster-consistent backup of Cassandra database. Creating such a cluster-consistent and space-efficient backup of a distributed database is a very challenging task. In this blogpost, I will highlight existing deduplication solutions for Cassandra, why deduplication matters, and how to achieve deduplication in Cassandra.
Introduction of Cassandra
Here is a brief introduction. You might be curious, "Aren't existing deduplication solutions enough to save space for Cassandra snapshot files?" There are multiple deduplication solutions which can eliminate redundant data from the different levels of the storage layer. Why do we need a new one for Cassandra cluster backups?
Let's begin with some background information. What is Cassandra and why deduplication is important for a Cassandra backup?
Cassandra is a distributed database that is becoming increasingly popular because of the emergence of big data applications such as SaaS, IoT, and real-time analytics. These applications require high availability and scalability over consistency. Cassandra supports eventual consistency rather than strict consistency, which is provided by traditional database systems such as Oracle, MySQL, and IBM DB2. Eventually, consistent means consistency will be achieved eventually rather than immediately. As well known in CAP theorem, we cannot have following three properties in a single system: Consistency, Availability, and Partition Tolerance. In short, Cassandra is an eventually consistent database that provides high performance, high availability, and high scalability — but, not strong consistency.
Replication and its Role in Deduplication
One of the most important mechanisms of distributed scale-out database systems like Cassandra is data replication. By replicating the same data on different nodes across failure boundaries, distributed database systems can continue to service application requests even with a certain number of node failures. The downside is the performance overhead to maintain multiple data copies; both write and read operations will be slower to create multiple copies and to check the consistencies among multiple copies. Although asynchronous data replication technique can be used to minimize the performance overhead of writes, it would also lower the level of guaranteed consistency level, which called is eventual consistency.
As I just explained, replication plays a very important role in a distributed database system, and therefore, we should not remove the redundancy from a live Cassandra cluster to save storage space.
The situation becomes different when we think about backup files (or, secondary data) from a Cassandra cluster. Like any other database system in an enterprise organization, backups are needed for Cassandra; it is not because Cassandra is not reliable or not available enough; but, primarily because people make mistakes ("fat fingers") and enterprise applications sometimes have to keep the history of their databases. As they say, to err is human!
Cassandra has a nice node-level snapshot feature, which can persist a complete state of an individual node to snapshot files. One very important point is that a Cassandra snapshot is a 'per-node' operation, which does not guarantee anything about the cluster status as shown in the figure below.
Backing up a Cassandra Cluster
To create a backup of a Cassandra cluster, we have to trigger a snapshot operation on each node, collect created snapshot files, then we can claim the set of collected snapshot files as a backup. In this backup, the replicated data exist as is, and the size of backup will be N times bigger than the size of user data, where 'N' is the replication factor. Replication has an important role in a 'live' Cassandra cluster to provide high availability and scalability; but, what's the use of replication for backups? If we upload the backup files to an object store like S3 and Swift, the 'already replicated' data will be replicated again by the object store to provide reliability and availability for their own sake.
In summary, there are redundant data copies in Cassandra backup files (secondary data). If we can eliminate the redundant data copies we will save massive storage space for Cassandra backups (without sacrificing retention periods). Saving on storage space directly translates to saving big bucks for the operational cost of maintaining and operationalizing a Big Data system across an organization.
But, let's return to the question whether the existing deduplication solutions would work for Cassandra backup files. You can test this by collecting and placing Cassandra database files to a deduplication system. Will the solution can save storage consumption? NO! Existing deduplication solutions wouldn't work for Cassandra data files for the following two reasons.
First, Cassandra has a masterless peer-peer architecture. Each node receives a different set of rows and therefore there are no identical nodes in a cluster, which means data files will look different as shown in the following figure. Because different combinations of rows are stored in Cassandra data files, each data file will be hardly identical even at the chunk level.
Second, Cassandra data files are compressed with 64 KB sized chunks regardless of row boundaries. If you know enough about deduplication algorithms, you can easily understand why Cassandra data files cannot be easily deduplicated. Fixed length chunk-based deduplication will not work because of the chunk alignment, and variable-length chunk-based deduplication will not work either because of compression.
On top of that, Cassandra's compaction and small record size can be additional reasons why the existing block or file level deduplication solutions will not work for Cassandra backup files. Compaction is an independent operation, which merges multiple database files into a new data file and small sized records can be hardly deduplicated with bigger sized chunk.
The existing file/block level deduplication solutions will not work for Cassandra backup files because:
- Each Cassandra data file contains a different set of records.
- Cassandra data files are compressed regardless of the row boundary.
- Cassandra runs compactions independently.
- Cassandra can have smaller record size than deduplication chunk size.