This blog is the second in a series of posts about enterprise Cassandra deployment. In this installment, I will share best practices for a successful Cassandra deployment in the cloud such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). I will also discuss how you can protect and recover the data in your Cassandra cluster when deployed in the public cloud.
Cassandra Deployments in the Public Cloud
There are two main ways to have successful Enterprise Cassandra Deployments in AWS:
- One option is to use a ‘managed’ Cassandra deployment. This is a reasonable approach, the only downside being the inability to customize data placement beyond what is typically provided by the vendor.
- If you want to have full control over your Cassandra deployment and customize it to scale and perform according your business needs, managing the cluster yourself is a better option.
Please refer documentation from Datastax or Apache Cassandra for detailed best practices for Cassandra deployments. The following best practices are reiterated here to simplify deployment of a scalable Cassandra cluster in AWS.
Both AWS and GCP offer two types of local storage options for their compute instances. For AWS, the options are Elastic Block Storage (EBS) and Ephemeral Storage. In GCP, the options are Persistent Disks and Local SSD. Both options have their own tradeoffs. Pick the one that best suits your deployment.
EBS volumes have performance characteristics similar to network storage like a NAS or SAN. Because storage from EBS volumes is multitenant in nature, it is not suitable for distributed, log structured, databases like Cassandra. If you want to ensure reliable performance through EBS volumes, use “EBS Optimized instances” with “Provisioned IOPS”. Since the IO traffic to EBS volumes goes over the instance network, EBS optimized will provide reliable performance, but does not guarantee any increased performance. To get the most performance out of EBS volumes, use instances with multiple disks and separate the volumes used for “Data” and “Commitlog”.
If you have to use EBS Volumes, go with “EBS Optimized Instances”, with “Provisioned IOPS” and split Data and Commit Log on different volumes.
Ephemeral storage is a great option to use for downstream Cassandra environments like test/dev, staging etc. Since ephemeral volumes are typically SSDs, there is no need to separate the Data and commitlog into separate drives. To optimize performance, choose EC2 instances with multiple ephemeral storage drives and stripe them together into one logical volume. Send Data and commitlog to the logical volume. Remember that, when you stop the EC2 instance, the data in Ephemeral storage will be deleted.
Stripe all Ephemeral drives into a single volume to get maximum performance.
Cassandra uses “Snitch” to determine the topology of the cluster. The Snitch is also used to determine the “Datacenter” and “Rack” for each node in the cluster. “Replication Strategy” is used to determine the placement of data across the different nodes in the cluster.
- EC2Snitch works well for single region deployments.
- Ec2MultiRegionSnitch works well for cluster distributed across multiple AWS regions. Please note, Multi-DC, Active-Active configuration makes your cluster resilient to region failures and for multi-dc cluster, use “local quorum” to improve performance.
Google Cloud Platform (GCP)
- Use “GoogleCloudSnitch” for Cassandra deployments in GCP
- GoogleCloudSnitch treats regions as Cassandra data centers and availability zones as racks
Both AWS and GCP offer a series of operational capabilities that seem to make it easier to maintain high uptime on Apache Cassandra cluster. These capabilities have to be chosen carefully depending your deployment needs.
- Load Balancing: Network level load balancing service is offered by both AWS (Elastic Load Balancer) and GCP. It is not recommended to put a load balancer in front of Cassandra. It is better to use a client connector that can handle this, consider open source utilities like Presto and Astyanax
- Autoscaling: Both AWS and GCP offer “Autoscaling” facility for compute instances. However, although this functionality looks very enticing, putting all C* nodes in an autoscaling group will not produce the desired result. Instead, create a separate auto-scaling group of ‘1 node’ for every node in your cluster. This will ensure that, anytime a node goes down, it is automatically brought back up. This will ensure the cluster data placement is not disturbed when nodes go down and nodes coming up do not cause excessive data movement.
Scalable Backups and Reliable Point-in-Time Recovery
Backing up your data is a must-have for enterprise organizations. The impact of data loss is so high that no enterprise wants to risk relying alone on replication. Protecting data in a public cloud environment may seem daunting, but there are several options available that can make this easier compared to an on-premise deployment. A simple option may look like one of the native snapshots:
I have seen few customers use “snapshots” of EBS volumes as a way to protect the data in their cluster against logical and/or operational errors. This solution will work but with some severe limitations. On top of the overhead of maintaining the scripts to take snapshots, there is a large overhead in storage costs. Taking a snapshot of volumes does not allow for incremental backups of your data. This causes an exponential bloat in the amount of data stored in snapshots, leading to high cost of the backup solution.
Amazon S3 or Google Cloud Storage
Using an S3 bucket as the secondary storage enables enterprises to use low cost, high throughput storage, while reducing overall operating costs for storing backup data. Here are a few recommendations when using S3:
- S3 Region: Amazon S3 buckets are region specific. A Bucket created in one region will provide better performance characteristics to other infrastructure in the same region. Subsequently, it is important to ensure the S3 bucket used as the secondary storage is in the same region as the production database being protected and the Datos IO cluster
- Storage Class and Life Cycle Policy: Amazon S3 allows users to create Life Cycle policies on buckets. These policies dictate the performance and response time characteristics of objects in the bucket. STANDARD_1A is more suitable for longer lived, and less frequently access data, and is not suitable as secondary storage. Additionally, do not turn on any Lifecycle policies on the buckets used for the secondary storage. All buckets should be in the “STANDARD” storage class.
As enterprises leverage the agility provided by public cloud environments, it’s important to understand the various facets of Cassandra deployments to ensure the efficient use of your money and time. In this article, we looked at various best practices to improve the operational efficiency Cassandra deployments in a public cloud (Amazon or Google Cloud).