8 Best Practices to Get Started With Apache Cassandra
8 Best Practices to Get Started With Apache Cassandra
Apache Cassandra has become one of the most popular NoSQL databases out there. Read on to learn how to go about getting up to speed with this open source DB.
Join the DZone community and get the full member experience.Join For Free
Since its inception in 2008, the open source Apache Cassandra™ database has proven to be unbeatable when you need high availability and scalability — especially for applications that can’t afford to lose data, even when a data center goes down. Top companies like Netflix, eBay, GitHub, and Instagram all rely on Cassandra to power always-on, highly individualized customer experiences. The leading database for hybrid cloud has some of the largest production deployments -- including Apple’s with more than 75,000 nodes storing more than 10 PB of data and Chinese search engine Easou with 270 nodes, 300 TB, and more than 800 million requests per day.
Cassandra is powerful, and it takes the best of the best to deploy it right. That’s one reason why Cassandra engineers are among the top 5 highest paid tech professionals today.
It does take time and commitment to learn the complexities of Apache Cassandra, but the rewards are worth it. If you’re interested in exploring Cassandra, here are eight steps to get started:
Know Your Access Patterns
In order to model data correctly, you first need to know the read and write patterns that your system will be servicing. For each query pattern, find out how often each query will occur under normal conditions and under peak load; and understand the expected SLA of each operation.
Do the Data Modeling
Once you understand the questions that will be required of your system, you can lay out the answers. Using the information about your query patterns, define the right keys for your tables. This helps you achieve the required latencies and prevents operational issues that result from inefficient data models and large partitions. Note that relational database modeling techniques do not directly translate to Cassandra data modeling, so carefully plan the compaction strategy for each table.
Beware of Tombstones
There are many advantages to Cassandra’s log-structured merge-tree structure for storing data. But, a big disadvantage of this structure is tombstones. You need to know what causes tombstones (not just deletes!) and how various compaction strategies deal with tombstones. It’s important to understand the impact that tombstones have on read latencies and to have a plan for purging unneeded data to avoid them.
Plan for and Practice Operations
Distributed databases are not simple to operate, but having a plan with well-documented and well-understood operational procedures can make it less complex. It’s a good idea to practice common operational tasks such as backing up and restoring nodes, adding or removing nodes from a cluster, repairing, and rebuilding a data center. You also want to incorporate security features early on, to ensure appropriate protection and prevent surprise impacts to your applications. Consider installing and using a program that will monitor your cluster and notify the right teams when an event occurs that needs intervention. Don’t forget you’ll need a plan for applying product upgrades.
Conduct Performance Testing
Before going from development to production, you’ll want to perform appropriate and realistic load testing. Set up a pre-production environment that matches the production specs, and create a load that matches your expected read and write volumes. Run this load test for several days to make sure that compaction and other background processes can keep up. Use this load test to establish baselines, and store those baseline metrics so that you have a record. This way, when you adjust settings or compare performance under a heavier load, you’ll have the ability to compare outputs to baselines. Test your happy path, but also see what your applications do while substantial operational activities are running — be sure to include administrative activities like bootstrapping new nodes or running ongoing maintenance activities.
Choose Your Drivers
Drivers are an integral part of the application stack. There is a lot of intelligence and power built into the various drivers, so be sure to choose the appropriate ones. Choose the right load balancing policy, connection pooling settings, and retry strategies. Use prepared statements and asynchronous queries for better performance. Mark statements as idempotent when you can. It’s crucial to understand how each setting impacts the way your application behaves in normal conditions and in failure conditions.
Use an Automated Orchestration Framework for Cluster Management
Plan to manage your cluster using an automated orchestration framework. There are several available that can do this job well, and you'll thank yourself for putting in the work to setup and get comfortable using these automated tools.
Seek Out Specialized Training
Apache Cassandra is well worth learning and there are many open source resources available to help you get up-to-speed and improve your Cassandra skillset. There are Slack channels available and Stack Overflow is a great place to get answers to questions. Leading Cassandra development companies like DataStax offer a free Academy as well as in-person training.
Getting started in Cassandra is a great way to uplevel your skillset and position yourself for a top-paying tech career. With a will to learn and dedication, you can master Apache Cassandra engineering and help take it to the next level.
Want to get a jump-start on your Cassandra career? Attend DataStax Accelerate, the world’s premier Apache Cassandra™ conference to learn from your peers, industry experts, and community leaders about how you can transform modern enterprise applications on any cloud at scale.
Opinions expressed by DZone contributors are their own.