Working With Cassandra Databases
Working With Cassandra Databases
Learn about differences between SQL and NoSQL databases, specific Cassandra workings and best practices, and what the alternatives to Cassandra are.
Join the DZone community and get the full member experience.Join For Free
Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph™ is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.
Not long ago, the SQL database was the normal way of organizing data. Now things have changed, with the appearance of multiple NoSQL databases that allow the storage of data in various ways, such as key-value or XML/JSON formats.
SQL Normalization vs. NoSQL Denormalization
Before getting into this topic, the first thing to know is that normalization refers to having well-organized data, which is usually done using relationships.
Let’s suppose that we want to bring user information to an online video platform. A single user can have one or more videos, so we can design the SQL table as follows:
We can then add a
user_id field to our video table, which references records by
We can also use the normalization technique in NoSQL, designing tables similar to the one above. However, this is not always feasible for reasons that will become evident below. We may decide to denormalize our document and repeat user information for every video.
Normalization has the scope to minimize data redundancy and stop duplication — in this case, of each user being on every video. This way, we can update just the user table without altering the video one.
|2341234||CassandraView||Cassandra introduction||1415463675||Dorian Jonas||Germanyemail@example.com|
This leads to faster queries, but updating the user information in multiple records will be much slower.
Nowadays, since data storage is not as expensive as data transfer latency, we can duplicate data as much as we want, so we can obtain fast reads and also minimize additional processing done after querying the database. But keep in mind that this is valid for mammoth data chunks. Usually, when the database footprint reaches 500GB, it is recommended to consider a NoSQL database.
Cassandra Best Practices
So first of all, what is the Apache Cassandra Database?
Being a distributed system, Cassandra uses gossip to learn about other nodes. This is done using the P2P protocol for handling the communication.
Additionally, data is distributed across a cluster on multiple nodes. Each node has a given range of primary keys to look after, according to their hash value. For example, the values having hash values from 0 to 3 are on node 1, the ones from 4 to 7 are on node 2, and so on. This helps not only with knowing where to find the data but also enables a quick way of accessing and evenly spreading it across a cluster. That's another reason why it is so important to know how to model the data tables.
Let's take the following scenario. We have an online video website and we want to model the comments by user table.
For this, we will have as the primary key as
user_id. Imagine having all the data in a map with the key
user_id. Next, what comes in handy is that
posted_timestamp is the first clustering key. What this means is that all the entries with a certain timestamp will be available on this certain node. It's impossible to have N entries for
user_id = 5 and
posted_timestamp = 2231 spread on multiple nodes. Additionally, you can have a sorting done, for example, ascending by
posted_timestamp. Now ,reading the first N comments for a certain user is done with little effort. This helps in terms of quick read speed; imagine having to get 10,000 comments form videos in-memory and sorting them by timestamp. This will require a lot of unnecessary processing.
For more in-depth details on this topic, I recommend reading this documentation.
Cassandra Data Flow
To manage and access data, it is important to understand how it works internally.
Initially, the data is stored in the memtables (the actual memory), and at the same time, log files are committed to disk. These log files are a kind of backup for each write to Cassandra, which helps ensure data consistency even during a power failure because upon reboot, the data will be recovered in-memory from these log files. Adding more and more information to Cassandra will finally result in reaching the memory limit. Then, the data stored by primary key is flushed into actual files on a disk called SSTables. Below you can see a representation of data flow in Cassandra.
Getting a better view of how data is managed helps get a better grasp on what actually happens and why maybe sometimes unexpected behavior may occur.
Besides Apache Cassandra, many other NoSQL database are out there. The main point of classification is by data model. Some of them may overlap and have multiple. Some examples are:
A Final Word
In the database landscape, NoSQL adoption is rising. And when used appropriately, it can offer real benefits. Nonetheless, enterprises should proceed with caution and be aware of potential limitations and issues that may occur. Many businesses, especially those managing large amounts or data or who want a smooth migration, choose to consult with database service providers who specialize in Cassandra architecture and management, such as Instaclustr.
From my point of view, it is very important to know your business, how your data will look, how big it will be, and how it will be affected by time. Just having a solid understanding of these things will take you down the road of making the best choice for your system infrastructure.
Opinions expressed by DZone contributors are their own.