Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Working With Cassandra Databases

DZone's Guide to

Working With Cassandra Databases

Learn about differences between SQL and NoSQL databases, specific Cassandra workings and best practices, and what the alternatives to Cassandra are.

· Database Zone ·
Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

Not long ago, the SQL database was the normal way of organizing data. Now things have changed, with the appearance of multiple NoSQL databases that allow the storage of data in various ways, such as key-value or XML/JSON formats.

SQL Normalization vs. NoSQL Denormalization

Before getting into this topic, the first thing to know is that normalization refers to having well-organized data, which is usually done using relationships.

Let’s suppose that we want to bring user information to an online video platform. A single user can have one or more videos, so we can design the SQL table as follows:

id name country email
322958 Dorian Jonas Germany dorianjonas@example.com

We can then add a user_id field to our video table, which references records by user_id:

video_id title user_id description timestamp
2341234 CassandraView 322958 Cassandra introduction 1415463675

We can also use the normalization technique in NoSQL, designing tables similar to the one above. However, this is not always feasible for reasons that will become evident below. We may decide to denormalize our document and repeat user information for every video.

Normalization has the scope to minimize data redundancy and stop duplication — in this case, of each user being on every video. This way, we can update just the user table without altering the video one.

video_id title description timestamp user_name country email
2341234 CassandraView Cassandra introduction 1415463675 Dorian Jonas Germany dorianjonas@example.com

This leads to faster queries, but updating the user information in multiple records will be much slower.

Nowadays, since data storage is not as expensive as data transfer latency, we can duplicate data as much as we want, so we can obtain fast reads and also minimize additional processing done after querying the database. But keep in mind that this is valid for mammoth data chunks. Usually, when the database footprint reaches 500GB, it is recommended to consider a NoSQL database.

Cassandra Best Practices

So first of all, what is the Apache Cassandra Database?

  • It's been around for almost nine years and was first used by Facebook to power the inbox search feature.

  • It is an aware datacenter with asynchronous replication.

  • Has a distributed key-value column-oriented structure.

Being a distributed system, Cassandra uses gossip to learn about other nodes. This is done using the P2P protocol for handling the communication.

Additionally, data is distributed across a cluster on multiple nodes. Each node has a given range of primary keys to look after, according to their hash value. For example, the values having hash values from 0 to 3 are on node 1, the ones from 4 to 7 are on node 2, and so on. This helps not only with knowing where to find the data but also enables a quick way of accessing and evenly spreading it across a cluster. That's another reason why it is so important to know how to model the data tables.

Let's take the following scenario. We have an online video website and we want to model the comments by user table.

For this, we will have as the primary key as user_id. Imagine having all the data in a map with the key user_id. Next, what comes in handy is that posted_timestamp is the first clustering key. What this means is that all the entries with a certain timestamp will be available on this certain node. It's impossible to have N entries for user_id = 5 and posted_timestamp = 2231 spread on multiple nodes. Additionally, you can have a sorting done, for example, ascending by user_id and posted_timestamp. Now ,reading the first N comments for a certain user is done with little effort. This helps in terms of quick read speed; imagine having to get 10,000 comments form videos in-memory and sorting them by timestamp. This will require a lot of unnecessary processing.

For more in-depth details on this topic, I recommend reading this documentation.

Cassandra Data Flow

To manage and access data, it is important to understand how it works internally.

Initially, the data is stored in the memtables (the actual memory), and at the same time, log files are committed to disk. These log files are a kind of backup for each write to Cassandra, which helps ensure data consistency even during a power failure because upon reboot, the data will be recovered in-memory from these log files. Adding more and more information to Cassandra will finally result in reaching the memory limit. Then, the data stored by primary key is flushed into actual files on a disk called SSTables. Below you can see a representation of data flow in Cassandra.

Image title

Getting a better view of how data is managed helps get a better grasp on what actually happens and why maybe sometimes unexpected behavior may occur.

Cassandra Alternatives

Besides Apache Cassandra, many other NoSQL database are out there. The main point of classification is by data model. Some of them may overlap and have multiple. Some examples are:

  • Column: Cassandra, HBase

  • Document: CouchDB, ArangoDB

  • Key-value: Dynamo, Redis

  • Graph: Neo4J, MarkLogic

  • Multi-model: MarkLogic, ArangoDB

A Final Word

In the database landscape, NoSQL adoption is rising. And when used appropriately, it can offer real benefits. Nonetheless, enterprises should proceed with caution and be aware of potential limitations and issues that may occur. Many businesses, especially those managing large amounts or data or who want a smooth migration, choose to consult with database service providers who specialize in Cassandra architecture and management, such as Instaclustr.

From my point of view, it is very important to know your business, how your data will look, how big it will be, and how it will be affected by time. Just having a solid understanding of these things will take you down the road of making the best choice for your system infrastructure.

Get comfortable using NoSQL in a free, self-directed learning course provided by RavenDB. Learn to create fully-functional real-world programs on NoSQL Databases. Register today.

Topics:
apache cassandra ,database ,nosql ,tutorial ,sql ,normalization

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}