DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Securing Your Software Supply Chain with JFrog and Azure
Register Today

Trending

  • From On-Prem to SaaS
  • Performance Comparison — Thread Pool vs. Virtual Threads (Project Loom) In Spring Boot Applications
  • Integrate Cucumber in Playwright With Java
  • Adding Mermaid Diagrams to Markdown Documents

Trending

  • From On-Prem to SaaS
  • Performance Comparison — Thread Pool vs. Virtual Threads (Project Loom) In Spring Boot Applications
  • Integrate Cucumber in Playwright With Java
  • Adding Mermaid Diagrams to Markdown Documents
  1. DZone
  2. Data Engineering
  3. Databases
  4. Best Practices for Cassandra Data Modeling

Best Practices for Cassandra Data Modeling

Take a look at the key points that need to be kept in mind when designing a schema in Cassandra.

Akhil Vijayan user avatar by
Akhil Vijayan
·
Jul. 09, 18 · Opinion
Like (6)
Save
Tweet
Share
15.78K Views

Join the DZone community and get the full member experience.

Join For Free

People new to NoSQL databases tend to relate NoSql as a relational database, but there is quite a difference between those. For people from relation background, CQL looks similar, but the way to model it is different. Picking the right data model is the hardest part of using Cassandra. I will explain to you the key points that need to be kept in mind when designing a schema in Cassandra. By following these key points, you will not end up re-designing the schemas again and again.

DON'T

Before explaining what should be done, let's talk about the things that we should not be concerned with when designing a Cassandra data model:

1) Minimize the Number of Writes

We should not be worried about the writes to the Cassandra database. It is much more efficient than reads. We should write the data in such a way that it improves the efficiency of read query.

2) Minimize Data Duplication

Data duplication is necessary for a distributed database like Cassandra. Disks are cheaper nowadays. To improved Cassandra reads we need to duplicate the data so that we can ensure the availability of data in case of some failures.

DO

Now let's jump to the important part, what all things that we need to have a check on.

1) Spread Data Evenly Around the Cluster

Data should be spread around the cluster evenly so that every node should have roughly the same amount of data. Data distribution is based on the partition key that we take. Hash is calculated for each partition key and that hash value is used to decide which data will go to which node in the cluster. So we should choose a good primary key.

2) Minimize the Number of Partitions Read

Partitions are groups of rows that share the same partition key. When we perform a read query, coordinator nodes will request all the partitions that contain data. So, if we keep the data in different partitions, then there will be a delay in response due to the overhead in requesting partitions. This doesn't mean that we should not use partitions. If we have large data, that data needs to be partitioned. So there should be a minimum number of partitions as possible.

To minimize partition reads we need to focus on modeling our data according to queries that we use. Minimising partition reads involve:

a) Model Data According to the Queries

We should always think of creating a schema based on the queries that we will issue to the Cassandra. If we have the data for the query in one table, there will be a faster read.

b) Create a Table Based on Where You Can Satisfy Your Query by Reading (Roughly) One Partition

This means we should have one table per query pattern. Different tables should satisfy different needs. It is ok to duplicate data among different tables, but our focus should be to serve the read request from one table in order to optimize the read.

Let's take an example to understand it better.

Assume we want to create an employee table in Cassandra. So, our fields will be employee ID, employee name, designation, salary, etc. Now, identify which all possible queries that we will frequently hit to fetch the data. Possible cases will be:

1) To Get the Details of an Employee Against a Particular Employee ID

The schema looks like this:

CREATE TABLE employee (
    employee_id int PRIMARY KEY,
    employee_name text,
    designation text,
    salary int,
    location text
)

Lets match against the rules:

Spread data evenly around the cluster — Yes, as each employee has different partition

Minimise the number of partition read — Yes, only one partition is read to get the data.

2) To Get the Details of All the Employees for a Particular Designation

Now the requirement has changed. Now we need to get the employee details on the basis of designation. The schema will look like this:

CREATE TABLE employee (
    employee_id int,
    employee_name text,
    designation text,
    salary int,
    location text,
    PRIMARY KEY (designation, employee_id)
)

In the above schema, we have composite primary key consisting of designation, which is the partition key and employee_id as the clustering key.

This looks good, but lets again match with our rules:

Spread data evenly around the cluster — Our schema may violate this rule. If say we have a large number of records falling in one designation then the data will be bind to one partition. There will not be an even distribution of data.

Minimise the number of partition read — Yes, only one partition is read to get the data.

3) To Get the Details of All Employee Details Living in a Particular Location

If we have a large number of records falling in a single partition, there will be an issue in spreading the data evenly around the cluster. We can resolve this issue by designing the model in this way:

CREATE TABLE employee (
    employee_id int,
    employee_name text,
    designation text,
    salary int,
    location text,
    PRIMARY KEY ((designation, location), employee_id)
)

Now the distribution will be more evenly spread across the cluster as we are taking into account the location of each employee.

Both of our rules satisfy this schema.

To sum it all up, Cassandra and RDBMS are different, and we need to think differently when we design a Cassandra data model. The above rules need to be followed in order to design a good data model that will be fast and efficient.

Thanks for reading this article till the end.

Reference: Datastrax 

This article was first published on the Knoldus blog.

Data (computing) Database Relational database Partition (database) clustering Data modeling

Published at DZone with permission of Akhil Vijayan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • From On-Prem to SaaS
  • Performance Comparison — Thread Pool vs. Virtual Threads (Project Loom) In Spring Boot Applications
  • Integrate Cucumber in Playwright With Java
  • Adding Mermaid Diagrams to Markdown Documents

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: