DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Getting Started With Apache Cassandra
  • Building RAG Apps With Apache Cassandra, Python, and Ollama
  • Essential Techniques for Performance Tuning in Snowflake
  • Hello YugabyteDB: Running Kong on the Distributed PostgreSQL Database

Trending

  • How to Perform Custom Error Handling With ANTLR
  • How Can Developers Drive Innovation by Combining IoT and AI?
  • Can You Run a MariaDB Cluster on a $150 Kubernetes Lab? I Gave It a Shot
  • The Role of AI in Identity and Access Management for Organizations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Using Primary, Partition, and Clustering Keys in ScyllaDB (or Cassandra)

Using Primary, Partition, and Clustering Keys in ScyllaDB (or Cassandra)

Learn core data modeling concepts that are critical for optimizing performance in wide-column NoSQL databases.

By 
Guy Shtub user avatar
Guy Shtub
·
May. 20, 24 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
1.5K Views

Join the DZone community and get the full member experience.

Join For Free

In ScyllaDB and other NoSQL databases, the data model is based on the queries and not just around the domain entities. When creating the data model, we take into account both the conceptual data model and the application workflow: which queries will be performed by which users and how often.

One of the main goals of data modeling in ScyllaDB and other wide-column databases (e.g., Apache Cassandra) is to return results fast. To achieve that, you want:

  • Even data distribution: Data should be evenly spread across the cluster so that every node holds roughly the same amount of data. ScyllaDB determines which node should store the data based on hashing the partition key. Therefore, choosing a suitable partition key is crucial. More on this later on.
  • To minimize the number of partitions accessed in a read query: To make reads faster, we’d ideally have all the data required in a read query stored in a single Table. Although it’s fine to duplicate data across tables, in terms of performance, it’s better if the data needed for a read query is in one table.

Things you should NOT focus on:

  • Avoiding data duplication: To get efficient reads, we sometimes have to duplicate data. More about that and denormalization later in this lesson. In a later session, we learn how to avoid duplication in some cases using Secondary Indexes.
  • Minimizing the number of writes: writes in ScyllaDB aren’t free, but they are very efficient and “cheap.” ScyllaDB is optimized for high write throughput. Reads, while still very fast, are usually more expensive than writes and are harder to fine-tune. We’d usually be ready to increase the number of writes to increase read efficiency. Keep in mind that the number of tables also affects consistency. 

In this tutorial, you will use an example based on a Veterinary Clinic named 4Paws Clinic. In this clinic, each admitted animal has a connected heart rate monitor, which logs heart rate and other vital information every five seconds.

What’s a Primary Key?

A Primary Key is defined within a table. It is one or more columns used to identify a row. All tables must include a definition for a Primary Key. For example, consider this table:

CQL
 
CREATE TABLE heartrate_v1 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   PRIMARY KEY (pet_chip_id)
);


In the example above the primary key is a single column – the pet_chip_id. If a Primary Key is made up of a single column, it is called a Simple Primary Key. 

For the above table perform the query:

 
SELECT * from heartrate_v1 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23;


What happens if we want to query our data by pet_chip_id but also by time? That is if our query is:

 
SELECT * from heartrate_v1 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time >='2019-03-04 07:01:00' AND time <='2019-03-04 07:02:00';


In that case, the above query won’t work. We can define the Primary Key to include more than one column, in which case it is called a Composite (or Compound) key. Create the following table:

CQL
 
CREATE TABLE heartrate_v2 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   PRIMARY KEY (pet_chip_id, time)
);


And insert some data:

 
INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:05', 100);

INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:10', 90); 

INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:50', 96); 

INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-04-04 07:01:50', 99);  


Enter Partition Keys and Clustering Keys

In the case shown above, the first part of the Primary Key is called the Partition Key (pet_chip_id in the above example) and the second part is called the Clustering Key (time).

A Primary Key is composed of 2 parts:

  • The Partition Key is responsible for data distribution across the nodes. It determines which node will store a given row. It can be one or more columns.

Partition key

  • The Clustering Key is responsible for sorting the rows within the partition. It can be zero or more columns.

Now execute the query we previously saw, according to time:

 
SELECT * from heartrate_v2 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time >='2019-03-04 07:01:00' AND time <='2019-03-04 07:02:00';


Additionally, we previously had an issue with heartrate_v1, where a pet could only have one heart rate value recorded regardless of the time. Now that we defined the time to be a part of the primary key, each primary key, which is a combination of pet_chip_id and time, can have a heart rate value.

Read the data for the same pet:

 
SELECT * from heartrate_v2 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 ;


We can see that as opposed to the previous example. This time the value wasn’t overwritten.

The Partition Key and the Clustering Keys With Multiple Columns

As we just saw, both the Partition Key and the Clustering Key can include more than one column, for example, if our query is:

 
SELECT * from heartrate_v3 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time ='2019-03-04 07:01:00' AND pet_name = 'Duke';


We could define the table as follows:

CQL
 
CREATE TABLE heartrate_v3 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   pet_name text,
   PRIMARY KEY ((pet_chip_id, time), pet_name)
);


Create the above table, then insert some data:

 
INSERT INTO heartrate_v3(pet_chip_id, time, heart_rate, pet_name) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:10', 90, 'Duke'); 


In this case, the partition key includes two columns: pet_chip_id and time, and the clustering key is pet_name. Keep in mind that every query must include all columns defined in the partition key.

Now try to execute this query:

 
SELECT * from heartrate_v3 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23; 


It fails, as the entire partition key is not given.

Now try this query:

 
SELECT * from heartrate_v3 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time ='2019-03-04 07:01:10' AND pet_name = 'Duke';  


It succeeds as the complete partition key is given.

Similarly, if we want each partition to be based on the pet_chip_id but want to be able to query according to pet_name and heart_rate:

 
SELECT * from heartrate_v4 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND pet_name = 'Duke' AND heart_rate = 100; 


it is possible to define (do this):

CQL
 
CREATE TABLE heartrate_v4 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   pet_name text,
   PRIMARY KEY (pet_chip_id, pet_name, heart_rate)
);


Note:

  • If there is more than one column in the Clustering Key (pet_name and heart_rate in the example above), the order of these columns defines the clustering order. For a given partition, all the rows are physically ordered inside ScyllaDB by the clustering order. This order determines what select query you can efficiently run on this partition.
  • In this example, the ordering is first by pet_name and then by heart_rate.
  • In addition to the Partition Key columns, a query may include the Clustering Key. If it does include the Clustering Key columns they must be used in the same order as they were defined.

Additional Data Modeling Exercises

If you want to continue learning about primary keys and other data modeling concepts, feel free to play around with these additional exercises and quizzes building on this same Veterinary Clinic example.

Apache Cassandra Data modeling Database clustering Partition (database)

Published at DZone with permission of Guy Shtub. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Getting Started With Apache Cassandra
  • Building RAG Apps With Apache Cassandra, Python, and Ollama
  • Essential Techniques for Performance Tuning in Snowflake
  • Hello YugabyteDB: Running Kong on the Distributed PostgreSQL Database

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!