Cassandra: Intro to Designing Tables Access, Query and so on.
Join the DZone community and get the full member experience.
Join For FreeBackground
Cassandra is a fast NoSQL data store that has the distributed capabilities of Amazon's DynamoDB and Google's BigTable data structure.
Cassandra is a scalable, high available and partition tolerant database that had focused specifically with 'AP' of CAP theorem, however, 'C' can be achieved with higher performance penalty, by getting a quorum of all the nodes participating in the cluster.
I think that is already some background, so we should really start on the important stuff of this write-up.
Designing Tables
NoSQL practices will tell you that you need to know how will query your records before even design the tables. It means that we need to know what and how we're going to access that data.
Assuming we have a simple data set:
Vehicle |
id |
Name |
Brand |
Model |
BodyType |
Trim |
Year |
color |
wheelSize |
date_built |
You would want to be able to access the data by id, color, bodytype, year, model and brand.
If we distinguish, that this is how I will be accessing the data then we need to decide a couple of things.
1) Do we need access this dataset, using 1 or the attributes or really a combination of them?
2) Do I have a way to get other attributes elsewhere and create a combination keys?
For sake of this writeup assume that we selected 1) above because that is what we really need to do. Then we need to create Cassandra's cluster keys.
NOTE: That I have jumped from creating keyspace and so, as I've assume you've done those before.
01.
CREATE
TABLE
Vehicle (
02.
id timeuuid,
03.
name
text,
04.
brand text,
05.
model text,
06.
body_type text,
07.
trim text,
08.
color text,
09.
year
text,
10.
wheelsize
int
,
11.
date_built
timestamp
,
12.
PRIMARY
KEY
((id), color, body_type,
year
, model, brand ))
13.
WITH
CLUSTERING
ORDER
BY
(color
asc
, body_type
asc
,
year
asc
, model
asc
, brand
asc
);
Lets describe what these is, we told Cassandra that we are going to have a table that will store vehicle attributes that are group by color, body_type, year, model and brand. We also said to store them ascending.
So everytime we access Vehicle table we'll say via CQL
1.
SELECT
*
FROM
Keyspace.Vehicle
where
id=
'er34SRFer'
and
color=
'blue'
2.
and
body_type=
'sedan'
and
model=
'Camry'
and
brand=
'Toyota'
;
Now, we can't always provide all the keys, id, can be ommited but the other part of cluster keys must be provided, otherwise cassandra will prevent you from reading the data. Also, cassandra will allow you to do a lookup with only the id as the retriction attribute filter.
1.
SELECT * FROM Keyspace.Vehicle where id=
'er34SRFer'
;
That is because id here is a partition key, so cassandra treat this as unique data row for that record.
If you want to access by specific cluster key attribute that are not restrictive, then you'll need to create a lookup table. Doing so, is designed using inverted index, which is the common technique to solve this problem.
Let's say you want to do a lookup by color, then we need to create the following table.
1.
CREATE
TABLE
Vehicle_by_Color (
2.
color text,
3.
ids
set
<text>,
4.
PRIMARY
KEY
(color)
5.
);
With this table, I can keep all the ids of vehicle, and my key will be the 'color' column.
So, it my client code, what I do is.
1.
public
List<Vehicle> findByColor(String color) {
2.
Vehicle_by_Color ret = Cassandra.executeCQL(
" SELECT * FROM Keyspace.VEHICLE_BY_COLOR where color='black'"
);
3.
4.
//now find all the Vehicles matching this criteria.
5.
List<Vehicle> allVehicleInThisColor = Cassandra.executeCQL("SELECT * FROM Keyspace.VEHICLE where id in ( convertToInsCommaSeparated(ret.getIds()) );
6.
7.
return
allVehicleInThisColor;
8.
}
Now, I don't have to provide all the cluster keys.
Summary
NoSQL design is not similar to relational world. It really is breaking the rules of relational world about normalizing records set and so. Here, we've duplicated data, we have created lookup table(s) to store additional data just to get to the records we want. Also, all data are mostly collapse on a table, because attributes are aggregated to the master record Vehicle.
I hope this has help you designing your first Cassandra table. We've made few mistakes on desiging Cassandra tables in the past, as we've also added indexes, which is another set of complexity on its own. If you want to know more about that I suggest you read Cassandra's documentation.
Overall we are very happy to have selected Cassandra as it performs really well with the data and load we've throwed at it. For us we want to push our Cassandra install to next level of adding more data, node and running Spark to run aggregations and calculations. More to come ...
Published at DZone with permission of Melvin Ramos. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments