Over a million developers have joined DZone.

Apache Cassandra vs. Apache Ignite: Affinity Collocation and Distributed SQL

DZone's Guide to

Apache Cassandra vs. Apache Ignite: Affinity Collocation and Distributed SQL

The simplified software architecture we get with Ignite is one of its many advantages over Cassandra. Read on to learn how.

· Database Zone ·
Free Resource

MariaDB TX, proven in production and driven by the community, is a complete database solution for any and every enterprise — a modern database for modern applications.

In the previous article, we reviewed and summarized pitfalls of the query-driven data modeling methodology (AKA denormalized data modeling) utilized in Apache Cassandra. Turns out, the methodology prevents us from developing efficient applications without insight into what our queries will be like. In reality, an application architecture under this scenario will get more complicated, less maintainable, and prone to notable data duplication.

Furthermore, it was discovered that the pitfalls are usually shielded with statements like, "If you really want to be scalable, fast, and highly available, then be ready to store duplicated data and sacrifice SQL along with strong consistency." That might have been true a decade ago but it's a total fable nowadays!

Not going too far, we picked another Apache Software Foundation citizen, Apache Ignite, and in this article, we will figure out how our Ignite-based application's architecture would look and what its maintenance costs would be.

The application of our choice tracks all the cars produced by a vendor and gives insights on production power for every single vendor. If you recall, it has the following relational model:

Cars and Vendors

Can we just go ahead and create those three tables in Ignite with the CREATE TABLE command and run our SQL-driven application immediately? Depends... If we were not supposed to join the data stored in different tables, then it would be enough. But in the previous article, we put a requirement that the application has to support two queries that require JOINs:

  1. Q1: Get car models produced by a vendor within a particular time frame (newest first).
  2. Q2: Get a number of cars of a specific model produced by a vendor.

In the Cassandra case, we got away with JOINs by creating two tables for each of the queries. Do we go through the same experience with Ignite? Totally not. In fact, Ignite's non-collocated JOINs are already at our service and doesn't require anything else from us once the three tables are created. However, they are not as efficient and fast as their collocated counter-part. Therefore, let's learn more about affinity collocation first and see how this concept is used in modern databases such as Apache Ignite.

Affinity Collocation-Based Data Modeling

Affinity collocation is a powerful concept of Ignite (and other distributed databases such as Google Spanner or MemSQL) that allows storing related data on the same cluster nodes. How do we know which data is related? It's simple, especially with the background of relational databases — just identify a parent-child relationship between your business objects, specify affinity keys in CREATE TABLE statements and your job is done because Ignite will take care of the rest!

Taking the Cars and Vendors application as an example, it's reasonable to use Vendor as a parent entity and Car with Production as its children. For instance, once this happens and the affinity keys are set, all the Cars produced by specific Vendor will be stored on the same node, as shown in the picture below:

Affinity Collocation

Specifically, the picture shows that all the Cars produced by Toyota are stored on Ignite node 1 while the Cars produced by Ford Motors are stored on Ignite node 2. This is the affinity collocation in action — Cars are stored on a node where an entry of a respective Vendor is.

It's time to define SQL tables to achieve such data distribution. Let's start with Vendor table:

    name VARCHAR,
    address VARCHAR

The Vendors will be distributed randomly across the cluster nodes. Ignite uses the primary key column to calculate a node that will own a Vendor's entry (refer to the recording of Ignite Essentials Part 1 webinar for more details).

The Car table goes next:

    id INT,
    vendor_id INT,
    model VARCHAR,
    year INT,
    price float,
    PRIMARY KEY(id, vendor_id)
) WITH "affinityKey=vendor_id";

The table for Cars has affinityKey parameter set to vendor_id column. This key instructs Ignite to store a Car on the cluster node of its vendor_id (refer to the recording of Ignite Essentials Part 2 webinar for more info).

Repeating the same procedure for the Production table, whose entries will be stored on the cluster nodes identified by vendor_id, as well:

CREATE TABLE Production (
    id INT,
    car_id INT,
    vendor_id INT,
    country VARCHAR,
    total INT,
    year INT,
    PRIMARY KEY(id, car_id, vendor_id)
) WITH "affinityKey=vendor_id";

That's it with the data modeling done right in Apache Ignite. The next step is to get down to our application code and make out all the required queries.

SQL Queries With JOINs

Ignite cluster can be queried with our good ol' friend SQL, which supports distributed JOINs and secondary indexes.

There are two types of JOINs in Ignite: collocated and non-collocated. The collocated JOINs omit data (that is needed to complete a JOIN) movement between cluster nodes expecting that the tables that are being joined are already collocated and all the data is available locally. That's the most efficient and performant JOINs you can get in distributed databases such as Ignite. The non-collocated JOINs as the name suggests is our a backup plan if there was no way to achieve affinity collocation for some tables but they still need to be joined. This type of JOINs is slower because they might involve data movement between cluster nodes at a JOIN time.

Previously, we could collocate Vendor, Car, and Production tables and now can benefit from the collocated JOINs preparing an SQL query for Q1 (get car models produced by a vendor within a particular time frame, newest first):

SELECT c.model, p.country, p.total, p.year FROM Vendor as v
JOIN Production as p ON v.id = p.vendor_id
JOIN Car as c ON c.id = p.car_id
WHERE v.name = 'Ford Motor' and p.year >= 2017
ORDER BY p.year;

Is there any way to make it even faster? Sure; let's define secondary indexes for Vendor.name and Production.year columns:

CREATE INDEX vendor_name_id ON Vendor (name);

CREATE INDEX prod_year_id ON Production (year);

The SQL query for required Q2 (get a number of cars of a specific model produced by a vendor) does not require extra effort from us:

SELECT p.country, p.total, p.year FROM Vendor as v
JOIN Production as p ON v.id = p.vendor_id
JOIN Car as c ON c.id = p.car_id
WHERE v.name = 'Ford Motor' and c.model = 'Explorer';

Now, whenever a boss stops by our cubicle requesting to add a new feature, we can quickly come up with a set of new SQL queries needed for it, configure secondary indexes, and go to the kitchen to chat with teammates. Job done! In comparison, recall what it took us to support new query Q2 for the Cassandra-based architecture.

Simplified Architecture: Mission Accomplished!

Affinity collocation-based data modeling in Ignite has the following advantages over the query-driven denormalized modeling in Cassandra:

  • The application's data layer is modeled in a familiar relational way and easy to maintain.
  • Data is accessed using standard SQL syntax.
  • Affinity collocation provides more benefits from modern distributed databases:
    • Efficient and highly performant distributed JOINs.
    • Collocated computations.
    • Watch a recording of this webinar to reveal more benefits of this technique.

The simplified software architecture we get with Ignite is not the only benefit if we consider it instead of Cassandra. Give me a couple of weeks to put together my thoughts on strong consistency and in-memory performance.

MariaDB AX is an open source database for modern analytics: distributed, columnar and easy to use.

apache ignite ,distributed queries ,apache cassandra ,architecture ,database ,tutorial ,data modeling

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}