Avoid Cross-Shard Data Movement in Distributed Databases

Learn four effective strategies to optimize distributed joins, reduce network overhead, and improve query performance in sharded databases.

Baskar Sikkayan

Mar. 26, 25 · Tutorial

Likes (4)

Comment

Save

4.5K Views

Modern applications rely on distributed databases to handle massive amounts of data and scale seamlessly across multiple nodes. While sharding helps distribute the load, it also introduces a major challenge — cross-shard joins and data movement, which can significantly impact performance.

When a query requires joining tables stored on different shards, the database must move data across nodes, leading to:

High network latency due to data shuffling
Increased query execution time as distributed queries become expensive
Higher CPU usage as more computation is needed to merge data across nodes

But what if we could eliminate unnecessary cross-shard joins?

In this article, we'll explore four proven strategies to avoid data movement in distributed databases:

Replicating reference tables
Collocating related data in the same shard
Using a mapping table for efficient joins
Precomputed join tables (materialized views)

By applying these strategies, you can optimize query execution, minimize network overhead, and scale your database efficiently.

Understanding the Problem: Cross-Shard Joins and Data Movement

How Data Movement Happens

Imagine a typical e-commerce database where:

The Orders table is sharded by CustomerID.

    SQL
   
 

   CREATE TABLE Orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,  -- Unique ID for the order
    customer_id INT NOT NULL,                 -- Foreign key to Customers (Sharding Key)
    product_id INT NOT NULL,                   -- Foreign key to Products
    quantity INT DEFAULT 1,                    -- Number of units ordered
    total_price DECIMAL(10,2),                 -- Total price of the order
    order_status ENUM('Pending', 'Shipped', 'Delivered', 'Cancelled') DEFAULT 'Pending',
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    shard_key INT GENERATED ALWAYS AS (customer_id) VIRTUAL -- Helps with sharding logic
);
  

Column Name	Data Type	Description
`order_id`	INT (Primary Key, Auto Increment)	Unique identifier for the order.
`customer_id`	INT (Sharding Key)	Foreign key to `Customers` table; used for sharding.
`product_id`	INT (Foreign Key)	References `Products.product_id`.
`quantity`	INT (Default: 1)	Number of units ordered.
`total_price`	DECIMAL(10,2)	Total cost of the order.
`order_status`	ENUM('Pending', 'Shipped', 'Delivered', 'Cancelled') (Default: 'Pending')	Current order status.
`order_date`	TIMESTAMP (Default: CURRENT_TIMESTAMP)	Timestamp when the order was placed.
`shard_key`	INT (Virtual Column)	Helps in sharding logic by using `customer_id`.

Sharding Key: customer_id

Ensures that all orders of a given customer remain on the same shard.

Why Not order_id?

Because customers place multiple orders, keeping all their orders on the same shard optimizes query performance for customer history retrieval.

The Products table is sharded by ProductID.

    SQL
   
 

   CREATE TABLE Products (
    product_id INT PRIMARY KEY AUTO_INCREMENT,  -- Unique product ID (Sharding Key)
    product_name VARCHAR(255) NOT NULL,         -- Name of the product
    category VARCHAR(100),                      -- Product category (e.g., Electronics, Clothing)
    price DECIMAL(10,2) NOT NULL,               -- Product price
    stock_quantity INT DEFAULT 0,               -- Available stock count
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    shard_key INT GENERATED ALWAYS AS (product_id) VIRTUAL -- Helps with sharding logic
);
  

Column Name	Data Type	Description
`product_id`	INT (Primary Key, Auto Increment)	Unique identifier for the product (Sharding Key).
`product_name`	VARCHAR(255)	Name of the product.
`category`	VARCHAR(100)	Product category (e.g., Electronics, Clothing).
`price`	DECIMAL(10,2)	Price of the product.
`stock_quantity`	INT (Default: 0)	Available stock count.
`last_updated`	TIMESTAMP (Default: CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP)	Timestamp for last update.
`shard_key`	INT (Virtual Column)	Helps in sharding logic by using `product_id`.

Sharding Key: product_id

Ensures that product details are distributed evenly across shards.

Why Not category?

Because product_id ensures uniform distribution, whereas some categories may be more popular and could cause unbalanced shards.

Now, let's say we want to retrieve the product names for a specific customer’s orders:

    SQL
   
   SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123;

Since Orders and Products are sharded differently, fetching the product_name requires data movement between shards. This can cause significant performance degradation, especially as the system scales.

Strategy 1: Replicating Reference Tables

Concept

Instead of storing reference tables on separate shards, we replicate them across all nodes. This allows queries to perform joins locally, eliminating inter-shard communication.

How to Implement

In a database like PostgreSQL Citus, we can create a replicated reference table:

    SQL
   
   CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    product_name TEXT
) DISTRIBUTED REPLICATED;

Benefits

Eliminates cross-shard joins for lookups.
Fast query execution, as the reference data is available on all shards.

When NOT to Use

If the reference table is too large (e.g., millions of records).
If the reference data changes frequently, causing replication overhead.

Strategy 2: Collocating Related Data in the Same Shard

Concept

Instead of sharding Orders by CustomerID and Products by ProductID, we shard both tables by the same key (CustomerID).

How It Works

    SQL
   
   SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123;

Now, because both tables are in the same shard, the join happens locally, avoiding cross-shard data movement.

Benefits

Optimized for queries that frequently join tables based on a shared key.
Works well for applications where customer-specific data is frequently accessed.

When NOT to Use

If tables grow at different rates, leading to uneven shard distribution.
If queries involve cross-customer searches, making sharding by CustomerID inefficient.

Strategy 3: Using a Mapping Table for Efficient Joins

Concept

Instead of performing cross-shard joins, we use a mapping table that stores which shard contains the required data.

How It Works

We create a customer-product mapping table:

    SQL
   
 

   CREATE TABLE CustomerProductMap (
    customer_id INT,
    product_id INT,
    shard_id INT,
    PRIMARY KEY (customer_id, product_id)
);
  

Before querying, we first retrieve the shard_id:

    SQL
   
   SELECT shard_id FROM CustomerProductMap WHERE customer_id = 123;

Then, we query only the relevant shard:

    SQL
   
 

   SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123
AND o.shard_id = <retrieved_shard>;
  

Benefits

Reduces the number of shards queried, improving performance.
Works well for large datasets that cannot be replicated.

When NOT to Use

If updates to CustomerProductMap are frequent, causing maintenance overhead.

Strategy 4: Precomputed Join Tables (Materialized Views)

Concept

Instead of executing joins at query time, we precompute and store frequently used join results in a materialized view or a separate table.

How It Works

We create a denormalized table that precomputes Orders and Products:

    SQL
   
 

   CREATE TABLE Precomputed_Order_Products (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    product_name TEXT,
    order_date TIMESTAMP
) SHARDED BY (customer_id);
  

A background ETL process precomputes the joins and inserts the data:

    SQL
   
   INSERT INTO Precomputed_Order_Products (order_id, customer_id, product_id, product_name, order_date)
SELECT o.order_id, o.customer_id, p.product_id, p.product_name, o.order_date
FROM Orders o
JOIN Products p ON o.product_id = p.product_id;

Querying without joins:

    SQL
   
   SELECT order_id, customer_id, product_name, order_date
FROM Precomputed_Order_Products
WHERE customer_id = 123;

Benefits

No joins required at query time, significantly improving performance.
Ideal for read-heavy workloads where query speed is critical.

When NOT to Use

If the dataset changes frequently, requiring frequent recomputation.
If storage costs for redundant data are too high.

Choosing the Right Strategy: Trade-offs and Considerations

Approach	Pros	Cons
Replicating Reference Tables	Fast joins, eliminates data movement	Only works for small tables
Collocating Data in the Same Shard	Efficient joins, avoids movement	Requires careful shard key selection
Mapping Table Approach	Efficient query routing	Extra storage & maintenance overhead
Precomputed Join Tables	Best for read-heavy workloads	Requires background ETL jobs

General Guidelines

Use replication – For small reference tables
Use colocation – When queries naturally group by a specific key
Use mapping tables – When reference tables are too large to replicate
Use precomputed joins – For high-performance analytical queries

Conclusion

Cross-shard data movement is one of the biggest performance challenges in distributed databases. We can significantly optimize query performance and reduce unnecessary network overhead by applying reference table replication, colocation, mapping tables, and precomputed joins.

Choosing the right strategy depends on query patterns, data distribution, and storage constraints. With the right approach, you can build highly scalable, performant distributed systems without being hindered by cross-shard joins.

Database Shard (database architecture) Data Types

Opinions expressed by DZone contributors are their own.

Related

Trending

Avoid Cross-Shard Data Movement in Distributed Databases

Learn four effective strategies to optimize distributed joins, reduce network overhead, and improve query performance in sharded databases.

Understanding the Problem: Cross-Shard Joins and Data Movement

How Data Movement Happens

Sharding Key: customer_id

Why Not order_id?

Sharding Key: product_id

Why Not category?

Strategy 1: Replicating Reference Tables

Concept

How to Implement

Benefits

When NOT to Use

Strategy 2: Collocating Related Data in the Same Shard

Concept

How It Works

Benefits

When NOT to Use

Strategy 3: Using a Mapping Table for Efficient Joins

Concept

How It Works

Benefits

When NOT to Use

Strategy 4: Precomputed Join Tables (Materialized Views)

Concept

How It Works

Benefits

When NOT to Use

Choosing the Right Strategy: Trade-offs and Considerations

General Guidelines

Conclusion

Related

Partner Resources