Avoid Cross-Shard Data Movement in Distributed Databases
Learn four effective strategies to optimize distributed joins, reduce network overhead, and improve query performance in sharded databases.
Join the DZone community and get the full member experience.
Join For FreeModern applications rely on distributed databases to handle massive amounts of data and scale seamlessly across multiple nodes. While sharding helps distribute the load, it also introduces a major challenge — cross-shard joins and data movement, which can significantly impact performance.
When a query requires joining tables stored on different shards, the database must move data across nodes, leading to:
- High network latency due to data shuffling
- Increased query execution time as distributed queries become expensive
- Higher CPU usage as more computation is needed to merge data across nodes
But what if we could eliminate unnecessary cross-shard joins?
In this article, we'll explore four proven strategies to avoid data movement in distributed databases:
- Replicating reference tables
- Collocating related data in the same shard
- Using a mapping table for efficient joins
- Precomputed join tables (materialized views)
By applying these strategies, you can optimize query execution, minimize network overhead, and scale your database efficiently.
Understanding the Problem: Cross-Shard Joins and Data Movement
How Data Movement Happens
Imagine a typical e-commerce database where:
The Orders table is sharded by CustomerID
.
CREATE TABLE Orders (
order_id INT PRIMARY KEY AUTO_INCREMENT, -- Unique ID for the order
customer_id INT NOT NULL, -- Foreign key to Customers (Sharding Key)
product_id INT NOT NULL, -- Foreign key to Products
quantity INT DEFAULT 1, -- Number of units ordered
total_price DECIMAL(10,2), -- Total price of the order
order_status ENUM('Pending', 'Shipped', 'Delivered', 'Cancelled') DEFAULT 'Pending',
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
shard_key INT GENERATED ALWAYS AS (customer_id) VIRTUAL -- Helps with sharding logic
);
Column Name
|
Data Type
|
Description
|
---|---|---|
order_id
|
INT (Primary Key, Auto Increment)
|
Unique identifier for the order.
|
customer_id
|
INT (Sharding Key)
|
Foreign key to Customers table; used for sharding.
|
product_id
|
INT (Foreign Key)
|
References Products.product_id .
|
quantity
|
INT (Default: 1)
|
Number of units ordered.
|
total_price
|
DECIMAL(10,2)
|
Total cost of the order.
|
order_status
|
ENUM('Pending', 'Shipped', 'Delivered', 'Cancelled') (Default: 'Pending')
|
Current order status.
|
order_date
|
TIMESTAMP (Default: CURRENT_TIMESTAMP)
|
Timestamp when the order was placed.
|
shard_key
|
INT (Virtual Column)
|
Helps in sharding logic by using customer_id .
|
Sharding Key: customer_id
Ensures that all orders of a given customer remain on the same shard.
Why Not order_id?
Because customers place multiple orders, keeping all their orders on the same shard optimizes query performance for customer history retrieval.
The Products table is sharded by ProductID
.
CREATE TABLE Products (
product_id INT PRIMARY KEY AUTO_INCREMENT, -- Unique product ID (Sharding Key)
product_name VARCHAR(255) NOT NULL, -- Name of the product
category VARCHAR(100), -- Product category (e.g., Electronics, Clothing)
price DECIMAL(10,2) NOT NULL, -- Product price
stock_quantity INT DEFAULT 0, -- Available stock count
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
shard_key INT GENERATED ALWAYS AS (product_id) VIRTUAL -- Helps with sharding logic
);
Column Name
|
Data Type
|
Description
|
---|---|---|
product_id
|
INT (Primary Key, Auto Increment)
|
Unique identifier for the product (Sharding Key).
|
product_name
|
VARCHAR(255)
|
Name of the product.
|
category
|
VARCHAR(100)
|
Product category (e.g., Electronics, Clothing).
|
price
|
DECIMAL(10,2)
|
Price of the product.
|
stock_quantity
|
INT (Default: 0)
|
Available stock count.
|
last_updated
|
TIMESTAMP (Default: CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP)
|
Timestamp for last update.
|
shard_key
|
INT (Virtual Column)
|
Helps in sharding logic by using product_id .
|
Sharding Key: product_id
Ensures that product details are distributed evenly across shards.
Why Not category?
Because product_id
ensures uniform distribution, whereas some categories may be more popular and could cause unbalanced shards.
Now, let's say we want to retrieve the product names for a specific customer’s orders:
SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123;
Since Orders
and Products
are sharded differently, fetching the product_name
requires data movement between shards. This can cause significant performance degradation, especially as the system scales.
Strategy 1: Replicating Reference Tables
Concept
Instead of storing reference tables on separate shards, we replicate them across all nodes. This allows queries to perform joins locally, eliminating inter-shard communication.
How to Implement
In a database like PostgreSQL Citus, we can create a replicated reference table:
CREATE TABLE Products (
product_id INT PRIMARY KEY,
product_name TEXT
) DISTRIBUTED REPLICATED;
Benefits
- Eliminates cross-shard joins for lookups.
- Fast query execution, as the reference data is available on all shards.
When NOT to Use
- If the reference table is too large (e.g., millions of records).
- If the reference data changes frequently, causing replication overhead.
Strategy 2: Collocating Related Data in the Same Shard
Concept
Instead of sharding Orders by CustomerID
and Products by ProductID
, we shard both tables by the same key (CustomerID
).
How It Works
SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123;
Now, because both tables are in the same shard, the join happens locally, avoiding cross-shard data movement.
Benefits
- Optimized for queries that frequently join tables based on a shared key.
- Works well for applications where customer-specific data is frequently accessed.
When NOT to Use
- If tables grow at different rates, leading to uneven shard distribution.
- If queries involve cross-customer searches, making sharding by
CustomerID
inefficient.
Strategy 3: Using a Mapping Table for Efficient Joins
Concept
Instead of performing cross-shard joins, we use a mapping table that stores which shard contains the required data.
How It Works
We create a customer-product mapping table:
CREATE TABLE CustomerProductMap (
customer_id INT,
product_id INT,
shard_id INT,
PRIMARY KEY (customer_id, product_id)
);
Before querying, we first retrieve the shard_id
:
SELECT shard_id FROM CustomerProductMap WHERE customer_id = 123;
Then, we query only the relevant shard:
SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123
AND o.shard_id = <retrieved_shard>;
Benefits
- Reduces the number of shards queried, improving performance.
- Works well for large datasets that cannot be replicated.
When NOT to Use
- If updates to
CustomerProductMap
are frequent, causing maintenance overhead.
Strategy 4: Precomputed Join Tables (Materialized Views)
Concept
Instead of executing joins at query time, we precompute and store frequently used join results in a materialized view or a separate table.
How It Works
We create a denormalized table that precomputes Orders
and Products
:
CREATE TABLE Precomputed_Order_Products (
order_id INT PRIMARY KEY,
customer_id INT,
product_id INT,
product_name TEXT,
order_date TIMESTAMP
) SHARDED BY (customer_id);
A background ETL process precomputes the joins and inserts the data:
INSERT INTO Precomputed_Order_Products (order_id, customer_id, product_id, product_name, order_date)
SELECT o.order_id, o.customer_id, p.product_id, p.product_name, o.order_date
FROM Orders o
JOIN Products p ON o.product_id = p.product_id;
Querying without joins:
SELECT order_id, customer_id, product_name, order_date
FROM Precomputed_Order_Products
WHERE customer_id = 123;
Benefits
- No joins required at query time, significantly improving performance.
- Ideal for read-heavy workloads where query speed is critical.
When NOT to Use
- If the dataset changes frequently, requiring frequent recomputation.
- If storage costs for redundant data are too high.
Choosing the Right Strategy: Trade-offs and Considerations
Approach | Pros | Cons |
---|---|---|
Replicating Reference Tables | Fast joins, eliminates data movement | Only works for small tables |
Collocating Data in the Same Shard | Efficient joins, avoids movement | Requires careful shard key selection |
Mapping Table Approach | Efficient query routing | Extra storage & maintenance overhead |
Precomputed Join Tables | Best for read-heavy workloads | Requires background ETL jobs |
General Guidelines
- Use replication – For small reference tables
- Use colocation – When queries naturally group by a specific key
- Use mapping tables – When reference tables are too large to replicate
- Use precomputed joins – For high-performance analytical queries
Conclusion
Cross-shard data movement is one of the biggest performance challenges in distributed databases. We can significantly optimize query performance and reduce unnecessary network overhead by applying reference table replication, colocation, mapping tables, and precomputed joins.
Choosing the right strategy depends on query patterns, data distribution, and storage constraints. With the right approach, you can build highly scalable, performant distributed systems without being hindered by cross-shard joins.
Opinions expressed by DZone contributors are their own.
Comments