DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Goose Migrations for Smooth Database Changes
  • ArangoDB: Achieving Success With a Multivalue Database
  • A Comprehensive Guide to Database Sharding: Building Scalable Systems
  • Practical Generators in Go 1.23 for Database Pagination

Trending

  • Microsoft Azure Synapse Analytics: Scaling Hurdles and Limitations
  • Beyond ChatGPT, AI Reasoning 2.0: Engineering AI Models With Human-Like Reasoning
  • Unlocking the Potential of Apache Iceberg: A Comprehensive Analysis
  • How the Go Runtime Preempts Goroutines for Efficient Concurrency
  1. DZone
  2. Data Engineering
  3. Databases
  4. Avoid Cross-Shard Data Movement in Distributed Databases

Avoid Cross-Shard Data Movement in Distributed Databases

Learn four effective strategies to optimize distributed joins, reduce network overhead, and improve query performance in sharded databases.

By 
Baskar Sikkayan user avatar
Baskar Sikkayan
·
Mar. 26, 25 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
4.4K Views

Join the DZone community and get the full member experience.

Join For Free

Modern applications rely on distributed databases to handle massive amounts of data and scale seamlessly across multiple nodes. While sharding helps distribute the load, it also introduces a major challenge — cross-shard joins and data movement, which can significantly impact performance.

When a query requires joining tables stored on different shards, the database must move data across nodes, leading to:

  • High network latency due to data shuffling
  • Increased query execution time as distributed queries become expensive
  • Higher CPU usage as more computation is needed to merge data across nodes

But what if we could eliminate unnecessary cross-shard joins? 

In this article, we'll explore four proven strategies to avoid data movement in distributed databases:

  1. Replicating reference tables
  2. Collocating related data in the same shard
  3. Using a mapping table for efficient joins
  4. Precomputed join tables (materialized views)

By applying these strategies, you can optimize query execution, minimize network overhead, and scale your database efficiently.

Understanding the Problem: Cross-Shard Joins and Data Movement

How Data Movement Happens

Imagine a typical e-commerce database where:

The Orders table is sharded by CustomerID.

SQL
 
CREATE TABLE Orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,  -- Unique ID for the order
    customer_id INT NOT NULL,                 -- Foreign key to Customers (Sharding Key)
    product_id INT NOT NULL,                   -- Foreign key to Products
    quantity INT DEFAULT 1,                    -- Number of units ordered
    total_price DECIMAL(10,2),                 -- Total price of the order
    order_status ENUM('Pending', 'Shipped', 'Delivered', 'Cancelled') DEFAULT 'Pending',
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    shard_key INT GENERATED ALWAYS AS (customer_id) VIRTUAL -- Helps with sharding logic
);


Column Name
Data Type
Description
order_id
INT (Primary Key, Auto Increment)
Unique identifier for the order.
customer_id
INT (Sharding Key)
Foreign key to Customers table; used for sharding.
product_id
INT (Foreign Key)
References Products.product_id.
quantity
INT (Default: 1)
Number of units ordered.
total_price
DECIMAL(10,2)
Total cost of the order.
order_status
ENUM('Pending', 'Shipped', 'Delivered', 'Cancelled') (Default: 'Pending')
Current order status.
order_date
TIMESTAMP (Default: CURRENT_TIMESTAMP)
Timestamp when the order was placed.
shard_key
INT (Virtual Column)
Helps in sharding logic by using customer_id.


Sharding Key: customer_id

Ensures that all orders of a given customer remain on the same shard.

Why Not order_id? 

Because customers place multiple orders, keeping all their orders on the same shard optimizes query performance for customer history retrieval.

The Products table is sharded by ProductID.

SQL
 
CREATE TABLE Products (
    product_id INT PRIMARY KEY AUTO_INCREMENT,  -- Unique product ID (Sharding Key)
    product_name VARCHAR(255) NOT NULL,         -- Name of the product
    category VARCHAR(100),                      -- Product category (e.g., Electronics, Clothing)
    price DECIMAL(10,2) NOT NULL,               -- Product price
    stock_quantity INT DEFAULT 0,               -- Available stock count
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    shard_key INT GENERATED ALWAYS AS (product_id) VIRTUAL -- Helps with sharding logic
);


Column Name
Data Type
Description
product_id
INT (Primary Key, Auto Increment)
Unique identifier for the product (Sharding Key).
product_name
VARCHAR(255)
Name of the product.
category
VARCHAR(100)
Product category (e.g., Electronics, Clothing).
price
DECIMAL(10,2)
Price of the product.
stock_quantity
INT (Default: 0)
Available stock count.
last_updated
TIMESTAMP (Default: CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP)
Timestamp for last update.
shard_key
INT (Virtual Column)
Helps in sharding logic by using product_id.


Sharding Key: product_id

Ensures that product details are distributed evenly across shards.

Why Not category? 

Because product_id ensures uniform distribution, whereas some categories may be more popular and could cause unbalanced shards.

Now, let's say we want to retrieve the product names for a specific customer’s orders:

SQL
 
SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123;


Since Orders and Products are sharded differently, fetching the product_name requires data movement between shards. This can cause significant performance degradation, especially as the system scales.

Strategy 1: Replicating Reference Tables

Concept

Instead of storing reference tables on separate shards, we replicate them across all nodes. This allows queries to perform joins locally, eliminating inter-shard communication.

How to Implement

In a database like PostgreSQL Citus, we can create a replicated reference table:

SQL
 
CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    product_name TEXT
) DISTRIBUTED REPLICATED;


Benefits

  • Eliminates cross-shard joins for lookups.
  • Fast query execution, as the reference data is available on all shards.

When NOT to Use

  • If the reference table is too large (e.g., millions of records).
  • If the reference data changes frequently, causing replication overhead.

Strategy 2: Collocating Related Data in the Same Shard

Concept

Instead of sharding Orders by CustomerID and Products by ProductID, we shard both tables by the same key (CustomerID).

How It Works

SQL
 
SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123;


Now, because both tables are in the same shard, the join happens locally, avoiding cross-shard data movement.

Benefits

  • Optimized for queries that frequently join tables based on a shared key.
  • Works well for applications where customer-specific data is frequently accessed.

When NOT to Use

  • If tables grow at different rates, leading to uneven shard distribution.
  • If queries involve cross-customer searches, making sharding by CustomerID inefficient.

Strategy 3: Using a Mapping Table for Efficient Joins

Concept

Instead of performing cross-shard joins, we use a mapping table that stores which shard contains the required data.

How It Works

We create a customer-product mapping table:

SQL
 
CREATE TABLE CustomerProductMap (
    customer_id INT,
    product_id INT,
    shard_id INT,
    PRIMARY KEY (customer_id, product_id)
);


Before querying, we first retrieve the shard_id:

SQL
 
SELECT shard_id FROM CustomerProductMap WHERE customer_id = 123;


Then, we query only the relevant shard:

SQL
 
SELECT o.order_id, o.customer_id, p.product_name
FROM Orders o
JOIN Products p ON o.product_id = p.product_id
WHERE o.customer_id = 123
AND o.shard_id = <retrieved_shard>;


Benefits

  • Reduces the number of shards queried, improving performance.
  • Works well for large datasets that cannot be replicated.

When NOT to Use

  • If updates to CustomerProductMap are frequent, causing maintenance overhead.

Strategy 4: Precomputed Join Tables (Materialized Views)

Concept

Instead of executing joins at query time, we precompute and store frequently used join results in a materialized view or a separate table.

How It Works

We create a denormalized table that precomputes Orders and Products:

SQL
 
CREATE TABLE Precomputed_Order_Products (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    product_name TEXT,
    order_date TIMESTAMP
) SHARDED BY (customer_id);


A background ETL process precomputes the joins and inserts the data:

SQL
 
INSERT INTO Precomputed_Order_Products (order_id, customer_id, product_id, product_name, order_date)
SELECT o.order_id, o.customer_id, p.product_id, p.product_name, o.order_date
FROM Orders o
JOIN Products p ON o.product_id = p.product_id;


Querying without joins:

SQL
 
SELECT order_id, customer_id, product_name, order_date
FROM Precomputed_Order_Products
WHERE customer_id = 123;


Benefits

  • No joins required at query time, significantly improving performance.
  • Ideal for read-heavy workloads where query speed is critical.

When NOT to Use

  • If the dataset changes frequently, requiring frequent recomputation.
  • If storage costs for redundant data are too high.

Choosing the Right Strategy: Trade-offs and Considerations

Approach Pros Cons
Replicating Reference Tables Fast joins, eliminates data movement Only works for small tables
Collocating Data in the Same Shard Efficient joins, avoids movement Requires careful shard key selection
Mapping Table Approach Efficient query routing Extra storage & maintenance overhead
Precomputed Join Tables Best for read-heavy workloads Requires background ETL jobs


General Guidelines

  • Use replication – For small reference tables
  • Use colocation – When queries naturally group by a specific key
  • Use mapping tables – When reference tables are too large to replicate
  • Use precomputed joins – For high-performance analytical queries

Conclusion

Cross-shard data movement is one of the biggest performance challenges in distributed databases. We can significantly optimize query performance and reduce unnecessary network overhead by applying reference table replication, colocation, mapping tables, and precomputed joins.

Choosing the right strategy depends on query patterns, data distribution, and storage constraints. With the right approach, you can build highly scalable, performant distributed systems without being hindered by cross-shard joins.

Database Shard (database architecture) Data Types

Opinions expressed by DZone contributors are their own.

Related

  • Goose Migrations for Smooth Database Changes
  • ArangoDB: Achieving Success With a Multivalue Database
  • A Comprehensive Guide to Database Sharding: Building Scalable Systems
  • Practical Generators in Go 1.23 for Database Pagination

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!