DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Elasticsearch Query and Indexing Architecture
  • AWS Redshift Data Sharing: Unlocking the Power of Collaborative Analytics
  • A Comprehensive Guide to Database Sharding: Building Scalable Systems
  • 7 Essential Tips for a Production ClickHouse Cluster

Trending

  • Build Your First AI Model in Python: A Beginner's Guide (1 of 3)
  • Segmentation Violation and How Rust Helps Overcome It
  • Streamlining Event Data in Event-Driven Ansible
  • Is the Model Context Protocol a Replacement for HTTP?
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Sharding Pitfalls Part III: Chunk Balancing and Collection Limits

Sharding Pitfalls Part III: Chunk Balancing and Collection Limits

By 
Francesca Krihely user avatar
Francesca Krihely
·
Oct. 27, 14 · Interview
Likes (0)
Comment
Save
Tweet
Share
4.1K Views

Join the DZone community and get the full member experience.

Join For Free

In Parts 1 and 2 we have covered a number of common issues people run into when managing a sharded MongoDB cluster. In this final post of the series we will cover a subtle, but important distinction in terms of balancing a sharded cluster as well as an interesting limitation that can be worked around relatively easily, but is nonetheless surprising when it comes up.

6. Chunk balancing != data balancing != traffic balancing

The balancer in a sharded cluster cares about just one thing:

Are chunks for a given collection evenly balanced across all shards?

If they are not, then it will take steps to rectify that imbalance. This all sounds perfectly logical, and even with extra complexity like tagging involved the logic is pretty straight forward. If we assume that all chunks are equal, then we can rest assured that our data is being evenly balanced across all the shards in our cluster and rest easy at night.

Although that is sometimes, perhaps even frequently, the case it is not always true - chunks are not always equal. There can be massive “jumbo” chunks that exceed the maximum chunk size (64MiB), completely empty chunks and everything in between.

Let’s use an example from our first pitfall, the monotonically increasing shard key. For our example, we have picked just such a key to shard on (date), and up until this point we have had just one shard and had not sharded the collection. We are about to add a second shard to our cluster and so we enable sharding on the collection and do the necessary admin work to add the new shard into the cluster.

Once the collection is enabled for sharding, the first shard contains all the newly minted chunks. Let’s represent them in a simplified table of 10 chunks. This is not representative of a real data set, but it will do for illustrative purposes:

Table 1 - Initial Chunk Layout

Now we add our second shard. The balancer will kick in and attempt to distribute the chunks evenly. It will do this by moving the lowest range chunks to the new shard until the counts are identical. Once it is finished balancing, our table now looks like this:

Table 2 - Balanced Chunk Layout

That looks pretty good at the moment, but lets imagine that more recent chunks are more likely to have more activity (updates say) than older chunks. Adding the traffic share estimates for each chunk shows that shard1 is taking far more traffic (72%) than shard2 (28%) despite the chunks seeming balanced overall based on the approximate size. Hence, chunk balancing is not equal to traffic balancing.

Using that same example, let’s add another wrinkle - periodic deletion of old data. Every 3 months we run a job to delete any data older than 12 months. Let’s look at the impact of that on our table after we run it for the first time (assuming the first run happens on July 1st 2015).

Table 3 - Post-Delete Chunk Layout

The distribution of data is now completely skewed toward shard1 - shard2 is in fact empty! However, the balancer is completely unaware of this imbalance - the chunk count has remained the same the entire time, and as far as it is concerned the system is in a steady state. With no data on shard2, our traffic imbalance as seen above will be even worse, and we have essentially negated the benefit of having a second shard for this collection.

Possible Mitigation Strategies

  • If data and traffic balance are important, select an appropriate shard key
  • Move chunks manually to address the imbalances - swap “hot” chunks for “cool” chunks, empty chunks for larger chunks

7. Waiting too long to shard a collection (collection too large)

This is not very common, but when it falls on your shoulders, it can be quite challenging to solve. There is a maximum data size for a collection when when it is initially split which is a function of the chunk size and data size as noted on the limits page.

If your collection contains less than 256GiB of data, then there will be no issue. If the collection size exceeds 256GiB but is less than 400GiB, then MongoDB may be able to do an initial split without any special measures being taken.

Otherwise, with larger initial data sizes and the default settings, the initial split will fail. It is worth noting that once split the collection may grow as needed and without any real limitations as long as you can continue to add shards as data size grows.

Possible Mitigation Strategies

Since the limit is dictated by the chunk size and the data size, and assuming there is not much to be done about the data size, then the remaining variable is the chunk size. This is adjustable (default is 64MiB) and can be raised in order to let a large collection split initially and then reduced once that has been completed.

The required chunk size increase will depend on the actual data size. However, this is relatively easy to work out - simply divide your data size by 256GB and then multiply that figure by 64MiB (and round up if it is not a nice even number). As an example, let’s consider a 4TiB collection:

4TiB divided by 256GiB = 16 64MiB x 16 = 1024MiB

Hence, set the max chunk size to 1024MiB, then perform the initial sharding of the collection, and then finally reduce the chunk size back to 64MiB using the same procedure. .

Thanks for reading through the Sharding Pitfall series! If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations.

Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.


Data (computing) Shard (database architecture) cluster

Published at DZone with permission of Francesca Krihely, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Elasticsearch Query and Indexing Architecture
  • AWS Redshift Data Sharing: Unlocking the Power of Collaborative Analytics
  • A Comprehensive Guide to Database Sharding: Building Scalable Systems
  • 7 Essential Tips for a Production ClickHouse Cluster

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!