DZone
Java Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Java Zone > Choosing a Good Sharding Key in MongoDB (and MySQL)

Choosing a Good Sharding Key in MongoDB (and MySQL)

Peter Zaitsev user avatar by
Peter Zaitsev
·
Mar. 21, 15 · Java Zone · Interview
Like (0)
Save
Tweet
5.94K Views

Join the DZone community and get the full member experience.

Join For Free

MongoDB 3.0 was recently released. Instead of focusing on what’s new – that is so easy to find, let’s rather talk about something that has not changed a lot since the early MongoDB days. This topic is sharding and most specifically: how to choose a good sharding key. Note that most of the discussion will also apply to MySQL, so if you are more interested in sharding than in MongoDB, it could still be worth reading.

When do you want to shard?

In general sharding is recommended with MongoDB as soon as any of these conditions is met:

  • #1: A single server can no longer handle the write workload.
  • #2: The working set no longer fits in memory.
  • #3: The dataset is too large to easily fit in a single server.

Note that #1 and #2 are by far the most common reason why people need sharding. Also note that in the MySQL world, #2 does not imply that you need sharding.

What are the properties of a good sharding key?

The starting point is that a cross-shard query is very expensive in a sharded environment. It is easy to understand why: the query has to be executed independently on several shards, and then results have to be merged.

With MongoDB, mongos will transparently route queries to the right shards and will automatically merge the results: this is very handy but the hidden complexity can also make you forget that you have executed a very poorly optimized query.

This is where the choice of a sharding key is so critical: choose the right key and most queries will be simple and efficient, choose a wrong one and you’ll have ugly and slow queries.

Actually a good sharding does not need to have tens of properties, but only two:

  • Insertions should be as much balanced as possible across all shards.
  • Each query should be able to be executed by retrieving data from as little shards as possible (ideally 1 shard).

Sounds quite easy, right? However depending on your use case, it may be quite difficult to find a good sharding key. Let’s look at a few examples.

Social Network

Say we have users who can be connected to other users, who can read or write posts and who have their own wall.

All 3 collections can become very large, so sharding all will be necessary over time.

For the user and wall collection the user_id field is an obvious choice and it actually meets both criteria for a good sharding key.

For the post collection, user_id also looks like an obvious choice, but if you think about how the collection is accessed for reads, you will realize that you will need to fetch it using its post_id, not its user_id (simply because a user can have multiple posts). If you shard by user_id, any read to a single post will be broadcast to all shards: this is clearly not a good option.

So using post_id is a better choice. However it only meets criteria #2: most posts are never updated, so all the writes are insertions that will go to a single shard. However the traffic on the post collection is strongly in favor of reads, so being able to speed up reads while not slowing down writes is probably an acceptable tradeoff.

Access Logs

The workload here is very specific: write-intensive and append-only.

Sharding by ObjectId is definitely a bad idea: while data can be easily spread across all shards, all writes will only go to one shard, so you will have no benefit compared to a non-sharded setup when it comes to scale the writes.

A better solution is to use a hash of the ObjectId: that way data AND writes will be spread across all shards.

Another good option would be to use another field in your documents that you know is evenly distributed across the whole dataset. Such field may not exist though, that’s why hashing the ObjectId is a more generic solution.

Ecommerce

MongoDB can be a good option to store a product catalog: being schemaless, it can easily store products with very different attributes.

To be usable such a catalog must be searchable. This means that many indexes need to be added, and the working set will probably grow very quickly. In this case your main concern is probably not to scale the writes, but to make reads as efficient as possible.

Sharding can be an option because if done properly, each shard will act as a coarse-grained index. Now the issue is to find which field(s) will evenly distribute the dataset. Most likely a single field will not be enough, you will have to use a compound sharding key.

Here I would say that there is no generic solution, but if the products are for instance targeted at either kid, woman or man and if you have several categories of products, a potential sharding key would be (target, category, sku).

Note that in this case, reading from secondaries may be enough to ensure good scalability.

Conclusion

As you can see, choosing a correct sharding key is not always easy: do not assume that because some apps are sharded by some fields, you should do the same. You may need a totally different strategy or sharding may even not be the right solution for your issues.

If you are interested in learning more about MongoDB, I will be presenting a free webinar on March 25 at 10 a.m. Pacific time. It will be an introduction to MongoDB for MySQL DBAs and developers. Register here if you are interested.

MongoDB Database MySQL Shard (database architecture)

Published at DZone with permission of Peter Zaitsev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 10 Steps to Become an Outstanding Java Developer
  • 6 Things Startups Can Do to Avoid Tech Debt
  • Ultra-Fast Microservices: When Microstream Meets Wildfly
  • How to Generate Fake Test Data

Comments

Java Partner Resources

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo