MongoDB enables horizontal scaling of the database using Shards. Sharding is generally used when your data gets large enough that it cannot fit on a single machine or the query performance starts degrading due to the increase in data.
Note that sharding is not a solution for some slow-running queries. For such cases, the problem might be in the structure of the data or the indexes which are created on the collection. Sharding does not solve the slowness created by poor indexing or lack of indexing. Sharding should only be considered when the data is larger than your resources on a machine and adding more resources on that machine is not feasible or is more expensive.
If you have decided to shard the MongoDB cluster, then it's very important to choose a good shard key. An unwisely chosen shard key nullifies all the benefits of sharding, and you may end up having lot more data on one shard than the other. Also, you must note that in MongoDB, you cannot change the shard key automatically at a later stage. So, once a key is selected, it will not change.
There are two types of sharding in MongoDB, one is Range Based and another is Hash Based.
MongoDB document explains what should you consider when choosing a shard key. You can have a look at it here — Choose a shard key. The shard key has to be an indexed field.
But for me, having some examples really helps understand the concept or theory. So I have written some down and shown how the shard key is chosen for each of them.
Note that you can shard by collection. Hence, if a collection is small, such as a collection of categories or user roles, don't shard it. Only shard large collections.
1. E-Commerce Shopping Order/Cart.
If you are working on an e-commerce application that stores the user orders in MongoDB (also known as a cart), then generally you have to retrieve the data by the user. Hence, you want all orders related to a given user to be in one shard. For this case, user_key would be a good shard key.
2. B2B Order Management
If your application manages orders or purchases by another organization, then you will generally retrieve all orders placed by that organization in one query. The user who actually placed the order may not matter to you.
Example: If there are three users from Organization A who make purchases on the organization's behalf, while displaying on the dashboard, you might be retrieving data only by organization key and not by individual user key. In this scenario, organization_key, would be a good choice.
But consider if you need to query data based on organization for a few reports or also based on user for a few other reports. If the user belongs to only one organization, then sharding by organization_key is fine. But if the user belongs to multiple organizations, then shard by user_key.
3. Product Data
If you have a lot of product data, then you can shard it using category or product type. This will help if you query based on a particular product type. But if your category list is very limited, then it is not advisable to use it as a shard key. In such scenarios, you should use hash_id as the shard key.
4. Blogs or Posts
If it's an application that has only one or two bloggers, then I believe you don't need to consider sharding. I don't think one person can create 10 million blogs. I checked with Superman, he is not interested in blogging.
So, if you have a blogging platform, then the shard key for the blog collection can be user_key again. All the shards by a particular user will be on one shard and can easy retrieved to show all blogs by a given user.
5. Page Access
If you are collecting page access details like time series and appending the information in MongoDB, the ideal shard key would be page_id. I am suggesting page_id here based on what reports and analytics I can think of. If your use case is user specific, then page_id may not fit. For me, such data is more useful to analyze the page performance, popularity etc.
6. Invoices or Payments
Invoices and payments are generally by user, if your application is B2C, or by organization, if your application is B2B. So, if it is B2C, use user_key as the shard key and if its B2B, use organization_key as shard key.
7. Hotel or Flight Reservations
Hotel or flight reservations can be tricky, as the reservation can be done by anonymous users and you may not have a user_key. The other option is hash_id, which randomly distributes the data across the shards. But a few performance runs have shown that sharding by hotel_property_id or flight_id is more efficient than hash_id. Most of the business use case requires the data to be fetched by hotel_id. Since user_key is not available for anonymous users, hotel_id or flight_id would be a better choice from my perspective.
Compound Shard Key
MongoDB also supports compound shard keys on indexed compound fields. In any of my projects so far, I have not found a good reason to use a compound shard key. I will update if I find a good one.
I hope this article helps at least some of you in deciding the right shard key.