Over a million developers have joined DZone.

Sharding: What It Is and How It's Done

A potential interview question from the folks at Hibernating Rhinos, this one tackling sharding functions.

· Database Zone

Build fast, scale big with MongoDB Atlas, a hosted service for the leading NoSQL database. Try it now! Brought to you in partnership with MongoDB.

The following is likely to end up in the list of questions we’ll ask candidates to answer when they apply to Hibernating Rhinos.

Imagine a sharded database. A sharded database is one where the data is split among multiple nodes. To make things simple, we will assume that each datum in the database has a 64 bit key associated with it, and we are interested in distributing the information evenly among the nodes. This can be done using Jump Consistent Hashing (see paper for details), and can be implemented using the following simple C# function:

private static int JumpConsistentHash(ulong key, int numBuckets)
    ulong choosenBucket = ulong.MaxValue;
    ulong index = 0;
    while (index < (ulong) numBuckets)
        choosenBucket = index;
        key = key * 2862933555777941757UL + 1;
        index = (ulong)((choosenBucket + 1) * (double)(1L << 31) / (key >> 33) + 1);
    return (int)choosenBucket;

This function is responsible for taking a key and (given how many nodes there are in the cluster) provide which node this key resides on.

So far, so good, and this makes quite a lot of things much simpler. This function ensures that roughly 1/N of the data items in the databases will require movement to a new node when it is introduced. Which is pretty much exactly what we want in a sharded environment. However, this function doesn’t help us figure out what to move.

Assume that we already have a database that has five nodes, and ten billion data items spread across all five nodes, spread according to the consistent jump function. Because of load, we need to add additional two nodes to the cluster, and we need to move 2/7 (2.8 billion data items) of the cluster data to the new nodes. However, since moving the data items alone is going to be costly, we want to avoid scanning through all ten billion items in all nodes to figure out which ones we need to ship to the new nodes, and which ones should remain with the current node.

Suggest a way that will allow the database to find out which data items need to be moved to the new nodes, without having to scan through all of them. In other words, anything that requires zero (number of items in each node) is out.

You are rated on the impact of your suggestion on the overall cluster behavior. The cheaper your option, the better. You are free to store additional information (at cluster creation/modification, as data items are entered into the cluster/deleted, etc) if this can help you, but be aware that any impact on standard operations (reads and writes) should be minimal and well-justified.

You only need to consider adding nodes to the cluster, removing nodes from the cluster is not required.

Now it's easier than ever to get started with MongoDB, the database that allows startups and enterprises alike to rapidly build planet-scale apps. Introducing MongoDB Atlas, the official hosted service for the database on AWS. Try it now! Brought to you in partnership with MongoDB.

database,sharding,interview questions,interview answers

Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}