Sharding: What It Is and How It's Done

DZone 's Guide to

Sharding: What It Is and How It's Done

A potential interview question from the folks at Hibernating Rhinos, this one tackling sharding functions.

· Database Zone ·
Free Resource

The following is likely to end up in the list of questions we’ll ask candidates to answer when they apply to Hibernating Rhinos.

Imagine a sharded database. A sharded database is one where the data is split among multiple nodes. To make things simple, we will assume that each datum in the database has a 64 bit key associated with it, and we are interested in distributing the information evenly among the nodes. This can be done using Jump Consistent Hashing (see paper for details), and can be implemented using the following simple C# function:

private static int JumpConsistentHash(ulong key, int numBuckets)
    ulong choosenBucket = ulong.MaxValue;
    ulong index = 0;
    while (index < (ulong) numBuckets)
        choosenBucket = index;
        key = key * 2862933555777941757UL + 1;
        index = (ulong)((choosenBucket + 1) * (double)(1L << 31) / (key >> 33) + 1);
    return (int)choosenBucket;

This function is responsible for taking a key and (given how many nodes there are in the cluster) provide which node this key resides on.

So far, so good, and this makes quite a lot of things much simpler. This function ensures that roughly 1/N of the data items in the databases will require movement to a new node when it is introduced. Which is pretty much exactly what we want in a sharded environment. However, this function doesn’t help us figure out what to move.

Assume that we already have a database that has five nodes, and ten billion data items spread across all five nodes, spread according to the consistent jump function. Because of load, we need to add additional two nodes to the cluster, and we need to move 2/7 (2.8 billion data items) of the cluster data to the new nodes. However, since moving the data items alone is going to be costly, we want to avoid scanning through all ten billion items in all nodes to figure out which ones we need to ship to the new nodes, and which ones should remain with the current node.

Suggest a way that will allow the database to find out which data items need to be moved to the new nodes, without having to scan through all of them. In other words, anything that requires zero (number of items in each node) is out.

You are rated on the impact of your suggestion on the overall cluster behavior. The cheaper your option, the better. You are free to store additional information (at cluster creation/modification, as data items are entered into the cluster/deleted, etc) if this can help you, but be aware that any impact on standard operations (reads and writes) should be minimal and well-justified.

You only need to consider adding nodes to the cluster, removing nodes from the cluster is not required.

database, interview answers, interview questions, sharding

Published at DZone with permission of Oren Eini , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}