Shared-Nothing Data Partitioning
Shared-Nothing Data Partitioning
In this article, take a look at building a partitioned database.
Join the DZone community and get the full member experience.Join For Free
Nowadays, many technologies allow data partitioning — SQL and NoSQL distributed databases, in-memory data grids, etc.
The basic idea is to split the data into smaller parts and distribute them across multiple computers to achieve horizontal scalability. This way, we can accomplish a couple of things:
- Storing large datasets that won’t fit a single server space. This is particularly important for in-memory databases — even the most advanced commodity servers usually provide not more than several hundreds of gigabytes of RAM.
- Virtually unlimited throughput. Different requests are processed by multiple computers in parallel, utilizing the combined power of their CPUs.
With this in mind, let’s build a partitioned database!
Below is our sample data. We will store the information about credit card accounts, where every row consists of the credit card number (used as the primary key), owner’s name, and current balance.
There are only nine rows, but that’s only for the sake of simplicity, of course. In a real dataset that you will want to partition, you will have many more elements.
Now, let’s see how we can split this table into three smaller pieces — to store them on three independent servers.
Attempt #1: Naive
So we have nine rows and three servers; therefore, the most optimal distribution would be to have three rows on each server. The first thing that comes to mind is to simply go through the rows one by one, putting the first three to server #1, second three to server #2, and the last three to server #3. Like this:
Woohoo, we partitioned the data! Mission accomplished!
Well, not so fast. Although we successfully created a partitioned dataset, we didn’t think about how to use it going forward. Imagine that there is an incoming transaction for one of the credit cards, and you want to update the balance of the corresponding account. Which server should you send the request to? Currently, there is no way to know.
One of the ways to solve this is to memorize the mapping between credit card numbers and corresponding servers. As long as your application can get a hold of this mapping, it will be able to access a particular row knowing the credit card number.
The question is — where to store the mapping? Presumably, we will have to set up an additional server specifically for this purpose, which will have multiple implications:
- Such a server will be a single point of failure. If it goes down, the whole system will become non-functional.
- Every request from every client will have to go through this server, so it will quickly become a bottleneck. You will lose all the scalability that you expect from a distributed system.
Although we successfully partitioned the data across three servers, we didn’t create an efficient mechanism for a client application to find out where a particular row is stored. How can we achieve this and avoid having a shared state, effectively creating a shared-nothing architecture?
Attempt #2: Hashing
If you’ve worked with hash-based data structures, you probably have already thought that hashing might be a solution here. That’s exactly how a typical hash table puts entries into buckets, so why not use it here? Let’s try.
Instead of storing the mapping, we can consistently calculate it like this:
Here are the steps in more detail:
- Calculate an integer hash value based on the credit card number.
- Apply the absolute value function to discard negative values.
- Apply the modulo operation to the value, using the number of servers as the divisor.
If we apply this formula to all the credit card numbers that we’ve got, here is what we get:
Note that the function returns values in [0..2] range, so we’ll assume that “0” actually corresponds to server #1, “1” to the server #2, and “2” to server #3. Therefore, the data will be split like this:
As you can see, we made a tremendous step forward. The data is still evenly partitioned (every server owns three entries), but we do not rely on any externally stored mapping. Any client can simply use the hashing function to map a credit card number to the corresponding server.
At this point, we have an algorithm that would work perfectly fine if the number of servers never changes. However, the beauty of horizontally scalable systems is that you can add more servers in case you need more resources. Let’s take a look at what happens if we add the fourth server into the picture. To make it easier, here is the visualization of our current hash-based distribution:
Now, if we add another server, our hashing function stays the same, but we need to use “4” instead of “3” as the divisor for the modulo operation. Here is the mapping after this change:
As expected, the function remapped the data based on the new set of nodes. In a real system, however, this is not going to be free, as we will need to move the data around to comply with the updated distribution (this is often called “data rebalancing”). Here is what will happen:
Here, green arrows represent data movements that we want — these are the rows transferred to the new server we’ve just added. However, in addition to two green arrows, there are three red ones. Red arrows show the entries that have been moved between the first three servers. Such movements are entirely redundant and add no value. Moreover, with large datasets, they can significantly slow down the process and even saturate the network, causing complete downtime of the system.
It turns out that the basic hashing algorithm works on a stable topology when the number of servers does not change. Still, it lacks efficiency during the rebalancing. Let’s see if we can modify the algorithm so that it avoids the “red arrows” that we see on the diagram above — this is called the “minimal disruption” requirement.
Attempt #3: Rendezvous Hashing
One of the well-known algorithms that solve the issue is Rendezvous Hashing.
As the name suggests, this algorithm is also based on hashes. However, instead of calculating a hash solely for a credit card number, we will calculate multiple hashes, each based on a combination of the credit card number and a server number. Then, we will pick the server that gave us the largest value.
For example, for the first credit card number in our list, we can get the following hash values:
The second one is the largest, so we will use the second server (indexed as “1”) to store the information about this account.
And that’s where we get to the crux of the algorithm. Remember, we wanted to solve the issue of unnecessary movements caused by the addition of a new server? With rendezvous hashing, if we add the fourth server, there are only two possibilities:
- The hash value for the new server is smaller than the current largest. In this case, nothing changes, and the account entry does not move.
- The hash value for the new server is larger than the current largest. In this case, we will move the account entry to the new server.
In other words, the algorithm guarantees that any entry might only be moved to the newly added server; otherwise, it’s not moved at all. Therefore, rendezvous hashing satisfies the minimal disruption requirement.
Let’s see this in action, starting with the distribution across three servers:
Adding the fourth server:
No red arrows! We did transfer some of the entries to the new server, but there was no reshuffling between the first three nodes. Precisely what we’ve been looking for.
Congratulations! You’ve just built your first distributed database!
You’ve probably noticed that although the rendezvous hashing algorithm satisfies all the requirements, the distribution it provided is less optimal than the one created by the basic hashing. With four servers, for example, we have only one entry on server #3, and four entries on server #1.
That’s the price we pay for minimal disruption during rebalancing. Essentially, the efficiency of rebalancing is as crucial as uniformity of the distribution, but those two things conflict with each other.
Such tradeoffs are everywhere in distributed systems. Scalability is not free, as multiple servers have to coordinate with each other to provide data distribution with consistency, as well as efficient parallel processing.
I believe that’s what makes working with distributed systems hard, but also exciting and fun. I encourage everyone interested to dive deeper into the concepts I’ve described, as well as many other related topics that you can find in books and over the Internet.
The topic of data partitioning and distributed systems is humongous — I’m barely scratching the surface here. Here are only some of the examples of what you can take a look at as a next step:
- Alternatives to the rendezvous hashing algorithms. E.g., Consistent Hashing, which uses completely different techniques to achieve the same goals.
- What should be used as the underlying hashing function? I’ve used MurmurHash3 in my calculations, but there are many other options.
- How can we add replication? Distributed databases typically allow to store more than one copy of every piece of data for redundancy, so the underlying algorithms need to support that.
- How does data partitioning effect transactions or (in case of a SQL database) query execution?
- And many more.
If you would like to verify any of the calculations or mappings that are shown above, here is the code I’ve used: https://github.com/vkulichenko/hashing. Feel free to use it as a starting point for your investigations.
Another great way to study existing implementations is to look at open-source projects. Apache Ignite, for example, uses rendezvous hashing for data distribution — the source code is here. I’m proud to be a part of the Ignite community that managed to fine-tune the implementation to the state when it can digest petabytes of data, load balance requests, and perform data rebalancing efficiently. You’re welcome to join as well!
Opinions expressed by DZone contributors are their own.