Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Distributed Task Assignment With Failover

DZone's Guide to

Distributed Task Assignment With Failover

If you have multiple nodes, how do you decide which node will distribute task assignments? In some cases, that's an easy question. In others? Not so much.

· Database Zone
Free Resource

Learn how to create flexible schemas in a relational database using SQL for JSON.

When building a distributed system, one of the more interesting aspects is how you are going to distribute task assignments. In other words, given that you have multiple nodes, how do you decide which node will do what? In some cases, that is a relatively easy question; you can say, “all nodes will process read requests.” But in others, it is more complex. Let's take the case where you have several nodes, and you need to have a regular backup of a database that is replicated between all those nodes. You probably don’t want to run the backup across all the nodes; after all, they are pretty much the same and you don’t want to backup the exact same thing multiple times. On the other hand, you probably don’t want to assign this work statically. If you do, and if the node that is responsible for the backup is down, you have no backup.

Another example of the problem can be seen when you have other processes that you would like to be sticky if possible, and only jump around if there is a failure. Bringing up a new node online is a common thing to do in a cluster, and the ideal scenario, in that case, is that a single node will feed it all the data that it needs. If we have multiple nodes doing that, they are likely to overlap and they might very well overload the poor new server. So we want just one node to update its state, but if that node goes down midway, we need someone else to pick up the slack. For RavenDB, those sorts of tasks include things like ETL processes, subscriptions, backup, bootstrapping new servers, and more. We keep discovering new things that can use this sort of behavior.

But how do we actually make this work?

One way of doing this is to take advantage of the fact that RavenDB is using Raft and have the leader do task assignment. That works quite well, but it doesn’t scale. What do I mean by that? I mean that as the number of tasks that we need to manage grows, the complexity in the task assignment portion of the code grows as well. Ideally, I don’t want to have to consider twenty different variables and considerations before deciding what operation should go on which server, and trying to balance that sort of work in one place has proven to be challenging.

Instead, we have decided to allocate work in the cluster based on simple rules. Each task in the cluster has a key (which is typically generated by hashing some of its parameters), and that task can be assigned to a group of servers. Given those two details, we can use Jump Consistent Hashing to spread the load around. However, that doesn’t handle failover. We have a heartbeat process that can detect and notify nodes that a sibling has went down, so combining those two, we get the following code:

public static string WhoseTaskIsIt(List<Entry> topology, ulong key)
{
   topology = new List<Entry>(topology); // local copy so we can safely change it
   while (true)
   {
       var index = (int)JumpConsistentHash(key, topology.Count);
       var entry = topology.Values[index];
       if (entry.Disabled == false)
           return entry.Id;

       topology.RemoveAt(index);

       key = CombineHash(key, (ulong)topology.Count);
   }
}

//A Fast, Minimal Memory, Consistent Hash Algorithm
//by John Lamping, Eric Veach
//relevant article: https://arxiv.org/abs/1406.2294
public static long JumpConsistentHash(ulong key, int numBuckets)
{
  long b = 1L;
  long j = 0;
  while (j < numBuckets)
  {
      b = j;
      key = key * 2862933555777941757UL + 1;
      j = (long)((b + 1) * ((1L << 31) / ((double)(key >> 33) + 1)));
  }
  return b;
}

public static ulong CombineHash(ulong x, ulong y)
{
    // This is the Hash128to64 function from Google's CityHash (available
    // under the MIT License).  We use it to reduce multiple 64 bit hashes
    // into a single hash.

    // Murmur-inspired hashing.
    ulong a = (y ^ x) * kMul;
    a ^= (a >> 47);
    ulong b = (x ^ a) * kMul;
    b ^= (b >> 47);
    b *= kMul;

    return b;
}

What we are doing here is relying on two different properties: Jump Consistent Hashing to let us know which node is responsible for what, and the Raft cluster leader that keep track of all the live nodes and let us know when a node goes down. When we need to assign a task, we use its hashed key to find its preferred owner, and if it is alive, that is that. But if it's currently down, we do two things: we remove the downed node from the topology and re-hash the key with the new number of nodes in the cluster. That gives us a new preferred node, and so on until we find a live one.

The reason we rehash on failover is that Jump Consistent Hashing is going to usually point to the same position in the topology (that is why we choose it in the first place, after all), so we rehash to get a different position so it won’t all fall unto the next node in the list. All downed node tasks are fairly distributed among the remaining live cluster members.

The nice thing about this is that aside from keeping the live/down list up to date, the Raft cluster doesn’t really need to do something. This is a consistent algorithm, so different nodes operating on the same data can arrive at the same result and a node going down will result in another node picking up on updating the new server up to spec and another will start a backup process. And all of that logic is right where we want it: right next to where the task logic itself is written.

This allows us to reason much more effectively about the behavior of each independent task and also allow each node to let you know where each task is executing.

Create flexible schemas using dynamic columns for semi-structured data. Learn how.

Topics:
database ,failover ,nodes ,distributed systems ,task assignment

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}