Over a million developers have joined DZone.

10 Useful Tips To Implement Distributed Fail-Over

DZone 's Guide to

10 Useful Tips To Implement Distributed Fail-Over

· Java Zone ·
Free Resource

GridGainIt is no secret that automatic fail-over in distributed environments is no picnic to implement. Here are some useful pointers if you ever decide to do it on your own:

  1. Make sure to implement some sort of heartbeat protocol. A heartbeat is a message that every node emits to tell others that it's alive. It is usually implemented with IP Multicast, however actual communication protocol is not important here. Other nodes will consider a node to be failed after it missed a certain pre-configured number of heartbeats.
  2. Account for delays in node discovery. There is always a time window between an actual node crash and when other nodes find out about it.
  3. Store all messages on sender node until they get processed. This way you can fail them over to other nodes in case if the processing node failed.
  4. Account for possibility of receiving multiple notification events for the same node failure - you don't want to process the same fail-over event more than once.
  5. Make sure that your message does not get failed-over forever, i.e. keeps jumping between grid nodes indefinitely. After a certain number of fail-over attempts, let the whole processing of the message fail.
  6. Make sure that your message does not get failed-over to the same node it failed on initially - always give preference to other grid nodes.
  7. Make sure that message failure is not limited to node crashes. For example, you may potentially want to fail-over a message if it threw some exception on remote node or returned a bad result.
  8. Avoid sending any messages within synchronization blocks - this is a sure way to introduce deadlocks into your code.
  9. Make sure that fail-over happens automatically at infrastructure level and is transparent to your application logic.
  10. Provide a good interface for your failover module and make it pluggable - failover logic, such as selecting a new node, may differ based on your application policy, so it is essential to be able to easily switch underlying implementation.
Of course, you could always download GridGain and get all of the above right out of the box.

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}