- Make sure to implement some sort of heartbeat protocol. A heartbeat is a message that every node emits to tell others that it's alive. It is usually implemented with IP Multicast, however actual communication protocol is not important here. Other nodes will consider a node to be failed after it missed a certain pre-configured number of heartbeats.
- Account for delays in node discovery. There is always a time window between an actual node crash and when other nodes find out about it.
- Store all messages on sender node until they get processed. This way you can fail them over to other nodes in case if the processing node failed.
- Account for possibility of receiving multiple notification events for the same node failure - you don't want to process the same fail-over event more than once.
- Make sure that your message does not get failed-over forever, i.e. keeps jumping between grid nodes indefinitely. After a certain number of fail-over attempts, let the whole processing of the message fail.
- Make sure that your message does not get failed-over to the same node it failed on initially - always give preference to other grid nodes.
- Make sure that message failure is not limited to node crashes. For example, you may potentially want to fail-over a message if it threw some exception on remote node or returned a bad result.
- Avoid sending any messages within synchronization blocks - this is a sure way to introduce deadlocks into your code.
- Make sure that fail-over happens automatically at infrastructure level and is transparent to your application logic.
- Provide a good interface for your failover module and make it pluggable - failover logic, such as selecting a new node, may differ based on your application policy, so it is essential to be able to easily switch underlying implementation.
10 Useful Tips To Implement Distributed Fail-Over
Join the DZone community and get the full member experience.Join For Free
Check out this 8-step guide to see how you can increase your productivity by skipping slow application redeploys and by implementing application profiling, as you code! Brought to you in partnership with ZeroTurnaround.
Published at DZone with permission of Dmitriy Setrakyan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.