Fault Tolerance Is Not High Availability
Fault Tolerance Is Not High Availability
These terms are not interchangeable. Learn about the ins and outs of fault tolerance to highlight the differences between the two concepts.
Join the DZone community and get the full member experience.Join For Free
Sensu is an open source monitoring event pipeline. Try it today.
Not always, but more often than not, these two terms are treated as interchangeable, people tend to complain when your fault tolerant platform is having downtimes, or even when your highly available setup is not capable of processing some input.
In the following paragraphs, I'm going to discuss one of these concepts, hopefully shedding some light on the difference between the two.
You can think of Fault Tolerance as a less strict version of HA. The latter was all about keeping the offline time of your platform to the minimum and always trying to keep performance unaffected. Now, with FT, we will again, try to minimize downtime but performance will not be a concern, in fact, you could say that a degraded performance is going to be expected.
That being said, the main and most important difference between these two, is actually that if an error occurs during an active action, a highly available system does not ensure the correct end state of that action, whilst a fault tolerant one, does.
In other words, if, for instance, a web request is being processed by your highly available platform, and one of the nodes crashes, that user will probably get a 500 error back from the API, but the system will still be responsive for following requests. In the case of a fault-tolerant platform, the failure will somehow (more on this in a minute) be worked-around and the request will finish correctly, so the user can get a valid response. The second case will most likely take longer, due to the extra steps.
This distinction is crucial because it’ll be the key to understanding which one you’ll want to implement for your particular use case.
Usually, fault-tolerant systems try to catch the error at its source and find a solution before it becomes critical. What I mean by this is, having mirrored hard drives in case one of them fails, instead of letting it fail and replace the entire server, affecting whatever actions that server could’ve been performing at the time.
Because of the aim of this book, I will not focus on hardware-level fault tolerance, instead, I will only cover some of the most common techniques to ensure FT at a software level.
One way to design fault-tolerant architectures is by incorporating redundancy in your key components. What this means basically, is that you have one or more components performing the same task and some form of checking logic to determine when one of them is has failed and its output needs to be ignored.
This is a very common practice for mission-critical components, and it can be applied to many scenarios.
For example, and as an interesting piece of trivia, in 2012, SpaceX sent it’s Dragon capsule to berth with the ISS, during the ascent, the Falcon9 rocket used, suffer a failure on one of it’s 9 Merlin engines, and thanks to their implemented redundancy their onboard computer was able to reconfigure the other 8 to ensure the success of the mission.
Because these systems as so complex to code and to test, the cost-benefit ratio is not always something the normal software project can handle. Instead, these type of systems are usually present in critical systems, where human lives might be at risk (such as air traffic controllers, rocket guidance systems and nuclear power plants, amongst others).
Let’s go over some techniques to provide software redundancy and fault tolerance.
Triple Modular Redundancy
Also known as Triple Mode Redundancy (or TMR), this is a form of redundancy where three systems perform the same process and their results are checked by a majority voting system that in turn, produces a single output (see next image). If one of the three systems fails, then the other two will correct it by providing the accurate output to the voting system.
Generic example of a Triple Modular Redundancy system.
This is a particular implementation of the N-modular redundancy systems, whereas you might’ve guessed, you can add as many parallel systems as you see the need for, in order to provide a higher degree of fault tolerance for a given component. A particularly interesting real-world use case for this type of solutions (or more like a 5-modular redundancy system) is the FlexRay system.
FlexRay is a network communication protocol used in cars, it was developed by the FlexRay Consortium to govern onboard car computing. The consortium disbanded in 2009, but the protocol became a standard. Cars such as the Audi A4 and BMW 7 series use FlexRay. This protocol uses both, data redundancy, sending extra information for problem detection purposes as metadata in the same messages, and structural redundancy in the form of a redundant communication channel.
Forward Error Correction
Yet another way to add a form of redundancy to the system, in this case, Forward Error Correction (or FEC) adds redundancy into the message itself, that way the receiver can verify the actual data and correct a limited amount of errors detected due to noisy or unstable channels.
Depending on the algorithm used to encode the data, the amount of redundancy on the channel may vary and with it, the amount of actual data that can be transferred through it.
You have two main types of encoding algorithms, block codes, and convolutional codes. The first kind deals with fix length blocks of data, and one of the most common algorithms is Reed-Solomon. A classic example of this is two-dimensional barcodes — you use them every day and they are encoded in a way that the reader can withstand a certain amount of missing bits from the code.
Another very interesting real-world example of this type of redundancy can be found on the messages sent by the Voyager space probe, and other similar types of probes. As you can imagine, the communication with these types of devices can’t really afford retransmissions due to a faulty bit, so this type of encoding is used to ensure that the receiving end takes care of solving as many errors as it can due to a problematic channel.
On the other hand, convolutional codes deal with streams of arbitrary length of data and the most common algorithm used for this is called the Viterbi algorithm. This particular algorithm is used for CDMA and GSM cellular networks, dial-up models and yes, you guessed it, deep-space communications (sometimes it’s even used in combination in combination with Reed-Solomon to ensure whatever defect can’t be fixed using Viterbi is fixed using R-S).
Checkpointing is yet another way to provide tolerance to failure; it is, in fact, one method that is commonly used by many programs regular users interact with daily, one of them being word processors, such as Microsoft Word or Word from LibreOffice.
This technique consists on saving the current state of the system into a reliable storage, and restarting the system pre-loading that saved state whenever there is a problem. Rings a bell now? Word processors usually do this while you type — not on every keystroke, that would be too expensive, but on pre-set periods of time, the system will save your current work, in case there is some sort of crash.
Now, this sounds great for small systems, such as a word processor which is saving your current document, but what about whole distributed platforms?
Dealing With Distributed Checkpointing
For these cases, the task is a bit more complex because there is usually a dependency between nodes, so when one of them fails and is restored to a previous checkpoint, the others need to ensure that their current state is consistent. This can cause a cascade effect, leading to the system to returning to the only common stable state: it’s original checkpoint.
There are already some solutions designed to deal with this problem, so you don’t have to, for instance, DMTCP, which stands for Distributed MultiThreading CheckPointing. This tool provides the ability to checkpoint the status of an arbitrary number of distributed systems.
Another solution, which is actively being used in RFID tags, is called Mementos. In this particular use case, the tags don’t have a power source, they use the environment background energy to function, and this can lead to arbitrary power failures. This tool actively monitors the power levels and when there is enough to perform a checkpoint, it’ll store the current tag’s state into a non-volatile memory, which can later be used to reload that information
When to Use
This technique is one that clearly, doesn’t work on every system and you need to carefully analyze your particular needs before starting to plan for it.
Since you’re not checkpointing every time there is new input on your system, you can’t ensure that the action taking place during the error will be able to finish, but what you can ensure, is that the system will be able to handle sudden problems and it will be restored to the latest stable state. Wheather that works for your or not, that’s a whole different beast.
Cases such as a server crash during an API request will most likely not be able to complete, and if retried, it could potentially return an unexpected value, due to an old state on the server side.
I intentionally left this one for last, because it could be considered the sum of all of the above. What we have here is a distributed system, where some components fail, but the majority of the monitoring modules can’t reach a consensus. In other words, you’re in trouble.
The next diagram shows a basic and high-level example of what this problem means for a platform architecture. In it, you have five replicas of component C, which send their output to four different status checkers (A, R, M and Y), they, in turn, exchange “notes” and try to reach a consensus regarding the data they all received. But because there is a problem, maybe with the data channel or with the fifth component, it ends up sending different values to different checkers, so in this case, a majority consensus can’t be reached.
Example of a Byzantine problem, where there is a faulty component sending random data.
There are different approaches to tackle this type of problem, in fact, there are too many ot there to cover in a single chapter, so I’ll just go over the most common ones, to try to give you an idea of where to start.
The simplest one, and I wouldn’t consider it a solution, but more like a workaround, is to let your status checkers default to a specific value whenever consensus can’t be reached. That way the system is not stalled and the current operation can continue.
Another possible solution, especially when the fault is on the data channel, and not on the component generating the message itself, is to use sign the messages with some sort of CRC algorithm, that way faulty messages can be detected and ignored.
Finally, yet another approach to ensure the authenticity of the message sent is to use blockchain, just like Bitcoin does, with a Proof of Work approach, where each node that needs to send a message will need to authenticate it by performing a heavy computation. I’m simply mentioning this approach since it can definitely be material of an entire book, but the idea behind this approach is that it solves the Byzantine Generals problems without any inconvenience.
Well, did you enjoy it? This is just a portion of one of the chapters for my newest book (still under development) "Scaling your Node.js Apps." If you did enjoy it and would like to read more about testing and development in general, please feel free to visit my site.
Thanks for reading!
Opinions expressed by DZone contributors are their own.