DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Databases
  4. Fault Tolerance on the Cheap Part I

Fault Tolerance on the Cheap Part I

Fault tolerance is a very real requirement for most, if not all, deployed systems. How can we build this without breaking the bank?

Brian Troutwine user avatar by
Brian Troutwine
·
Apr. 29, 16 · Opinion
Like (3)
Save
Tweet
Share
3.11K Views

Join the DZone community and get the full member experience.

Join For Free

The property of fault tolerance is desirable in common systems, but there is little common literature on the subject. What does exist is seemingly out of date — being associated with either a defunct computer manufacturer or something proposed in early relational-database works or specific, esoteric functional programming languages.

Achieving fault tolerance, however, is not an esoteric matter. It is practical and approachable in an iterative fashion, fault-tolerance being a matter of degrees, amenable to trade-offs necessitated by budgets or organizational needs.

This article is Part I in a two-part series that will discuss a high-level approach to building fault-tolerant systems. Here, we’ll focus on background information.

What Is Fault Tolerance?

The late Jim Gray defined fault-tolerant systems as being those in which “parts of the system may fail but the rest of the system must tolerate failures and continue delivering service.” This is taken from Dr. Gray’s article “Why Do Computer Stop and What Can Be Done About It?”, a 1985 paper about Tandem Computers’ NonStop, a computer system with fully redundant hardware.

Dr. Gray was concerned with the software running on these systems and the interaction with human operators, noting that these two categories made up the majority of observed failures. (That this undermined the business case for the Tandem NonStop was not mentioned in the article, but the trend of history toward deploying on cheap, fault-prone hardware and solving for this at the software level was inevitable.)

There are two things to unpack in Dr. Gray’s definition:

  • “fail”
  • “continue delivering service”

Consider a software/hardware system as a total abstraction, with each subcomponent of the system being defined along some boundary interface. A CPU is such a component, speaking over certain buses, obeying certain properties. An ORM model object is another, obeying a certain call protocol and behaving in a certain fashion. The exact semantics of components are often unspecified, but a large portion can be discovered through trial and error experimentation.

A component that violates its discovered semantics has “failed.” This could be a permanent violation — maybe someone set your CPU on fire — or it could be temporary — perhaps someone shut the database your model object abstracts but they’ll turn it back on again. But what truly matters is that a component of the system is behaving in a fashion not anticipated.

This is where “continue delivering service” comes in. Seen from the outside, the system (the conglomeration of components) itself possesses some interface with some discoverable behavior. The service produced by this system is that behavior. A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole. Your database may go offline and your ORM object may fail, but the caller of that object copes and things go on as expected.

But how do you make such a thing?

Approaches to Coping

There are three broad approaches that organizations take to the production of fault-tolerant systems. Naturally, each have their tradeoffs.

Option 1: Perfection

In this option, organizations reduce the probability of undiscovered behaviour in system components down to significantly low values. Consider the process described in They Write the Right Stuff, a 1996 Fast Company article on the Space Shuttle flight computer development. Because of the critical nature of the flight computer in the shuttle’s operation, unusual constraints were placed on the construction of the software. In particular:

  • Total control was held over the hardware. Everything was described in-spec and custom made. Unknowns were intolerable, so no unknowns were accepted.
  • Total understanding of the problem domain was required. Every little bit of physics was worked out, and the cooperation of the astronauts with the computer system was scripted.
  • Specific, explicit goals were defined by the parent organization (in this case NASA). The shuttle had a particular job to do, and this was defined in detail.
  • The service lifetime of the system was defined in advance. That is, the running time of the computer was set, allowing for formal methods that assumed fixed runtimes were employed.

This approach was not without its failures; the first orbiter flight was delayed by a software bug, in fact, but each of the shuttle’s catastrophes were attributable to mechanical faults, not software.

The downside of taking the perfection route is that it is extremely expensive and stifling. When an organization attempts perfection, it is explicitly stating that the final system is all that matters. Discovering some cool new trick the hardware can do in addition to its stated goals do not matter. The intellectual freedom of the engineers — outside of the strictures of the All Important Process — do not matter. All that matters is the end result.

Working in this fashion requires a boatload of money, a certain kind of engineer, sophisticated planning, and time. Systems produced like this are fault-tolerant because all possible faults are accounted for and handled, usually with a runbook and a staff of expert operators on standby.

Option 2: Hope for the Best

On the flip side is the “hope for the best” model of system development. This is exemplified by a certain social media company’s “Move Fast and Break Things” former motto.

Such a model requires little upfront understanding of the problem domain, coupled to very short-term goals, often “Just get something online.” It is also often cheaper, in the short-term, to pull off; the costs associated with upfront planning are entirely avoided, and the number of engineers needed to produce something (anything) are also avoided.

Organizations taking this approach are implicitly stating that the future system and its behavior matters less than the research done to produce it, sometimes resulting in a Silicon-Valley-style “pivot.”

The downside of this approach can be seen in the longer view. Ignorance of the problem domain will often result in long-term system issues, which may be resolvable but not without significant expense. Failures in a system produced in this way do propagate out to users, which may or may not be an issue depending on the system.

The most pernicious thing about this model is its cultural impact. That is, it’s difficult to flip a switch in an organization and declare that, today, all software must be high quality. Ad hoc verification practices, poor operational management, these things linger after the organization declares them to be no longer a virtue but a liability.

Sometimes, but not always, redundancy will be featured in a production “hope for the best” system, and the fail-over between components may or may not be tested. Such systems are often accidentally able to cope with failures and are kept online through ingenuity and coolness under fire.

Option 3: Embracing Failure

Sitting in between these two options is the embrace of undiscovered faults as a first-class component of a system. Part Two of this series on fault tolerance will discuss the final option in-depth. Stay tuned!

Fault tolerance Fault (technology)

Published at DZone with permission of Brian Troutwine, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Writing a Modern HTTP(S) Tunnel in Rust
  • Using AI and Machine Learning To Create Software
  • Asynchronous HTTP Requests With RxJava
  • DevOps Roadmap for 2022

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: