The repercussions of recent cloud outages — AWS’s S3 crash and Azure’s Active Directory cascading failure — linger in IT departments and manifest in revenue loss. But, the bigger story is that the next outage is around the corner — unpredictable, coming to get us on a random Tuesday. Whether businesses are using cloud providers, on-premise data centers, or hybrid setups to host web services and backends, infrastructure failures are a fact of life and have to be on our radars as a matter of routine. This makes architecting for failure and for the future, from the start, among the most pressing imperatives for business IT departments.
The next five years will see the rise and democratization of centralized control systems for cloud ops with fault tolerance architected into the very fabric of those systems. Configuration management is being reinvented and taken to entirely new levels of automated action, where the machines take responsibility for failure and do the right thing as part of their continuous tasking. The cloud's scalability, elasticity, distributed resources, and potential cost savings increasingly make it the wiser, preferred choice for enterprises. Unlike on-premise data centers, with cloud, the pieces are all there to help us withstand the storm of outages and their fallout. The challenge is to figure out how to stack, manage, and tune those pieces to automate resilience — and to do that as the pieces change over time.
Be Honest About the Weakest Links
Your architecting analysis begins with ruthless honesty about what causes systems fragility. Knowing where the potential to get burned is and not taking shortcuts requires a disciplined, cynical, self-critical, transparent architecting and engineering ethic. Ask: have I really solved for this? If your transaction boundaries are fuzzy, if you’re not clear as to whether an interface is really idempotent, if folks try to sneak a little bit of state into what need to be stateless operations, those all constitute fundamental cracks in the infrastructure that can manifest in an outage. They have to be accounted for up front, not as afterthoughts. There has to be honest reconciliation of vulnerabilities.
In fact, whenever you’re building a system out of services, the reliability of your system at most is not greater than the lowest reliability threshold of all the services you’re using combined. Come again? In order to work well, maybe your business application requires ten services, some of which are cloud provider services, some of which are homegrown or adapted by your team. Think about the measured percentage of uptime each service has and the percentage of failure for each. It isn’t just the weakest individual link in the chain of services you have to worry about. It’s the combined weaknesses of all the links in the chain. This constitutes the overall fragility of the system.
Is Disaster Recovery an Antiquated Notion?
Not yet, but, maybe it should be. It’s tough to stop thinking in terms of catastrophic recovery plans and reactionary behavior. But a core best practice is to design cloud systems with preemptive and proactive built-in mechanisms that expect failure, while at the same time designing flexible components, which anticipate the only real certainty about the future—change. Having adaptive, automated fail-safes that are a fundamental part of the way you manage dynamic infrastructure is a very different notion than accepting that inevitable tech failures have to mean inevitable business catastrophes or revenue losses. The trend with next generation automated infrastructure is that a single command resuming processes will replace multiple, manual recovery steps potentially fraught with complications.
Separate Concerns and Determine Priorities
After honestly reflecting on system weaknesses and thinking anew about preventative disaster medicine, so to speak, you roll up your sleeves. Separating your application into the appropriate pieces and understanding the requirements for each piece are your architecting meat and potatoes in building a resilient system. Requirements boil down to prioritizing certain characteristics like reliability, durability, availability, accessibility, speed, security, and scalability for each of your components.
For example, persistent data need to be kept in a place featuring very high durability—durability level refers to whether or not your data get lost. Despite S3’s downtime, persistent data for thousands of businesses were never lost. The data weren’t available for a time, and that’s bad. But, had S3’s durability, as opposed to availability, not been exceptional, the outcome would have been infinitely worse.
On the other hand, the durability of data in the transactional part of the system that’s facing the customer, perhaps governed by a different service, is not of critical importance. Availability matters much more; keeping user interaction open is paramount. If you’re doing a few searches, say, and the system goes down and you just lose some session data, that’s inconvenient, but not terrible. Your persistent data is still safe in another part of your system that’s shaped by its own, most appropriate, highest priority requirements.
Maybe there’s another component of your system where speed yielding cost efficiency is the highest priority consideration. You architect accordingly. Your prioritizations, sometimes in a cyclical manner, help you decide exactly which concerns to separate from others and to implement as different services or microservices—where exactly you draw the line.
Do No Harm: Architect a Responsive “Circuit Breaker”
Making assertions about architecting fault tolerance from the get-go, in the very fabric of a system, and demonstrating what that specifically looks like are two different things. It helps to examine part of a blueprint that shows abstract principles in action.
Fugue is an example of a dynamic cloud infrastructure orchestration and enforcement system that provides a concise, accurate view of an application’s cloud footprint at any given time and automatically returns infrastructure straying for any reason to desired declarations. It centralizes cloud management and handles failure elegantly with “built-in” preemptive mechanisms that use the cloud’s native advantages to keep applications safe. Its engine is the Fugue Conductor that builds out infrastructure, checks it every 30 seconds, and is empowered to make the right decisions about processing work. Human intervention almost always necessitates adrenalin-driven mistakes which can cause things to worsen during an outage. A Conductor’s programmed, automated, Hippocratic mantra is ‘first, do no harm.’
When the S3 outage hit and AWS API error responses in the 500s indicated that something aberrant was underway, that unsafe behavior was unfolding on the other side of a cloud API, Fugue Conductors, following their core design, automatically erred on the side of caution and stopped work immediately—popping the metaphorical circuit breaker. Each Conductor, comparing actual infrastructure state for an application against declarations in a single, concise file serving as the source of truth, could not verify consistency. When a Conductor consumes messaging that indicates its view of the world is inaccurate, it halts all change. As the core service disruption, like S3’s, calms and a Conductor’s view of the world is restored, it’s designed to take a single resume command and continue work without missing a beat. Since all change was halted, since its async messages were not marked as read, and since the design is idempotent and stateless, a Conductor can start operating on a message again with nothing corrupted.
Beyond that, a Fugue Conductor is further architected so that, at any moment, it can go offline or be completely destroyed—its instance terminated—and another one will come up, picking up where the other left off with no reliance on the previous conductor. A regular part of Fugue testing includes killing Conductors out of band and making sure new ones come up correctly, that weird, inconsistent states can be handled safely. Testing like this, really pounding away at the Conductors, is an extension of architecting principles and helps ensure reliability and availability.
If you try to bolt on ad hoc features in a system to accomplish these things, it’ll be a mess. Either a system is blueprinted to handle eventual consistency and API problems or it isn’t.
Humans Forget, Computers Remember
Cloud crashes, big and small, are as inevitable as human fallibility. A vast majority of outages come from human operator error and that’s okay. We know that. It’s expected. People get tired. Even the savviest developers make the occasional mistake. As noted by AWS, the S3 cascade resulted from an incorrectly typed command. So, the rational and effective response isn’t to lose faith in humanity, but to give it a break. Build or use a system that checks for correctness before it does anything else at all. That’s a hard thing to layer back on top if you don’t have it in place from the start. Make it the responsibility of the machine to tell the operator whether something is right or wrong, to the extent possible.
A computer can say decisively whether something is possible to do. It can also respond to boundaries specified ahead of time around whitelisted, correct behaviors. Here’s the list of stuff you may do within these constraints, computer! Here’s a finite set to process—as opposed to a blacklist of what not to do, which might be an infinite set. Any operations performed against the system first can be checked to make sure they’re allowed. So, even if you fat-fingered something, you can’t destroy 3000 servers because that’s not allowed. This is critically important because humans can be bad at remembering the details, especially in the heat of the moment, whereas computers don’t break a sweat.
Making the machine responsible in its DNA for managing failure quickly, gracefully, and safely is how a cloud crash becomes a non-event for businesses.