The Value of Optimizing for Resilience

DZone 's Guide to

The Value of Optimizing for Resilience

Learn what optimizing for resilience means, and why it's so valuable in software delivery, plus what it costs to achieve it.

· Performance Zone ·
Free Resource

What does it mean to optimise for resilience? Why is resilience so valuable to an organization, and how can operability contribute towards it? In this article, Steve Smith explains what optimising for resilience is, and why it is so valuable to IT delivery. This is part of the Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler - TBA

Resilience Is Graceful Extensibility

When an organisation wants to improve the reliability of its IT services, it should optimize for resilience. Resilience is the ability to " absorb or avoid damage without suffering complete failure," and it is achieved by minimising the Mean Time To Repair (MTTR) of services. Some classes of failure should never occur, some failures are more costly than others, and some safety-critical systems should never have failures, but in general, organizations should adhere to John Allspaw's advice that " being able to recover quickly from failure is more important than having failures less often."

Resilience can be thought of as graceful extensibility. In Four Concepts for Resilience and their Implications for Systems Safety in the Face of Complexity, David Woods describes graceful extensibility as "the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries." Optimising for resilience means creating a production environment that can gracefully extend to deal with the unpredictable behaviors, unexpected changes, and periods of failure that will inevitably occur with running IT services. This allows for the cost per unit time and duration of production failures to be minimised, reducing both the direct revenue costs and indirect opportunity costs created by a failure.

Resilience needs to be built into teams and services throughout an organization. In Resilience Engineering In Practice, Erik Hollnagel et al define the cornerstones of Resilience Engineering as:

  • Anticipation
    is knowing what to expect. This is imagining the potential for future failures, and mitigating for those scenarios in advance.
  • Monitoring is knowing what to look for. This is inspecting past and present operating conditions, and alerting when anomalies occur.
  • Response is knowing what to do. This is using guidelines, heuristics, improvisation skills, and situational awareness to mitigate a failure.
  • Learning is knowing what has happened. This is understanding the circumstances of a near-miss or failure, and sharing the observations.

These cornerstones are non-linear and complementary. For example, if a team has a major launch in its near-future it might invest more time in anticipating failure scenarios, which might result in improved monitoring and response capabilities.

Creating Adaptive Capacity With Operability

The graceful extensibility of an organization is derived from the adaptive capacity of its teams and their services. When an organization optimises for resilience it can create sources of adaptive capacity by making a long-term investment in the operability of its IT services. Operability is defined as "the ability to keep a system in a safe and reliable functioning condition," and it is associated with a set of practices:

Each of these operability practices can be linked to a cornerstone of Resilience Engineering. They will produce a more effective incident response, and increase adaptive capacity:

For example, incident response at Fruits-U-Like would be much improved if the organization was optimised for resilience. Assuming its third-party registration service starts to struggle under load, new customers cannot check out their purchases, and the failure cost per unit time is £80K per day. The checkout team would receive an automated alert for the failure, and their logging and monitoring dashboards would show a correlation between checkout and registration failures. The team would be able to triage a third party registration error within 5 minutes, and self-deploy an improvement to connection handling within a day. The failure would have a 1-day repair cost of £80K, with a detection sunk cost of £278 and a remediation opportunity cost of £79,722.

If the checkout team adopted Defensive Architecture techniques they could combine a Circuit Breaker, a Bulkhead, and a Feature Toggle in anticipation of registration errors. If the registration service struggled under load the Circuit Breaker would regulate registration requests to allow a percentage to succeed, and the Bulkhead would warn the checkout frontend to skip registration for some customers. This approach would reduce the failure cost per unit time to a marketing opportunity cost of £5K a day. The checkout team would not receive an alert, but within minutes their dashboards would highlight registration errors and they could use a Feature Toggle to enable anonymous checkouts for new customers. This would allow them to deploy their connection handling fix within 3 hours with no customer impact. The result would be a 3-hour repair cost of £625, with a sunk cost of £18 and an opportunity cost of £607.

1 In How Complex Systems Fail, Richard Cook warns that " hindsight bias remains the primary obstacle to accident investigation. There is no such thing as a root cause in a complex production system, nor a blameworthy individual

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler - TBA


This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O'Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

performance, performance optimization, resilience

Published at DZone with permission of Steve Smith , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}