When we think about app infrastructure planning, we often ask how will it scale. Equally important, though, is how will it fail. You might not be able to 100% prevent failure, but you can mitigate its impact on your customers by building the capacity for failure — or graceful degradation — into your app.
I find that the best writing on software architecture is often by people who are not software architects. In fact, they may never have written a single line of code. This is because the goal of good software is to fit the needs of the customer so well that the software is invisible (quote repurposed from Donald Norman).
And that is a goal we, software engineers, share with designers and marketers whose perspective is also relevant to the design of software.
Seth Godin’s article on Graceful Degradation is a great example of such writing. He says:
Most failures aren't shocking surprises. The law of large numbers is too strong for that. Instead, they are predictable events that smart designers plan for, instead of wishing them away as rare unpredictable accidents.
Failure is not an exception in software — it is the rule. That is why graceful degradation is such a key concept in software architecture.
Graceful degradation in cloud software is a wide-ranging topic encompassing people, code, and infrastructure. Here are four ideas that are critical to successfully incorporating graceful degradation into your app or service.
1. Service-Oriented Architecture
Independent service-oriented architecture (or its modern variation of microservices) allows software architects to localize failure to a single service, thus preventing failure in non-critical functionality from disrupting critical functionality.
2. Elastic Hardware
Sudden spikes in traffic are a significant source of failures in cloud systems. Having elastic hardware that can spin up on demand goes a long way in solving this failure scenario.
3. Fault-Tolerant Communication
If a service stops meeting its SLA, calling services should taper calls to it and resort to a backup behavior. This prevents failures from cascading. A good example is Netflix’s Hystrix.
4. Controlled Rollouts
Every new feature should be having a controlled rollout onto a subset of traffic and be gradually ramped to all customers. Through this rollout, its impact should be measured on key performance and business metrics. If a metric degrades, the feature should be ramped down. Thus, problems can be ironed out without risking global customer experience. A good example of controlled rollouts is my company, Split.
How do you ensure graceful degradation in your systems? Have you come across non-engineers whose writings have influenced your thinking about software? Comment below!