Designing Resilient Systems
Designing Resilient Systems
Cyber-resilience is becoming somewhat of a buzzword recently and various companies have begun to co-opt the term to describe things that are, well, not resilience at all. So, what is resilience?
Join the DZone community and get the full member experience.Join For Free
Cyber-resilience is becoming somewhat of a buzzword recently and various companies have begun to co-opt the term to describe things that are, well, not resilience at all. DHS defines cyber resilience as the ability for systems to adapt to changing conditions, withstanding and recovering from disruptions while continuing to supply services. Those changing conditions are assumed to stem from cyber attack, though realistically they may not always.
Resilience can be evident in confidentiality, integrity, or availability, depending on the attribute attacked, and many consider it analogous to NIST's identify, protect, detect, respond, recover framework.
NIST Cybersecurity Framework and Resilience
We've defined cyber-resilience, and we have a framework we can refer to in order to understand it more clearly. But what exactly in NIST's framework embodies resilience?
Realistically, all of the guidelines laid out in the doc are meant to help, if you're designing resilience into your system.
Ideally, systems would be able to respond naturally to attack or failure. We see this in nature today; chemical gradients, in ponds, naturally balance out over time. Likewise, heat in a metal plate will spread throughout the plate, eventually becoming uniform. We'd like to be able to have the same kind of thing happen in computer systems - but unfortunately, they don't work that way. This kind of emergent resilience may be the ideal, but we'll only be able to emulate it with digital systems. Computer systems are discrete, digital things. We can emulate service gradients with enough scale, but a smaller scale, we just can't. This leads to the major problem with cyber-resilience today: despite what people claim, it's not anything special (until recently, more on this later), and you've seen it before.
How to Implement Resilience
Cyber resilience has, until recently, been implemented in one of two ways: either through redundancy, or via byzantine fault tolerance. And if you're a small company, you didn't want to deal with either one of these—they're expensive, difficult to manage, and hard to justify.
After all, how can you really justify a hot spare for your extranet to your typical CFO? or two (which is what you really need, as a minimum)? Good luck with that.
Byzantine fault tolerance is even more annoying. It requires redundancy and is complicated to implement. So until recently, you didn't get resilience unless you really, really needed it, you had the pockets to pay for it, and the brain to implement it.
Cloud computing is changing all of this.
Enter the Cloud
Cloud computing gives us computational resources at scale that can respond at system speed. Remember, we really can't implement emergent resilience, but we can design engineered resilience. And this is where the NIST framework comes in.
First, you need to identify some kind of basis threat. This is the threat (or realistically, the group of threats) that you'll be resilient against. You should already be doing this, in fact—this kind of thing has been called an attack model in the past, and usually incorporates some notion of an attack surface.
Next, you'll want to put controls in place to protect your system from the identified threats. Nothing new here, you do this now. This is pretty typical run-of-the-mill cybersecurity stuff, right? Making sure you have correctly configured firewalls, anti-malware solutions, that kind of thing.
Now, things become more interesting. We're designing resilience into a system, so we need some way to detect when the system is attacked or under threat. This should take many different forms: you should monitor external threat feeds, internal network traffic, system performance, system state, and overall resource use. You need to detect issues prior to being able to respond to them, after all. Detection like this is more difficult, certainly, but we're making progress. New intrusion prevention systems are improving, and unified reporting tools are making potential problems more visible. This is still a difficult problem though and is a long way from being solved.
Next, if you've detected an impending issue, you need to respond. You need to ensure that your system can continue to provide service while under attack or compromised. This is where the cloud comes in.
In the past, we've used standby, isolated systems to continue to provide service. Today, we can spin up new system instances in the cloud. And we can do so without any initial investment as long as we design things right. By design things right, I mean that you need a system that is capable of filling in new instances over a common distributed data repository, for example. You need a system that can handle failures without going down. Once you detect a problem, you can immediately provision new resources to continue to provide services while you recover from the problem.
What Resilience Looks Like
You want a resilient system? Well, then you want a system that is failing.
The company that has built arguably the most resilient system today is Netflix. They did it by constantly provoking failures in their infrastructure and ensuring that their infrastructure could respond and recover. They do this in production systems, seven days a week, 24 hours a day. And they've released their code, so you can do this too.
So, what is resilience? The ability to function under attack. Do your systems need it? Well, you'd better get to failing.
Opinions expressed by DZone contributors are their own.