I don’t often to editorials (and when I do, they tend to ramble), but I felt I’m due and this is a conversation I’ve been having a lot lately. I sit to talk with clients about cloud and one of the first questions I always get is “what is the SLA”? And I hate it.
The fact is that an SLA is an insurance policy. If your vendor doesn’t provide a basic level of service, you get a check. Not unlike my home owners insurance. If something happens, I get a check. The problem is that most of us NEVER want to have to get that check. If my house burns down, the insurance company will replace it. But all those personal mementos, the memories, the “feel” of the house are gone. So that’s a situation I’d rather avoid. What I REALLY want is safety. So install a fire-alarm, I make sure I have an extinguisher in the kitchen, I keep candles away from drapes. I take measures to help reduce the risk that I’ll need to cash my insurance policy.
When building solutions, we don’t want SLA’s. What we REALLY want is availability. So we as the solution owners need to take steps to help us achieve this. We have to weight the cost vs the benefit (do I need an extinguisher or a sprinkler system?) and determine how much we’re wiling to invest in actively working to achieve our own goals.
This is why when I get asked the question, I usually respond by giving them the answer and immediately jump into a discussion about resiliency. What is a service degradation versus an outage? How can we leverage redundancy? Can we decouple components and absorb service disruptions? These are the types of things we architects need to start considering, not just for cloud solutions but for everything we build.
I continue to tell developers that the public cloud is a stepping stone. The patterns we’re using in the public cloud are lessons learned that will eventually get applied back on premises. As the private cloud becomes less vapor and more reality, the ability to think in these new patterns is what will make the next generation of apps truly useful. If a server goes down, how quickly does your load balancer see this and take that server out of rotation? How do the servers shift workloads?
When working towards availability, we need to take several things in mind.
Failures will happen – how we deal with them is our choice. We can have the world stop, or we can figure out how to “degrade” our solution to keep anything we can going.
How are we going to recover – when things return to normal, how does the solution “catch up” with what happened during the disruption
the outage is less important than how fast we react – we need to know something has gone wrong before our clients call to tell us
We (solution/application architects) really need to start changing the conversation here. We need to steer away from SLA’s entirely and when we can’t manage that at least get to more meaningful, scenario based SLA’s. This can mean instead of saying “the email server will be 99% of the time” we switch to “99% of emails will be transmitted within 5 minutes." This is much more meaningful for the end users and also gives s more flexibility in how we achieve it. And depending on how traffic.
Anyway, enough rambling for now. I need to get a deck that discusses this ready for a presentation on Thursday that only about 20 minutes ago I realized I needed to do. Fortunately, I have an earlier draft of the session and definitely have the passion and knowhow to make this happen. So time to get cracking!
Until next time!