There is no doubt that you’ve seen the use of the mathematical operator <= meaning ‘less than or equal to’ at the other side of the equation. The reason that this is important is that it relates to the assigning of application SLA (Service Level Agreement) metrics to your applications.
In other words, you have to know both the application and the infrastructure side of the equation, and the application side will always be less than or equal to the SLA of the underlying infrastructure.
SLA Turtles All the Way Down
If you have an application that has a 99.99 percent uptime requirement, it has to be on infrastructure that has a 99.99 percent uptime. That SLA will need to be from the top layer of the application, including RPO (Recover Point Objective) and RTO (Recover Time Objective). Making that promise on-premises has already been a challenge that some have faced and learned about the hard way.
Think that AWS is the answer to your SLA? Perhaps we should look back to the recent S3 outage that took the world by storm and left many organizations having to explain why they were hit by service issues despite being on the cloud. Remember that the cloud does not give you automatic resiliency. It does however give you the ability to leverage geographically dispersed infrastructure in a resilient manner. There are still issues which can be localized and cause significant outages.
Amazon Web Services ✔ @awscloud
The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.
That tweet sums it up rather beautifully if you ask me. AWS themselves had their own service status dashboard fail because it was hosted on the very infrastructure that suffered the outage. There is a strange irony in that.
Know Your Underlay
In order to boost your application SLA, you have to architect it on underlying infrastructure that has an SLA which is greater than the SLA you require for the application itself. Even in the cloud you have to think bigger. It could be a matter of cross-region design, or in some cases an even more resilient underlay design that is cross-cloud.
The important thing that you should take away from this is that the SLA of the application can only be higher than the underlying infrastructure at any single layer if the layer above that spans multiple service layers that subsequently gives it a higher SLA. In other words, you can combine lower SLA services at one layer, but span multiple regions/zones/clouds above it and that makes that layer less impactful on the total SLA because we’ve avoided the single point of failure.
Don’t take that as advice to run on spinning disks and think that HA proxy will save you from drive failures. You still have to architect the data to be shared across the boundaries, and the application must be able to recover gracefully when outages and significant slowdowns occur.
This also doesn’t deal with performance in any way. That’s a whole separate issue altogether. The SLA and associated RPO/RTO metrics are about availability and recoverability during significant business and technical disruptions.
Best advice you can get: build it to fail by design. As the SLA approaches 100% availability, the costs and effort rise significantly to get there. It’s like an exponential growth curve where availability is the X-axis and cost is the Y-axis. Once you begin to build a practice of designing for failure, the application designs that come later will win the benefits of learnings and services you’ve created for those first ones.
It may seem expensive to design for failure, but how much does failure cost you? I’m betting on the design strategy myself.