Over a million developers have joined DZone.

Cloud Service Management: From DevOps to SRE (Part 1)

DZone 's Guide to

Cloud Service Management: From DevOps to SRE (Part 1)

See how cloud hosting has changed reliability and the current state of Site (or Service) Reliability Management—and the role DevOps plays.

· Cloud Zone ·
Free Resource

As an electrical engineer by training, the reliability Service Level Objectives (SLOs) I was trained on were always in the 5-nines definition – 99.999% reliability. Engineering firms, especially those in the electrical engineering space have had those levels of reliability objectives for decades. In today’s world of web applications, such an objective is unreasonable. We will discuss why that is so soon, but first, let’s understand what that level of reliability means. When looking at reliability in the context of service availability, how much downtime does 99.999% availability translate to?

  • Daily: 0.9s
  • Weekly: 6.0s
  • Monthly: 26.3s
  • Yearly: 5m 15.6s

Yes, less than a second per day, only 6 seconds of downtime a week, or just 26.3 seconds of downtime in an entire month!

Looking at more common availability objectives, like say 4-nines or 99.99%. It translates to downtime of:

  • Daily: 8.6s
  • Weekly: 1m 0.5s
  • Monthly: 4m 23.0s
  • Yearly: 52m 35.7s

That’s a full order of magnitude lower than the 5-nines reliability we used to talk about in engineering school, and it translates to just 8.6 seconds a day or one minute (and 0.5 sec) of downtime a week! Even the more common 99.95% availability SLO is a mere 43 seconds/day or 5:24 minutes/week.

This complexity of managing and delivering the high level of reliability expected of web-based, cloud-hosted systems today (ever seen Facebook or Google search engine has even a scheduled outage?), and the expectation of Continuous Delivery of new features and bug fixes (my mobile phone always has Apps that need to be updated – always), has led to the evolution of a totally new field of Reliability Engineering catered for such systems. Google, who has been a pioneer in this field calls it Site Reliability Engineering (SRE). While it would be more aptly named Service Reliability Engineering (and still keep the acronym of SRE), the name has caught on. The seminal work documenting Google’s approach and practices are in the book by the same name (commonly referred to as the ‘SRE book’), has become the de facto standard on how to adopt SRE in an organization. ‘SRE Engineer’ has suddenly become almost as common a title on LinkedIn profiles as ‘DevOps Engineer’ (don’t get me started…).

Going back to the name Site Reliability Engineering, Google does define SRE as ‘Google’s approach to Service Management’. I guess they use the term ‘Site’ given the nature of their core business, but at the end of the day, it is all about Service Reliability Management.

In my three-part blog series on the topic of SRE, I examine SRE in depth, drawing parallels from other fields we are familiar with, and introducing new concepts like Antifragility.

  • In part 1, titled ‘From Apollo 13 to Google SRE’, I examine the origins of SRE. I compare today's SRE to reliability in a more traditional context (like what is required for a space mission like the Apollo Lunar missions). I also introduce the eight core tenets of SRE as defined by Google’s thesis on SRE.
  • In part 2, titled ‘Houston, we have an… outage!’, I examine the incident response. Comparing the SRE approach to incident management to how the NASA engineers and mission controllers responded to the incident on Apollo 13. Not much is unique, other than the obvious difference in saving lives of astronauts stranded in space and rebooting servers.
  • In part 3, titled ‘Antifragile: When DevOps met SRE”, I introduce the term ‘Antifragile’ – things that are neither fragile or robust, but rather thrive in chaos. I propose that the adopting SRE requires the systems and services being supported to be Antifragile. And the supporting teams be structured and trained to support Antifragile systems and services. The goal is not to prevent incidents, it is to minimize their impact and the Mean Time to Repair (MTTR). I introduce the readers to Netflix’s Simian Army that tests and prepares systems and teams to become Antifragile.

So, a question to you, my readers, are you adopting SRE in your organization? Are your systems and services Antifragile? Do share your answers and thoughts on this series in the comments section below.

Learn more about IBM’s approach to Cloud Service Management and Operations (CSMO) and reference architecture in this GitHub repo. And more on my friend and colleague Ingo Averdunk’s blog post.

devops best practices ,downtime prevention ,site reliability ,cloud ,availability

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}