The Importance of SRE (and How It's Changing)
Join the DZone community and get the full member experience.Join For Free
Today, companies are increasingly turning to the cloud to push services out to multiple geographies and ever-increasing user bases. Scaling up is of course beneficial, but maintaining reliability, security and safety standards at scale presents a significant challenge.
There’s no handbook to assist with delivering a service effectively at scale, but firms would do well to follow the example of larger companies that have led the way in cloud. As one of the four big techs, AKA GAFA, Google is both a forerunner and prime example of how to build and run services at scale. A core component to its success was its implementation of, and continued focus on, site reliability engineering (SRE).
A Critical Practice
Many of the principles behind site reliability engineering are borrowed from industries in which the failure of a system might have catastrophic consequences, such as in aerospace or defense. The software employed in each of these has to be highly reliable and cannot be prone to failure.
As Google rapidly expanded across the globe, it had to run its services 24x7 in multiple geographies at scale. Having such a vast userbase made it the perfect test case for many practices, particularly those related to high velocity releases alongside zero downtime.
For firms expanding services via the cloud, ensuring reliability for customers is essential. A service level agreement incorporating uptime will most likely be in place for most customers, so it’s essential to protect against downtime caused by overstretched systems.
DevOps and SRE
The adoption of site reliability engineering is inextricably tied to the evolution of DevOps. SRE focuses on a series of pillars that are development-centric, such as system monitoring and how to deal with failures. Examining how firms deal with incidents in production that affect the reliability and availability of the services they offer is to gauge the effectiveness of their SRE practices.
Site reliability engineering embraces failure and searches the depths of what happened before, during and after critical incidents so that more resilient systems can be put in place. But with so much data being created by systems today – the Microsoft Azure telemetry platform currently records 10 petabytes of data per day – it’s important to identify and correlate alerts that might signify a critical failure.
Root cause analysis and the post-mortem process are vital stages in the lifecycle of a production incident, but effective diagnosis is wholly dependent on the data that informs these stages.
The Move to AIOps
As with DevOps, automation is a significant part of SRE. Diagnosing where and when incidents occur and automating the remediation process is the ideal scenario for site reliability engineers. Increasingly, AI and machine learning tools are being employed to perform this function, giving rise to AIOPs.
When an incident occurs in production, determining the level of threat posed to a system is paramount. No one alert explains what is going wrong, as each is merely a symptom, but alert should be ignored, as it may be one in a series that indicates imminent critical failure.
AIOps software can observe and determine causal relationships across multiple systems and services, with machine learning algorithms determining how incidents are dealt with.
Vital to this stage is the data that informs the algorithms. Telemetry tools must therefore be tuned to listen to the right noises within systems, so that the algorithms can identify what’s critical and what’s not.
In short, if the reliability of a service is essential, so too is the mitigation of incidents in production. Machine learning and AI tools can analyse metrics, logs, tracing and alerts holistically and predict where and when incidents are likely to occur. They can also correlate separate incidents from networks and databases and determine the root cause.
But automating a response to every incident is simply not possible for most firms, as this would require endless resources. It is also not necessary. This is why firms delivering services at scale must take a nuanced approach and focus on identifying and remediating critical incidents, rather than trying to create flawless systems.
Key Skills for Site Reliability Engineers
The business implications of delivering a poor-quality service has led to SRE practices informing the early development of products, with site reliability engineers being placed within or alongside engineering teams. Following a DevOps-focused model, SREs become responsible for the availability of their service, rather than an external operations team.
SREs differ from traditional systems IT engineers as they come from a development background. This typically means they are well equipped to execute incremental change at high velocity across multiple geographies, as this requires a high level of sophistication and integration in the cloud. SREs are also able to ensure that services are scaled at a rate that ensures capacity is not exceeded and systems not overstretched.
As a nascent field, it is important to remember that there is no one-size-fits-all approach to SRE as different firms will require different implementations. What works for Google, for example, may not work for other companies, especially those not born in the cloud. What is certain, however, is that SREs and SRE practices will become business-critical in the years to come.
Opinions expressed by DZone contributors are their own.