Site Reliability Is More Important Than Ever, Yet Challenges Persist
Site reliability has become more important than ever. But the people responsible for delivering it face persistent challenges.
Join the DZone community and get the full member experience.
Join For FreeWhen Google originally defined the role of site reliability engineer (SRE), they were looking to address an urgent problem: the mission-critical need for high availability and performance of digital services. Too often, organizations lacked a proactive approach to reliability, instead of relying on a reactive break-fix model. By adding SREs—engineers that combine expert operational troubleshooting with advanced development skills to make services more robust—companies could take a more strategic approach.
At least, that’s how it’s supposed to work. In practice, many SREs say their real-world activities fall well short of that ideal. If companies want to get the return they expect on-site reliability investments, they need to understand where SREs are struggling, and where the processes and tools they rely on are breaking down.
These questions are near and dear to me and those I work with. So much so that my company commissioned an annual survey of SREs to get the inside scoop on their experiences. What are the biggest barriers SREs face in improving digital experiences? How are COVID-19 shutdowns affecting site reliability? Let’s take a closer look at what the people on the front lines are saying.
You Can’t Fix What You Can’t See
To ensure a good digital experience, you have to understand that experience from the customer’s point of view. Ideally, SREs start by examining what end-users encounter when accessing the site out at the edge and working backward from there. If load times look slow, for example, is the problem with the application code? The internet? A third-party component? To find out, all those elements need to be observable. That doesn’t happen automatically. You have to prioritize observability and work proactively to build observable systems.
Unfortunately, out in the real world, few companies seem to do that. When asked about the tools they rely on, 71% of SREs said that the error rate is the key metric they track. This suggests that companies are making big assumptions about their monitoring tools—specifically, that any problem that could affect the customer experience will show up as an error that their systems track.
Can you assume that though? For most companies, the answer is no. The vast majority of web pages, for example, include at least one third-party domain. (The average web page uses nine!) But according to our survey, just 11% of SREs’ say their automated workflows extend to third-party providers.
In Most Cases, the Balance Between Operations and Development Work Isn’t Balanced at All
As the role was originally defined, SREs should spend no more than half their time on operations tasks like troubleshooting and incident response. The rest should be devoted to development work to improve systems. In practice though, this 50/50 split is a pipe dream. Of the more than 700 SREs we surveyed, the majority spend 75% of their time on operations, leaving little time for proactive development activities.
It should be obvious that the earlier in the process developers start thinking about reliability, the more robust the resulting applications will be. (It’s also a lot less expensive to expand existing reliability attributes of an application than to build them from scratch after the fact.) But, more than half of SREs (53%) say they get brought into the development lifecycle too late to make the kind of difference they’re capable of making.
Those missed opportunities carry a cost, making applications more expensive to own and maintain. It doesn’t look like things will get better anytime soon, either. After two and a half months working from home as a result of COVID-19 shutdowns, SREs’ ops-related responsibilities had increased by a net 10%.
Shifting SREs To Remote Work Creates Both Opportunities and Challenges
COVID-19 has created huge disruptions in the way SREs operate—just as it’s disrupted every other aspect of work. As you’d expect, maintaining work/life balance has become a significant challenge for survey respondents who’ve been working from home. Just as troubling is the risk of burnout. More than 40% of SREs said that half their work or more consists of “toil”—mostly manual, repetitive tasks that could be automated.
The news wasn’t all bad. Envisioning what their jobs will look like in a post-pandemic world, 50% of SREs believe they’ll continue working remotely. (Before the pandemic, just one out of five SREs worked remotely.) Two-thirds said they were able to manage incidents just as effective working from home as in the office. A net +9% of SREs said they were more effective.
Build a Better Future for SREs and the Applications They Support
If organizations want to get the most bang for their SRE buck, they have some work to do. Just as though, by making some changes in processes and priorities, companies can empower their SREs to work more proactively and effectively.
Based on this year’s survey results, we identified several recommendations to help companies get there, including:
- Put more emphasis on preventive measures and observability: Start by identifying where all your services converge into the user’s digital experience at the point of consumption. Make sure you’re evaluating not just your code, but the networks, infrastructure, and third-party and delivery chain components as well. If you don’t, you’re not getting the full picture.
- Get SREs involved sooner: Take concrete steps to include SREs earlier in the development process. Take the time to evaluate how you can better align SREs and DevOps teams, and identify the barriers SREs face in devoting more time to proactive work.
- Use shutdowns as an opportunity: With so many people struggling with isolation and work/life balance, now is the time to invest in improving employee experience and morale. We’re likely to emerge from this pandemic in a very different world. Embracing an employee-first mentality will go a long way towards helping you attract and retain talent.
These are just some of the highlights. For an in-depth picture of the current state of SREs and detailed conclusions and analysis, see the full report.
Opinions expressed by DZone contributors are their own.
Trending
-
Using Render Log Streams to Log to Papertrail
-
Incident Response Guide
-
Payments Architecture - An Introduction
-
Constructing Real-Time Analytics: Fundamental Components and Architectural Framework — Part 2
Comments