DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Top Book Picks for Site Reliability Engineers
  • AI-Driven Kubernetes Troubleshooting With DeepSeek and k8sgpt
  • SRE Best Practices for Java Applications
  • Security Considerations for Observability: Enhancing Reliability and Protecting Systems Through Unified Monitoring and Threat Detection

Trending

  • DZone's Article Submission Guidelines
  • The End of “Good Enough Agile”
  • MCP Servers: The Technical Debt That Is Coming
  • Rust, WASM, and Edge: Next-Level Performance
  1. DZone
  2. Culture and Methodologies
  3. Methodologies
  4. Creating an SRE Practice: Why and How

Creating an SRE Practice: Why and How

Explore who SREs are, what they do, key philosophies shared by successful SRE teams, and how to start migrating your operations teams to the SRE model.

By 
Greg Leffler user avatar
Greg Leffler
·
Nov. 21, 22 · Analysis
Likes (3)
Comment
Save
Tweet
Share
8.8K Views

Join the DZone community and get the full member experience.

Join For Free

This is an article from DZone's 2022 Performance and Site Reliability Trend Report.

For more:


Read the Report

Site reliability engineering (SRE) is the state of the art for ensuring services are reliable and perform well. SRE practices power some of the most successful websites in the world. In this article, I'll discuss who site reliability engineers (SREs) are, what they do, key philosophies shared by successful SRE teams, and how to start migrating your operations teams to the SRE model.

Who Are SREs?

SREs operate some of the busiest and most complex systems in the world. There are many definitions for an SRE, but a good working definition is a superhuman-merged engineer who is a skilled software engineer and a skilled operations engineer. Each of these roles alone are difficult to hire, train, and retain — and finding people who are good enough at both roles to excel as SREs is even harder. In addition to engineering responsibilities, SREs also require a high level of trust, a keen eye for software quality, the ability to handle pressure, and a little bit of thrill-seeking (in order to handle being on call, of course). 

The Variance in SREs

There are many different job descriptions that are used when hiring SREs. The prototypical example of SRE hiring is Google, who has SRE roles in two different job families: operations-focused SREs and "software engineer" SREs. The interview process and career mobility for these two roles is very different despite both roles having the SRE title and similar responsibilities on the job. 

In reality, most people are not equally skilled at operations work and software engineering work. Acknowledging that different people have different interests within the job family is likely the best way to build a happy team. Offering a mix of roles and job descriptions is a good idea to attract a diverse mix of SRE talent to your team. 

What Do SREs Do?

As seen in Figure 1, the SRE's work consists of five tasks, often done cyclically, but also in parallel for several component services. 

Figure 1: SRE responsibility cycle 

Depending on the size and maturity of the company, the roles of SRE vary, but at most companies, they are responsible for these elements: architecture, deployment, operations, firefighting, and fixing. 

Architect Services

SREs understand how services actually operate in production, so they are responsible for helping design and architect scalable and reliable services. These decisions are generally sorted into design-related and capacity-related decisions. 

Design Considerations

This aspect focuses on reviewing the design of new services and involves answering questions like: 

  • Is a new service written in a way that works with our other services?
  • Is it scalable? 
  • Can it run in multiple environments at the same time? 
  • How does it store data/state, and how is that synchronized across other environments/regions?
  • What are its dependencies, and what services depend on it? 
  • How will we monitor and observe what this service does and how it performs?

Capacity Considerations

In addition to the overall architecture, SREs are tasked with figuring out cost and capacity requirements. To determine these requirements, questions like these are asked: 

  • Can this service handle our current volume of users?
  • What about 10x more users? 100x more users?
  • How much is this going to cost us per request handled? 
  • Is there a way that we can deploy this service more densely? 
  • What resource is bottlenecking this service once deployed? 

Operate Services 

Once the service has been designed, it must be deployed to production, and changes must be reviewed to ensure that those changes meet architecture goals and service-level objectives. 

Deploy Software

This part of the job is less important in larger organizations that have adopted a mature CI/CD practice, but many organizations are not yet there. SREs in these organizations are often responsible for the actual process of getting binaries into production, performing a canary deployment or A/B test, routing traffic appropriately, warming up caches, etc. At organizations without CI/ CD, SREs will generally also write scripting or other automation to assist in this deployment process. 

Review Code

SREs are often involved in the code review process for performance-critical sections of production applications as well as for writing code to help automate parts of their role to remove toil (more on toil below). This code must be reviewed by other SREs before it is adopted across the team. Additionally, when troubleshooting an on-call issue, a good SRE can identify faulty application code as part of the escalation flow or even fix it themselves. 

Firefight

While not glamorous, firefighting is a signature part of the role of an SRE. SREs are an early escalation target when issues are identified by an observability or monitoring system, and SREs are generally responsible for answering calls about service issues 24/7. Answering one of these calls is a combination of thrilling and terrifying: thrilling because your adrenaline starts to kick in and you are "saving the day" — terrifying because every second that the problem isn't fixed, your customers are unhappy. SREs answering on-call pages must identify the problem, find the problem in a very complicated system, and then fix the problem either on their own or by engaging a software engineer. 

Figure 2: The on-call workflow

For each on-call incident, SREs must identify that an issue exists using metrics, find the service causing the issue using traces, then identify the cause of the issue using logs.

Fix, Debrief, and Evaluate Incidents

As the on-call incidents described above are stressful, SREs have a strong interest in making sure that incidents do not repeat. This is done through post-incident reviews (sometimes called "postmortems"). During these sessions, all stakeholders for the service meet and figure out what went wrong, why the service failed, and how to make sure that the exact failure never happens again. 

Not listed above, but sometimes an SRE's responsibility is building and maintaining platforms and tooling for developers. These include source code repositories, CI/CD systems, code review platforms, and other developer productivity systems. In smaller organizations, it is more likely that SREs will build and maintain these systems, but as organizations grow, these tasks generally grow in scale to where it makes sense to have a separate (e.g., "developer productivity") team handle them. 

SRE Philosophies

One of the most common questions asked is how SREs differ from other operations roles. This is best illustrated through SRE philosophies, the most prevalent of which are listed below. While any operations role will likely embrace at least some of these, only SREs embrace them all. 

  • "Just say no" to toil
    • Toil is the enemy of SREs and is described as "tedious, repetitive tasks associated with running a production environment" by Google. You eliminate toil by automating processes so that manual work is eliminated.
    • One philosophy around toil held by many SREs is to try to "automate yourself out of a job" (though there will always be new services to work on, so you never quite get there).
  • Cattle, not pets
    • In line with reducing toil and increasing automation, an important philosophy for SREs is to treat servers, environments, and other infrastructure as disposable. Small organizations tend to take the opposite approach — treating each element of the application as something precious, even naming it. This doesn't scale in the long run. 
    • A good SRE will work to have the application's deployment fully automated so that infrastructure and code are stored in the same repositories and deploy at the same time, meaning that if the entire existing infrastructure was blown away, the application could be brought back up easily. 
  • Uptime above all 
    • Customer-facing downtime is not acceptable. The storied "five nines" of uptime (less than six minutes down per year) should be a baseline expectation for SREs, not a maximum. Services must have redundancy, security, and other defenses so that customer requests are always handled.  
  • Errors will happen
    • Error budgeting and the use of service-level indicators is the secret sauce behind delivering exceptional customer facing uptime. By accepting some unreliability in services and architecting their dependent services to work around this, customer experience can be maintained.  
  • Incidents must be responded to
    • An incident happening once sometimes happens. The same incident happening twice is beyond the pale. A thorough, blameless post-incident review process is essential to the goal of steadily increasing reliability and performance over time.

How to Migrate an Ops Team to SRE

Moving from a traditional operations role to an SRE practice is challenging and often seems overwhelming. Small steps add up to big impact. Adopting SRE philosophies, advancing the skill set of your team, and acknowledging that mistakes will occur are three things that can be done to start this process. 

Adopt SRE Philosophies

The most important first step is to adopt the SRE philosophies mentioned in the previous section. The one that will likely have the fastest payoff is to strive to eliminate toil. CI/CD can do this very well, so it is a good starting point. If you don't have a robust monitoring or observability system, that should also be a priority so that firefighting for your team is easier. 

Start Small: Uplevel Expectations and Skills

You can't boil the ocean. Everyone will not magically become SREs overnight. What you can do is provide resources to your team (some are listed at the end of this article) and set clear expectations and a clear roadmap to how you will go from your current state to your desired state. 

A good way to start this process is to consider migrating your legacy monitoring to observability. For most organizations, this involves instrumenting their applications to emit metrics, traces, and logs to a centralized system that can use AI to identify root causes and pinpoint issues faster. The recommended approach to instrument applications is using OpenTelemetry, a CNCFsupported open-source project that ensures you retain ownership of your data and that your team learns transferable skills.

Acknowledge There Will Be Mistakes

Downtime will likely increase as you start to adopt these processes, and that must be OK. Use of SRE principles described in this article will ultimately reduce downtime in the long run as more processes are automated and as people learn new skills. In addition to mistakes, accepting some amount of unreliability from each of your services is also critical to a healthy SRE practice in the long run. If the services are all built around this, and your observability is on-point, your application can remain running and serving customers without the unrealistic demands that come with 100 percent uptime for everything. 

Conclusion

SRE, traditionally, merges application developers with operations engineers to create a hybrid superhuman role that can do anything. SREs are difficult to hire and retain, so it's important to embrace as much of the SRE philosophy as possible. By starting small with one app or part of your infrastructure, you can ease the pain associated with changing how you develop and deploy your application. The benefits gained by adopting these modern practices have real business value and will enable you to be successful for years to come. 

Resources: 

  • Site Reliability Engineering, Google
  • "How to Run a Blameless Postmortem," Atlassian
  • Implementing Service Level Objectives, Alex Hidalgo  

This is an article from DZone's 2022 Performance and Site Reliability Trend Report.

For more:


Read the Report

Site reliability engineering

Opinions expressed by DZone contributors are their own.

Related

  • Top Book Picks for Site Reliability Engineers
  • AI-Driven Kubernetes Troubleshooting With DeepSeek and k8sgpt
  • SRE Best Practices for Java Applications
  • Security Considerations for Observability: Enhancing Reliability and Protecting Systems Through Unified Monitoring and Threat Detection

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!