DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Learning From Failure With Blameless Postmortem Culture
  • Adopt Site Reliability Engineering to Win
  • Top Book Picks for Site Reliability Engineers
  • Security Considerations for Observability: Enhancing Reliability and Protecting Systems Through Unified Monitoring and Threat Detection

Trending

  • Multiple Stakeholder Management in Software Engineering
  • Seata the Deal: No More Distributed Transaction Nightmares Across (Spring Boot) Microservices
  • Build Real-Time Analytics Applications With AWS Kinesis and Amazon Redshift
  • Turn SQL into Conversation: Natural Language Database Queries With MCP
  1. DZone
  2. Culture and Methodologies
  3. Methodologies
  4. What Is Site Reliability Engineering and Why You Should Embrace It

What Is Site Reliability Engineering and Why You Should Embrace It

Site Reliability Engineering encompasses upkeep tasks for your website's overall health. See why it's crucial enough to warrant a dedicated role.

By 
Matt Watson user avatar
Matt Watson
·
Dec. 15, 17 · Opinion
Likes (3)
Comment
Save
Tweet
Share
7.2K Views

Join the DZone community and get the full member experience.

Join For Free

Software developers spend a lot of time chasing bugs and putting out production fires. I've been a software developer for over 15 years and it has always just been part of the job. Thanks to agile development, we are constantly shipping new code. By-products of constant change are constant issues with performance, software defects, and other issues that eat up our time.

Web applications that receive even a modest amount of traffic require constant care and feeding. This includes overseeing deployments, monitoring overall performance, reviewing error logs, and troubleshooting software defects.

These tasks have traditionally been handled by a mixture of lead developers, development management, system administrators and more often than not, nobody. The problem is that these critical tasks lacked a clear owner ... until now.

What Is Site Reliability Engineering?

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that wants to ship things as fast as possible and the operations team that doesn't want anything to blow up in production.

In many organizations, you could argue that site reliability engineering eliminates much of the IT operations workload related to application monitoring. It shifts the responsibility to be part of the development team itself.

"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Niall Murphy, Google

Site reliability engineers typically spend up to 50% of their time dealing with the daily care and feeding of software applications. They spend the rest of their time writing code like any other software developer would.

A key skill of a software reliability engineer is that they have a deep understanding of the application, the code, and how it runs, is configured, and scales. That knowledge is what makes them so valuable at also monitoring and supporting it as a site reliability engineer.

Some of the typical responsibilities of a site reliability engineer:

  • Proactively monitor and review application performance
  • Handle on-call and emergency support
  • Ensure software has good logging and diagnostics
  • Create and maintain operational runbooks
  • Help triage escalated support tickets
  • Work on feature requests, defects and other development tasks
  • Contribute to overall product roadmap

History of site reliability engineering

The concept of site reliability engineering started in 2003 within Google. As Google continued to grow and scale to become the massive company they are today, they encountered many of their own growing pains. Their challenge was how to support large-scale systems while also introducing new features continuously.

To accomplish the goal, they created a new role that had the dual purpose of developing new features while also ensuring that production systems ran smoothly. Site reliability engineering has grown significantly within Google and most projects have site reliability engineers as part of the team. Google now has over 1,500 site reliability engineers.

Site reliability engineering vs DevOps

So, I know what you are thinking ... how does site reliability engineering compare to DevOps?

Traditionally, DevOps has been more about collaboration between developer and operations. It has also focused more on deployments. Site reliability engineering is more focused on operations and monitoring. Depending on how you define DevOps, it could be related or not.

At Stackify, we have hundreds of servers and we don't even have an IT operations team. So when I think of DevOps, I actually think about the functions of site reliability engineering. For other companies like us who were born in the cloud and heavily use PaaS services, I believe they will also see site reliability engineering as the missing element to their development team success. We effectively operate as a NoOps team.

For larger companies or companies who don't use the cloud, I could see them using both DevOps and site reliability engineering. DevOps practices can help ensure IT helps rack, stack, configure, and deploy the servers and applications. The site reliability engineers can then handle the daily operation of the applications. They also work as a fast feedback loop to the entire team about how the application is performing and running in production.

Site reliability engineering skills

The type of skills needed will vary wildly based on your type of application, how and where it is deployed, and how it is monitored. At Stackify, most of our applications are deployed to Azure PaaS with a little PowerShell. In-depth knowledge of Windows or Linux systems management isn't much of a priority for us. We live in a pretty serverless world at Stackify. However, it may be really critical to your team depending on how your applications are deployed.

The other key skills for a good site reliability engineer are more focused on application monitoring and diagnostics. You want to hire people who are good problem solvers and have a knack for finding problems. Experience with application performance management tools like Retrace, New Relic, and others would be really valuable. They should be well versed at application logging best practices and exception handling.

The future of site reliability engineering

Software developers are increasingly taking a larger role in deployments, production operations, and application monitoring. The tools available today make it extremely easy to deploy our applications and monitor them. Things like PaaS and application monitoring solutions like Retrace make it easy for developers to own their projects from ideation all the way to production.

I believe that IT operations will always exist in most medium to large enterprises. But I believe their type of work will continue to change because of the cloud, PaaS, containers, and other technologies. I previously wrote about divvying up the development and operations tasks.

Summary

As a developer who has been writing code for over 15 years, I feel like I have always been a site reliability engineer, but I just didn't have the job title. In the future, I think every team will have site reliability engineers who take ownership of production operations. Thanks to the cloud and application monitoring tools like Retrace, it has never been a better time to be a software developer.

If you want to learn more about site reliability engineering, you can check out the free online book from the Google team.

Site reliability engineering Reliability engineering Engineering application

Published at DZone with permission of Matt Watson, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Learning From Failure With Blameless Postmortem Culture
  • Adopt Site Reliability Engineering to Win
  • Top Book Picks for Site Reliability Engineers
  • Security Considerations for Observability: Enhancing Reliability and Protecting Systems Through Unified Monitoring and Threat Detection

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: