DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • How to Become a DevOps Engineer
  • DevOps Best Practices
  • Is DevSecOps Compatible With Managed Detection and Response?
  • Jump Into the DevOps Pool: The Water Is Fine

Trending

  • Introducing Graph Concepts in Java With Eclipse JNoSQL
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  • Implementing Explainable AI in CRM Using Stream Processing
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. Site Reliability Engineering (SRE) 101 With DevOps vs SRE

Site Reliability Engineering (SRE) 101 With DevOps vs SRE

SRE is the practice and cultural shift towards creating a robust IT operation process that would instill stability, high performance, and scalability.

By 
Sunny Raskar user avatar
Sunny Raskar
·
Updated Jun. 15, 20 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
13.1K Views

Join the DZone community and get the full member experience.

Join For Free

Consider the Scenario Below

An Independent Software Provider (ISV) developed a financial application for a global investment firm that serves global conglomerates, leading central banks, asset managers, broking firms, and governmental bodies. The development strategy for the application encompassed a thought through DevOps plan with cutting-edge agile tools.

This has ensured zero downtime deployment at maximum productivity. The app now handles financial transactions in real-time at an enormous scale, while safeguarding sensitive customer data and facilitating uninterrupted workflow. One unfortunate day, the application crashed, and this investment firm suffered a severe backlash (monetarily and morally) from its customers.  

Here is the backstory – application’s workflow exchange had crossed its transactional threshold limit, and lack of responsive remedial action crippled the infrastructure. The intelligent automation brought forth by DevOps was confined mainly to the development and deployment environment. The IT operations, thus, remained susceptible to challenges.

Decoupling DevOps and RunOps — The Genesis of Site Reliability Engineering (SRE)

A decade or two ago, companies operated with a legacy IT mindset. IT operations consisted mostly of administrative jobs without automation. This was the time when the code writing, application testing, and deploying was done manually. Somewhere around 2008-2010, automation started getting prominence.

Now Dev and Ops worked concurrently towards continuous integration and continuous deployment - backed by the agile software movement. The production team was mainly in charge of the runtime environment. However, they lacked skillsets to manage IT operations, which resulted in application instability, as depicted in the scenario above.

Thus, DevOps and RunOps were decoupled, paving the way for SRE – a preventive technique to infuse stability in the IT operations.

"Site Reliability Engineering is the practice and a cultural shift towards creating a robust IT operation process that would instill stability, high performance, and scalability to the production environment."

Software-First Approach: Brain Stem of SRE

“SRE is what happens when you ask a software engineer to design an operations team,” Benjamin Treynor Sloss, Google. This means an SRE function is run by IT operational specialists who code. These specialist engineers implement a software-first approach to automate IT operations and preempt failures.

They apply cutting edge software practices to integrated Dev and Ops on a single platform, and execute test codes across the continuous environment. Therefore, they carry advanced software skills, including DNS Configuration, remediating server, network, and infrastructure problems, and fixing application glitches.

The software approach codifies every aspect of IT operations to build resiliency within infrastructure and applications. Thus, changes are managed via version control tools and checked for issues leveraging test frameworks, while following the principle of observability.

infrastructure

The Principle of Error Budget

SRE engineers verify the code quality of changes in the application by asking the development team to produce evidence via automated test results. SRE managers can fix Service Level Objectives (SLOs) to gauge the performance of changes in the application. They should set a threshold for permissible minimum application downtime, also known as Error Budget. If a downtime during any changes in the application is within the scale of the Error Budget, then SRE teams can approve it. If not, then the changes should be rolled back for improvements to fall within the Error Budget formula.

"Error Budgets tend to bring balance between SRE and application development by mitigating risks. An Error Budget is unaffected until the system availability will fall within the SLO. The Error Budget can always be adjusted by managing the SLOs or enhancing the IT operability. The ultimate goal remains application reliability and scalability."

Calculating Error Budget

A simple formula to calculate Error Budget is (System Availability Percentage) minus (SLO Benchmark Percentage). Please refer to the System Availability Calculator below.

availability cheat sheet

Illustration.

Suppose the system availability is 95%. And, your SLO threshold is 80%.

Error Budget: 95%-80%= 15%

Availability

SLA/SLO Target

Error Budget

Error Budget per Month (30 days)

Error Budget per Quarter (90 days)

95%

80%

15%

108 hours

324 hours

 Error Budget/month: 108 hours. (At 5% downtime, per day downtime hours is 1.2 hours. Therefore for 15% it is 1.2*3 = 3.6. So for 30 days it will be 30*3.6 = 108 hours)

Error Budget/quarter: 108*3 = 324 hours.

Quick Trivia – Breaking monolithic applications lets us derive SLOs at a granular level.

Cultural Shift: A Right Step Towards Reliability and Scalability

Popular SRE engagement models such as Kitchen Sink, a.k.a. “Everything SRE” – a dedicated SRE team, Infrastructure – a backend managed services or Embedded – tagging SRE engineer with developer/s, require additional hiring. These models tend to build dedicated teams that lead to a ‘Silo’ SRE environment. 

The problem with the Silo environment is that it promotes a hands-off approach, which results in a lack of standardization and co-ordination between teams. So, a sensible approach is shelving off a project-oriented mindset and allowing SRE to grow organically within the whole organization. It starts by apprising the teams of customer principles and instilling a data-driven method for ensuring application reliability and scalability.

Organizations must identify a change agent who would create and promote a culture of maximum system availability. He/She can champion this change by practicing the principle of observability, where monitoring is a subset. Observability essentially requires engineering teams to be vigilant of common and complex problems hindering the attendance of reliability and scalability in the application. See the principles of observability below.
pyramid

The principle of observability follows a cyclical approach, which ensures maximum application uptime.

Step Zero – Unlocking Potential of Pyramid of Observability

Step zero is making employees aware of end-to-end product detail – technical and functional. Until an operational specialist knows what to observe, the subsequent steps in the pyramid of observability remain futile.

Also, remember that this culture shift isn’t achievable overnight – it will be successful when practiced sincerely over a few months.

DevOps vs. SRE

People often confuse SRE with DevOps. DevOps and SRE are complementary practices to drive quality in the software development process and maintain application stability.

Let’s analyze four key the fundamental difference between DevOps and SRE.

Parameter

DevOps

SRE

Monitoring vs. Remediation

DevOps typically deals with the pre-failure situation. It ensures conditions that do not lead to system downtime.

SRE deals with the post-failure situation. It needs to have a postmortem for root cause analysis. The main aim is to ensure maximum uptime and weed out failures for long term reliability.

Role in Software Development Life Cycle (SDCL)

DevOps is primarily concerned with the efficient development and effective delivery of software products. It must ensure Zero Down Time Deployment (ZDD). It also requires to identify blind spots within infrastructure and application.

SRE is managing IT operations efficiently once the application is deployed. It must ensure maximum application uptime and stability within the production environment.

Speed and Cost of Incremental Change

DevOps is all about rolling out new updates/features, faster release cycle, quicker deployment and continuous integration, and continuous development. The cost of achieving all this isn’t of significance.

SRE is all about instilling resilience and robustness in the new updates/features. However, it expects small changes at frequent intervals. This gives a larger room to measure those changes and adopt corrective measures in case of a possible failure. The bottom line is efficient testing and remediation to bring down the cost of failure.

Key Measurements

DevOps’ measurement plan revolves around CI/CD. It tends to measure process improvements and workflow productivity to maintain a quality feedback loop.

SRE regulates IT operations with some specific parameters like Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Conclusion – SRE Teams as Value Center

A software product is expected to deliver uninterrupted services. The ideal and optimal condition is maximum uptime with 24/7 service availability. This requires unmatched reliability and ultra-scalability.

Therefore, the right mindset will be to treat SRE teams as a value center, which carries a combination of customer-facing skills and sharp technical acumen.  Lastly, for SRE to be successful, it is necessary to create SLI driven SLOs, augment capabilities around cloud infrastructure, a smooth inter-team co-ordination, and thrust Automation and AI within IT operations.

Site reliability engineering DevOps Reliability engineering application Software development Engineering agile Continuous Integration/Deployment IT

Published at DZone with permission of Sunny Raskar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How to Become a DevOps Engineer
  • DevOps Best Practices
  • Is DevSecOps Compatible With Managed Detection and Response?
  • Jump Into the DevOps Pool: The Water Is Fine

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!