DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • A Step-by-Step Guide to Write a System Design Document
  • Mutable vs. Immutable: Infrastructure Models in the Cloud Era
  • The Case for Working on Non-Glamorous Migration Projects
  • Books To Start Your Career in Cloud, DevOps, or SRE in 2024

Trending

  • Efficient API Communication With Spring WebClient
  • Code Reviews: Building an AI-Powered GitHub Integration
  • Apple and Anthropic Partner on AI-Powered Vibe-Coding Tool – Public Release TBD
  • Building a Real-Time Audio Transcription System With OpenAI’s Realtime API
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. Implementing SLAs, SLOs, and SLIs: A Practical Guide for SREs

Implementing SLAs, SLOs, and SLIs: A Practical Guide for SREs

Explore definitions along with how SLAs, SLOs, and SLIs help in effective monitoring and maintaining system performance.

By 
Karthigayan Devan user avatar
Karthigayan Devan
·
Jun. 13, 24 · Analysis
Likes (3)
Comment
Save
Tweet
Share
5.6K Views

Join the DZone community and get the full member experience.

Join For Free

In today’s Information Technology (IT) digital transformation world, many applications are getting hosted in cloud environments every day. Monitoring and maintaining these applications daily is very challenging and we need proper metrics in place to measure and take action. This is where the importance of implementing SLAs, SLOs, and SLIs comes into the picture and it helps in effective monitoring and maintaining the system performance.  

Defining SLA, SLO, SLI, and SRE

What Is an SLA? (Commitment)

A Service Level Agreement is an agreement that exists between the cloud provider and client/user about measurable metrics; for example, uptime check, etc. This is normally handled by the company's legal department as per business and legal terms. It includes all the factors to be considered as part of the agreement and the consequences if it fails; for example, credits, penalties, etc. It is mostly applicable for paid services and not for free services. 

What Is an SLO? (Objective)

A Service Level Objective is an objective the cloud provider must meet to satisfy the agreement made with the client. It is used to mention specific individual metric expectations that cloud providers must meet to satisfy a client’s expectation (i.e., availability, etc). This will help clients to improve overall service quality and reliability. 

What Is an SLI? (How Did We Do?)

A Service Level Indicator measures compliance with an SLO and actual measurement of SLI. It gives a quantified view of the service's performance (i.e., 99.92% of latency, etc.). 

Who Is an SRE?

A Site Reliability Engineer is an engineer who always thinks about minimizing gaps between software development and operations. This term is slightly related to DevOps, which focuses on identifying the gaps. An SRE creates and uses automation tools to monitor and observe software reliability in production environments. 

In this article, we will discuss the importance of SLOs/SLIs/SLAs and how to implement them into production applications by a Site Reliability Engineer (SRE). 

Implementation of SLOs and SLIs

Let’s assume we have an application service that is up and running in a production environment.   The first step is to determine what an SLO should be and what it should cover. 

Example of SLOs

  • SLO = Target 
    • Above this target, GOOD
    • Below this target, BAD: Needs an action item
      • While setting up a Target, please do not consider it 100% reliable.  It is practically not possible and it fails most of the items due to patches, deployments, downtime, etc. This is where Error Budget (EB) comes into the picture. EB is the maximum amount of time that a service can fail without contractual consequences.

For example:

  • SLA = 99.99% uptime
    • EB = 55 mins and 35 secs per year, or 4 mins and 23 secs per month, the system can go down without consequences. A step is how to measure this SLO, and it is where SLI comes into the picture, which is an indicator of the level of service that you are providing. 

Example of SLIs

  • HTTP reqs = No. of success/total requests

Common SLI Metrics

  • Durability
  • Response time
  • Latency
  • Availability
  • Error rate
  • Throughput

Leverage automation of deployment monitoring and reporting tools to check SLIs and detect deviations from SLOs in real-time (i.e., Prometheus, Grafana, etc.).

Category SLO SLI
Availability 99.92% uptime/month X % of the time app is available
Latency 92% of reqs with response time under 240 ms X average resp time for user reqs
Error rate Less than 0.8% of requests result in errors X % of reqs that fail

Challenges

  • SLA: Normally, SLAs are written by business or legal teams with no input from technical teams, which results in missing key aspects to measure. 
  • SLO: Not able to measure or too broad to calculate 
  • SLI: There are too many metrics and differences in capturing and calculating the measures.  It leads to lots of effort for the SREs and gives less beneficial results.

Best Practices

  • SLA: Involve the technical team when SLAs are written by the company's business/legal team and the provider. This will help to reflect exact tech scenarios into the agreement. 
  • SLO: This should be simple, and easily measurable to check, whether we are in line with objectives or not. 
  • SLI: Define all standard metrics to monitor and measure. It will help SREs to check the reliability and performance of the services.

Conclusion

Implementation of SLAs, SLOs, and SLIs should be included as part of the system requirements and design and it should be in continuous improvement mode. SREs need to understand and take responsibility for how the systems serve the business needs and take necessary measures to minimize the impact.

Site reliability engineering System requirements Cloud systems

Opinions expressed by DZone contributors are their own.

Related

  • A Step-by-Step Guide to Write a System Design Document
  • Mutable vs. Immutable: Infrastructure Models in the Cloud Era
  • The Case for Working on Non-Glamorous Migration Projects
  • Books To Start Your Career in Cloud, DevOps, or SRE in 2024

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!