Understanding the relationships between SLO, SLI, and SRE

An SLI is a measure of compliance with an SLO. This means there is no SLI without SLO. This article looks into the importance of SLIs and SLOs in SREs and how to implement them.

Alireza C

Nov. 15, 21 · Opinion

Likes (4)

Comment

Save

12.8K Views

Even after delivering a project to a client, the software engineer’s job is not complete. The next phase is ensuring service reliability. In site reliability engineering (SRE) practice, there are two key concepts that the engineer should know, service level objective (SLO) and service level indicator (SLI).

This article looks into the importance of SLIs and SLOs in SREs and how to implement them.

What are service level objectives?

A service level objective is an agreement about a specific metric like uptime or response time. In other words, SLOs are the individual promises made by a service provider to the client and used to set expectations of the service. SLOs also let the IT and DevOps teams have a goal or metric to measure themselves against for a view of how well they are performing.

A service may have more than one SLOs, and they apply to both paying and non-paying customers and even internal clients in the same organization. For example, when a customer-facing team uses tools provided by another team in the same organization, the two teams need to have clearly defined service level objectives so that the customer-facing teams can meet their contractual obligations.

For an SLO to be effective, it must not be vague, very complicated, or impossible to measure. Only the relevant SLOs should be in the document and be spelled out in plain language to provide clarity. It is also essential to factor in other issues like delays from the client.

Using an online service that is called by clients an example, SLOs can include system availability, how long it takes for a request to get a response, the error rate or how often an error is encountered expressed as a fraction, and the number of requests the service can handle per second.

What are service level indicators?

An SLI is a measure of compliance with an SLO. This means there is no SLI without SLO.

Returning to the example of online service, if the service level agreement (SLA) promises availability of 99.95 percent, then your SLO is 99.95 percent. Your SLI is then the actual availability reported by your system.

If your SLI is above 99.95 percent, then you have met your obligation to your client. While 100 percent availability is not possible, the goal is to get as close as possible.

Some of the challenges of SLIs are choosing the relevant metrics to track and implementing how to track them as accurately as possible. Tracking metrics just because you can and not because they are essential to the client is a waste of resources.

How does SRE benefit from SLOs and SLIs?

Having excellent and practical SLOs and SLIs is fundamental to seamlessly transitioning from development to operations. SLOs help the team prioritize their work, while SLIs indicate areas where attention is needed to meet client expectations.

Now that you know what SLOs and SLIs stand for, we will look at the best practices of implementing them to improve your SRE.

Best practices for SLOs and SLIs

When formulating your SLOs within your SLA, it is important to pay attention to these points:

Take customers’ expectations into account

When drafting your SLA, it is important to know what your customers expect from your service or product. With the understanding of what matters to your clients, your team can craft what is practical and that the customer can work with.

Use the plainest language possible in your SLA

Your client might not read the document in your presence where they can ask you for clarification. If any part of your SLA, which includes the SLOs, is ambiguous, you and your client will probably have disagreements on expectations down the line.

Not every metric is an SLO

You will avoid lots of troubles by limiting your SLOs to only practical and essential ones. Use as few SLOs as possible, do not cram in as much as you can to impress with your metric tracking capabilities.

Don’t promise the moon even if you can deliver it

While setting your SLOs, you do not need to promise clients your total capacity. For example, if your system can maintain an uptime of 99.99 percent, you do not have to set your SLO at 99.99 percent. It is better to have a wiggle room by underpromising and over-delivering. This way, you can take care of unforeseen issues that can affect the service you provide.

Have a sounds disaster recovery plan

Before committing to an SLO, prepare a detailed plan of what to do when your SLI drops below your SLOs. Failure to do this will result in an uncoordinated response that only wastes your team’s time, instead of fixing the problem.

Site reliability engineering

Opinions expressed by DZone contributors are their own.

Related

Trending