SLOs and why you should care

Ever wondered what all the fuss over Service Level Objectives (SLOs) is about? Let’s find out.

Published in

Solaris Engineering

5 min readSep 8, 2021

Service Level Agreements (SLAs) are usually the constant center of attention within a company to ensure these carefully drafted contractual obligations with customers or clients are never breached. Breaching an SLA typically means penalties for an organization — financially, as well as an adverse impact on reputation.

The answer to this is Service Level Objectives (SLOs). These are realistic — sometimes borderline ambitious — continuously monitored performance targets set internally across an organisation. To arrive at a valid SLO, you need to really understand what you’re measuring and if it’s a good indication of tracking quality within your product. This carefully identified and selected unit of measurement is referred to as the Service Level Indicator (SLI). An SLI can also be described as a carefully defined quantitative measure to some aspect of the level of service that is provided.

A good example of an indicator to measure (SLI) would be the availability of a product and the target objective (SLO) would be ensuring your availability is always greater than 99.9% (availability > 99%). This is the general uptime of your product/service and is usually referred to using the number of 9’s that show up. 5 nines = 99.999% and 3 nine’s = 99.9%. Get the gist? In this example if your SLO target is availability < 99.9%, it would mean you are expecting to have no more than 8.76 hours of product downtime in a year or 43.8 minutes of downtime per month. Anything more and you’ll be missing your SLOs and potentially crossing into breaching your SLAs. You can play with various availability targets and related uptime expectations at https://uptime.is.

The advantage of having legitimate and ambitious SLOs in place is that they have a higher performance / quality target than SLAs, minus the contractual obligations — which means as long as you’re within your SLOs, there’s a protective buffer that ensures your SLA “never” gets breached. In case you do breach your internal SLOs, reliable monitoring should be in place so there is ample time to rectify the situation before it breaches an SLA.

Ok great …. and why should I care?

In a nutshell, SLOs:

are internal performance/quality targets by which you can clearly measure and quantitatively improve on as an organization
ensure you have a higher performance/quality target than your SLAs and keep your end customers and/or clients happy in the process
ensure you can confidently set SLAs for contractual obligations since SLAs will be a lower quality target than you SLOs
gives you clearly defined thresholds by which teams might need to react
helps teams truly prioritize when the need to focus on quality and performance of their tooling/service outweighs delivering new features
transparently communicates org-wide what a tooling/service can deliver and what quality to expect

Hmm … sounds interesting … where to start?

Finding the right Service Level Indicators can be a pain. We’ve been there. Fortunately, there are 4 main indicators that are the basics of really building strong SLOs: latency, traffic, errors and saturation. They are referred to as the 4 golden signals in the Google SRE handbook.

Latency

This is how long it takes to return a response to your product/service and is usually represented in milliseconds (ms). You guessed it — less is definitely better in this case! Clients don’t like to wait too long.

At Solarisbank, we closely monitor each and every API endpoint and have alerts configured in place to ensure they are meeting our SLO targets.

Traffic

Does RPS — or even better requests per second — ring a bell? Even if not, this indicator gives you insights as to how much demand your system is under. Depending on the type of system you might be dealing with other metrics such as network i/o rate but you need to find out what fits best for your use-case. For most client-facing systems, RPS is the way to go.

Errors

This can be extracted as a fraction of all requests that your service failed in returning a valid response and can be visualised as a fraction or percentage. It is commonly also measured the other way around using the success rate — focussing instead on a fraction of valid successful responses.

Saturation

How can you tell which bits of your system is under a higher than normal workload and which resources might be constrained? Some examples of things to be on the lookout for are CPU utilization or network bandwidth. Having a good handle on this gives you the ability to prioritize resources or more evenly distribute workload across your setup.

So to kick SLOs off, a good strategy is to start with only one out of our these four that you might already have metrics for or at least have a concept on how to extract these metrics, and then work your way up.

Happy SLO’ing!

We hope this gave you an indication on what to look out for and how introducing SLOs can undeniably improve the quality of your products for your clients. It did for us at Solarisbank and we have tons of metrics to ensure we have a good grasp on our SLOs, and in essence our SLAs with our partners.

We dumped quite a bunch of Site Reliability Engineering (SRE) principles in this post. If you want to take a deeper dive all you need can be found in the Google SRE E-book, which is free.

Solaris Engineering

SLOs and why you should care

Ever wondered what all the fuss over Service Level Objectives (SLOs) is about? Let’s find out.

Ok great …. and why should I care?

Hmm … sounds interesting … where to start?

Latency

Traffic

Errors

Saturation

Happy SLO’ing!

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Solaris Engineering

Written by Manny Acquah

Responses (1)