The concept of Service levels is a key part of System Reliability Engineering (SRE). This is well explained in the SRE book
As mentioned in the book:
It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product.
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.
However, the concept of SLA, SLO, SLI might seem a bit confusing to the users. In this blog post and YouTube video I have explained the key differences and concept between these metrics.
Watch the video
To explain this I have taken a analogy of a restaurant. Suppose you are the restaurant manager and you are running a quick delivery restaurant. Your USP is that you will deliver food on the table in 20 minutes.
With this analogy we can define the key terms.
SLA: The Promise
Imagine walking into a restaurant. The menu you’re handed is akin to a Service Level Agreement (SLA). It’s a promise from the restaurant to you, the customer. The menu states what you can expect — quality of food, service speed, and the ambiance. In the IT world, an SLA is similar. It’s a formal agreement between a service provider and the end user that defines the level of service expected, like system uptime or response time.
So to have an SLA defined you need to have
- Service Provider
- Service User
- Service
- Metric
The Service Provider, provides service to a service user and specifies the commitment to provide the service as per the defined metrics over a period of time. This offer has to be agreed by the service user. A formal agreement is then termed as SLA.
SLO: The Kitchen’s Objective
Now, think about what’s happening in the kitchen. The chefs have their own goals, or Service Level Objectives (SLOs). These are targets set to achieve the promises made in the menu. If the menu promises a 20-minute meal prep time, the kitchen’s SLO might be to prepare each dish within 10 minutes.
This internal goal ensures they meet or exceed the customer’s expectations.
In IT services, SLOs are the specific goals set by a provider to achieve the standards set in the SLA.
SLI: Measuring the Service
Service Level Indicators (SLIs) are the metrics used to measure the service’s performance against the SLO. Back in our restaurant, an SLI would be the actual time taken to prepare each meal. If meals are consistently prepared in 15 minutes, the restaurant is meeting its SLO.
In IT, SLIs could be the actual uptime percentage or the average response time of a system.
Error Budgets
No restaurant is perfect. There will be days when things don’t go as planned — a delayed dish, for instance. This is where the concept of an Error Budget comes in. It’s the allowable margin of error while still keeping customers happy. If the restaurant’s error budget for delayed meals is 5%, as long as they keep the delays under this threshold, they’re okay. In IT, an Error Budget is the amount of time services can be down without breaching the SLA. It’s a crucial part of reliability engineering, allowing teams to balance innovation and reliability.
The error budget is essentially the amount of time a service is allowed to be unavailable or not meeting the SLO before it breaches the SLA.
To calculate the error budget, we need to understand the difference between the commited Service Level Agreement (SLA) and the target Service Level Objective (SLO).
In our case:
- SLA is 95% availability.
- SLO is 99% availability.
First, let’s calculate the total allowed downtime for both:
Total Allowed Downtime for SLA (95% Availability)
- Yearly: 365 days × 5% = 18.25 days/year
- Monthly: 30 days × 5% ≈ 1.5 days/month
- Weekly: 7 days × 5% = 0.35 days/week ≈ 8.4 hours/week
- Daily: 24 hours × 5% = 1.2 hours/day
Total Allowed Downtime for SLO (99% Availability)
- Yearly: 365 days × 1% = 3.65 days/year
- Monthly: 30 days × 1% = 0.3 days/month ≈ 7.2 hours/month
- Weekly: 7 days × 1% = 0.07 days/week ≈ 1.68 hours/week
- Daily: 24 hours × 1% = 14.4 minutes/day
Now, the error budget is the difference between the SLA and SLO downtime’s:
- Yearly Error Budget: 18.25 days (SLA) — 3.65 days (SLO) = 14.6 days/year
- Monthly Error Budget: 1.5 days (SLA) — 7.2 hours (SLO) ≈ 1.2 days/month
- Weekly Error Budget: 8.4 hours (SLA) — 1.68 hours (SLO) ≈ 6.72 hours/week
- Daily Error Budget: 1.2 hours (SLA) — 14.4 minutes (SLO) ≈ 1.04 hours/day
This error budget represents the amount of additional downtime you can afford without breaching the SLA, given that you’re aiming to meet the SLO. It’s a buffer that allows for some level of imperfection in service delivery while still maintaining the overall agreement with the client or user.
Availability Calculations
Availability is typically represented as a percentage of uptime over an annual period.
The table below illustrates the permissible downtime for specific availability percentages, assuming continuous system operation.
Service level agreements commonly use monthly downtime or availability figures to determine service credits, aligning with monthly billing cycles.
Conclusion
Understanding SLA, SLO, SLI, and Error Budgets is crucial in any service-oriented industry, whether it’s a bustling restaurant or a cloud service provider. These components work together to ensure customer satisfaction and service efficiency. Just like a well-run restaurant that keeps customers coming back, a well-managed service with clear SLAs, SLOs, SLIs, and a reasonable Error Budget will maintain customer trust and business success.