The Multi-Agent OS for SRE & DevOps

SLOs, SLIs, SLAs, and Error Budgets: The Complete Guide

Reliability is not a vibe; it is a number with a budget attached. This is the definitive 2026 guide to service level objectives: what an SLO, SLI, SLA, and error budget each mean precisely, how to choose the right indicator, how to pick a target without chasing impossible nines, how error budgets and burn-rate alerting work, what a team actually does when the budget is spent, why your SLO should be stricter than your SLA, the mistakes that quietly waste the whole exercise, and a 90-day plan to roll SLOs out for real.

By Dr. Samson Tanimawo Updated 2026 23 min read
Nova AI Ops dashboard showing SLO compliance and error-budget burn rate across services

SLO, SLI, SLA, error budget: the one-paragraph mental model

Four terms get used interchangeably and they should not be. Here is the whole relationship in one breath: you measure the service with an SLI, you aim that measurement at an SLO, the gap below the SLO is your error budget, and the SLA is the looser version you put in a contract. Get those four in the right order and almost everything else about reliability engineering falls into place.

Let us define each precisely. An SLI (service level indicator) is a quantitative measure of some aspect of the service, almost always expressed as a ratio of good events to valid events: the proportion of requests that returned successfully, or the proportion served faster than a latency threshold. It is a number between 0 and 100%, and it answers the question "how are we doing right now?"

An SLO (service level objective) is a target value or range for an SLI, measured over a window. "99.9% of requests succeed over a rolling 28 days" is an SLO. It is the internal line you hold yourself to. It is a promise to yourselves, not to customers, which is exactly why it can be honest and strict.

An SLA (service level agreement) is a contract with a customer that includes one or more SLOs plus the consequences of missing them, usually service credits or refunds. The SLA is the legal, external commitment. It is almost always looser than your internal SLO, for reasons we cover in section six.

An error budget is the inverse of the SLO: it is the amount of unreliability you are permitted before you breach the objective. If the SLO is 99.9%, the error budget is 0.1%. That small slice of allowed failure is not waste; it is the fuel that lets you ship changes, run experiments, and survive the dependencies you do not control. The error budget is the bridge between "we want it reliable" and "we want to move fast," because it turns that tension into a single spendable number.

The mental model in one line: SLIs measure, SLOs target, error budgets are the leftover room below the target, and SLAs are the contract. Everything reliability teams argue about reduces to defending the SLI, choosing the SLO, and spending the budget wisely.

Choosing the right SLI

The SLI is where most SLO programs succeed or fail, because if you measure the wrong thing the rest of the machinery faithfully optimizes the wrong thing. A good SLI tracks what your users actually experience, moves when they suffer, and stays flat when they do not. The classic formulation is a ratio:

SLI = good events / valid events. For availability, that is successful responses divided by all valid responses. For latency, it is requests served under a threshold divided by all requests. Expressing every SLI this way makes the math, the budget, and the alerting consistent across very different signals.

There are four families of SLI that cover the vast majority of services:

Availability

The fraction of valid requests that succeed. The subtlety is defining "success." A 200 with a broken body is not a success; a 503 from a circuit breaker is a failure; a 400 caused by a genuinely bad client request is usually not counted against you. Decide what "good" and "valid" mean once, write it down, and measure consistently.

Latency

The fraction of requests served faster than a threshold, for example the share of requests under 300ms. Note that this is still a ratio of good events to valid events, not an average. Averages hide the tail; a threshold-based SLI keeps you honest about the slow requests that users actually notice.

Error rate

Closely related to availability but framed around the rate of failed operations, useful for asynchronous and batch workloads where there is no clean request/response. The good-events ratio still applies: successful jobs over total valid jobs.

Freshness (and correctness)

For data pipelines, caches, and replicas, the user-facing question is often "how stale is this?" A freshness SLI measures the fraction of data served within an acceptable age. Correctness and durability SLIs round out the set for storage-heavy systems.

What makes a good SLI, concretely: it is measured as close to the user as possible (ideally at the load balancer or the client, not deep in a backend); it has an unambiguous definition of good and valid; it moves with user pain and ignores things users never feel; and it is cheap and reliable to collect. The anti-pattern is the vanity SLI, such as CPU utilization or raw uptime of a host, which can look perfect while the product is unusable.

Setting the target: why not 100%, and the nines table

Once you have an SLI, you have to choose the number. The single most important rule is this: the target is never 100%. A 100% objective is impossible to sustain, costs exponentially more for each additional nine, and leaves zero error budget, which freezes all change because every deploy threatens the perfect record. The correct target is the lowest reliability that keeps your users happy, because reliability beyond what users can perceive is money set on fire.

How do you find that number? Work backwards from the user. Look at where complaints, churn, and abandoned sessions actually start. If users do not notice the difference between 99.9% and 99.99%, do not pay for the extra nine. If a single bad hour triggers a wave of support tickets, you need a tighter target. Start with a defensible estimate, watch real behavior, and tune it. An SLO is a living target, not a stone tablet.

The cost of each nine is exponential, and the allowed downtime collapses fast. Here is the canonical nines table, with allowed unavailability per period:

Availability SLODowntime / yearDowntime / 30 daysDowntime / dayTypical use
99% (two nines)3.65 days7.2 hours14.4 minutesInternal tools, batch
99.9% (three nines)8.77 hours43.2 minutes1.44 minutesMost SaaS APIs
99.95%4.38 hours21.6 minutes43.2 secondsPaid, business-critical
99.99% (four nines)52.6 minutes4.32 minutes8.6 secondsPayments, core infra
99.999% (five nines)5.26 minutes25.9 seconds0.86 secondsTelco, very few systems

Read that table as a price list. Moving from 99.9% to 99.99% is not 10% more work; it is a different operational regime that usually demands redundancy, automated failover, and sub-minute detection and remediation, which is exactly the territory where automation stops being a nicety and becomes the only way to hit the number. For most teams a target in the 99.9-99.99% band is the honest, affordable sweet spot.

At 99.99% you cannot resolve incidents by hand fast enough. See how Nova auto-resolves inside the budget.

Explore the platform →

Error budgets and burn rate

The error budget is simply 1 minus the SLO, expressed over a window. If your SLO is 99.9% over 28 days, your budget is 0.1% of that window, which works out to roughly 40 minutes of allowed badness across those four weeks. That budget is a resource you get to spend on change: every risky deploy, every experiment, every flaky dependency draws it down, and when it refills at the start of the next window you get to spend again.

The key operational concept is burn rate: how fast you are spending the budget relative to the SLO window. A burn rate of 1 means you are on pace to exhaust exactly the budget by the end of the window, which is fine. A burn rate of 10 means you are spending ten times too fast and will run out in a tenth of the window. Burn rate is what turns a static target into a live signal.

Fast burn vs slow burn

Not all budget consumption is equal. A fast burn is a sudden outage chewing through the budget in minutes; it warrants paging someone immediately. A slow burn is a chronic, low-grade leak (a small but persistent error rate) that will exhaust the budget over days; it warrants a ticket, not a 3am page. Treating these the same is how teams end up either over-paged or blind to slow decay.

Multi-window, multi-burn-rate alerting

The modern way to alert on SLOs is not a static threshold but a combination of burn rate and time windows. You define a fast-burn alert that fires when, say, you have burned a large share of the budget over the last hour and the last five minutes confirm it is still happening. You add a slow-burn alert over a much longer window to catch chronic leaks. Pairing a long window for statistical confidence with a short window for freshness gives you alerts that are sensitive to real problems and resistant to noise. This is the single biggest upgrade most teams can make to their paging: stop alerting on causes and raw thresholds, start alerting on how fast the user-facing budget is burning.

Error-budget policy: what teams actually do when the budget is spent

An SLO without a policy is just a chart. The error-budget policy is the agreement, written and signed off by both engineering and product before any crisis, that says what happens when the budget runs out. Deciding this in advance is the whole point: nobody negotiates priorities in the middle of an outage.

The canonical policy is the feature freeze. When the budget is exhausted, the team stops shipping risky changes and redirects effort to reliability work until the budget recovers. It is blunt, it is effective, and it aligns incentives instantly: product now cares about reliability because unreliability blocks the roadmap.

A mature policy usually has several graduated clauses:

  • Budget healthy: ship freely, run experiments, take reasonable risks. This is the budget doing its job.
  • Budget low (for example under 25% remaining): require extra review on risky changes, slow down non-urgent deploys, and start prioritizing the reliability backlog.
  • Budget exhausted: feature freeze. Only reliability fixes and critical security work ship until the budget recovers.
  • Budget repeatedly blown: escalate to leadership; the target, the architecture, or the staffing is wrong and needs a decision above the team.

The policy only works if it has teeth and is honored by both sides. If product can wave through features during a freeze, the SLO is theater. The discipline is what converts a number on a dashboard into a real control over the speed-versus-reliability trade-off.

SLOs vs SLAs: internal target versus contractual promise

People conflate these constantly, but they serve opposite purposes. The SLO is your internal early-warning line. The SLA is the external cliff with money attached. The single most important rule that follows: your SLO should always be stricter than your SLA.

Here is why. If you set your internal SLO equal to your contractual SLA, the first time you learn you are in trouble is the moment you have already breached the contract and owe customers credits. There is no warning, no buffer, no time to react. By setting the SLO tighter, you build in a safety cushion: your internal error budget runs out, your policy triggers, and your team starts fixing reliability while there is still contractual margin left.

A common, healthy pattern: publish a customer-facing SLA of 99.9%, but hold an internal SLO of 99.95%. The 0.05% gap is your cushion. When you breach the internal 99.95% objective you go into reliability mode, but you have not yet missed the 99.9% you promised customers. The stricter SLO protects the looser SLA.

One more distinction: SLAs include consequences (credits, penalties, escalation paths) and are negotiated with legal and sales; SLOs are owned by engineering and exist purely to drive operational behavior. Never let the contract dictate your internal target. The contract is the floor you must never hit; the SLO is the line you defend so you never get close to that floor.

Common SLO mistakes

Most failed SLO programs fail the same handful of ways. Recognize these early.

Too many SLOs

The eager team creates an SLO for every metric on every service. The result is a wall of dashboards nobody reads and alerts nobody trusts. SLOs are expensive to maintain and only valuable if defended. Pick the few user journeys that matter and resist the urge to instrument everything.

Vanity SLIs

Choosing indicators that are easy to collect rather than meaningful to users: CPU, memory, raw host uptime, internal queue depth. These can sit at 100% while the product is down. Always ask: if this SLI is green, can a user still be in pain? If yes, it is the wrong SLI.

Alerting on causes, not symptoms

Paging on "disk is 80% full" or "CPU is high" generates noise and trains responders to ignore alerts, because most of those conditions never hurt a user. Alert on the symptom the user feels (the burn rate of a user-facing SLI) and let diagnosis find the cause. This single shift cuts page volume dramatically and raises trust in every remaining alert.

Setting the target by gut, then never revisiting it

A target plucked from the air and frozen forever is almost certainly wrong. Set it from user behavior, then tune it as you learn. An SLO that never changes is an SLO nobody is actually using.

An SLO with no policy

The most common failure of all: a beautifully measured SLO with no agreement about what happens when it is breached. Without a policy, breaching the SLO changes nothing, and the whole apparatus becomes decoration.

A 90-day plan to roll out SLOs

You do not roll SLOs out across an organization in a week. Here is a staged plan that gets one service to a defended, alerting, policy-backed SLO in a quarter, which then becomes the template for the rest.

The 90-day SLO rollout plan

  1. Weeks 1-2: Pick one critical user journey on one service. Define the SLI as a good-events / valid-events ratio, decide exactly what counts as good and valid, and confirm you can measure it close to the user.
  2. Weeks 3-4: Set a provisional SLO from real user behavior and historical data, not from a wish. Compute the error budget and the window. Write it down where the team can see it.
  3. Weeks 5-6: Instrument the SLI properly and build a dashboard that shows current SLI, the SLO line, budget remaining, and burn rate. Watch it for two weeks before you alert on it.
  4. Weeks 7-8: Add multi-window, multi-burn-rate alerting: a fast-burn page and a slow-burn ticket. Tune until the alerts are trustworthy and quiet when nothing is wrong.
  5. Weeks 9-10: Draft and sign the error-budget policy with product and engineering together. Agree the graduated clauses and the feature-freeze trigger before you need them.
  6. Weeks 11-12: Run a game day. Deliberately burn budget, confirm the alerts fire, the policy triggers, and the team responds. Fix whatever broke.
  7. Ongoing: Review the SLO monthly, tune the target from real user pain, and feed every breach into prevention. Then replicate the template onto the next service.

The 10-point SLO checklist

Before you call an SLO done, run it past these ten points. If you cannot tick all ten, the SLO is not finished.

  1. The SLI is a ratio of clearly defined good events to valid events.
  2. The SLI is measured as close to the user as possible.
  3. The SLI moves when users feel pain and stays flat when they do not.
  4. The SLO target is below 100% and justified by real user behavior.
  5. The window is explicit (for example a rolling 28 days) and consistent.
  6. The error budget is computed and visible as a number, not a vibe.
  7. Burn rate is on the dashboard, with fast-burn and slow-burn views.
  8. Alerting is multi-window, multi-burn-rate, and fires on symptoms not causes.
  9. A signed error-budget policy says what happens when the budget is spent.
  10. The internal SLO is stricter than any contractual SLA covering the same journey.

See how Nova watches error-budget burn and auto-acts within your policy envelope.

Explore the platform →

Frequently asked questions

What is the difference between an SLO, an SLI, and an SLA?
An SLI is the measurement, an SLO is the target, and an SLA is the contract. The SLI, service level indicator, is a number that describes how well the service is doing, such as the fraction of requests that succeed or the fraction served faster than 300ms. The SLO, service level objective, is the internal target you hold that indicator to over a window, for example 99.9% of requests succeed over 28 days. The SLA, service level agreement, is the external, contractual promise made to customers, usually looser than the SLO and carrying financial penalties if breached. In short: you measure with SLIs, you aim at SLOs, and you sign SLAs.
What makes a good SLI?
A good SLI is a ratio of good events to valid events that tracks what users actually feel. It is measured as close to the user as possible, it moves when the user experience degrades and stays flat when it does not, and it has a clear definition of what counts as a good event and what counts as a valid event. Request-based availability (good responses divided by total valid responses) and latency (the fraction of requests served under a threshold) are the canonical examples. Avoid indicators that look healthy while users suffer, such as raw server CPU, and avoid ones that are noisy or impossible to attribute to user pain.
Why should an SLO not be 100%?
Because 100% is the wrong target for any real system. It is effectively impossible to reach, it costs exponentially more for each extra nine, and chasing it freezes the pace of change because every deploy risks the perfect record. A 100% SLO also leaves no error budget, which means there is no room to ship features, run experiments, or tolerate the occasional dependency failure you do not control. The right target is the lowest level of reliability that keeps users happy, which leaves a deliberate, spendable budget for change. Reliability above what users notice is wasted money.
What is an error budget?
An error budget is the amount of unreliability you are allowed to spend, and it equals one minus the SLO. If your SLO is 99.9% over 28 days, your error budget is 0.1% of that window, which is roughly 40 minutes of allowed badness. While the budget has room, the team is free to ship fast, run risky experiments, and tolerate small failures. When the budget is spent, the policy kicks in and the team shifts from shipping features to protecting reliability. The error budget turns reliability from an argument into a number that both product and engineering can act on.
What is burn rate and why does it matter?
Burn rate is how fast you are consuming the error budget relative to the SLO window. A burn rate of 1 means you will exactly exhaust the budget by the end of the window; a burn rate of 10 means you are burning ten times too fast and will exhaust it in a tenth of the time. Burn rate matters because it separates a slow, chronic leak from a sudden catastrophic outage. Alerting on burn rate, rather than on a raw threshold, lets you page loudly for a fast burn that will exhaust the budget in an hour and ticket quietly for a slow burn that merely needs attention this week.
How does multi-window, multi-burn-rate alerting work?
It fires when the budget is burning too fast over both a long window and a short window at the same time. A single fast-burn alert might check that you have burned a large fraction of the budget over the last hour and confirm it over the last five minutes, so a brief blip does not page anyone but a sustained fast burn does. A slower alert watches a longer window for a chronic leak. Pairing a long window for confidence with a short window for freshness gives alerts that are both sensitive to real problems and resistant to noise, which is the whole point of moving from threshold alerts to burn-rate alerts.
What should a team do when the error budget is exhausted?
Follow the error-budget policy, which is agreed in advance so nobody negotiates it during a crisis. The standard response is a feature freeze: stop shipping risky changes and redirect engineering effort to reliability work until the budget recovers. Other clauses can include blocking non-urgent deploys, requiring extra review for changes, prioritizing the reliability backlog, and escalating to leadership if the budget stays blown. The policy must have teeth and be honored by both product and engineering, otherwise the SLO is just a dashboard. The point is to make the trade-off between speed and reliability automatic and pre-agreed.
Why should your SLO be stricter than your SLA?
Because the SLO is your internal early-warning line and the SLA is the contractual cliff with penalties. If you set your SLO equal to your SLA, you only find out you are in trouble when you have already breached the contract and owe customers money. Setting the SLO tighter, for example an internal SLO of 99.95% behind a customer SLA of 99.9%, gives you a buffer: your error budget runs out before the contract does, the policy triggers, and the team fixes reliability while there is still margin. The gap between the stricter SLO and the looser SLA is your safety cushion.
How many SLOs should a service have?
As few as possible while still covering what users care about, usually a small handful per service rather than dozens. The common mistake is creating an SLO for every metric, which produces a wall of dashboards nobody watches and alerts nobody trusts. Start with the one or two journeys that matter most to users, typically an availability SLI and a latency SLI on the critical path, and add more only when a real user-facing failure mode is not yet covered. A few meaningful, well-defended SLOs beat a hundred vanity ones every time.
Where does Nova AI Ops fit in an SLO program?
Nova watches error-budget burn across your services and acts within the policy envelope you define, so the budget becomes a live control plane rather than a dashboard reviewed at the weekly meeting. It correlates the signals behind a fast burn into a single incident with a ranked root cause, then auto-resolves the known-safe class of issues within the guardrails you set, which protects the budget before a human finishes reading the page. It does not replace your SLO tooling or your monitoring; it operates on top of them as the agentic layer that turns a burning error budget into automatic, policy-bounded action across AWS, GCP, Azure, Linux, and Windows.

Go deeper into the reliability stack: site reliability engineering is the discipline SLOs come from; AI SRE and agentic SRE for how agents defend objectives; MTTR for the resolution metric error budgets pressure; alert fatigue for why burn-rate alerting beats threshold noise; incident management for the lifecycle a blown budget triggers; AI observability for the data SLIs are computed from; on-call for who carries the pager when the budget burns; self-healing infrastructure and AIOps for automating the response. See the Nova AI Ops feature set across detection, diagnosis, and auto-resolution.

Stop reviewing error budgets once a week. Start defending them in real time.

Nova AI Ops watches the burn rate on every SLI, correlates a fast burn into one incident with a ranked root cause, and auto-resolves the known-safe class of issues within the policy envelope you define, so the budget is protected before a human finishes reading the page. The Multi-Agent OS for SRE & DevOps across AWS, GCP, Azure, Linux, and Windows. Free tier available for small teams.