SLO Math Cheat Sheet
Every number you need to defend an SLO target in a meeting, without opening a calculator and without saying "I'll get back to you."
Availability to downtime
Memorise this table. The single most common SLO conversation is "what does N nines actually mean?" and the answer should be instant.
- 99% (two nines), 7h 12m/month, 3.65d/year, 1h 40m/week
- 99.5%, 3h 36m/month, 1.83d/year, 50m/week
- 99.9% (three nines), 43.2 min/month, 8.77h/year, 10.1 min/week
- 99.95%, 21.6 min/month, 4.38h/year, 5.04 min/week
- 99.99% (four nines), 4.32 min/month, 52.6 min/year, 1.01 min/week
- 99.995%, 2.16 min/month, 26.3 min/year
- 99.999% (five nines), 25.9 sec/month, 5.26 min/year
Formula: downtime = (1 - SLO) × window. A 30-day month is 43,200 minutes; a year is 525,600.
Error budget
The budget is what you're allowed to spend on outages, deploys, experiments. Spend it wisely or freeze releases.
- Budget =
1 - SLO. 99.9% SLO → 0.1% budget - Allowed bad events =
budget × total events. 0.1% × 10M requests = 10,000 errors allowed/window - Budget remaining =
1 - (errors_observed / errors_allowed) - Budget consumed % =
(1 - actual_SLI) / (1 - SLO_target) - If consumed > 100%, you've blown the budget, stop pushing risky changes
- If consumed < 25% with two weeks left, you're being too conservative, ship more
Burn rate
Burn rate is "how many times faster than allowed are you spending the budget right now?" A burn rate of 1.0 spends the entire budget exactly over the SLO window. Higher = faster.
- Burn rate =
(error_rate_now) / (1 - SLO_target) - Burn rate 1, exhausts budget exactly at the end of the window
- Burn rate 14.4, exhausts a 30-day budget in 50 hours (Google SRE's classic fast-burn threshold)
- Burn rate 36, exhausts in 20 hours
- Burn rate 720, exhausts in 1 hour (the page-now value)
- Multi-window: alert when 1h burn > 14.4 and 5m burn > 14.4 (cuts false pages)
- Slow-burn: alert when 6h burn > 6 and 30m burn > 6
Composition
Stack services in series and the budgets multiply (so availability drops). Stack in parallel with a load balancer and availability climbs.
- Series (A then B):
SLO_total = SLO_A × SLO_B. Two 99.9% services in series = 99.8% - Parallel (A or B):
SLO_total = 1 - (1-SLO_A) × (1-SLO_B). Two 99.9% replicas in parallel = 99.9999% - Three services at 99.9% in series = 99.7%, or 2.16h/month downtime
- Want 99.99% end-to-end on a 5-service path? Each service needs ~99.998%
- Always set the SLO at the user-visible boundary, not per-service. Per-service targets are budgets, not promises
Latency SLOs
Availability isn't enough, slow is broken. Latency SLOs read like "95% of requests under 250ms over 30 days."
- Good event = request with latency ≤ threshold. Bad event = anything slower or any error
- SLI =
good / total. SLO is the target percentage - Two-threshold SLO: 95% < 250ms and 99% < 1000ms, catches both tail latency and the long tail
- Avoid SLOs on average latency, averages hide the tail you actually care about
- p99 of p99s is not the system's p99, quantiles don't compose. Aggregate from raw histograms