SLO Confidence Intervals
SLO measurements have uncertainty.
Idea
Every SLO number you publish is a sample, not a population. When you say "checkout was 99.95% available last month", what you really observed is some count of successful requests over some count of total requests, and the true underlying availability of the service is some other number that you cannot know exactly. The width of the gap between observed and true is a function of how many requests you saw.
The honest way to report this is a confidence interval:
- Point estimate.: The single number, 99.95%, computed as successes divided by total. This is what most dashboards show today and what most teams use for the green/yellow/red traffic light.
- Confidence interval.: The range the true value plausibly sits in. For a binomial like availability, the 95% Wilson interval on 99.95% from 200,000 requests is roughly 99.94% to 99.96%. From 2,000 requests it is 99.65% to 99.99%, which is a much honest picture of how much you actually know.
- Range, not point.: Treat the SLO as a band. If the lower bound is below your target, you have missed. If the entire band is above target, you have met. If the band straddles the target, you do not have enough data to say either way and you need a longer window or more traffic.
Most teams skip this because point estimates fit on a dashboard and CIs do not. The cost of skipping it shows up later, when a low-volume service reports 99.4% one month and 100% the next and nobody can tell whether anything actually changed.
When
Confidence intervals matter most when sample size is small relative to the precision you are claiming. Three failure modes to watch for:
- Low-volume services.: Internal admin tools, batch jobs, infrequent API endpoints. A service that handles 800 requests a month cannot meaningfully report 99.99% availability. The CI on that point estimate is so wide that the number is meaningless. Either lower the precision claim or aggregate across a longer window.
- Tail-heavy distributions.: Latency p99 from 1,000 samples has a CI of plus or minus 30%. If your p99 SLO is 800 ms and your observed p99 is 750 ms, the upper bound of the interval is over 970 ms. You are not meeting the SLO with confidence, you just got a lucky sample.
- Per-month rollups.: A 28-day window is short for high-precision targets. A 99.99% SLO requires hundreds of millions of requests in the window to be measured with even a tenth of a 9 of confidence. Most teams claiming four nines are reporting noise.
The rule of thumb: if your SLO target has more nines than the log10 of your monthly request count, you are overclaiming. A service with 100,000 requests a month (5 zeros) cannot honestly claim 99.99% (4 nines) at month resolution. The math will not let you.
Display
Putting confidence intervals on dashboards is the part most teams resist, because intervals are visually busier than point estimates. The fix is a small UI investment that pays back in calibration and trust.
- Dual line on the time-series.: Plot the point estimate as a solid line and the 95% CI as a shaded band around it. The band naturally narrows as the window accumulates more samples, which gives a visual reading of "do we know this yet."
- Lower-bound traffic light.: Color the SLO badge red, yellow, or green based on the LOWER bound of the CI, not the point estimate. This biases toward honesty: a service is only "green" if you can prove it is meeting the target, not if your best guess says it is.
- Sample-size annotation.: Show "n = 1.2M requests, 95% CI = 99.94 to 99.96" inline. The number alone does not tell anyone how trustworthy it is. The annotation does.
- Don't compute monthly numbers from a single noisy week.: If the CI on your week-1 number is wider than your SLO budget, do not extrapolate. Wait for the rest of the data or aggregate longer.
Showing CIs on the SLO dashboard is the difference between a reliability practice that is rigorous and one that is theatre. Nova AI Ops computes Wilson intervals on every SLI by default, plots the band on every chart, and flags services whose claimed precision exceeds what their request volume can statistically support, so you stop overclaiming numbers your data cannot back up.