SLO Target Setting Discipline
Setting realistic SLO targets.
Data-driven
SLO target setting is the most consequential decision in any reliability practice and the one most teams get wrong by treating it as a meeting topic rather than a data analysis. The right number is not "what does leadership want to be able to claim" but "what does the system actually do, plus what stretch can engineering credibly commit to." The first question is answered by data; the second is answered by judgment grounded in data.
What a data-driven baseline looks like:
- Past 90 days of actual performance.: The minimum window for a meaningful baseline. Anything shorter misses seasonality, weekly cycles, and the long tail of dependency outages. 90 days captures enough variability to set a target that holds under realistic conditions.
- Honest baseline, with anomalies acknowledged.: The baseline analysis includes the bad weeks, not just the median. A baseline computed from "the good days only" produces a target the team cannot defend in the bad ones. Document anomalies that were excluded; the documentation is what makes the baseline trustworthy.
- Per-dimension separately.: Latency baseline, availability baseline, error-rate baseline, freshness baseline. Each dimension gets its own analysis. Setting the composite target first and decomposing produces less defensible numbers than per-dimension first and composing.
- Per-segment if needed.: Some services have very different performance per region, per tenant tier, or per request type. The baseline per segment may differ enough that a single target across all segments is misleading. Per-segment baselines inform whether the SLO needs to be set per segment too.
- Show the distribution.: The median is one number; the worst day is another; the long tail is a third. Show the p50, p95, and p99 of weekly performance over the 90 days. The target should be set against the realistic worst, not the median.
Data-driven targets take a day to compute and produce numbers the team can defend. Aspirational targets without data are the source of most chronic SLO misses.
Aspire
The baseline tells you what the system has been doing. The target should be a deliberate stretch beyond it. Set the target equal to the baseline and the SLO drives no improvement; set it far above the baseline and the team misses every quarter. The right answer is in the middle.
- 10 to 20% better than baseline.: The rule of thumb that holds up across most teams. If the 90-day baseline is 99.85%, a target of 99.9% is a meaningful stretch (a one-third reduction in error budget). If the baseline is 99.5%, a target of 99.6% is a similar stretch in absolute terms.
- Stretch but achievable.: The target should require real investment to hit but not require architectural rework. A target the team thinks they can hit if they invest some quarter on reliability work; not a target that requires multi-region migration the team has not budgeted for.
- Account for known investments.: If reliability work in the next quarter will improve the underlying signal (better caching, redundancy, faster rollback), bake that into the target. The target is what the service should be after the planned work, not what it has been historically.
- Stretch in the dimension that matters.: If the baseline shows availability is fine but latency is the user-felt problem, stretch on latency. The dimension that gets stretched is the one the team is willing to invest in.
- Sanity-check against dependencies.: If your target requires upstream services to be 99.99% reliable and they are at 99.9%, your target is mathematically impossible. The dependency math is the ceiling; aspirational stretching cannot exceed it.
Aspirational without data is denial; data without aspiration is stagnation. The 10 to 20% rule keeps the team in the productive zone.
Avoid
The most common SLO target-setting mistakes come from picking round numbers without doing the data work. The convention of "99.99%" sounds rigorous and often is not.
- Avoid round numbers without data.: 99% is round. 99.9% is round. 99.99% is round. None of these are special; they are convenient. The right target for your service might be 99.85% or 99.93%; the round-number convention loses precision and produces commitments that do not match reality.
- Avoid 99.99% by default.: Many teams pick 99.99% because it sounds appropriately ambitious. The math is brutal: 99.99% allows about 4 minutes of monthly downtime, which a single AZ failure consumes. Hitting 99.99% requires multi-region architecture, hot standby, and operational discipline most teams do not have. Picking it without doing that work guarantees the team will miss.
- Justify with data.: Whatever target you pick, document the data analysis that justifies it. The 90-day baseline. The aspirational stretch. The dependency math. The team's planned reliability investments. The documentation is what defends the target during leadership reviews and customer escalations.
- Avoid copying competitor SLAs.: "Our competitor publishes 99.99%, so we will too" is the wrong reason to pick a target. The competitor may be lying; the competitor may have invested in architecture you have not; the competitor may be defining availability differently. Set your target based on your own data, not theirs.
- Avoid setting once and forgetting.: The target needs review at least annually, often quarterly. As the system evolves, the right target evolves too. The target that was right at launch is not necessarily right two years in.
SLO target setting done with data, with deliberate stretch, and with honest revision over time produces commitments the team can keep. Nova AI Ops automates the baseline analysis, suggests target ranges based on observed performance and dependency math, and tracks target-versus-actual quarter over quarter so the SLO conversation stays anchored in evidence.