SLO vs Availability: Confusion
Different concepts often confused.
Availability
Availability is the simplest reliability metric and the easiest to game. Most teams that say "we have SLOs" actually have an availability target, which is necessary but not enough. Understanding the difference is the first step in moving from a checkbox reliability practice to one that actually catches the failure modes that hurt users.
What availability actually measures:
- Uptime divided by total time.: The simplest form: the fraction of seconds in a window during which the service was reachable and serving requests. Sometimes computed as request-level: the fraction of requests that returned a non-error status. Both are availability; both are partial views.
- A single number, easy to communicate.: "We are 99.9% available" is intelligible to anyone. That accessibility is also availability's biggest weakness, because the simple number invites people to stop there.
- Insensitive to user-visible failure modes.: A service returning 200 OK with stale data, the wrong answer, or a 30 second response time is technically available. Availability alone cannot distinguish "broken" from "fine."
- Often gamed by definition.: What counts as "downtime" is up to the team that publishes the metric. Excluding planned maintenance, excluding regional outages, counting only specific endpoints. Each exclusion makes the number look better; none of them make the service better.
Availability is the floor of reliability measurement, not the ceiling. Teams that report only availability are reporting a partial truth.
SLO
An SLO is a service-level objective, and the difference from availability is that it is multi-dimensional by design. The SLO answers "is the service good enough" by combining several SLIs (service-level indicators), each capturing a different dimension of "good."
- Includes latency, not just success.: "P99 latency under 500 ms" is an SLI. A service can be available (returning 200s) but slow enough that users perceive it as broken. Latency-bound SLOs catch the slow-but-not-broken failure mode.
- Includes correctness, not just up.: "99% of search results are above the relevance threshold" or "99% of payment confirmations match the source of truth" are correctness SLIs. Wrong answers count as failures, even when the request returned 200 OK.
- Includes freshness, where applicable.: "95% of partitions arrive within 30 minutes" is a freshness SLI. A service serving stale data is failing in a way pure availability does not capture.
- Composes multiple dimensions into one target.: A request that succeeded only counts as good if it met every dimension. A 200 OK at 5 second latency with stale data is a bad request even though availability would call it good.
- Tied to user experience, not infrastructure.: SLO definitions should reflect what users actually experienced, not what the infrastructure reported. Server-side success rate disagrees with client-side experience all the time. The SLO should track the user side.
An SLO is the contract that says "this is what 'good' means for this service, measured the way users would measure it." It is richer than availability by design.
Avoid
The most common reliability mistake is conflating availability with the full SLO. The team picks an availability target, calls it the SLO, and assumes the practice is in place. The architecture-level failure modes the team is not measuring continue to bite, and the SLO dashboard remains green while users complain.
- "99.9% SLO" is a meaningless phrase on its own.: Without saying which dimensions, over what window, against what user experience, the number is rhetoric. Be specific: "99.9% of API requests complete in under 200 ms with the correct response, measured at the load balancer over a rolling 28-day window."
- Don't publish an availability number and call it the SLO.: If your status page shows 99.95% uptime and your customers report 5 second latency on every other call, you have an SLO problem the dashboard is hiding. The dashboard should reflect every dimension that matters.
- Don't set SLO targets without baselining each dimension separately.: The latency that produces a good user experience is not always the latency the team can hit at 99.9%. Set the target per dimension based on what is achievable AND what users need.
- Don't roll up multiple dimensions into a single percentage too early.: Track each SLI separately. Roll up only at the dashboard layer where the audience needs a summary. The team needs the per-dimension breakdown to know where to invest.
- Don't accept "we're meeting our SLO" without asking what dimensions.: A team meeting its availability SLO while quietly failing latency or correctness is a team that hasn't yet been honest about what reliability means for their service.
The fix is not abandoning availability; it is layering the other SLIs on top of it. Nova AI Ops tracks availability, latency, correctness, and freshness as separate SLIs per service, computes the composite SLO under multiple aggregation methods, and shows the per-dimension breakdown on the dashboard so the team has the data to defend a real reliability commitment instead of a sanitized availability headline.