Best Practices Intermediate By Samson Tanimawo, PhD Published Sep 18, 2026 7 min read

Choosing SLIs That Reflect Real User Pain (Not Just Uptime)

Uptime is the SLI everyone reaches for first. It is also the one most likely to be 99.99% green while users complain. Pick SLIs that move when users are angry.

Why uptime is not enough

Uptime says "the service answered HTTP at all." Modern services degrade in shapes uptime does not see. Slow checkout. Missing thumbnails. Stale recommendations. The user feels every one. The uptime metric registers nothing.

The reason uptime persists despite its inadequacy: it's easy to measure. A health check pings the service; the service answers; uptime is recorded as 100%. The metric requires no understanding of what the service does. Teams default to it because the alternative, measuring user-facing behaviour, requires more thought.

The cost shows up at the customer-experience boundary. A service with 99.99% uptime can have terrible perceived reliability if the 0.01% downtime is correlated with peak usage, or if the up-time count includes "answered with a 500 error." The metric is technically true and narratively false. SLIs that reflect actual user experience close this gap.

The four-question filter

Does it move when users are angry? If users are complaining and the metric is flat, it is the wrong metric.
Does the team have leverage on it? An SLI you cannot improve is a thermometer, not a steering wheel.
Is it cheap to measure? If reading the metric distorts the system, you have a Heisenbug, not an SLI.
Is it understandable to a non-engineer? "p99 latency" passes. "Sum of squared deviations from rolling p95" does not.

Each question filters out a common SLI antipattern. Question 1 catches "metric we measure because we can." Question 2 catches "metric driven by external systems we don't control." Question 3 catches metrics that are correlated with system load and therefore biased. Question 4 catches metrics that engineers love and stakeholders ignore.

Apply all four to every candidate SLI. An SLI that fails any one is suspect; an SLI that fails two should not be adopted. Most teams' first attempt at "let's pick better SLIs" produces a list that fails 2-3 questions for at least half the candidates; the filter is doing real work.

Three SLIs that work

Tail latency, availability of critical paths, and correctness. Almost any service can express what it does in those three terms. The set is comprehensive and small enough to remember; large SLI sets dilute attention.

The three together cover the user-experience surface. Latency catches "slow"; availability catches "broken"; correctness catches "wrong." Most user complaints map to one of these three categories. Anything not covered is usually a symptom that bubbles up through one of them.

Why exactly three. One is too few (you'll have an incident type that fits none of the categories). Five is too many (the team won't track all of them seriously). Three fits in a head, fits on a dashboard, and covers the failure space well enough.

Tail latency, not average

Average latency hides the bad experiences. The user with the 8-second page load remembers it; the average says everything is fine. Measure p95 or p99 of the operations users actually do.

The math: averages compress outliers into invisibility. A service with 99% of requests at 200ms and 1% at 8 seconds has an average of 280ms, looks great. The 1% of users hitting 8 seconds churn. The average lied; the p99 (8 seconds) told the truth.

The choice between p95 and p99. p99 is more honest but noisier, small request counts produce unstable p99 numbers. p95 is more stable and roughly captures the same insight. Most production teams use p95 for high-volume services and p99 for low-volume. Either is fine; "average" is not.

Availability of the critical path

Not "the service is up." "The login button works." Pick three or four user-visible flows and measure availability of those specifically. Auth, search, checkout. If any of them is broken the user is having a bad day, even if the rest of the service is healthy.

The discipline of identifying critical paths. List the top user actions; for each, define what "available" means at the user level (not the system level). "Login is available when the auth endpoint responds with success and a valid session token within 2 seconds, AND the resulting page renders." That definition crosses 3-4 services and dependencies; if any link fails, login is not available.

How to measure: synthetic checks against the critical paths from the user's perspective. Real user monitoring (RUM) data complements but doesn't replace synthetics, synthetics give you constant signal even when traffic is low.

Correctness of responses

Hardest to measure but the highest-leverage when it goes wrong. Did the recommendation engine return personalised results, or did it fall back to a generic list? Did the search return any results at all? A response that succeeds but is empty is often worse than a 500.

The trap. Correctness is harder to encode than availability or latency. "Did the user get the right thing?" requires knowing what right is. Most teams skip correctness because it's hard, then discover months later that their recommendations have been broken for weeks while uptime stayed at 99.99%.

Practical correctness SLIs. (1) Empty-result rate (% of search queries returning zero results, should be near 0% for non-trivial queries). (2) Fallback rate (% of recommendation calls that hit the generic fallback instead of personalised, should be low and stable). (3) Validation-failure rate (% of write operations rejected by validation, should be stable; spikes indicate upstream data quality issues).

Common antipatterns

SLIs that measure infrastructure health instead of user outcomes. CPU below 80%, queue depth below 1000, disk usage below 90%. These can all be true while users are unhappy. The SLI must reflect user experience, not system state.

SLIs that depend on a synthetic check that does not exercise the real path. A health check that only verifies the load balancer is up tells you nothing about whether the service can handle real requests. Synthetics must traverse the same code path real users do.

SLIs that are computed once a day. Daily-aggregated SLIs can't drive timely alerting or postmortem analysis. Compute SLIs at the same cadence the service operates, typically per-minute aggregates over rolling windows.

The "all green" SLI. Some teams pick SLIs that are easy to keep green at 99.99%. Looks great in reports; tells you nothing because the target is too easy. The SLI should fail occasionally if it's calibrated honestly.

What to do this week

Three moves. (1) For your most-complained-about service, list its current SLIs. Run each through the four-question filter. Most teams find at least one fails. (2) Identify the top 3 user journeys for that service. Define availability of each one in user-experience terms (not system terms). (3) Implement synthetic checks for those 3 user journeys, running every minute. Compute availability per journey, alert when below the SLO. The work takes a sprint; the visibility lasts forever.