Aggregate SLOs vs Per-User
Aggregate hides individual experience.
Aggregate
The standard way to compute an SLO is across all requests in aggregate: total successes divided by total requests. The result is easy to compute, easy to display, and easy to game. The biggest weakness of aggregate SLOs is that they hide the case where most users have a great experience and a small subset have a terrible one. Both populations contribute to the same number; neither is visible separately.
What aggregate SLOs are good for and bad for:
- Easier to measure.: The metric pipeline counts total successes and total requests, divides, reports the ratio. The math is simple, the storage cost is small, the dashboards are clean. Aggregate SLOs are the right starting point for any SLO practice.
- Hides outliers.: A service with 99.5% aggregate availability could be at 99.9% for most users and 95% for a small cohort. The aggregate average is identical to a service that delivers 99.5% to everyone uniformly. The user experiences are radically different; the metric does not distinguish them.
- Standard for reporting.: Customer SLAs are usually expressed in aggregate terms. The status page shows aggregate availability. Quarterly reports use aggregate numbers. The aggregate is the lingua franca for talking about reliability across the organization.
- Misses tenant-specific issues.: When a single tenant's traffic produces high error rates because of their specific data shape, their specific integration, or a regional issue affecting only them, the aggregate SLO barely moves. The tenant churns; the aggregate stays green; nobody on the engineering team knew there was a problem.
- Misses concentration of failures.: Some failure modes affect specific user cohorts disproportionately. Free-tier users on slower regional infrastructure. Enterprise customers with bespoke integrations. Mobile users on worse networks. The aggregate flattens all of these into one number.
Aggregate SLOs are necessary, not sufficient. They are the right top-line metric and the wrong investigation tool.
Per-user
The complement to aggregate SLOs is per-user SLI tracking. Instead of computing reliability across all requests, compute it per user (or per tenant, per region, per cohort) and look at the distribution. The tail of that distribution tells the story the aggregate hides.
- Tail experience surfaces.: The 1% of users with the worst experience may be at 95% availability while the median user is at 99.9%. The aggregate masks this; the per-user view exposes it. The 1% is usually where the most consequential bugs are hiding.
- Some users see 99%, others 95%.: The dispersion of per-user SLO values is a signal. A tight distribution (everyone close to the median) means failures are random. A wide distribution (some users much worse than the median) means failures are systemic against specific cohorts.
- Cohort identification.: When the per-user data shows a heavy tail, the next question is "who is in the tail?" Group the worst-experience users by cohort: tier, region, signup date, integration type. The cohort that shows up disproportionately is the structural issue worth investigating.
- Per-tenant SLA enforcement.: Enterprise customers with individual SLAs need their own SLO calculation, not the aggregate. A tenant whose own slice is at 99% does not care that the aggregate is at 99.95%. They care about their own experience, and the SLA was written about their slice.
- Customer success integration.: The customer success team needs per-customer reliability data. Aggregate is irrelevant to a CS conversation about a specific customer's experience. Per-user data is the input to proactive customer outreach.
Per-user SLOs are the diagnostic tool. They surface the failure modes that aggregate metrics hide, especially in multi-tenant SaaS where customers experience the platform very differently from each other.
Layer
The right answer is not aggregate vs per-user. It is both, layered. The aggregate SLO is the headline; the per-user SLI is the investigation tool. Each answers a different question; using both is what makes the practice robust.
- Aggregate SLO at the top.: The headline number that goes on the status page, in the SLA, in the executive dashboard. It is the customer-facing commitment and the company-wide reliability story.
- Per-user p99 SLI underneath.: Track the worst 1% of users explicitly. Their experience is itself a metric. The 99th-percentile per-user availability tells you how bad the worst-served users have it. This is the metric that catches concentrated failure modes.
- Per-tenant SLI for enterprise customers.: Each enterprise customer has its own SLI rolling up to its own SLA. The SRE team can tell, per customer, whether they are within commitment. The CS team can do proactive outreach when individual customers are at risk.
- Both signals in dashboards.: The dashboard shows the aggregate at the top and the per-user distribution below. Operators see both at a glance. When the aggregate is fine but the tail is degrading, the team knows immediately that there is a tenant-specific issue.
- Both inform investment.: The aggregate informs whether to invest in reliability at all. The per-user view informs where the investment should go. Together they produce focused reliability work, not generic reliability work.
Aggregate plus per-user is the SLO architecture that scales from a single-tenant API to a multi-thousand-tenant SaaS platform. Nova AI Ops computes both layers in parallel, surfaces the per-user distribution alongside the aggregate, and identifies the cohorts that are pulling down the tail so the team's reliability investment targets the cases that matter most.