SLO Ownership
Who owns the SLO when it's off?
Service team
An SLO without an owner is a number nobody defends. The first question to answer in any SLO design is "who is on the hook when this misses." The answer is almost always the team that owns the service, and the ownership has to be explicit in writing, not implied by org chart geography.
What service-team ownership actually entails:
- Owns the achievement.: The team is responsible for hitting the target, full stop. No "the platform team broke it." No "our dependencies were down." When the SLO misses, the service team is the team that explains why and proposes the fix. They can call out dependency contributions, but they own the conversation.
- Owns the engineering investment.: Reliability work that the SLO requires (better tests, more monitoring, refactoring a brittle component) lands on the service team's roadmap. Not the platform team's, not the SRE team's. The team that owns the SLO owns the work.
- Owns the trade-offs.: When product wants a feature that would put the SLO at risk, the service team is the one negotiating the trade-off. They have the data and they have the commitment, so they have the standing.
- Owns the postmortem.: Incidents that consume the budget produce postmortems written by the service team. The SRE team facilitates; the service team writes. The ownership of the lesson lives where the ownership of the system lives.
This boundary keeps SLOs honest. Teams that do not own their reliability stop investing in it, and the SLO becomes a number on a dashboard nobody actually defends.
Platform team
The service team owns the SLO. The platform or SRE team owns the infrastructure that makes the SLO measurable, defendable, and comparable across services. This is a different job and conflating it with service ownership is how reliability practices fall apart.
- Owns the measurement infrastructure.: Metric pipelines, log aggregation, tracing, dashboards, alerting. The platform team builds and runs the substrate so service teams can measure their SLOs without building telemetry from scratch each time.
- Owns telemetry quality.: Wilson intervals on availability, percentile-correct latency math, tail-aware aggregation, drift detection on metric pipelines. The service team consumes the telemetry; the platform team makes sure the telemetry is right.
- Owns the SLO definitions library.: A shared catalog of SLI templates (latency, availability, freshness, correctness) that service teams can pick up and apply. Without this, every service team reinvents the wheel and produces SLOs that are technically present but practically incomparable.
- Owns the runtime that enforces governance.: Auto-rollback on SLO breach, freeze gates on burn rate, deploy-time SLO checks. The platform team builds the rails; the service team uses them.
- Does not own service SLO outcomes.: When a service misses, the platform team helps investigate but does not take blame. Their job is to provide the tooling, not to make the service reliable.
The platform team's job is to make it easy for service teams to do the right thing. They do not get credit for service team success and they do not get blame for service team failure. That separation is what keeps the platform investment focused on leverage rather than on individual service rescue work.
Escalation
Most SLO breaches resolve at the service team level: a sprint of reliability work, a fix, the budget recovers, the practice continues. Some do not. When breaches persist or compound, the escalation path has to be defined ahead of time so the team is not inventing it during a moment when judgment is already strained.
- One-quarter miss: service-team retro and corrective plan.: The team writes up what caused the miss, what they are doing differently, and the timeline. Visible to engineering leadership. The team continues to operate normally; this is information, not intervention.
- Two-quarter miss: SRE leadership engages.: The director-level SRE owner is now in the conversation. This is the layer that allocates platform investment, helps prioritize cross-team work, and can spot patterns across multiple service teams. The conversation is structural, not tactical.
- Three-quarter miss: executive-level resource decision.: The org has to decide either to invest more in the team's reliability capacity (headcount, oncall budget, dependency renegotiation) or to publicly relax the SLO target to match what the architecture can sustain. Either is acceptable. Continuing to miss the same target every quarter is not.
- Major incident: cross-team postmortem.: When a single incident consumes most of the budget, the retro escalates beyond the service team to include any contributing dependencies, the platform team, and any operational layer that played a role. This is broader than the routine retro and produces structural changes.
- Don't escalate too early.: A team that misses one quarter is not a team in trouble. It is a team that learned something. Escalating prematurely undermines ownership and produces a culture of CYA writing instead of honest reflection.
SLO ownership done right has a clear boundary between service team accountability, platform team enablement, and leadership escalation. Nova AI Ops tracks per-service SLO compliance, identifies the contributing dependencies, surfaces persistent breach patterns, and routes the escalation signal so the right team owns the right part of the conversation at the right moment.