Customer-Facing SLOs vs Internal
Externally promised vs internal targets.
External
The customer-facing SLO is the number you publish, the number on the SLA, the number procurement reads in the contract. It is the legal commitment with financial consequences when missed. Treat it as the contract it is, and pick the number conservatively.
What an external SLO means in practice:
- Contract with consequences.: The published SLA usually includes service credits, penalty clauses, or termination rights when the SLO misses. Customers reference it during procurement. Legal references it during disputes. The number is load-bearing in the commercial relationship.
- Penalty if missed.: Service credit is the most common penalty, typically 10 to 50% of the affected month's fees if availability falls below the threshold. For high-value contracts, the financial exposure of repeated misses is real money. The number you commit to is also the number you bet on.
- Conservative by design.: The published number should be one your engineering team can hit comfortably, not the number you barely achieved last month. Conservative is not the same as low; it is the same as defensible across multiple quarters under realistic operating conditions.
- Stable, not aspirational.: External SLOs do not change frequently. Customers build their own systems on top of yours assuming the contract holds. Lowering the published number is a renegotiation that requires customer notice and often contract amendments. Raising it is fine but requires sustained evidence first.
- Audited and reported.: Most published SLOs come with quarterly performance reports, status pages, or RFOs (reasons for outage). The number is not just published once; it is defended on an ongoing basis. The reporting cadence is part of the cost.
The external SLO is a customer commitment, a sales tool, and a legal artifact in one. Pick the number with the same care you would pick any other contract term.
Internal
The internal SLO is what engineering aims for. It is tighter than the external one, and the gap between them is the reliability buffer that keeps the team from accidentally breaching the contract. The internal target is for engineering eyes; the external target is for customer eyes.
- Engineering target.: The internal SLO is what the team commits to internally, what burn-rate alerts fire against, what the deploy-freeze policy uses as its trigger. It is the operational number, separate from the contractual one.
- Tighter than external.: If the external SLO is 99.9%, the internal target is 99.95% or higher. The gap is the buffer that absorbs the inevitable volatility. Without the buffer, engineering's bad month is also customer-facing's bad month, and the SLA is at immediate risk.
- Buffer absorbs noise and incidents.: Real systems have variability that the SLA cannot tolerate but the internal target must. Maintenance windows, dependency outages, regional events. The buffer is what lets engineering recover from these without breaching the customer commitment.
- Drives operational decisions.: Deploy freezes, on-call escalation, reliability sprints all key off the internal target. Engineering responds to internal-budget burn, not external. By the time external is at risk, the internal alarm should have fired weeks ago.
- Not visible to customers.: The internal target stays internal. Don't publish it. Don't put it on the status page. Customers see the external commitment and the actual performance, not the buffer that sits between them. Publishing the internal target dilutes the contract and confuses the conversation.
The internal target is what makes the external SLO routinely meetable. Without it, every quarter is a coin flip between meeting and missing.
Ratio
The right gap between internal and external is a function of how volatile your system is and how much margin your operating model can afford. The rule of thumb works for most teams.
- Internal target equals external SLO times roughly 0.5 in remaining error budget.: If the external SLA allows 0.1% downtime (99.9%), the internal budget is 0.05% (99.95%). The team's "we missed" alarm fires when half the customer-facing budget remains, not when it is gone.
- 50% headroom is the sweet spot.: Less buffer (75% of external budget burned before internal alarm fires) leaves no room to react before the contract is at risk. More buffer (25% of external budget burned before internal alarm) creates noise that desensitizes the team to real problems. 50% is the practical middle.
- Adjust by service criticality.: Revenue-path services with sharp customer impact may warrant a tighter buffer (e.g., 30%). Internal services that already have looser external SLOs may run thinner buffers (e.g., 70%). The defaults are sensible starting points; service teams can tune within reasonable bands.
- Different metric, same logic.: The buffer applies to all SLO dimensions: latency, error rate, freshness, correctness. Each has its own external commitment, its own internal target, and its own buffer. Tracking them separately keeps the math honest.
- Recalibrate the ratio after misses.: If the internal alarm consistently fires but the external SLA still gets met comfortably, the buffer is too conservative; tighten the gap. If the external SLA misses despite the internal alarm firing in time, the buffer is too narrow; widen it. The ratio is dynamic, not fixed.
The customer-facing SLO is the commitment. The internal SLO is the operational target. The gap between them is the engineering practice. Nova AI Ops tracks both targets per service, fires alerts on the internal-budget threshold, and surfaces the buffer remaining so engineering responds before the customer commitment is at risk.