SLOs on Internal APIs
Internal APIs: SLOs are looser.
When
Internal APIs sit in a quiet middle ground. They are not customer-facing, so the optics of an outage are softer. They are not infrastructure-private, so other teams depend on them and an outage propagates. The right SLO posture for internal APIs sits in this middle: looser than customer-facing, tighter than nothing, and explicit about the trade.
The conditions under which a looser internal SLO actually makes sense:
- Internal teams are the only consumers.: No partner integrations, no public docs, no signed contract with a third party. The blast radius of a breach is bounded by your own engineering org.
- Consumers can negotiate the contract.: The teams reading from your API can ask for a tighter SLO and you can either commit or tell them no. There is a real conversation, not a one-way obligation.
- Failure modes are recoverable.: A 5 minute outage on an internal data API costs engineering productivity but does not cost revenue or trust with paying customers. The cost of failure is real but bounded.
- Investment level matches.: The team running the API is sized for the looser target. You are not staffing 24/7 oncall on a service that can be down for 30 minutes during business hours without anyone noticing.
If any of these conditions does not hold, the internal-SLO logic does not apply and the API needs the same rigor as a customer-facing one. The discount only applies inside the lines.
Loose
What does looser actually mean numerically? The pattern that holds up across most internal API teams:
- 99% availability is the typical baseline.: That works out to roughly 7 hours of allowed downtime per month, which is enough budget for routine maintenance, planned migrations, and the occasional incident. Anything below 99% starts to cost real productivity for the consuming teams.
- Latency tolerances roughly 2x the customer-facing equivalent.: If your public API targets p99 under 200 ms, your internal API can target p99 under 400 ms. Internal consumers are typically server-side and can tolerate higher latency without user-visible impact.
- Office hours, not 24/7, oncall.: Page on degradation during business hours. After hours, page only on full outage. Internal consumers are mostly idle outside business hours and the cost of waking the oncall does not match the impact.
- Honest "no SLO" is better than a fake one.: If a service is genuinely best-effort, label it that way. A wishful 99.9% SLO that the team has no plan to defend is worse than no SLO at all, because consumers build assumptions on it that will fail.
Looser is not lazy. It is matching the contract to the actual cost of failure and the actual investment the team can make. That alignment is what keeps the SLO honest over time.
Escalate
The most common SLO accident in internal APIs is the day they stop being internal. A new partner integration, an open beta, an acquisition, a "let's let customers hit this directly" decision. The audience changed. The SLO often did not, and now the team is on the hook for a customer-grade contract they were not staffing for.
- Tighten when the audience expands.: The first move when an internal API gets a new external consumer is to upgrade its SLO target to match the new audience. 99% becomes 99.9%. Office hours becomes 24/7. The investment must follow.
- Explicit reclassification.: Don't let an API drift from internal to external by accident. The reclassification is a deliberate decision, with sign-off from the team that owns it, the consuming team, and ops leadership. Update the catalog, update the runbook, update the oncall rotation.
- Soak before commit.: Run the API at the new SLO target for at least a quarter before publishing the SLA externally. The looser-target operational habits (deploy windows, batch maintenance, lower oncall response) need to fade before the tighter contract goes live.
- Versioning helps.: If the internal version of an API needs to keep its looser SLO while the external version takes the tighter one, version them explicitly (v1-internal vs v2-public). Pretending one endpoint can serve both audiences at one SLO target is how teams end up missing both.
Internal APIs are where most reliability practices learn the difference between an SLO and a wish. Nova AI Ops tracks per-consumer traffic on every API so the moment an internal endpoint starts taking external traffic, the audience-shift signal is visible and the conversation about tightening the SLO can happen before the next incident teaches it the hard way.