SLOs From Dependent Services
Your SLO inherits from dependencies.
Math
Your SLO is not a thing you set; it is a thing your dependencies hand you. If your service can only succeed when every backend it calls also succeeds, your maximum achievable SLO is bounded by the product of every dependency's SLO. The math is unforgiving and most teams discover it too late, after they have published a target their architecture cannot support.
The numbers that matter:
- Multiplicative composition.: If your service depends on N services with availabilities a1, a2, ... aN, and each call is required for your service to respond, your maximum availability is the product. Five 99.9% dependencies cap you at 99.5% (0.999 to the fifth power). Ten 99.9% dependencies cap you at 99%. The ceiling drops fast.
- Latency tail composition is worse.: Your p99 is not the sum of dependency p99s; it is closer to the worst case across all of them. A service calling five backends each with 100ms p99 will see its own p99 closer to 250ms because once in a while every call lands in the slow tail.
- Your work adds variability.: On top of the dependency floor, your own code adds error rate, latency, and saturation. Even with perfect dependencies, your SLO cannot exceed your own code's reliability. The two compose: dependency_avail times your_avail equals upper bound.
- You inherit your dependencies' burns.: When a dependency burns its budget, you burn yours at the same rate, regardless of how reliable your code is. Your SLO is hostage to theirs.
The first move in any SLO conversation is to walk the dependency tree, sum the implied ceiling, and ask whether the number you want to publish is mathematically possible. Most of the time it is not, and the conversation has to shift from "how reliable do we want to be" to "what does the architecture allow."
Design
Once you accept the dependency math, the design choices that matter are the ones that reduce your effective dependency on each backend. There is no way to make a hard dependency on a 99% service produce a 99.99% own-service SLO. There are several ways to soften the dependency.
- Caching.: A read-through cache with a short TTL turns a hard dependency into a soft one. If 95% of requests hit cache and 5% reach the backend, your effective backend availability becomes (cache_avail times 0.95) plus (backend_avail times 0.05). A 99% backend behind a 99.9% cache contributes about 99.85% to your SLO instead of 99%.
- Redundancy.: Two independent instances of the same service (different regions, different AZs, different vendors) reduce the effective failure rate. Two 99% services in active-active configuration give you something like 99.99% if their failures are uncorrelated. Correlation matters: redundancy across a shared substrate buys less than the math suggests.
- Circuit breakers and fallback.: When a dependency fails, fall back to a default response or a stale cache instead of failing the request. The customer gets a degraded but useful answer instead of an error. Degraded counts as success against most reasonable SLO definitions.
- Asynchronous decoupling.: Where the use case allows, queue the dependency call instead of blocking on it. The request returns immediately; the dependency call happens in the background; the result reconciles later. A request that does not block on a downstream service cannot fail because of it.
- Eliminate optional dependencies.: Some calls do not have to happen on the request path. Personalization, analytics, recommendations, audit logging. Defer them to a queue or do them server-side after responding. Every dependency you remove from the critical path raises the ceiling.
The architectural goal is not zero dependencies. It is fewer hard dependencies on the request path. That ratio is the lever.
Track
Even with the best design, dependency failures will still show up in your SLO. The question is whether you find out fast enough to act, and whether you have receipts when stakeholders ask why this quarter's budget burned.
- Per-dependency SLI.: Track every outbound call by destination service. Error rate, latency, saturation. Each dependency gets its own dashboard tile. When your own SLO is burning, the per-dependency view tells you which backend is contributing.
- Burn-rate alerting on dependencies you depend on but do not own.: Set alerts on your dependencies' SLOs as visible to you, even if their team has their own. You will see their budget burn before they do, because their failure mode is a leading indicator for your own SLO.
- Tag every span with the upstream service.: Distributed tracing must record which call slowed or failed. When your latency spikes, the trace tells you "dependency X took 2 seconds" rather than leaving you to guess. This is the single highest-value tracing investment for SRE.
- Quarterly dependency review.: Each quarter, review the SLOs of every service in your dependency tree against your own committed SLO. If a dependency has missed three quarters in a row, your SLO is at risk regardless of your own code. Either renegotiate the dependency contract or change your own commitment.
- Surface the dependency burn in stakeholder reviews.: When you miss SLO this month, the executive review should show "dependency X burned 60% of our budget." This changes the conversation from "your team failed" to "the architecture has a structural issue we need to fix together."
Your SLO is honest only when it accounts for what your dependencies actually deliver. Nova AI Ops tracks every outbound call by destination, computes per-dependency burn rate, and shows the contribution of each upstream to your own SLO so you can renegotiate dependency contracts with the teams whose reliability is capping yours.