SLO Cascade: Service Dependencies
Downstream SLO depends on upstream.
Math
Cascading SLO failures are the kind of math that quietly invalidates ambitious reliability targets. If your service depends on N upstream services, and your code can only succeed when every upstream call succeeds, your maximum achievable SLO is the product of the upstream SLOs. The math gets worse fast.
The numbers you cannot escape:
- Multiplicative composition.: Two upstream services at 99.9% give you a ceiling of 99.8%. Five at 99.9% gives you 99.5%. Ten at 99.9% gives you 99%. Each additional dependency drops the ceiling, regardless of how reliable your own code is.
- Latency tail compounds.: Your p99 is not the sum of upstream p99s. It is closer to the worst case across N samples, which is significantly worse than any single upstream's p99. A request fanning out to 5 services with 100ms p99 each will see your own p99 closer to 250ms.
- Failure correlation makes it worse.: The math above assumes independent failures. In practice, upstream failures correlate (shared substrate, common cloud provider, shared dependency). Correlated failures mean the worst case is much closer than the multiplicative math suggests.
- You inherit upstream burns.: When an upstream service burns its budget, you burn yours at the same rate, because their failure is your failure. Your SLO is hostage to theirs whether they know it or not.
The first move when designing any SLO is walking the dependency tree, multiplying the upstream availabilities, and asking whether the target you want to commit is mathematically possible. Most of the time it is not, and the conversation has to shift from "how reliable do we want to be" to "what does the architecture allow."
Design
The architectural fix for cascading SLO failures is to reduce your effective dependency on each upstream. You cannot make a 99% upstream produce a 99.99% downstream by being more careful. You can soften the dependency so that upstream failures degrade you partially instead of totally.
- Caching reduces dependency on slow upstreams.: A cache hit ratio of 95% means only 5% of requests reach the upstream. Your effective availability becomes (cache_avail) for 95% plus (upstream_avail) for 5%. A 99% upstream behind a 99.9% cache contributes about 99.85% to your composite, much better than 99% raw.
- Decouple via async boundaries.: If a downstream call does not have to happen on the request path, queue it. The request returns immediately; the call happens in the background; the result reconciles later. Async dependencies do not contribute to your synchronous availability calculation.
- Circuit breakers for graceful failure.: When an upstream is degraded, stop hammering it. Open the breaker, fall back to a default response or a stale cache. The customer gets a degraded but useful answer instead of an error.
- Bulkheads to prevent compound damage.: Isolate upstream calls into separate thread pools or connection pools so a slow upstream cannot exhaust your shared resources. A misbehaving payment service should not consume the worker pool serving search.
- Retry with budget, not infinitely.: A retry budget bounds how many retries you will spend per second on a given upstream. This protects you from being dragged down by upstream slowness and protects them from being held under longer by your retries.
The architectural goal is not zero dependencies. It is fewer hard dependencies on the request path, with soft fallback for the ones that remain. That ratio is the lever that determines whether your SLO is realistic.
Monitor
Even with the best design, upstream failures will show up in your SLO. The question is whether you find out fast enough to act and whether you have receipts when stakeholders ask why this quarter's budget burned.
- Per-dependency SLI.: Track every outbound call by destination service. Error rate, latency, saturation. Each upstream gets its own dashboard tile. When your own SLO is burning, the per-dependency view tells you which upstream is contributing.
- Burn-rate alerting on dependencies.: Set alerts on the dependencies' SLOs as visible to you, even if their team has their own. You will see their budget burn before they do, because their failure mode is a leading indicator for your own.
- Tag every span with the upstream service.: Distributed tracing must record which call slowed or failed. When your latency spikes, the trace tells you "upstream X took 2 seconds" rather than leaving you to guess. This is the single highest-value tracing investment for SRE.
- Surface upstream contribution in the SLO dashboard.: When customers ask why your service is missing SLO this quarter, the dashboard should show "upstream X consumed 60% of our budget through three incidents." This changes the conversation from "your team failed" to "the architecture has a structural risk we need to address."
- Quarterly dependency review.: Review the SLOs of every service in your dependency tree against your own committed SLO. If a dependency has missed three quarters in a row, your SLO is at risk regardless of your own code. Either renegotiate the dependency contract or change your own commitment.
Cascading SLO failures look like your team's failure but are usually structural. Nova AI Ops tracks every outbound call by destination, computes per-dependency burn rate, and shows the contribution of each upstream to your own SLO so you can renegotiate dependency contracts with the teams whose reliability is capping yours.