SLO Cascade Failures
When dependencies' SLOs break.
Risk
Reliability does not compose. If your service depends on a database that is 99% available and an authentication service that is 99% available, your service cannot be more than about 98% available no matter how good your own code is. This is the cascading failure problem and it is the reason most teams set unrealistic SLOs.
The math that breaks SLO planning:
- Multiplicative availability.: If you depend on N independent services with availability a1, a2, ... aN, your maximum achievable availability is the product. Three 99% dependencies cap you at 97%. Five 99.9% dependencies cap you at 99.5%. The deeper your dependency tree, the lower the ceiling.
- Latency tail composition.: The p99 latency of a service that calls N dependencies is not the sum of their p99s. It is closer to the worst case across N samples, which is significantly worse than any single dependency's p99. A service calling five backends each with 100 ms p99 will see its own p99 closer to 250 ms.
- Error budget consumption is asymmetric.: When a dependency fails, you burn your error budget at the rate of the dependency's failure, not yours. A 5 minute outage in a critical dependency burns 5 minutes of your budget too, regardless of how reliable your code is. Your SLO is hostage to theirs.
- Realistic targets are rare.: Most teams set SLOs based on what leadership wants to be able to claim, not what their dependency math allows. A service committing to 99.99% on top of two 99.9% dependencies is making a promise the architecture cannot keep.
The cascading failure problem is mathematical. You cannot solve it with better code. You solve it by either tightening the dependencies or decoupling from them.
Design
Once you accept that you cannot rely on perfect dependencies, the design choices that matter are the ones that reduce your effective dependency on each backend.
- Caching.: A read-through cache with a sensible TTL turns a hard dependency into a soft one. If the cache hit rate is 95%, your effective availability of that backend becomes (cache_avail × 1) plus (backend_avail × 0.05). A backend at 99% availability behind a 95% cache contributes more like 99.95% to your composite.
- Circuit breakers.: When a dependency is failing, stop hammering it. Open the breaker, fall back to a default response or a stale cache, and stop adding load to a struggling backend. This protects you from being dragged down by their slow recovery and protects them from being held under longer by your retries.
- Bulkheads.: Isolate dependency calls into separate thread pools or connection pools so a slow backend cannot exhaust your shared resources. A misbehaving payment service should not consume the worker pool serving search.
- Asynchronous boundaries.: Where the use case allows, drop the synchronous call. Queue the work, return optimistically, reconcile later. A request that does not block on a downstream call cannot fail because of it.
- Graceful degradation.: Decide ahead of time which features can run with reduced functionality when a dependency is unavailable. Search without personalization is still search. Checkout without recommendations is still checkout. Make the degraded path explicit instead of failing the request.
The goal is not to eliminate dependencies. It is to make sure that dependency failures degrade your service partially, not totally. That is what makes SLOs achievable in a multi-service architecture.
Monitor
Even with the best design, dependency failures will happen. The question is how fast you find out and whether you can act before your own SLO burns. The answer is per-dependency telemetry that surfaces upstream issues before they become user-visible.
- Per-dependency error rate and latency.: Track every outbound call by destination service, separately from your own service health. The dashboard for "your service" should be one tab. The dashboard for "your dependencies" should be another. Mixing them hides the cascade.
- Burn-rate alerts on dependency budget, not just yours.: If a dependency is on track to consume more error budget than its own SLO allows, that is your early warning. You will see their budget burn before your own users notice, which is exactly when you want to know.
- Tag every span with the upstream service.: Distributed tracing must record which downstream call slowed or failed. When your own latency spikes, the trace tells you the cause is "dependency X took 2 seconds" instead of leaving you to guess.
- Cross-team alerting.: When a dependency degrades, the team that owns it should hear about it from your monitoring before their own monitoring catches up. This sounds aggressive but it is one of the highest-trust moves between teams. You become each other's canary.
- Status of every dependency on the SLO dashboard.: When customers ask why your service is missing SLO this month, the dashboard should show "dependency X consumed 60% of our budget through three incidents." That changes the conversation from "your team failed" to "the architecture has a structural issue we need to fix."
Per-dependency monitoring is what turns cascading failures from a mystery into a known and managed risk. Nova AI Ops tracks every outbound call by destination, computes per-dependency burn rate, alerts when a backend's failure mode is going to dominate your own SLO, and gives you the receipts to renegotiate dependency SLAs with the teams whose reliability is capping yours.