SLO & Reliability Practical By Samson Tanimawo, PhD Published Jun 28, 2025 4 min read

SLO Cascade Failures

When dependencies' SLOs break.

Risk

Reliability does not compose. If your service depends on a database that is 99% available and an authentication service that is 99% available, your service cannot be more than about 98% available no matter how good your own code is. This is the cascading failure problem and it is the reason most teams set unrealistic SLOs.

The math that breaks SLO planning:

The cascading failure problem is mathematical. You cannot solve it with better code. You solve it by either tightening the dependencies or decoupling from them.

Design

Once you accept that you cannot rely on perfect dependencies, the design choices that matter are the ones that reduce your effective dependency on each backend.

The goal is not to eliminate dependencies. It is to make sure that dependency failures degrade your service partially, not totally. That is what makes SLOs achievable in a multi-service architecture.

Monitor

Even with the best design, dependency failures will happen. The question is how fast you find out and whether you can act before your own SLO burns. The answer is per-dependency telemetry that surfaces upstream issues before they become user-visible.

Per-dependency monitoring is what turns cascading failures from a mystery into a known and managed risk. Nova AI Ops tracks every outbound call by destination, computes per-dependency burn rate, alerts when a backend's failure mode is going to dominate your own SLO, and gives you the receipts to renegotiate dependency SLAs with the teams whose reliability is capping yours.