SLOs and Circuit Breakers
Circuit breakers protect SLO.
Idea
Circuit breakers are the operational safety net that lets a service degrade gracefully instead of cascading into total failure. Most teams know about circuit breakers as a code-level pattern (Hystrix, resilience4j) without making the connection to SLO management. The integration is the win: tie circuit-breaker triggers to SLO burn rate, and the service automatically protects its own SLO under pressure.
What SLO-aware circuit breaking does:
- When SLO at risk, breakers open.: The circuit breaker monitors the SLO burn rate, not just the raw error rate. When the burn rate exceeds a threshold (e.g., 14x normal for more than 1 minute), the breaker opens. Calls to the protected dependency are short-circuited; the caller falls back to a default response or a cached value.
- Preserve core function during partial failure.: The breaker isolates the failing dependency. The rest of the service continues to operate. A search service whose recommendation backend is failing can serve search results with no recommendations, instead of failing the whole search request. The user sees a degraded but useful experience.
- Trade some functionality for stability.: The trade is explicit: features that depend on the broken backend stop working, but the service stays up. This is almost always the right trade. A working search without recommendations is much better than a failed search with rich recommendations.
- Auto-recover when SLO is safe.: When the burn rate drops back below the recovery threshold and stays there for the hysteresis window, the breaker closes. Calls to the dependency resume. The cycle is automatic; the breaker manages itself based on the SLO signal.
- Documented per service.: The breaker behavior is documented as part of the service's SLO definition. "When the recommendation backend is unhealthy, search returns results without recommendations and the recommendation field is null." The contract is visible to consumers; nobody is surprised by what degradation looks like.
SLO-aware circuit breakers are the difference between a service that protects its own SLO and one that has its SLO burned by every dependency that misbehaves.
Setup
Setting up SLO-aware breakers is more configuration than code. The breaker library handles the mechanics; the SLO integration is the part that requires deliberate design.
- Threshold per service.: Each protected dependency has its own breaker with thresholds tuned to its SLO contribution. A backend that contributes 30% of the SLO budget needs a tighter breaker threshold than one that contributes 5%. The configuration matches the dependency's importance.
- Auto-trigger on burn rate.: The breaker watches the SLO burn rate from the dependency's perspective. When the burn rate exceeds the configured threshold for a configured window, the breaker opens. Both the threshold and the window are tuned per dependency.
- Rules-based, not single-metric.: The breaker can fire on combinations: high error rate AND high latency, OR high saturation. The rules let you express "the dependency is unhealthy in a way that matters for SLO" rather than just "error rate is high right now."
- Configurable fallback.: What happens when the breaker is open is configured per dependency. Default response. Cached value. Empty result. Graceful error to the caller. The fallback is a deliberate design decision, not a default.
- Observability built in.: Every breaker state change emits an event. Open. Half-open. Closed. The events go to the observability system so the team can see in real time which breakers are firing and why. The state changes are also part of the SLO retro data.
The setup is one-time per service. The ongoing maintenance is tuning the thresholds based on observed behavior. After a few cycles, the breaker fires when it should and stays closed when it should not.
Test
Circuit breakers that have not fired in production are circuit breakers you cannot trust. The discipline that keeps them ready is testing them deliberately and regularly, before a real incident requires them to work.
- Quarterly chaos engineering.: Each quarter, the team injects failures that should trigger the breakers. Latency injection, error injection, full dependency outage. The breakers should fire within the configured window and the service should degrade gracefully without breaching its SLO.
- Breakers fire under stress.: Verify that the breaker actually opens at the configured threshold. A breaker tuned too loosely will not fire even under real failure; a breaker tuned too tightly will fire under normal load. The chaos test calibrates both directions.
- Verify recovery.: After the injection ends and the dependency is healthy again, verify the breaker closes within the recovery window. Stuck-open breakers are a real failure mode; they keep the service degraded after the underlying issue is fixed. The test catches them.
- Verify fallback behavior.: When the breaker is open, confirm the fallback behaves as documented. Default response is correct. Cached value is reasonable. Errors are graceful. The fallback path is exercised rarely in production; the test is the reliable way to verify it works.
- Test with the real SLO signal.: The chaos test feeds real signal into the SLO calculation, not synthetic data. The burn rate that triggers the breaker comes from the same metric pipeline production uses. This is what makes the test meaningful: production behavior, not test-fixture behavior.
SLO-aware circuit breakers are one of the highest-leverage availability patterns. They turn dependency failures from SLO-burning incidents into self-mitigated degradation events. Nova AI Ops integrates with circuit breaker telemetry, watches the breaker state alongside SLO burn rate, and runs scheduled chaos exercises to verify the breakers fire correctly so the team's confidence in their availability story is grounded in test evidence.