Degradation vs Failure Alerts

Distinguish degradation from full failure. Different responses.

The distinction

Failure and degradation are different states with different signals, thresholds, and runbooks. Failure means the service is not working: hard 5xx, complete outage, total unavailability, sev 1. Degradation means the service is working but slow or partially broken, often sev 2. One alert cannot catch both.

Different signals

Failure signals are absolute. 5xx error rate above 50% sustained, health-check failures, no successful requests for 1 minute. Degradation signals are relative. Latency p99 above target for 5 minutes, error rate above 1%, queue depth above threshold. One can mask the other if not watched.

Different responses

The response posture differs. Failure: incident commander, war room, customer comms within 15 minutes; all hands. Degradation: on-call investigates, may not need full incident response, comms may be informational only. The triage decision has real staffing and customer-comms implications.

Monitoring discipline

Two alert sets per service is the discipline: one for failure, one for degradation. Different routing, different urgency. Failure alerts page immediately; degradation alerts page during business hours and notify off-hours; per-alert runbook so the on-call knows what to do without lookup.

Quarterly review

The quarterly review keeps the classification sharp. Are degradation alerts firing for things that should be failures? Re-tune. Are failure alerts firing for what’s actually degradation? Re-classify. Customer-impact correlation surfaces the gap between alert behaviour and customer experience.