Degradation vs Failure Alerts

Distinguish degradation from full failure. Different responses.

The distinction

Failure and degradation are different states with different signals, thresholds, and runbooks. Failure means the service is not working: hard 5xx, complete outage, total unavailability, sev 1. Degradation means the service is working but slow or partially broken, often sev 2. One alert cannot catch both.

Failure. Service not working; hard 5xx errors, complete outage, total unavailability; sev 1.
Degradation. Service working but slow or partially broken; latency above target, increased error rate, some features unavailable; often sev 2.
Different alerts required. One alert cannot catch both; different signals, thresholds, runbooks.
Per-state response posture. Failure triggers war-room response; degradation triggers investigation; conflating them produces wrong response.

Different signals

Failure signals are absolute. 5xx error rate above 50% sustained, health-check failures, no successful requests for 1 minute. Degradation signals are relative. Latency p99 above target for 5 minutes, error rate above 1%, queue depth above threshold. One can mask the other if not watched.

Failure signals. 5xx above 50% sustained, health-check failures, no successful requests for 1 minute.
Degradation signals. Latency p99 above target for 5 minutes, error rate above 1%, queue depth above threshold.
Masking risk. A degraded service can become a failed service silently if only one is watched.
Per-signal threshold tuning. Thresholds calibrated per service; supports the right firing rate.

Different responses

The response posture differs. Failure: incident commander, war room, customer comms within 15 minutes; all hands. Degradation: on-call investigates, may not need full incident response, comms may be informational only. The triage decision has real staffing and customer-comms implications.

Failure response. Incident commander, war room, customer comms within 15 minutes; all hands.
Degradation response. On-call investigates; may not need full incident response; comms informational only.
Triage matters. Sev 1 vs sev 2 has real implications for staffing and customer comms.
Per-response named owner. The IC is named for failure, the investigator is named for degradation; supports clear accountability.

Monitoring discipline

Two alert sets per service is the discipline: one for failure, one for degradation. Different routing, different urgency. Failure alerts page immediately; degradation alerts page during business hours and notify off-hours; per-alert runbook so the on-call knows what to do without lookup.

Two alert sets per service. One for failure, one for degradation; different routing, different urgency.
Failure pages immediately. No business-hours filter; the response cannot wait.
Degradation differentiated. Page during business hours, dashboard alert off-hours; matches the response urgency.
Per-alert runbook. The on-call sees the alert, knows whether it’s failure or degradation, knows what to do.

Quarterly review

The quarterly review keeps the classification sharp. Are degradation alerts firing for things that should be failures? Re-tune. Are failure alerts firing for what’s actually degradation? Re-classify. Customer-impact correlation surfaces the gap between alert behaviour and customer experience.

Up-classify mistakes. Degradation alerts firing for failures get re-tuned; the response was too soft.
Down-classify mistakes. Failure alerts firing for degradation get re-classified; the response was too loud.
Customer impact correlation. Degradations that didn’t fire but caused complaints; fix the gap.
Per-quarter classification audit. Each service’s alert set reviewed; supports continuous calibration.