Alerts Practical By Samson Tanimawo, PhD Published Dec 5, 2025 4 min read

Degradation vs Failure Alerts

Distinguish degradation from full failure. Different responses.

The distinction

Failure: the service is not working. Hard 5xx errors, complete outage, total unavailability. Sev 1.

Degradation: the service is working but slow or partially broken. Latency above target, increased error rate, some features unavailable. Often sev 2.

The same alert can't catch both. Different signals, different thresholds, different runbooks.

Different signals

Failure signals: 5xx error rate above 50% sustained, health check failures, no successful requests for 1 minute.

Degradation signals: latency p99 above target for 5 minutes, error rate above 1%, queue depth above threshold.

Watch for one masking the other. A degraded service can become a failed service silently.

Different responses

Failure: incident commander, war room, customer comms within 15 minutes. All hands.

Degradation: on-call investigates; may not need full incident response. Comms may be informational only.

Triage matters. Sev 1 vs sev 2 has real implications for staffing and customer comms.

Monitoring discipline

Two alert sets per service: one for failure, one for degradation. Different routing, different urgency.

Failure alerts: immediate page. Degradation alerts: page during business hours, dashboard alert off-hours.

Per-alert runbook. The on-call sees the alert, knows whether it's failure or degradation, knows what to do.

Quarterly review

Are degradation alerts firing for things that should be failures? Re-tune.

Are failure alerts firing for what's actually degradation? Re-classify.

Customer impact correlation: degradations that didn't fire but caused customer complaints. Fix the gap.