Database Failure Modes and Detection
Database failures cluster into a small number of recognizable patterns. Recognizing the pattern is half the fix.
Why failures cluster
Database failures look infinite until you classify them. In practice, most production incidents fall into a small number of recognisable patterns.
- Pattern count. Six common modes cover the majority of database incidents; the long tail is rare.
- Recognition speed. Naming the pattern routes investigation in seconds, not hours.
- Per-pattern runbook. One runbook per pattern; on-call jumps to the right one without re-investigation.
- Training value. New engineers learn the patterns once; transferable across databases and incidents.
Six common modes
- 1. Corruption.
- 2. Replication break.
- 3. Runaway transaction.
- 4. Vacuum stall (Postgres).
- 5. OOM.
- 6. Slow queries cascading.
Per-mode symptoms
Each failure mode has a distinctive signature in metrics and logs. Knowing the symptoms narrows investigation immediately.
- Corruption. Read errors, checksum failures, specific pages unreadable; rare but devastating.
- Replication break. Lag metric spikes to infinite; downstream readers serve stale data.
- Runaway transaction. Locks held forever, table bloat;
pg_stat_activityshows long-running query. - Vacuum stall, OOM, cascade. Dead tuples accumulate; OOM-kill loops; query queue grows and latency climbs respectively.
Auto-detection
Each pattern has a unique metric signature. Auto-detection with pattern matching beats human pattern-matching during an incident.
- Per-pattern signature. Each failure mode has a unique combination of metrics; encode it once, detect forever.
- Tooling. PMM, Datadog DB Monitoring, native cloud DB monitoring increasingly include pattern detection.
- Custom detection. For unusual patterns, custom Prometheus alerts on the metric combination work just as well.
- Routing. Detected pattern triggers the right runbook automatically; on-call gets context, not raw alert.
Antipatterns
- Diagnosing each from scratch. Slow.
- One alert ‘DB unhealthy.’ Misses pattern.
- No runbook per pattern. Re-investigate every time.
What to do this week
Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.