Database Failure Modes and Detection

Database failures cluster into a small number of recognizable patterns. Recognizing the pattern is half the fix.

Why failures cluster

Database failures look infinite until you classify them. In practice, most production incidents fall into a small number of recognisable patterns.

Pattern count. Six common modes cover the majority of database incidents; the long tail is rare.
Recognition speed. Naming the pattern routes investigation in seconds, not hours.
Per-pattern runbook. One runbook per pattern; on-call jumps to the right one without re-investigation.
Training value. New engineers learn the patterns once; transferable across databases and incidents.

Six common modes

1. Corruption.
2. Replication break.
3. Runaway transaction.
4. Vacuum stall (Postgres).
5. OOM.
6. Slow queries cascading.

Per-mode symptoms

Each failure mode has a distinctive signature in metrics and logs. Knowing the symptoms narrows investigation immediately.

Corruption. Read errors, checksum failures, specific pages unreadable; rare but devastating.
Replication break. Lag metric spikes to infinite; downstream readers serve stale data.
Runaway transaction. Locks held forever, table bloat; pg_stat_activity shows long-running query.
Vacuum stall, OOM, cascade. Dead tuples accumulate; OOM-kill loops; query queue grows and latency climbs respectively.

Auto-detection

Each pattern has a unique metric signature. Auto-detection with pattern matching beats human pattern-matching during an incident.

Per-pattern signature. Each failure mode has a unique combination of metrics; encode it once, detect forever.
Tooling. PMM, Datadog DB Monitoring, native cloud DB monitoring increasingly include pattern detection.
Custom detection. For unusual patterns, custom Prometheus alerts on the metric combination work just as well.
Routing. Detected pattern triggers the right runbook automatically; on-call gets context, not raw alert.

Antipatterns

Diagnosing each from scratch. Slow.
One alert ‘DB unhealthy.’ Misses pattern.
No runbook per pattern. Re-investigate every time.

What to do this week

Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.