Database Failure Modes and Detection
Database failures cluster into a small number of recognizable patterns. Recognizing the pattern is half the fix.
Why failures cluster
Most database incidents are one of six patterns.
Recognizing the pattern routes investigation in seconds, not hours.
Six common modes
- 1. Corruption.
- 2. Replication break.
- 3. Runaway transaction.
- 4. Vacuum stall (Postgres).
- 5. OOM.
- 6. Slow queries cascading.
Per-mode symptoms
Corruption: read errors; checksum failures.
Replication break: lag spike to infinite.
Runaway txn: locks held forever; bloat.
Vacuum stall: dead tuples accumulate.
OOM: process killed; restart loop.
Cascade: query queue grows; latency climbs.
Auto-detection
Each failure has a unique signature in metrics. Auto-detect by pattern matching.
Tools (PMM, datadog DB monitoring, native Cloud) increasingly include this.
Antipatterns
- Diagnosing each from scratch. Slow.
- One alert ‘DB unhealthy.’ Misses pattern.
- No runbook per pattern. Re-investigate every time.
What to do this week
Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.