Database Query Plan Debugging
Query plan reading is a skill; the patterns are well-known; the time saved per debug session is large.
Why plans matter
Slow queries have one of four root causes; the plan tells you which. Reading the plan is a skill that pays back in minutes saved per debug session.
- Bad plan. Optimiser picked a worse strategy than available; usually a hint or rewrite gets the better plan.
- Missing index. Sequential scan where an index would have been faster; the most common fix.
- Bad statistics. Optimiser’s row estimates are wrong; ANALYZE refreshes; sometimes per-column STATISTICS tuning helps.
- Data distribution surprise. Skew the optimiser couldn’t see; one value dominates; partial indexes or query rewrite.
Four common plan problems
- Sequential scan when expected indexed.
- Wrong join type (hash vs merge vs nested).
- Bad row estimate (off by 10x+).
- External sort or external hash (memory pressure).
Per-problem fix
Each plan problem has a canonical fix. Knowing the mapping turns the debug session from open-ended into mechanical.
- Sequential scan. Index missing for the predicate, or planner thinks the table is small; add the index, ANALYZE first.
- Wrong join type. ANALYZE the tables to refresh statistics; the optimiser usually picks the right join when stats are accurate.
- Bad row estimate.
ALTER TABLE ... SET STATISTICSfor skewed columns; raises the histogram resolution. - External sort or hash. Tune
work_memupward (Postgres) or session-level memory; fits the operation in RAM.
Workflow
The canonical workflow is four steps; once internalised, debugging a slow query takes 5-10 minutes regardless of complexity.
- Step 1.
EXPLAIN ANALYZEthe query; see the actual plan with real numbers, not just the estimate. - Step 2. Identify the most expensive node; the bottleneck is usually one stage, not distributed.
- Step 3. Check estimated vs actual rows at that node; off by 10x+ means statistics are wrong.
- Step 4. Apply the per-problem fix from the table above; verify with another
EXPLAIN ANALYZE.
Antipatterns
- Reading plan without understanding cost units. Wrong conclusion.
- Adding indexes blindly. Slow writes.
- Trusting EXPLAIN without ANALYZE. Estimate-only; no real numbers.
What to do this week
Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.