Slow Query Investigation: How an Agent Routes the Work
Most slow queries belong to two or three patterns. The agent that classifies, then routes to specialised remediation paths, and the eval suite for each path.
Classify the slow query
Slow queries split into four patterns that need four different responses. Classifying first is what makes the rest of the agent’s reasoning tractable.
- Pattern A: query changed. New index missing or plan regression. Look at recent EXPLAIN diffs to confirm.
- Pattern B: data changed. Table grew, statistics are stale. Look at row counts and statistics age.
- Pattern C: contention. Lock waits, IO queue depth, buffer pressure. Look at wait events.
- Pattern D: external. Slow downstream service, slow disk, noisy neighbour. Look at correlations with infrastructure metrics.
Route to specialised remediation
Each pattern routes to a different remediation. Mixing them produces overconfident proposals on the wrong axis.
- Pattern A: index recommendation. Specific index DDL proposed; human approves; the index is applied during a low-traffic window.
- Pattern B: statistics refresh. Run ANALYZE on the table. Cheap and usually safe; can auto-apply after the trust window.
- Pattern C: lock contention. Identify the blocking session; surface the dependency chain; let humans decide whether to kill or wait.
- Pattern D: escalate. The slow query is a symptom of an upstream problem. The database team is not the right owner; route to the upstream service team.
Eval suite per path
Each pattern needs its own test cases. A single pass-rate metric across all four hides regressions where the agent gets one pattern right and a sibling wrong.
- Pattern A test. Query with a missing index. Agent identifies the gap and proposes correct DDL.
- Pattern B test. Query slow due to stale statistics. Agent identifies and proposes ANALYZE.
- Pattern C test. Query blocked by a lock. Agent identifies the blocker and surfaces the dependency chain.
- Pattern D test. Query slow due to disk pressure. Agent identifies the symptom and escalates without proposing a database-level fix.
Audit the routing
Routing accuracy is the metric that decides whether the rest of the agent works. Track it per pattern, not in aggregate.
- Identification accuracy. Track which pattern was identified versus which was correct in production. Disagreement is signal that the classifier needs work.
- Time-to-resolution per pattern. Pattern B usually resolves fastest; Pattern D is the slowest. The histogram tells you where to invest.
- Human acceptance rate. High rejection on Pattern A often means the agent is suggesting indexes that have hidden costs such as write amplification.
- Cross-pattern audit. Cases that look like one pattern and are really another (B masquerading as A is common) deserve a dedicated audit slice.
Trust path
Trust is earned per pattern, not for the whole agent. Auto-applying every pattern at once is a recipe for an outage.
- Read-only first 90 days. Operators see the proposed action and apply it manually. The agent records acceptance and rejection.
- Auto-apply Pattern B after 90 days. ANALYZE is low-risk and the trust history is usually clean by then.
- Pattern A after 180 days. Auto-apply index DDL only after 180 days and only during defined low-traffic windows; index changes are not low-risk.
- Patterns C and D never auto-apply. Lock kills and upstream escalations stay human-driven indefinitely; the blast radius is too uneven.