An Agentic Approach to Database Latency Spikes
Six tools, four decision points, one common failure mode. The agent design that triages db latency without falling into the slow-query trap.
The six tools
The triage agent for database latency spikes runs against six narrow tools. Six is the cap because the prompt budget for tool descriptions is finite and a thin tool surface keeps the agent fast.
- Metric query. Pull p50, p95, p99 latency over the last hour. Bounded to the database the alert names.
- Query stats. Top 10 queries by total time, queries by row-scan count, queries with recent EXPLAIN regressions.
- Connection pool stats. Active, idle, waiting connections; recent pool exhaustion events.
- Change context plus infra stats. Recent deploys to services that talk to this database in the last 4 hours, DDL applied in the last 24 hours, and disk and CPU utilisation, IOPS, and queue depth.
Four decision points
Latency spikes branch on four binary questions. The agent walks them in order; the answer to each shrinks the search space for the next.
- Query versus infrastructure. Is this a query problem (top queries dominating) or an infrastructure problem (CPU or disk pegged)? Different remediation paths.
- Recent change versus intrinsic. Recent deploy or schema change, or long-standing slowness? Recent-change is fastest to revert; intrinsic needs deeper analysis.
- One service versus many. Many services affected means a database-level issue. One service means a service-level issue, not the database’s fault.
- Continuing versus recovering. A recovering spike is observation; a continuing spike is action. The remediation choice depends on which.
The slow-query trap
The most common failure mode of a naive triage agent is also the easiest to instrument against. The slow-query trap kills more agent runs than any other single mistake.
- Naive instinct. The agent looks at slow queries first and fixes the slowest one. This is wrong when the cause is elsewhere.
- Why it explains failures. The query was always slow; the new behaviour is in the connection pool, the disk, or the recent deploy.
- Time-correlation guard. Avoid the trap by requiring the slow query to correlate with the spike’s start time. If not, it is a coincidence, not the cause.
- Eval coverage. Include “always-slow query during an unrelated spike” in the eval set. The agent must not pick the always-slow query as the cause.
What the agent should output
Output is structured so the on-call can scan it in seconds. Free-text triage erodes the agent’s value.
- Ranked causes. Most likely cause first, with a confidence score per cause and the runner-up clearly named.
- Evidence per cause. The metric panels and log excerpts that support the ranking. No ranked cause without supporting evidence.
- Recommended next action. Usually “investigate cause #1” with specific commands; sometimes “escalate, evidence is ambiguous.”
- Time and tool calls. Elapsed time and number of tool calls so the run cost stays visible to the on-call.
When to escalate
Three escalation triggers cover most of the cases the agent should not handle alone. Escalating quickly is more useful than guessing past the agent’s scope.
- Equally likely causes. Multiple causes carry similar confidence. The agent cannot disambiguate without acting; escalate to a human.
- Infrastructure cause. Disk or CPU at the limit. The triage agent’s tools do not include infrastructure remediation; escalate to the database team.
- Customer-data cause. A specific tenant’s pattern is responsible. Customer-impact judgement is human territory; escalate.
- Cross-incident correlation. If the spike correlates with another active incident, escalate so the response is coordinated rather than parallel.