An Agentic Approach to Database Latency Spikes
Six tools, four decision points, one common failure mode. The agent design that triages db latency without falling into the slow-query trap.
The six tools
Metric query: pull p50/p95/p99 latency over the last hour. Bounded to the database the alert names.
Query stats: top 10 queries by total time, queries by row-scan count, queries with recent EXPLAIN regressions.
Connection pool stats: active, idle, waiting connections; recent pool exhaustion events.
Recent deploys: deploys to services that talk to this database in the last 4 hours.
Recent schema changes: DDL applied in the last 24 hours.
Disk and CPU stats: percent utilisation, IOPS, queue depth.
Four decision points
Is this a query problem (top queries dominating) or an infrastructure problem (CPU/disk pegged)? Different remediation paths.
Is this a recent-change problem (deploy or schema) or an intrinsic problem (long-standing slowness)? Recent-change is fastest to revert; intrinsic needs deeper analysis.
Is this affecting one service or many? Many services affected = database-level issue; one service = service-level issue, not database's fault.
Is this still happening or recovering? A recovering spike is observation; a continuing spike is action.
The slow-query trap
The naive agent looks for slow queries first and fixes the slowest one. This is wrong when the cause is elsewhere.
The slow-query trap explains many failed agent triages. The query was always slow; the new behaviour is in the connection pool, the disk, or the recent deploy.
Avoid by checking the time-correlation: does the slow query correlate with the spike's start time? If not, it is a coincidence, not the cause.
What the agent should output
Most likely cause, ranked. Confidence score per cause.
Evidence: the metric panels and log excerpts that support the ranking.
Recommended next action: usually "investigate cause #1" with specific commands; sometimes "escalate, evidence is ambiguous."
When to escalate
Multiple causes are equally likely. The agent cannot disambiguate without acting; escalate to a human.
The cause appears to be infrastructure (disk, CPU). Triage agent's tools do not include infrastructure remediation; escalate to the database team.
The cause appears to be customer data (a specific tenant's pattern). Customer-impact judgement is human territory; escalate.