Agentic SRE Advanced By Samson Tanimawo, PhD Published May 27, 2026 5 min read

An Agentic Approach to Database Latency Spikes

Six tools, four decision points, one common failure mode. The agent design that triages db latency without falling into the slow-query trap.

The six tools

Metric query: pull p50/p95/p99 latency over the last hour. Bounded to the database the alert names.

Query stats: top 10 queries by total time, queries by row-scan count, queries with recent EXPLAIN regressions.

Connection pool stats: active, idle, waiting connections; recent pool exhaustion events.

Recent deploys: deploys to services that talk to this database in the last 4 hours.

Recent schema changes: DDL applied in the last 24 hours.

Disk and CPU stats: percent utilisation, IOPS, queue depth.

Four decision points

Is this a query problem (top queries dominating) or an infrastructure problem (CPU/disk pegged)? Different remediation paths.

Is this a recent-change problem (deploy or schema) or an intrinsic problem (long-standing slowness)? Recent-change is fastest to revert; intrinsic needs deeper analysis.

Is this affecting one service or many? Many services affected = database-level issue; one service = service-level issue, not database's fault.

Is this still happening or recovering? A recovering spike is observation; a continuing spike is action.

The slow-query trap

The naive agent looks for slow queries first and fixes the slowest one. This is wrong when the cause is elsewhere.

The slow-query trap explains many failed agent triages. The query was always slow; the new behaviour is in the connection pool, the disk, or the recent deploy.

Avoid by checking the time-correlation: does the slow query correlate with the spike's start time? If not, it is a coincidence, not the cause.

What the agent should output

Most likely cause, ranked. Confidence score per cause.

Evidence: the metric panels and log excerpts that support the ranking.

Recommended next action: usually "investigate cause #1" with specific commands; sometimes "escalate, evidence is ambiguous."

When to escalate

Multiple causes are equally likely. The agent cannot disambiguate without acting; escalate to a human.

The cause appears to be infrastructure (disk, CPU). Triage agent's tools do not include infrastructure remediation; escalate to the database team.

The cause appears to be customer data (a specific tenant's pattern). Customer-impact judgement is human territory; escalate.