Agentic SRE Advanced By Samson Tanimawo, PhD Published Apr 10, 2026 5 min read

Agent Specialization by Failure Mode: A Sketch

Network failures need different reasoning than DB failures. The taxonomy of failure modes, and a sketch of which modes each specialist handles.

The taxonomy

Network failures: connectivity, DNS, latency between services. Tools: traceroute, dig, network metrics.

Database failures: query, connection, replication. Tools: query stats, EXPLAIN, replication lag.

Compute failures: CPU, memory, disk. Tools: process stats, OOM logs, disk metrics.

Application failures: bugs, exceptions, regressions. Tools: error rates, traces, recent deploys.

External failures: third-party APIs, CDN, DNS providers. Tools: status pages, probe results.

Why specialists by failure mode

Each mode requires different reasoning patterns. Network failures involve graph traversal; database failures involve query analysis; compute failures involve resource accounting.

Each mode requires different tools. A network specialist does not need EXPLAIN; a database specialist does not need traceroute.

Specialists' prompts can be tighter and more domain-specific. Tighter prompts are more reliable.

Classifying the incoming alert

Most alerts have a clear primary mode: a database alert is database, a 5xx alert is application, a packet-loss alert is network.

Some alerts span modes. "Latency spike" could be database, network, or compute. The classifier escalates to a triage specialist that gathers evidence first.

The classifier itself is a small, fast LLM call. Its eval suite is the alert-to-mode mapping.

Composing specialists

Triage: identify the mode. Hand off to the right specialist.

Specialist: investigate within its domain. Produce hypothesis and recommended action.

Remediation: act on the hypothesis. Could be the specialist itself (if it has action tools) or a separate agent.

Audit: record the chain for postmortem and learning.

How the taxonomy evolves

Year one: 5 modes covering 80% of incidents. Misclassifications are bugs; refine the taxonomy.

Year two: 8 modes covering 90%. New modes added based on observed gaps.

Beyond year two: stability. The taxonomy stops growing; specialists deepen instead.