Agent Specialization by Failure Mode: A Sketch
Network failures need different reasoning than DB failures. The taxonomy of failure modes, and a sketch of which modes each specialist handles.
The taxonomy
Network failures: connectivity, DNS, latency between services. Tools: traceroute, dig, network metrics.
Database failures: query, connection, replication. Tools: query stats, EXPLAIN, replication lag.
Compute failures: CPU, memory, disk. Tools: process stats, OOM logs, disk metrics.
Application failures: bugs, exceptions, regressions. Tools: error rates, traces, recent deploys.
External failures: third-party APIs, CDN, DNS providers. Tools: status pages, probe results.
Why specialists by failure mode
Each mode requires different reasoning patterns. Network failures involve graph traversal; database failures involve query analysis; compute failures involve resource accounting.
Each mode requires different tools. A network specialist does not need EXPLAIN; a database specialist does not need traceroute.
Specialists' prompts can be tighter and more domain-specific. Tighter prompts are more reliable.
Classifying the incoming alert
Most alerts have a clear primary mode: a database alert is database, a 5xx alert is application, a packet-loss alert is network.
Some alerts span modes. "Latency spike" could be database, network, or compute. The classifier escalates to a triage specialist that gathers evidence first.
The classifier itself is a small, fast LLM call. Its eval suite is the alert-to-mode mapping.
Composing specialists
Triage: identify the mode. Hand off to the right specialist.
Specialist: investigate within its domain. Produce hypothesis and recommended action.
Remediation: act on the hypothesis. Could be the specialist itself (if it has action tools) or a separate agent.
Audit: record the chain for postmortem and learning.
How the taxonomy evolves
Year one: 5 modes covering 80% of incidents. Misclassifications are bugs; refine the taxonomy.
Year two: 8 modes covering 90%. New modes added based on observed gaps.
Beyond year two: stability. The taxonomy stops growing; specialists deepen instead.