Agent Specialization by Failure Mode: A Sketch
Network failures need different reasoning than DB failures. The taxonomy of failure modes, and a sketch of which modes each specialist handles.
The taxonomy
Five failure modes cover most incidents. Network failures (connectivity, DNS, latency between services; tools: traceroute, dig, network metrics); database failures (query, connection, replication; tools: query stats, EXPLAIN, replication lag); compute failures (CPU, memory, disk; tools: process stats, OOM logs); application failures (bugs, exceptions, regressions; tools: error rates, traces, recent deploys); external failures (third-party APIs, CDN, DNS providers).
- Network failures. Connectivity, DNS, latency; traceroute, dig, network metrics.
- Database failures. Query, connection, replication; query stats, EXPLAIN, replication lag.
- Compute failures. CPU, memory, disk; process stats, OOM logs, disk metrics.
- Application and external. Bugs, exceptions, regressions and third-party APIs, CDN, DNS providers.
Why specialists by failure mode
Each mode rewards different specialisation. Each mode requires different reasoning patterns (network: graph traversal; database: query analysis; compute: resource accounting); each mode requires different tools (a network specialist does not need EXPLAIN, a database specialist does not need traceroute); specialists’ prompts can be tighter and more domain-specific because tighter prompts are more reliable.
- Different reasoning patterns. Graph traversal, query analysis, resource accounting; mode-specific.
- Different tools. Network specialist doesn’t need EXPLAIN; database specialist doesn’t need traceroute.
- Tighter domain-specific prompts. Tighter prompts are more reliable; specialisation pays.
- Per-specialist eval suite. Each specialist tested on its domain; supports targeted quality.
Classifying the incoming alert
The classifier routes alerts to the right specialist. Most alerts have a clear primary mode (database alert is database, 5xx is application, packet-loss is network); some alerts span modes (“latency spike” could be database, network, or compute, so the classifier escalates to a triage specialist that gathers evidence first); the classifier itself is a small fast LLM call with an eval suite that is the alert-to-mode mapping.
- Clear primary mode for most. Database, 5xx application, packet-loss network; routing is obvious.
- Spanning alerts go to triage. “Latency spike” can be many modes; triage gathers evidence first.
- Small fast classifier. The classifier itself is cheap; the specialists are heavy.
- Alert-to-mode eval suite. The classifier’s test bench; supports continued accuracy.
Composing specialists
Four roles compose the system. Triage identifies the mode and hands off to the right specialist; Specialist investigates within its domain and produces hypothesis plus recommended action; Remediation acts on the hypothesis (could be the specialist itself or a separate agent); Audit records the chain for postmortem and learning.
- Triage. Identify mode; hand off to specialist; the routing layer.
- Specialist. Investigate within domain; produce hypothesis and recommended action.
- Remediation. Act on hypothesis; specialist itself or separate agent.
- Audit. Record the chain for postmortem and learning.
How the taxonomy evolves
The taxonomy grows then stabilises. Year one: 5 modes covering 80% of incidents and misclassifications are bugs (refine the taxonomy). Year two: 8 modes covering 90% with new modes added based on observed gaps. Beyond year two: stability where the taxonomy stops growing and specialists deepen instead.
- Year 1: 5 modes, 80%. Misclassifications are bugs; refine the taxonomy.
- Year 2: 8 modes, 90%. New modes added based on observed gaps.
- Year 3+: stability. Taxonomy stops growing; specialists deepen instead.
- Per-mode coverage tracked. Documented per quarter; supports continued investment.