Agent Specialization by Failure Mode: A Sketch

Network failures need different reasoning than DB failures. The taxonomy of failure modes, and a sketch of which modes each specialist handles.

The taxonomy

Five failure modes cover most incidents. Network failures (connectivity, DNS, latency between services; tools: traceroute, dig, network metrics); database failures (query, connection, replication; tools: query stats, EXPLAIN, replication lag); compute failures (CPU, memory, disk; tools: process stats, OOM logs); application failures (bugs, exceptions, regressions; tools: error rates, traces, recent deploys); external failures (third-party APIs, CDN, DNS providers).

Network failures. Connectivity, DNS, latency; traceroute, dig, network metrics.
Database failures. Query, connection, replication; query stats, EXPLAIN, replication lag.
Compute failures. CPU, memory, disk; process stats, OOM logs, disk metrics.
Application and external. Bugs, exceptions, regressions and third-party APIs, CDN, DNS providers.

Why specialists by failure mode

Each mode rewards different specialisation. Each mode requires different reasoning patterns (network: graph traversal; database: query analysis; compute: resource accounting); each mode requires different tools (a network specialist does not need EXPLAIN, a database specialist does not need traceroute); specialists’ prompts can be tighter and more domain-specific because tighter prompts are more reliable.

Different reasoning patterns. Graph traversal, query analysis, resource accounting; mode-specific.
Different tools. Network specialist doesn’t need EXPLAIN; database specialist doesn’t need traceroute.
Tighter domain-specific prompts. Tighter prompts are more reliable; specialisation pays.
Per-specialist eval suite. Each specialist tested on its domain; supports targeted quality.

Classifying the incoming alert

The classifier routes alerts to the right specialist. Most alerts have a clear primary mode (database alert is database, 5xx is application, packet-loss is network); some alerts span modes (“latency spike” could be database, network, or compute, so the classifier escalates to a triage specialist that gathers evidence first); the classifier itself is a small fast LLM call with an eval suite that is the alert-to-mode mapping.

Clear primary mode for most. Database, 5xx application, packet-loss network; routing is obvious.
Spanning alerts go to triage. “Latency spike” can be many modes; triage gathers evidence first.
Small fast classifier. The classifier itself is cheap; the specialists are heavy.
Alert-to-mode eval suite. The classifier’s test bench; supports continued accuracy.

Composing specialists

Four roles compose the system. Triage identifies the mode and hands off to the right specialist; Specialist investigates within its domain and produces hypothesis plus recommended action; Remediation acts on the hypothesis (could be the specialist itself or a separate agent); Audit records the chain for postmortem and learning.

Triage. Identify mode; hand off to specialist; the routing layer.
Specialist. Investigate within domain; produce hypothesis and recommended action.
Remediation. Act on hypothesis; specialist itself or separate agent.
Audit. Record the chain for postmortem and learning.

How the taxonomy evolves

The taxonomy grows then stabilises. Year one: 5 modes covering 80% of incidents and misclassifications are bugs (refine the taxonomy). Year two: 8 modes covering 90% with new modes added based on observed gaps. Beyond year two: stability where the taxonomy stops growing and specialists deepen instead.

Year 1: 5 modes, 80%. Misclassifications are bugs; refine the taxonomy.
Year 2: 8 modes, 90%. New modes added based on observed gaps.
Year 3+: stability. Taxonomy stops growing; specialists deepen instead.
Per-mode coverage tracked. Documented per quarter; supports continued investment.