Agent Specialization by Failure Mode: A Sketch

Network failures need different reasoning than DB failures. The taxonomy of failure modes, and a sketch of which modes each specialist handles.

The taxonomy

Five failure modes cover most incidents. Network failures (connectivity, DNS, latency between services; tools: traceroute, dig, network metrics); database failures (query, connection, replication; tools: query stats, EXPLAIN, replication lag); compute failures (CPU, memory, disk; tools: process stats, OOM logs); application failures (bugs, exceptions, regressions; tools: error rates, traces, recent deploys); external failures (third-party APIs, CDN, DNS providers).

Why specialists by failure mode

Each mode rewards different specialisation. Each mode requires different reasoning patterns (network: graph traversal; database: query analysis; compute: resource accounting); each mode requires different tools (a network specialist does not need EXPLAIN, a database specialist does not need traceroute); specialists’ prompts can be tighter and more domain-specific because tighter prompts are more reliable.

Classifying the incoming alert

The classifier routes alerts to the right specialist. Most alerts have a clear primary mode (database alert is database, 5xx is application, packet-loss is network); some alerts span modes (“latency spike” could be database, network, or compute, so the classifier escalates to a triage specialist that gathers evidence first); the classifier itself is a small fast LLM call with an eval suite that is the alert-to-mode mapping.

Composing specialists

Four roles compose the system. Triage identifies the mode and hands off to the right specialist; Specialist investigates within its domain and produces hypothesis plus recommended action; Remediation acts on the hypothesis (could be the specialist itself or a separate agent); Audit records the chain for postmortem and learning.

How the taxonomy evolves

The taxonomy grows then stabilises. Year one: 5 modes covering 80% of incidents and misclassifications are bugs (refine the taxonomy). Year two: 8 modes covering 90% with new modes added based on observed gaps. Beyond year two: stability where the taxonomy stops growing and specialists deepen instead.