Hand-off Patterns Between Triage and Remediation Agents
Triage produces a hypothesis. Remediation acts on it. The handoff schema, the validation step in between, and the case where remediation should refuse the handoff.
The handoff schema
Triage output: hypothesis (with confidence), evidence (the data that supports the hypothesis), recommended action (the next step).
Each field is required. Missing fields fail the handoff; the remediation agent does not act on partial input.
Schema is versioned. Changes to the schema are coordinated; both agents see the change at the same time.
Validate before handing off
Confidence threshold: if triage confidence < 0.7, the handoff stops; the case escalates to a human instead.
Action allowlist: triage's recommended action must be in the remediation agent's tool allowlist. If not, the handoff fails.
Evidence freshness: the evidence must be < 5 minutes old. Older evidence may be stale; refresh before acting.
When remediation should refuse
Hypothesis is implausible given the evidence. The remediation agent does its own check; if the hypothesis does not match the evidence, refuse.
Action is high-risk and confidence is borderline. The remediation agent escalates rather than acting.
Required pre-conditions are not met. Pre-flight checks fail; the action does not fire.
Auditing the handoff chain
Each handoff is logged: triage agent, hypothesis, confidence, recommended action, remediation agent, action taken (or refused).
The chain is reconstructable for any past run. "Why did the remediation agent restart this pod?" answers from the chain.
When things go wrong, the handoff log is the first thing to read. Was the bug in triage, in remediation, or in the handoff itself?
Eval cases for the chain
Successful handoff: triage produces correct hypothesis; remediation acts; outcome matches expected.
Refusal handoff: triage produces correct hypothesis with low confidence; remediation correctly refuses.
Stale handoff: evidence is old; remediation correctly refuses or refreshes.
Wrong-action handoff: triage suggests an action outside remediation's scope; the handoff correctly fails.