Hand-off Patterns Between Triage and Remediation Agents
Triage produces a hypothesis. Remediation acts on it. The handoff schema, the validation step in between, and the case where remediation should refuse the handoff.
The handoff schema
The handoff is a typed object that crosses the boundary between agents. Treat it like an API contract, not a free-text message, and the failure modes get tractable.
- Triage output. Hypothesis with confidence, evidence that supports the hypothesis, recommended action that follows from both.
- Required fields. Every field is mandatory. Missing fields fail the handoff and the remediation agent does not act on partial input.
- Versioned schema. The shape is pinned to a version number. Both agents pick up schema changes in the same deploy, never one ahead of the other.
- Idempotency key. Each handoff carries a unique ID so the remediation agent can detect and reject replays.
Validate before handing off
Validation runs at the boundary, before the remediation agent ever sees the payload. Three checks catch most of the bad cases.
- Confidence threshold. If triage confidence is below 0.7, the handoff stops and the case escalates to a human instead.
- Action allowlist. Triage’s recommended action must appear in the remediation agent’s tool allowlist. If not, the handoff fails closed.
- Evidence freshness. Evidence must be under five minutes old. Stale evidence triggers a refresh before the action fires.
- Schema version match. The triage version must match what the remediation agent expects. A mismatch fails the handoff and pages the platform team.
When remediation should refuse
The remediation agent is not a rubber stamp. It runs its own checks, and refusing is often the right answer.
- Implausible hypothesis. The remediation agent re-reads the evidence. If the hypothesis does not follow from what is actually there, refuse and escalate.
- High-risk plus borderline confidence. A risky action with confidence between 0.7 and 0.8 escalates to a human rather than firing automatically.
- Failed pre-conditions. Pre-flight checks for resource quota, deploy freeze, or related incident fail; the action does not fire.
- Recent same-action history. If the same remediation ran in the last 15 minutes and did not resolve the symptom, refuse and escalate; repeating it is unlikely to help.
Auditing the handoff chain
The chain is reconstructable for every run. That property is what makes the system reviewable; without it, you cannot tell where a bad outcome came from.
- Per-handoff log line. Triage agent, hypothesis, confidence, recommended action, remediation agent, action taken or refused. One row, one handoff.
- Replay path. Any past run can be reconstructed end to end. “Why did the remediation agent restart this pod?” gets answered from the log alone.
- Bug localisation. When things go wrong, the handoff log is the first thing on-call reads. The fault sits in triage, in remediation, or in the handoff itself, and the log makes it obvious which.
- Retention. Keep handoff logs for at least 90 days. Many agent regressions surface only when a similar incident recurs weeks later.
Eval cases for the chain
The eval set covers the four canonical paths through the handoff. Run it on every change to either agent or to the schema.
- Successful handoff. Triage produces a correct hypothesis; remediation acts; outcome matches the expected resolution.
- Refusal handoff. Triage produces a correct hypothesis with low confidence; remediation correctly refuses and escalates.
- Stale handoff. Evidence is older than the freshness threshold; remediation correctly refreshes or refuses.
- Wrong-action handoff. Triage suggests an action outside the remediation agent’s allowlist; the handoff fails closed at the boundary.