Agentic SRE Advanced By Samson Tanimawo, PhD Published Jul 12, 2026 5 min read

The Confusion Matrix Adapted for SRE Agent Output

True positives, false positives, missed pages, and the SRE-specific quadrant most teams forget. How to track each in production agents.

The classic 2x2

True positive: agent correctly identifies a real incident as needing attention. False positive: agent flags a non-issue. False negative: agent misses a real incident. True negative: agent correctly ignores noise.

Each cell has a different cost. False negatives are usually catastrophic in SRE; false positives are merely annoying. The matrix is asymmetric in cost; do not optimise for accuracy alone.

Track each cell over time. The week-over-week deltas tell you whether the agent is getting better, worse, or differently-broken.

The fourth quadrant most teams forget

False-positive-with-cost: the agent acted, the action was unnecessary, and the action had a cost (a restart that briefly degraded service, a ticket that wasted on-call attention). This is not a normal false positive; it is more expensive.

Track FP-with-cost separately from plain FP. Reducing it might mean leaving more plain FPs alone but firing fewer expensive actions.

Most teams miss this quadrant because their classifier was built for content moderation, where actions are cheap. SRE actions are not cheap; the matrix needs the extra dimension.

How to instrument

Each agent run emits a result label. Each label is later compared to ground truth (the human on-call's verdict). The comparison populates the matrix.

Ground truth is gathered by a daily job that pulls the human's actions from the incident management system. The match between agent label and human label is the eval signal.

Manual review for ambiguous cases. "The agent recommended X; the human did Y, but X would have worked too" is alternative-correct, not wrong.

Tuning the trade-off

Tighten the agent's threshold for action: fewer FPs, more FNs. Loosen it: more FPs, fewer FNs. Pick the trade that matches your team's tolerance.

The right trade differs by agent. Triage agents lean toward false positives (better to over-investigate than miss). Remediation agents lean toward false negatives (better to escalate than act wrongly).

Document the trade. "This agent is tuned for high recall, accepting 10% FP, because missing real incidents is far more expensive than the FP cost." Future you will want this written down.

The dashboard panel

Four cells, each a count over the last 7 days. Color cells by direction: TP/TN green, FN red, FP yellow, FP-with-cost orange. Eyes go to red first.

Tooltip with the most recent five examples per cell. Operators click through to see what the agent got right or wrong.

Refresh daily. Real-time is overkill for an eval signal; daily is the right cadence.