Per-Decision Confidence: Surfacing It Without Over-Surfacing
Confidence scores are useful. Confidence-overload makes operators numb. The confidence-budget that surfaces only the decisions that actually need a second look.
Why surface confidence at all
Operators need to know when to trust the agent and when to second-guess. Confidence scores convey that signal cheaply.
Without confidence, every agent output looks the same. The operator has to rebuild trust from scratch on each interaction.
Confidence is also a signal for downstream automation: high-confidence outputs auto-route; low-confidence escalates to human.
The over-surfacing trap
Showing confidence on every output makes operators numb. The signal becomes background noise within a week.
Worse, operators start ignoring confidence entirely. The information is there but unread; the cost was paid for no benefit.
Confidence-overload is the natural failure mode of teams that decided to surface confidence without thinking about how operators would consume it.
Confidence-budget pattern
Set a budget: surface confidence on at most 10 cases per day per operator. Not every case; the most important ones.
The 10 cases are: the lowest-confidence outputs, the highest-stakes ones, the ones where the agent flagged uncertainty explicitly.
Operators read 10 confidence scores carefully. They would have ignored 100. The budget makes the signal usable.
Confidence thresholds for auto-routing
>0.9: auto-act (for actions the agent is allowed to take). Operator sees a notification, not a request for action.
0.7-0.9: surface to operator. Confidence-budget candidate; show the score because the case is borderline.
<0.7: escalate. The agent itself is unsure; operator decides what happens next.
Calibrating the confidence
Validate: what does "0.8 confidence" mean in production? Take 100 outputs at 0.8; check how many were actually correct. Should be 80%; if not, the model is mis-calibrated.
Re-calibrate after every model swap. Confidence calibration is a model-specific property; it does not transfer.
Recalibration is mostly automatic: run the eval suite, compare predicted confidence to actual correctness, adjust the threshold.