Per-Decision Confidence: Surfacing It Without Over-Surfacing
Confidence scores are useful. Confidence-overload makes operators numb. The confidence-budget that surfaces only the decisions that actually need a second look.
Why surface confidence at all
Operators need to know when to trust the agent and when to second-guess. Confidence scores convey that signal cheaply; without confidence, every agent output looks the same and the operator has to rebuild trust from scratch on each interaction; confidence is also a signal for downstream automation (high-confidence outputs auto-route, low-confidence escalates to human).
- Trust signal for operator. When to trust, when to second-guess; cheap signal.
- Without it, rebuild trust each time. Outputs look identical; expensive cognition.
- Routing signal. High-confidence auto-routes; low-confidence escalates.
- Per-decision confidence value. Numeric or banded; the actionable surface.
The over-surfacing trap
Surfacing too much confidence breaks the signal. Showing confidence on every output makes operators numb and the signal becomes background noise within a week; worse, operators start ignoring confidence entirely (the information is there but unread, the cost was paid for no benefit); confidence-overload is the natural failure mode of teams that decided to surface confidence without thinking about how operators would consume it.
- Numbness within a week. Per-output confidence becomes background noise.
- Ignored entirely. Information present but unread; cost paid for no benefit.
- Surfacing without consumption design. The natural failure mode.
- Per-operator attention budget. Confidence consumes attention; budget it.
Confidence-budget pattern
The budget pattern preserves the signal. Set a budget: surface confidence on at most 10 cases per day per operator (not every case, the most important ones); the 10 cases are the lowest-confidence outputs, the highest-stakes ones, the ones where the agent flagged uncertainty explicitly; operators read 10 confidence scores carefully but would have ignored 100, so the budget makes the signal usable.
- 10 cases per day per operator. Bounded attention; preserves signal.
- Lowest confidence plus highest stakes. The selection criteria; pick the consequential.
- Plus explicit uncertainty flags. Cases where agent flagged itself; surfaced.
- 10 read carefully vs 100 ignored. The budget is the signal-preservation mechanism.
Confidence thresholds for auto-routing
Three confidence bands drive routing. >0.9: auto-act (for actions the agent is allowed to take, operator sees a notification not a request for action). 0.7-0.9: surface to operator (confidence-budget candidate, show the score because the case is borderline). <0.7: escalate (the agent itself is unsure, operator decides what happens next).
- >0.9 auto-act. For permitted actions; operator notified, not asked.
- 0.7-0.9 surface to operator. Confidence-budget candidate; borderline.
- <0.7 escalate. Agent unsure; operator decides.
- Per-band documented action. Each band has a documented response; supports correct routing.
Calibrating the confidence
Calibration keeps the score meaningful. Validate what “0.8 confidence” means in production: take 100 outputs at 0.8 and check how many were actually correct (should be 80%, if not the model is mis-calibrated); re-calibrate after every model swap because confidence calibration is a model-specific property and does not transfer; recalibration is mostly automatic (run the eval suite, compare predicted confidence to actual correctness, adjust the threshold).
- Validate 0.8 = 80% correct. Take 100 at 0.8; verify; mis-calibration if not.
- Re-calibrate per model swap. Calibration is model-specific; doesn’t transfer.
- Automatic recalibration. Eval suite plus correctness comparison plus threshold adjustment.
- Per-quarter calibration check. Documented per cycle; supports continued meaning.