LLM-as-Judge for SRE Agent Output: Pitfalls and Patterns
Judges are cheaper than humans and more biased. The bias categories you must counter, the rubric design that holds up, and the cases where humans are still required.
When to use a judge
An LLM judge fills a specific gap: scoring open-ended output where exact-match scoring is impossible and human review is too slow or too expensive.
- Open-ended output. Hypotheses, summaries, postmortems, anything where many valid answers exist and exact-match scoring fails.
- Cost vs human. Judges run 5x to 50x cheaper than humans and infinitely faster. The trade is bias and noise; judges are not neutral.
- High volume, low stakes. Use judges for high-volume, low-stakes evals. Use humans for low-volume, high-stakes ones; the hybrid covers most of the space.
- Pre-merge gates. Judges work well as pre-merge eval gates because the speed lets you run the full set on every change.
Biases you must counter
Judges have predictable biases. Each one has a known counter; ignoring them produces evals that look rigorous and are not.
- Position bias. Judges prefer the first option presented. Counter by randomising the order across cases and averaging.
- Length bias. Judges prefer longer answers. Counter by including length in the rubric so “appropriate brevity” is a scoring dimension.
- Self-bias. A model judging its own output gives itself higher scores. Use a different model family as the judge.
- Verbosity bias. Confident, hedge-free answers score higher even when wrong. Counter with a separate “hedges appropriately” dimension.
Rubric design
The rubric is what separates a useful judge from a noisy one. Specific dimensions and explicit negatives are the difference.
- Specific dimensions. “Identifies the correct affected service: yes or no” beats “is the answer good.”
- Independent scoring. Each dimension is scored independently. Aggregate at the end; do not let the judge aggregate, or you lose the per-dimension signal.
- Explicit negatives. “Does not invent metrics: yes or no” catches hallucinations the positive criteria might miss.
- Few-shot grounding. Include 2 to 3 worked examples in the rubric prompt so the judge sees the calibration target rather than guessing it.
Calibrate against humans
A judge that has not been calibrated against humans is decoration, not measurement. The protocol below is small, repeatable, and the only way to defend the eval.
- Calibration set. Pick 30 cases. Have humans score them. Have the judge score them. Compute case-level agreement.
- Agreement target. Target 90 percent or higher. Below that, the judge is unreliable for this rubric. Either tune the rubric or accept that humans are required.
- Quarterly re-calibration. Judge models update, rubrics drift, agreement rates change. The calibration is not one-time work.
- Disagreement audit. Read every case where judge and human disagreed. The disagreement is where the rubric needs sharpening.
Cases where judges fail
Three classes of evaluation defeat judges entirely. Pretending otherwise produces measurements that mislead the team.
- Real-world verification. “Is this hypothesis correct” requires looking at production data the judge does not have. Use a deterministic checker or a human.
- Tribal knowledge. “This team usually solves X with Y” is invisible to the judge. The convention lives in postmortems, not in training data.
- High stakes. Judges are noisy. High-stakes calls (production deploys, security responses) deserve human time even when the judge is confident.
- Adversarial output. Output crafted to game the rubric will gain unfair scores. For adversarial workloads, the judge needs an adversarial eval set of its own.