LLM-as-Judge for SRE Agent Output: Pitfalls and Patterns
Judges are cheaper than humans and more biased. The bias categories you must counter, the rubric design that holds up, and the cases where humans are still required.
When to use a judge
Judge models score open-ended output: hypotheses, summaries, postmortems. Anything where exact-match scoring fails because there are many valid answers.
Judges are 5-50x cheaper than humans and infinitely faster. The trade is bias and noise; judges are not neutral.
Use judges for high-volume, low-stakes evals. Use humans for low-volume, high-stakes ones. The hybrid covers most of the space.
Biases you must counter
Position bias: judges prefer the first option presented. Counter by randomising order across cases.
Length bias: judges prefer longer answers. Counter by including length in the rubric ("appropriate brevity" is a scoring dimension).
Self-bias: a model judging its own output gives itself higher scores. Use a different model family as the judge.
Rubric design
Specific dimensions, not vague ones. "Identifies the correct affected service: yes/no" beats "is the answer good."
Each dimension is independently scored. Aggregate at the end; do not let the judge aggregate, or you lose the per-dimension signal.
Include negative criteria explicitly. "Does not invent metrics: yes/no" catches hallucinations the positive criteria might miss.
Calibrate against humans
Pick 30 cases. Have humans score them. Have the judge score them. Compute agreement: judge-human pass rate at the case level.
Target 90%+ agreement. Below that, the judge is unreliable for this rubric. Either tune the rubric or accept that humans are required for this eval.
Re-calibrate quarterly. Judge models update; rubrics drift; agreement rates change. The calibration is not one-time.
Cases where judges fail
Anything requiring real-world verification: "is this hypothesis correct" requires looking at production data the judge does not have.
Anything that depends on tribal knowledge: "this team usually solves X with Y" is invisible to the judge.
Anything where the cost of being wrong is high: judges are noisy; high-stakes calls deserve human time.