LLM-as-Judge for SRE Agent Output: Pitfalls and Patterns

Judges are cheaper than humans and more biased. The bias categories you must counter, the rubric design that holds up, and the cases where humans are still required.

When to use a judge

An LLM judge fills a specific gap: scoring open-ended output where exact-match scoring is impossible and human review is too slow or too expensive.

Biases you must counter

Judges have predictable biases. Each one has a known counter; ignoring them produces evals that look rigorous and are not.

Rubric design

The rubric is what separates a useful judge from a noisy one. Specific dimensions and explicit negatives are the difference.

Calibrate against humans

A judge that has not been calibrated against humans is decoration, not measurement. The protocol below is small, repeatable, and the only way to defend the eval.

Cases where judges fail

Three classes of evaluation defeat judges entirely. Pretending otherwise produces measurements that mislead the team.