Structured vs Unstructured Evals: When Each Wins
Multiple-choice evals are cheap and noisy. Free-form evals are expensive and informative. The decision rule for picking the right shape per task.
When structured evals win
Classification tasks. Spam-or-not, severity-1-to-5, intent-detection. The expected output is a label; structured evals score it perfectly.
Cost is cheap; can run thousands of cases per minute. Regression detection is fast and reliable.
Limitation: structured evals miss reasoning quality. The model can pick the right label for the wrong reason.
When unstructured evals win
Generation tasks. Summarisation, postmortem drafts, email replies. The expected output is open-ended; structured scoring misses what matters.
Use LLM-as-judge or human review. Slower and more expensive but captures quality dimensions that labels cannot.
Limitation: judge models are noisy. Calibrate against humans; expect 90%+ agreement, not 99%.
The hybrid that ships
Structured evals on every PR. Cheap, fast, catches obvious regressions.
Unstructured evals weekly or per-release. Expensive but catches subtle quality drifts.
Layer both. Structured is the loud floor; unstructured is the quiet ceiling.
Common mistakes
Using only structured evals on a generation task. The eval suite passes but quality has rotted; users notice before the team does.
Using only unstructured evals on a classification task. Expensive and slow when a multiple-choice eval would have caught the regression in seconds.
Pick the format by task type, not by team preference.