Structured vs Unstructured Evals: When Each Wins
Multiple-choice evals are cheap and noisy. Free-form evals are expensive and informative. The decision rule for picking the right shape per task.
When structured evals win
Structured evals win on classification tasks: spam-or-not, severity 1 to 5, intent detection. The expected output is a label, structured evals score it perfectly, and the cost is low enough to run thousands of cases per minute. The limitation is that they miss reasoning quality.
- Classification tasks. Spam-or-not, severity-1-to-5, intent-detection; structured evals score them perfectly.
- Cheap and fast. Thousands of cases per minute; regression detection is fast and reliable.
- Reasoning blind spot. The model can pick the right label for the wrong reason; the score doesn’t catch it.
- Per-PR run target. Structured evals run on every PR; the feedback loop is fast enough to gate merges.
When unstructured evals win
Unstructured evals win on generation tasks: summarisation, postmortem drafts, email replies. The expected output is open-ended and structured scoring misses what matters; LLM-as-judge or human review captures quality dimensions that labels cannot, at the cost of speed and money.
- Generation tasks. Summarisation, postmortem drafts, email replies; the expected output is open-ended.
- LLM-as-judge or human review. Slower and more expensive but captures quality dimensions that labels cannot.
- Judge noise. Judge models are noisy; calibrate against humans; expect 90%+ agreement, not 99%.
- Per-release cadence. Unstructured evals run weekly or per-release; the cost matches the cadence.
The hybrid that ships
The hybrid that ships layers both. Structured evals on every PR catch obvious regressions cheaply; unstructured evals weekly or per-release catch subtle quality drifts that structured evals miss. Structured is the loud floor; unstructured is the quiet ceiling.
- Structured every PR. Cheap, fast, catches obvious regressions; the merge gate.
- Unstructured per-release. Expensive but catches subtle quality drifts; the release gate.
- Layered floor and ceiling. Structured is the loud floor; unstructured is the quiet ceiling.
- Per-eval purpose documented. Each suite’s role written down; supports investigation when an eval surfaces a regression.
Common mistakes
The two common mistakes are mismatching format and task. Structured evals on a generation task pass while quality rots; unstructured evals on a classification task are expensive and slow when a multiple-choice eval would have caught the regression in seconds. Pick the format by task type, not by team preference.
- Structured on generation. Eval suite passes but quality has rotted; users notice before the team does.
- Unstructured on classification. Expensive and slow when a multiple-choice eval would have caught it in seconds.
- Format follows task. Pick by task type, not by team preference; the task shape drives the eval shape.
- Per-task format documented. The format choice rationale committed to the eval directory; supports later review.