Structured vs Unstructured Evals: When Each Wins

Multiple-choice evals are cheap and noisy. Free-form evals are expensive and informative. The decision rule for picking the right shape per task.

When structured evals win

Structured evals win on classification tasks: spam-or-not, severity 1 to 5, intent detection. The expected output is a label, structured evals score it perfectly, and the cost is low enough to run thousands of cases per minute. The limitation is that they miss reasoning quality.

When unstructured evals win

Unstructured evals win on generation tasks: summarisation, postmortem drafts, email replies. The expected output is open-ended and structured scoring misses what matters; LLM-as-judge or human review captures quality dimensions that labels cannot, at the cost of speed and money.

The hybrid that ships

The hybrid that ships layers both. Structured evals on every PR catch obvious regressions cheaply; unstructured evals weekly or per-release catch subtle quality drifts that structured evals miss. Structured is the loud floor; unstructured is the quiet ceiling.

Common mistakes

The two common mistakes are mismatching format and task. Structured evals on a generation task pass while quality rots; unstructured evals on a classification task are expensive and slow when a multiple-choice eval would have caught the regression in seconds. Pick the format by task type, not by team preference.