AI & ML Practical By Samson Tanimawo, PhD Published Jul 30, 2026 4 min read

Structured vs Unstructured Evals: When Each Wins

Multiple-choice evals are cheap and noisy. Free-form evals are expensive and informative. The decision rule for picking the right shape per task.

When structured evals win

Classification tasks. Spam-or-not, severity-1-to-5, intent-detection. The expected output is a label; structured evals score it perfectly.

Cost is cheap; can run thousands of cases per minute. Regression detection is fast and reliable.

Limitation: structured evals miss reasoning quality. The model can pick the right label for the wrong reason.

When unstructured evals win

Generation tasks. Summarisation, postmortem drafts, email replies. The expected output is open-ended; structured scoring misses what matters.

Use LLM-as-judge or human review. Slower and more expensive but captures quality dimensions that labels cannot.

Limitation: judge models are noisy. Calibrate against humans; expect 90%+ agreement, not 99%.

The hybrid that ships

Structured evals on every PR. Cheap, fast, catches obvious regressions.

Unstructured evals weekly or per-release. Expensive but catches subtle quality drifts.

Layer both. Structured is the loud floor; unstructured is the quiet ceiling.

Common mistakes

Using only structured evals on a generation task. The eval suite passes but quality has rotted; users notice before the team does.

Using only unstructured evals on a classification task. Expensive and slow when a multiple-choice eval would have caught the regression in seconds.

Pick the format by task type, not by team preference.