AI & ML Practical By Samson Tanimawo, PhD Published Jul 30, 2026 4 min read

Structured vs Unstructured Evals: When Each Wins

Multiple-choice evals are cheap and noisy. Free-form evals are expensive and informative. The decision rule for picking the right shape per task.

When structured evals win

Classification tasks. Spam-or-not, severity-1-to-5, intent-detection. The expected output is a label; structured evals score it perfectly.

Cost is cheap; can run thousands of cases per minute. Regression detection is fast and reliable.

Limitation: structured evals miss reasoning quality. The model can pick the right label for the wrong reason.

Generation tasks. Summarisation, postmortem drafts, email replies. The expected output is open-ended; structured scoring misses what matters.

Use LLM-as-judge or human review. Slower and more expensive but captures quality dimensions that labels cannot.

Limitation: judge models are noisy. Calibrate against humans; expect 90%+ agreement, not 99%.

Structured evals on every PR. Cheap, fast, catches obvious regressions.

Unstructured evals weekly or per-release. Expensive but catches subtle quality drifts.

Layer both. Structured is the loud floor; unstructured is the quiet ceiling.

Using only structured evals on a generation task. The eval suite passes but quality has rotted; users notice before the team does.

Using only unstructured evals on a classification task. Expensive and slow when a multiple-choice eval would have caught the regression in seconds.

Pick the format by task type, not by team preference.