AI & ML Beginner By Samson Tanimawo, PhD Published Apr 29, 2025 9 min read

Model Evaluations 101: Beyond Accuracy

Accuracy is one number. Real ML systems need ten. Here is how to evaluate models so you actually know if they’re working in production.

Why accuracy alone misleads

Accuracy is the percentage of predictions a model gets right. It’s the easiest metric to compute and the most common one to over-rely on.

The classic failure: a fraud-detection dataset where 99% of transactions are legitimate. A model that always predicts “not fraud” achieves 99% accuracy and detects zero fraud. The metric is technically correct and operationally useless.

Production-grade evaluation requires multiple metrics, chosen for the failure modes you care about. Below, the families you’ll keep reaching for.

Classification metrics: precision, recall, F1

For any classification task, four numbers tell the real story:

From those:

Precision-recall tradeoffs are domain-specific. Spam filter: high recall (don’t miss spam) at the cost of some precision (some legit mail in spam folder). Court system: high precision (don’t convict the innocent) even if recall drops.

Text generation: BLEU, ROUGE, and their limits

For tasks where the model produces free text (translation, summarisation), classical metrics try to match outputs against references:

All three require reference outputs, which are expensive to produce. They also don’t distinguish “factually wrong” from “differently phrased.” A confident hallucination can score high if it’s linguistically close to the reference.

LLM-as-judge

The most common 2024-2025 evaluation approach for generative output: use a stronger LLM to score the outputs of the model under test.

The pattern: define a rubric (“rate this answer 1-5 on factual accuracy, completeness, and tone”), give it to a frontier model along with the question and the answer, and average the scores across hundreds of examples.

Strengths: cheap, scales, captures quality nuances numerical metrics miss.

Caveats: judges have biases (they often prefer their own family’s style, longer answers, and answers that match their training data). Calibrate against human ratings on a sample. Don’t trust LLM-as-judge alone for high-stakes decisions.

Task-specific metrics

For specific applications, narrow metrics beat general ones:

Pick the metric closest to the business outcome. “BLEU on translation” is a proxy for “does the customer understand the translated email?” If you can measure the latter, do.

Building eval sets that grow with your product

The eval set is your ground truth. It deserves the same care as production code.

Three principles:

  1. Start small. 50-100 examples is enough to detect major regressions. Don’t wait for a 10,000-example gold set; you’ll never start.
  2. Grow on real failures. Every production bug becomes an eval case. The eval set is the museum of every wrong answer your model has ever given.
  3. Version it. Track when an eval case was added and why. When a metric score moves, you want to know which examples drove the change.

An eval set you can’t version, can’t reproduce, and didn’t collect from real failures is decoration. The teams that ship the best AI products are also the teams with the most boring, exhaustive, well-curated eval sets.