Model Evaluations 101: Beyond Accuracy
Accuracy is one number. Real ML systems need ten. Here is how to evaluate models so you actually know if they’re working in production.
Why accuracy alone misleads
Accuracy is the percentage of predictions a model gets right. It’s the easiest metric to compute and the most common one to over-rely on.
The classic failure: a fraud-detection dataset where 99% of transactions are legitimate. A model that always predicts “not fraud” achieves 99% accuracy and detects zero fraud. The metric is technically correct and operationally useless.
Production-grade evaluation requires multiple metrics, chosen for the failure modes you care about. Below, the families you’ll keep reaching for.
Classification metrics: precision, recall, F1
For any classification task, four numbers tell the real story:
- True positives (TP): predicted fraud, actually fraud.
- False positives (FP): predicted fraud, actually legitimate (annoying for users).
- False negatives (FN): predicted legitimate, actually fraud (the costly miss).
- True negatives (TN): predicted legitimate, actually legitimate.
From those:
- Precision = TP / (TP + FP). Of the things we flagged, how many really were?
- Recall = TP / (TP + FN). Of the things that really were, how many did we catch?
- F1 score = harmonic mean of precision and recall. A single number when you need one.
Precision-recall tradeoffs are domain-specific. Spam filter: high recall (don’t miss spam) at the cost of some precision (some legit mail in spam folder). Court system: high precision (don’t convict the innocent) even if recall drops.
Text generation: BLEU, ROUGE, and their limits
For tasks where the model produces free text (translation, summarisation), classical metrics try to match outputs against references:
- BLEU: n-gram precision. Looks at overlapping word sequences. Used heavily in machine translation. Doesn’t account for synonyms or paraphrasing.
- ROUGE: n-gram recall. Common in summarisation. Same blind spots as BLEU.
- BERTScore: uses contextual embeddings for similarity. Catches paraphrases that BLEU misses. The default modern choice for reference-based text eval.
All three require reference outputs, which are expensive to produce. They also don’t distinguish “factually wrong” from “differently phrased.” A confident hallucination can score high if it’s linguistically close to the reference.
LLM-as-judge
The most common 2024-2025 evaluation approach for generative output: use a stronger LLM to score the outputs of the model under test.
The pattern: define a rubric (“rate this answer 1-5 on factual accuracy, completeness, and tone”), give it to a frontier model along with the question and the answer, and average the scores across hundreds of examples.
Strengths: cheap, scales, captures quality nuances numerical metrics miss.
Caveats: judges have biases (they often prefer their own family’s style, longer answers, and answers that match their training data). Calibrate against human ratings on a sample. Don’t trust LLM-as-judge alone for high-stakes decisions.
Task-specific metrics
For specific applications, narrow metrics beat general ones:
- Code generation: pass@k. Generate k candidate solutions, run them against test cases, count as success if any one passes. The current standard for code-model evaluation.
- Question answering with a single right answer: exact match + F1 over tokens. Used in SQuAD and similar benchmarks.
- Information retrieval: NDCG@k, MRR, Recall@k. Each measures a different aspect of how well the right document made it to the top of results.
- Conversation quality: turn-level helpfulness + session completion + escalation rate. No single number captures whether a chat went well.
Pick the metric closest to the business outcome. “BLEU on translation” is a proxy for “does the customer understand the translated email?” If you can measure the latter, do.
Building eval sets that grow with your product
The eval set is your ground truth. It deserves the same care as production code.
Three principles:
- Start small. 50-100 examples is enough to detect major regressions. Don’t wait for a 10,000-example gold set; you’ll never start.
- Grow on real failures. Every production bug becomes an eval case. The eval set is the museum of every wrong answer your model has ever given.
- Version it. Track when an eval case was added and why. When a metric score moves, you want to know which examples drove the change.
An eval set you can’t version, can’t reproduce, and didn’t collect from real failures is decoration. The teams that ship the best AI products are also the teams with the most boring, exhaustive, well-curated eval sets.