Data Contamination in ML Benchmarks
If your benchmark questions appeared in the training data, the model isn’t reasoning, it’s remembering. Contamination has quietly broken many published comparisons.
What contamination is
Benchmark contamination occurs when test data ends up in training data, often by accident. The model has effectively memorised the test, its score reflects recall, not generalisation. Reported accuracy is inflated, sometimes dramatically.
How it happens
- Public benchmarks (MMLU, HumanEval, MATH) are scraped along with the rest of the web.
- Researchers post benchmark questions on forums, blogs, and Q&A sites; those leak into training corpora.
- Synthetic data generators trained on contaminated models produce contaminated outputs.
Detection
Two practical tests:
- String match: search the training corpus for verbatim test questions. Crude but catches direct contamination.
- Likelihood gap: compute the model’s likelihood of the test question text vs random comparable text. A large gap suggests memorisation.
Neither is perfect; both catch most cases. Modern benchmark releases include contamination audits.
Avoiding it in your own evals
For any internal eval that drives decisions:
- Don’t use public benchmarks as your sole signal. Build a private held-out set.
- Rotate eval questions periodically. The set you used six months ago may be partially leaked.
- Test multiple variants (paraphrased, translated, modified) to disambiguate memorisation from understanding.
The mature stance: every published benchmark number is an upper bound. The unpublished, internal, post-2025 data is the only reliable signal.