Training Data: Why It Decides Everything
A great architecture with mediocre data loses to a mediocre architecture with great data. Every time. Here is what “great data” actually means and the audit every project should run.
Garbage in, confident garbage out
The classic phrase is “garbage in, garbage out.” With machine learning, the accurate version is worse: garbage in, confident garbage out. The model produces its output with the same level of certainty regardless of whether the underlying data was clean or broken. This is the most dangerous failure mode in production ML.
Every serious practitioner reaches the same conclusion within a year: time spent on the data pays back many times more than time spent on the model. A team that obsesses over architecture choices but ships with raw training data is making a classic beginner mistake.
The three qualities of good training data
Good training data has three properties. Missing any one of them caps your model’s performance, no matter how sophisticated the architecture.
- Representative: the training distribution matches the production distribution. If your spam filter trained on 2018 emails deploys on 2025 emails, it will fail silently on new spam patterns.
- Clean: labels are correct and consistent. If one annotator labels “sarcasm” as positive and another labels it as negative, the model learns confusion.
- Balanced enough: if 99% of your examples are one class, the model can hit 99% accuracy by always predicting that class and actually learning nothing.
A dataset missing any of these produces a model that performs well on paper and badly in production.
Size: how much is enough?
There’s no universal answer. The honest guide by task:
- Tabular supervised learning: 1,000-10,000 examples per class is usually enough to ship something useful.
- Image classification with a pretrained model: 100-1,000 images per class, thanks to transfer learning.
- Image classification from scratch: 50,000+ images per class. Don’t do this; use transfer learning.
- Text classification: 1,000-10,000 labelled documents per class for most business tasks.
- Large language model pretraining: trillions of tokens. Only the largest companies do this.
The practical rule: start with what you have, measure the model’s accuracy, and if it isn’t good enough, the first thing to try is doubling your data, not changing the model.
Labels: the most expensive part of ML
Labelling is boring, slow, and often the biggest cost line in an ML project. A well-labelled dataset for a medical imaging task can cost six figures because trained radiologists have to label every image.
Three techniques that reduce the pain:
- Active learning: train a small model on a small labelled set, have it predict on unlabelled data, and then have humans label only the examples the model is most uncertain about. This cuts labelling cost by 3-10×.
- Weak supervision: write simple labelling rules (“if the email contains ‘Viagra’, label it spam”), apply them to unlabelled data, and train on the resulting noisy labels. Libraries like Snorkel formalise this.
- LLM labelling: use a large language model to pre-label your data, and have humans review and correct. For text tasks, this now often matches human labeller accuracy at 1% of the cost.
Bias, hiding in plain sight
Training data is a snapshot of history. History contains structural bias. Models trained on biased data reproduce and amplify that bias, sometimes catastrophically.
The canonical example: an Amazon resume-screening model trained on 10 years of hired-candidate data learned to penalise applicants whose resumes contained the word “women’s” (as in “women’s chess club”). The model was doing exactly what the data taught it, past hiring had favoured men, so the model rediscovered that pattern. Amazon killed the project.
Bias audits belong in every ML project. Break down model accuracy across demographic groups. Check that the error rates don’t diverge sharply. If they do, the dataset has structural gaps, and patching the model won’t fix the root cause.
The audit every project needs
Before any model ships, run through this checklist. It takes a day. It catches the failures that take months of incidents to surface in production.
- Distribution check: plot the distribution of each important feature in training vs production. Any feature whose distribution shifted is a signal the model will underperform.
- Label spot-check: hand-review 100 random labels. If more than 3 are wrong, your whole labelling process needs a review, not just a reshuffle.
- Class balance: is the class distribution similar in train, validation, and production?
- Leakage: does any training feature depend on the target label? (“Number of doctor visits” predicting “has disease” can leak.)
- Fairness: does accuracy differ by demographic group by more than 5 percentage points? If yes, flag for review before launch.
When to say no, the project can’t be done yet
Sometimes the answer is that the data isn’t ready. That’s a valid engineering output. A mediocre model shipping with bad data is worse than no model at all, because the bad predictions will be used in decisions and will need to be explained in retrospect.
Push back on timelines when:
- You have under 500 examples and no way to get more.
- Labels are disputed within the team, let alone across the annotation pool.
- The production distribution hasn’t stabilised (the product itself is still changing).
Saying “not yet” is one of the most senior things an ML engineer can do. Do it when the data says so.