AI & ML Beginner By Samson Tanimawo, PhD Published Jan 28, 2025 9 min read

Training Data: Why It Decides Everything

A great architecture with mediocre data loses to a mediocre architecture with great data. Every time. Here is what “great data” actually means and the audit every project should run.

Garbage in, confident garbage out

The classic phrase is “garbage in, garbage out.” With machine learning, the accurate version is worse: garbage in, confident garbage out. The model produces its output with the same level of certainty regardless of whether the underlying data was clean or broken. This is the most dangerous failure mode in production ML.

Every serious practitioner reaches the same conclusion within a year: time spent on the data pays back many times more than time spent on the model. A team that obsesses over architecture choices but ships with raw training data is making a classic beginner mistake.

The three qualities of good training data

Good training data has three properties. Missing any one of them caps your model’s performance, no matter how sophisticated the architecture.

A dataset missing any of these produces a model that performs well on paper and badly in production.

Size: how much is enough?

There’s no universal answer. The honest guide by task:

The practical rule: start with what you have, measure the model’s accuracy, and if it isn’t good enough, the first thing to try is doubling your data, not changing the model.

Labels: the most expensive part of ML

Labelling is boring, slow, and often the biggest cost line in an ML project. A well-labelled dataset for a medical imaging task can cost six figures because trained radiologists have to label every image.

Three techniques that reduce the pain:

Bias, hiding in plain sight

Training data is a snapshot of history. History contains structural bias. Models trained on biased data reproduce and amplify that bias, sometimes catastrophically.

The canonical example: an Amazon resume-screening model trained on 10 years of hired-candidate data learned to penalise applicants whose resumes contained the word “women’s” (as in “women’s chess club”). The model was doing exactly what the data taught it, past hiring had favoured men, so the model rediscovered that pattern. Amazon killed the project.

Bias audits belong in every ML project. Break down model accuracy across demographic groups. Check that the error rates don’t diverge sharply. If they do, the dataset has structural gaps, and patching the model won’t fix the root cause.

The audit every project needs

Before any model ships, run through this checklist. It takes a day. It catches the failures that take months of incidents to surface in production.

  1. Distribution check: plot the distribution of each important feature in training vs production. Any feature whose distribution shifted is a signal the model will underperform.
  2. Label spot-check: hand-review 100 random labels. If more than 3 are wrong, your whole labelling process needs a review, not just a reshuffle.
  3. Class balance: is the class distribution similar in train, validation, and production?
  4. Leakage: does any training feature depend on the target label? (“Number of doctor visits” predicting “has disease” can leak.)
  5. Fairness: does accuracy differ by demographic group by more than 5 percentage points? If yes, flag for review before launch.

When to say no, the project can’t be done yet

Sometimes the answer is that the data isn’t ready. That’s a valid engineering output. A mediocre model shipping with bad data is worse than no model at all, because the bad predictions will be used in decisions and will need to be explained in retrospect.

Push back on timelines when:

Saying “not yet” is one of the most senior things an ML engineer can do. Do it when the data says so.