Synthetic Data: The Quality Paradox
Synthetic data lets you train without paying for human labels. It also can quietly degrade quality if you do not handle it carefully. The paradox of generating your own training set.
Why use synthetic data
Human-labelled data is expensive, slow to produce, and limited to what humans can write. Synthetic data, generated by a model, is fast, cheap, and unbounded. It also exposes the model to scenarios real data may not cover.
For specific tasks (math, code, structured extraction), synthetic data has driven measurable capability gains since 2023. The DeepSeek and Phi families lean heavily on it.
The collapse risk
If you train a model on its own outputs, then on the next generation’s outputs, and so on, quality degrades. The model converges to a narrow distribution that loses the long tail of real data. This is “model collapse,” demonstrated rigorously in 2024 papers.
The mechanism: each generation amplifies the most-likely outputs and forgets rare ones. Like a lossy photocopy of a photocopy.
Mitigations
- Mix with real data. Even 10-20% real human-written data anchors the distribution.
- Filter aggressively. Discard synthetic examples that fail quality checks. Quality > quantity.
- Distillation, not generation. Use a strong teacher model to label real inputs, rather than generating both inputs and labels.
- Diversity scoring. Embed synthetic examples; penalise duplicates.
Where it works now
Mathematical reasoning, code, structured extraction, instruction-following formats, all benefit from carefully-curated synthetic data. Open-ended creative tasks benefit less.
The 2026 production pattern: real data for the floor, synthetic for the volume, aggressive filtering throughout. Roughly 10:90 real-to-synthetic in many fine-tuning recipes, with quality controls dominating the work.