Scaling Laws: Chinchilla, Hoffmann, and Beyond
Scaling laws are empirical curves that predict model loss given parameter count and training data. They turned LLM training from craft into engineering.
What scaling laws are
Scaling laws are empirical relationships discovered by training many small models and fitting curves to their loss. They predict: given X parameters and Y training tokens, the test loss will be roughly Z.
The shape is power-law: doubling compute reduces loss by a fixed amount. The constants depend on architecture, data, and optimiser, but the qualitative shape is remarkably consistent.
Kaplan 2020
The first major paper (Kaplan et al., OpenAI). Trained dozens of models from 1M to 100M parameters and extrapolated. Conclusion: model size matters more than data size; bigger models are more sample-efficient.
This drove the 2020-2021 trajectory of GPT-3 and similar “train a giant model on whatever data we have” runs. In retrospect, Kaplan’s extrapolation was wrong.
Chinchilla 2022
Hoffmann et al. (DeepMind) re-examined the relationship at much larger scale. Their finding: for a given compute budget, the optimal allocation has roughly equal proportions of parameters and tokens. Earlier models had been undertrained.
The Chinchilla rule of thumb: ~20 tokens per parameter. A 70B model should see ~1.4T training tokens for compute-optimal performance. The Chinchilla-70B model, trained per this prescription, beat the 280B Gopher model on most benchmarks at one-quarter the parameter count.
This realigned the entire field. After Chinchilla, training runs targeted compute-optimal token counts, which often meant much more data on smaller models.
After Chinchilla
The 2023-2025 trajectory complicated the picture. Several observations:
- Chinchilla-optimal is for compute. For inference cost, you want even more tokens per parameter (smaller models, more data) because inference is dominated by parameters.
- Llama and Mistral models train at 5-10x Chinchilla token counts. Empirically better for downstream task performance.
- Synthetic data adds a wrinkle: scaling laws derived from real-text data may not hold in regimes where synthetic data dominates.
The current rule of thumb for production-oriented training: train smaller than Chinchilla suggests, on more data than Chinchilla suggests. Inference cost dominates training cost over a model’s lifetime.
What scaling laws don’t say
Scaling laws predict loss. Loss correlates with capability but not perfectly. Several phenomena escape scaling laws:
- Emergent capabilities: certain abilities appear suddenly at certain scales without smooth precursors.
- Long-tail knowledge: rare facts don’t track the loss curve.
- Reasoning depth: not predicted by parameter count alone.
- Alignment: post-training behaviour doesn’t scale with pretraining loss.
Scaling laws are essential planning tools. They’re not a complete theory of capability. The remaining gaps are where the interesting research happens.