AI & ML Advanced By Samson Tanimawo, PhD Published Dec 9, 2025 8 min read

Scaling Laws: Chinchilla, Hoffmann, and Beyond

Scaling laws are empirical curves that predict model loss given parameter count and training data. They turned LLM training from craft into engineering.

What scaling laws are

Scaling laws are empirical relationships discovered by training many small models and fitting curves to their loss. They predict: given X parameters and Y training tokens, the test loss will be roughly Z.

The shape is power-law: doubling compute reduces loss by a fixed amount. The constants depend on architecture, data, and optimiser, but the qualitative shape is remarkably consistent.

Kaplan 2020

The first major paper (Kaplan et al., OpenAI). Trained dozens of models from 1M to 100M parameters and extrapolated. Conclusion: model size matters more than data size; bigger models are more sample-efficient.

This drove the 2020-2021 trajectory of GPT-3 and similar “train a giant model on whatever data we have” runs. In retrospect, Kaplan’s extrapolation was wrong.

Chinchilla 2022

Hoffmann et al. (DeepMind) re-examined the relationship at much larger scale. Their finding: for a given compute budget, the optimal allocation has roughly equal proportions of parameters and tokens. Earlier models had been undertrained.

The Chinchilla rule of thumb: ~20 tokens per parameter. A 70B model should see ~1.4T training tokens for compute-optimal performance. The Chinchilla-70B model, trained per this prescription, beat the 280B Gopher model on most benchmarks at one-quarter the parameter count.

This realigned the entire field. After Chinchilla, training runs targeted compute-optimal token counts, which often meant much more data on smaller models.

After Chinchilla

The 2023-2025 trajectory complicated the picture. Several observations:

The current rule of thumb for production-oriented training: train smaller than Chinchilla suggests, on more data than Chinchilla suggests. Inference cost dominates training cost over a model’s lifetime.

What scaling laws don’t say

Scaling laws predict loss. Loss correlates with capability but not perfectly. Several phenomena escape scaling laws:

Scaling laws are essential planning tools. They’re not a complete theory of capability. The remaining gaps are where the interesting research happens.