Energy and Sustainability in ML
Training a frontier model consumes the energy of a small country’s annual usage. Inference at scale dwarfs training. The footprint matters and is regulated in some jurisdictions.
The math
Training GPT-4-class models: ~100 GWh, comparable to a few thousand US households for a year. One H100 in serving inference: ~6 kW. A 1000-GPU inference cluster: ~6 MW continuous, ~50 GWh/year. Inference dwarfs training when you scale.
Efficiency gains
Per-token energy has dropped 10x in three years through:
- Quantisation (4-bit reduces compute proportionally).
- Mixture of experts (only some experts active per token).
- Speculative decoding (more tokens per GPU pass).
- Better hardware (H100 → H200 → B100).
The trajectory: per-token energy keeps falling faster than usage grows for now.
Reporting
EU companies above certain revenue thresholds report Scope 1, 2, 3 emissions including ML compute. CSRD framework. Cloud providers increasingly publish emissions data per region.
What teams should do
- Track per-feature compute and emissions.
- Pick lower-carbon regions when latency permits.
- Use efficiency techniques (quantisation, caching, routing) for both cost and emissions.