Mixture of Depths: The Follow-up to MoE
MoE varied which expert each token visited. Mixture of Depths varies how many layers each token visits. Easy tokens skip layers; hard tokens go deep.
The core idea
In a standard transformer, every token goes through every layer. A simple word like “the” passes through 96 transformer layers identically to a complex piece of reasoning. Wasteful.
Mixture of Depths (MoD) lets each token decide whether it needs the current layer’s computation. Easy tokens skip the layer (the residual passes through unchanged). Hard tokens get processed normally. Total compute drops; quality is preserved.
How the router decides
At each layer, a small linear router scores every token. The top-K tokens (by score) get processed by the layer; the rest skip.
The K is fixed in advance, typically 50% of tokens process, 50% skip. This keeps compute predictable per batch (essential for production serving) while saving roughly half the per-layer cost.
The routing decision is per-layer and per-token. The same token might be processed in some layers and skipped in others.
Training challenges
Two issues that took research to solve:
- Causal-aware routing: in autoregressive generation, the router can’t know future tokens. The router has to make decisions on past context only. Recent work addresses this with auxiliary predictors.
- Load-balancing: similar to MoE, naive routing collapses to always-process or never-process. Auxiliary losses force a target K-fraction.
Combining with MoE
MoE and MoD are orthogonal. MoE varies which expert each token visits. MoD varies whether each token visits the layer at all. Combining them yields models with very large total parameters at very low per-token compute.
Mixture-of-Depth-and-Experts (MoDE) and similar combinations are an active research area. Early reports show 30-50% compute savings over MoE alone, with no accuracy loss.
Where this stands in 2025
MoD is in research and limited deployment. Frontier labs are running it experimentally; it’s not yet in mainstream open-weight releases. Expect it to mature into the default in 2026, similar to MoE’s trajectory in 2023-2024.
For practitioners: not actionable yet. Watch the benchmarks. The technique that combines MoE, MoD, and Flash Attention will be the architecture of the late-2026 frontier.