Mixture of Experts (MoE), Explained
A 200B-parameter MoE costs about as much to run as a 20B dense model. That’s the trick: only a small fraction of the model is active for any single token.
The core idea
A standard transformer applies the same dense feedforward network to every token. A Mixture of Experts replaces that single feedforward with K parallel experts (typically 8-128), and a routing function that picks the top 1-2 experts per token.
The total parameter count balloons (8 experts at 4B each = 32B in MoE layers vs 4B in a dense model). But only the routed experts compute for any given token, so per-token compute stays close to the dense baseline.
Result: dramatically more parameters (more capacity, more knowledge stored) at similar inference cost. The headline trick of modern frontier models.
How routing works
Each MoE layer has a small linear projection that scores every expert against the current token’s representation. The top-K experts (usually K=2) are selected. The token is routed to those experts; their outputs are averaged (weighted by the routing scores) and passed to the next layer.
Three details that matter:
- Top-K typically equals 2. K=1 is faster but loses accuracy. K=4 plateaus.
- Auxiliary load-balancing loss. Without it, training collapses to using only a few experts. The auxiliary loss penalises uneven utilisation.
- Capacity factor. Each expert has a token budget per batch. Tokens beyond capacity are dropped or rerouted, which hurts quality. Tuning capacity factor matters.
Why MoE wins at scale
The crucial insight is parameters store knowledge but compute does the work. MoE decouples the two. You can have a model with 1T parameters (huge knowledge capacity) that trains and inferences at the cost of a 100B dense model.
Empirically, MoE models match or beat dense models at the same inference FLOPs. They’re strictly better at the same parameter count for cost-bounded deployments, with one caveat: memory.
The hard parts
Three challenges that have shaped MoE engineering:
- All experts in memory. Even though only 2 are active per token, all 32-128 must be loaded. Memory cost is the dense-equivalent of the full parameter count, which is huge. Inference servers stuff this onto multi-GPU rigs.
- Routing instability during training. Without careful auxiliary losses, expert utilisation collapses. The auxiliary loss design has been a research focus for the last two years.
- Communication overhead. Distributed training of MoE requires all-to-all communication when tokens shuffle to their experts. Bandwidth becomes the bottleneck. Specialised primitives (Megablocks, Tutel) handle this.
Production models that use it
By 2025, most frontier LLMs are MoE under the hood:
- Mixtral 8x7B and Mixtral 8x22B (open-weight, popular benchmark for MoE).
- DeepSeek-V3, DeepSeek-R1.
- Reportedly: GPT-4 family, Gemini family, Claude (architectures aren’t fully disclosed).
- Switch Transformer, GShard (research-side foundational work).
The trajectory: MoE will be the default architecture for any model above ~30B parameters where inference cost matters more than memory cost.
When it matters for you
If you’re a consumer of LLMs: MoE matters because it’s why the frontier APIs got so much better at the same price point in 2024-2025. The model has more capacity at the same compute cost.
If you’re self-hosting: MoE saves money at sufficient scale. A 22B-active MoE that fits in 8x H100s outperforms a 70B dense model on the same hardware. Below that scale, the all-experts-in-memory tax negates the benefit.
If you’re training: MoE adds infrastructure complexity. Use Megablocks or DeepSpeed-MoE; don’t roll your own expert routing.