AI & ML Intermediate By Samson Tanimawo, PhD Published Jul 15, 2025 8 min read

Mixture of Experts (MoE), Explained

A 200B-parameter MoE costs about as much to run as a 20B dense model. That’s the trick: only a small fraction of the model is active for any single token.

The core idea

A standard transformer applies the same dense feedforward network to every token. A Mixture of Experts replaces that single feedforward with K parallel experts (typically 8-128), and a routing function that picks the top 1-2 experts per token.

The total parameter count balloons (8 experts at 4B each = 32B in MoE layers vs 4B in a dense model). But only the routed experts compute for any given token, so per-token compute stays close to the dense baseline.

Result: dramatically more parameters (more capacity, more knowledge stored) at similar inference cost. The headline trick of modern frontier models.

How routing works

Each MoE layer has a small linear projection that scores every expert against the current token’s representation. The top-K experts (usually K=2) are selected. The token is routed to those experts; their outputs are averaged (weighted by the routing scores) and passed to the next layer.

Three details that matter:

Why MoE wins at scale

The crucial insight is parameters store knowledge but compute does the work. MoE decouples the two. You can have a model with 1T parameters (huge knowledge capacity) that trains and inferences at the cost of a 100B dense model.

Empirically, MoE models match or beat dense models at the same inference FLOPs. They’re strictly better at the same parameter count for cost-bounded deployments, with one caveat: memory.

The hard parts

Three challenges that have shaped MoE engineering:

Production models that use it

By 2025, most frontier LLMs are MoE under the hood:

The trajectory: MoE will be the default architecture for any model above ~30B parameters where inference cost matters more than memory cost.

When it matters for you

If you’re a consumer of LLMs: MoE matters because it’s why the frontier APIs got so much better at the same price point in 2024-2025. The model has more capacity at the same compute cost.

If you’re self-hosting: MoE saves money at sufficient scale. A 22B-active MoE that fits in 8x H100s outperforms a 70B dense model on the same hardware. Below that scale, the all-experts-in-memory tax negates the benefit.

If you’re training: MoE adds infrastructure complexity. Use Megablocks or DeepSpeed-MoE; don’t roll your own expert routing.