Back to Glossary
GEO & AI Search

Mixture of Experts

Mixture of Experts (MoE) is a neural network architecture that uses several specialized expert sub-networks and a router that activates only a few of them for each input token. It lets a model grow its total parameter count dramatically while keeping the actual computation per token limited to a small subset of experts.

  • MoE replaces a single large neural network with multiple expert sub-networks, using a router to select and activate only a few experts per token in a sparse design.
  • Because the total parameter count is large but each token is processed by only the selected experts, the architecture buys far greater model capacity at the same compute cost.
  • The two essentials are the router's top-k routing and load balancing across experts, with an auxiliary loss trained alongside to keep tokens from piling onto a single expert.
  • Mixtral 8x7B uses only 2 of its 8 experts per token, so despite 46.7B total parameters it activates just 12.9B per token and delivered roughly 6x faster inference than Llama 2 70B.
  • In the era of large language models, MoE is a core technique for decoupling model capacity from compute cost, and it underpins the design of frontier models such as GPT, Mixtral, and the Switch Transformer.

What Is Mixture of Experts

Mixture of Experts (MoE) is an architecture that, instead of treating a model as one monolithic neural network, builds it from multiple expert sub-networks plus a router (gating network) that chooses among them. When an input arrives, the router decides which expert each token should be sent to, and only the selected experts take part in the computation. Unlike a dense model that fires every parameter on every pass, MoE turns on only a handful of experts per token, and this sparse activation is its defining property.

The advantage of this design is that it decouples model capacity from compute cost. The total parameter count grows as you add experts, yet the actual computation for processing a single token stays confined to the small set of selected experts. In transformer-based LLMs, MoE is typically applied by replacing each layer's feed-forward (FFN) block with several expert FFNs.

How It Works: Routing and Sparse Activation

The router sits at the center of the mechanism. For an input token vector x, the router multiplies by a gating weight matrix Wg and applies a softmax to produce a score for each expert, then selects only the top k experts by score. This is called top-k routing. The Switch Transformer uses k=1 (a single expert), while Mixtral uses k=2 (two experts). Experts that are not selected perform no forward pass for that token, which is where the compute savings come from.

# Conceptual routing (top-k)
gates = softmax(x @ W_g)        # score for each expert
top = topk(gates, k)           # select top-k experts (e.g., k=2)
y = sum( gates[i] * expert_i(x) for i in top )  # only selected experts compute

Sparse routing brings one complication along with it. As training proceeds, tokens can start funneling toward a few popular experts while the rest go almost unused, producing an imbalance. To prevent this, MoE adds a load-balancing auxiliary loss that encourages the experts to be used evenly. Hugging Face's MoE explainer describes adding an auxiliary loss so that "all experts are given equal importance," and notes techniques such as the router z-loss, which suppresses the large logits entering the router to improve training stability.

Comparison with Dense Models

AspectDense ModelMoE (Sparse) Model
Activation scopeAll parameters used on every tokenOnly the experts selected per token
Compute vs. parametersCompute grows in proportion to parameter countTotal parameters are large, but per-token compute stays roughly constant
Pretraining speedBaselineReaches the same quality faster (about 7x for the Switch Transformer)
Memory (VRAM)Only active parameters loadedAll experts must be held in memory, even those not actually used
Fine-tuningRelatively stableMore prone to overfitting, requires separate hyperparameters

Significance and Evidence

The MoE concept itself dates back to the 1991 work of Jacobs et al., "Adaptive Mixture of Local Experts," but the turning point for its serious application to modern large neural networks was the 2017 paper by Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (arXiv:1701.06538). That paper proposed a sparsely-gated MoE layer built from thousands of experts, showing that model capacity could be scaled by more than 1000x without a major loss in computational efficiency.

Google's Switch Transformer (Fedus et al., arXiv:2101.03961) later simplified top-k routing to the most minimal k=1, cutting routing computation and communication cost while preserving quality, and trained the Switch-C model at up to 1.571 trillion parameters. At the same compute budget, it reported roughly a 7x pretraining speedup over the dense T5-XXL.

A case that clearly demonstrated MoE's impact in a real, openly available LLM is Mistral AI's Mixtral 8x7B (arXiv:2401.04088). Each layer is composed of 8 experts and the router picks 2 per token, so while the total is 46.7B parameters, only 12.9B are activated per token. According to Mistral's official announcement, Mixtral handles a 32k-token context and outperforms Llama 2 70B on most benchmarks with "6x faster inference," while matching or surpassing GPT-3.5. In effect, it achieves the quality of a far larger model at the speed and cost of a 12.9B dense model.

The trade-offs, however, are clear. At inference time every expert must be loaded into memory even though most go unused, which imposes a heavy VRAM burden, and during fine-tuning MoE is more susceptible to overfitting than a dense model. The Hugging Face MoE explainer points out that "even at the same pretraining perplexity, a sparse model can lag behind a dense one on reasoning-heavy downstream tasks," while also noting that MoE tends to gain more from instruction tuning than dense models do.

References