@jbhuang0604: Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’v…
Summary
The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.
View Cached Full Text
Cached at: 06/18/26, 06:04 AM
Huge!
It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’ve been a recurring foundation for explaining key ideas. GOAT!
MoE: https://youtu.be/0QQlYR1r6pQ SwiGLU: https://youtu.be/JRaPNrpsQ9s MQA: https://youtu.be/Y-o545eYjXM Transformer: https://youtu.be/rcWMRA9E5RI
TL;DR: Mixture-of-Experts architecture enables scaling model parameters while maintaining training and inference efficiency by routing each token to only a sparse subset of experts, making it a key technology in many advanced AI models.
How Feed-Forward Networks (FFNs) Store and Retrieve Facts
Transformers are mainly composed of alternating stacked attention layers and feed-forward networks (FFNs). The attention mechanism captures contextual dependencies between tokens, while the FFN is responsible for retrieving factual information from model weights. Take a token embedding vector as an example: suppose we have the sentence “The capital of Taiwan is here,” and the model needs to complete it correctly. The FFN consists of three steps:
- Up-projection: A linear layer (weight matrix
W_up, bias vector) maps the input embedding vectorxto the hidden feature dimension (typically 4 times the input dimension), resulting in a vectorz. Each row ofWcan be seen as a semantic concept; the dot product measures the relevance of the input to these concepts. For example, certain questions (semantic directions) may have a high response (positive value) to an input token, while others are near zero or negative. - Nonlinear activation: An activation function like ReLU is applied to remove irrelevant responses, producing the activation vector
a. - Down-projection: Another linear layer (weight matrix
W_down) mapsaback to the original dimension. The column vectors ofW_downencode factual information (e.g., “Taipei”, “TSMC”) related to specific questions. The outputyis a weighted sum of the columns ofW_down, weighted bya.
Finally, y is added back to the original token embedding via a residual connection, preserving contextual information while integrating newly retrieved facts.
From Dense FFNs to Sparse Mixture of Experts (MoE)
Increasing the hidden dimension of FFNs enhances model expressiveness (more questions, finer-grained facts) but slows down speed and increases memory usage. Solution: each token only needs to activate a small set of relevant units. The core idea of sparse MoE: divide a large FFN into multiple smaller, more specialized “expert” networks, and activate only a sparse subset of experts for each token.
Routing Mechanism
The router computes a score (logit) for each token–expert pair, applies softmax to obtain a probability distribution, and selects the top-k experts with the highest probabilities. The input token is sent to the selected experts for processing, and the output is a weighted average of those experts’ outputs (weights are the router probabilities). The router is simple to implement: it has a learnable weight vector for each expert; logit = dot product (input embedding, expert weight vector).
Fine-Grained Experts and Shared Experts
Recent work (e.g., DeepSeek-V3) uses many small experts (256), activating 8 per token—called “fine-grained experts”. This significantly improves performance and has been adopted by open-source models like Qwen 2.5, Kim K2, etc. Another design includes a shared expert that is always activated, learning general information, while the routed experts focus on specific patterns. But experimental results vary: DeepSeek benefits, while OLMoE observed no improvement, potentially because converting one routed expert to a shared expert limits the diversity of expert combinations.
Expert Capacity and Load Balancing
Capacity and Overflow Issues
In distributed deployment, each device has a fixed capacity (number of tokens it can handle). For example, with 16 tokens, 8 experts, and a capacity factor of 1, each expert can handle 2 tokens. However, routing is dynamic and imbalanced, causing some experts to overflow (tokens are dropped). In practice, dropped tokens are passed directly to the next layer via a residual connection, skipping the expert computation. Increasing the capacity factor reduces overflow but increases computation and communication costs (wasted padding from empty slots).
Drop-Free Block Sparse Matrix Multiplication
Traditional implementations force each expert to receive the same number of tokens, leading to a trade-off between “overflow or waste”. A newer approach: represent expert computation as block diagonal matrix multiplication. When experts receive different numbers of tokens, each block has a different size (non-uniform routing), and block sparse matrix multiplication kernels can compute without instantiating zero blocks, avoiding dropping or padding.
Load Balancing During Training
With random initial weights, all experts are equally poor. If the router assigns most tokens to a few experts, those experts get more updates and become better, creating a self-reinforcing cycle while other experts become nearly useless. The goal of load balancing is to distribute token processing evenly.
Method 1: Noisy Top-k Gating
Add Gaussian noise to the logits to encourage the router to select a more diverse set of experts. Also, a learnable scaling factor can be added per expert.
Method 2: Soft Constraints (Auxiliary Losses)
- Expert Importance: Sum the router probabilities assigned to an expert over the entire batch. Encourage a uniform distribution of expert importance. Measurable via the coefficient of variation (CV = std/mean): uniform distribution gives CV=0. But this metric can be misleading—e.g., the first four experts each accumulate 0.25, the last four each accumulate 0.125; importance is uniform (CV=0), yet the actual token assignment is highly skewed (first four get 5 tokens each, last four get 0).
- Direct Load: Count the number of tokens routed to each expert. This more directly encourages each expert to receive roughly equal training samples.
Sources
YouTube Video: Huge! It’s amazing how often Noam’s papers end up at the center of the field… (https://www.youtube.com/watch?v=0QQlYR1r6pQ)
Noam Shazeer (@NoamShazeer): I’m excited to share that I’ll be joining OpenAI and look forward to working with the exceptional team there.
It was a difficult decision to move on. I’m incredibly proud of the amazing team at Google and everything we’ve built together. It has been an honor and a pleasure to
Similar Articles
Mixture of Experts (MoEs) in Transformers
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.
@markchen90: Very excited to welcome @NoamShazeer to OpenAI as our new lead for architecture research! His work on transformers, MoE…
Noam Shazeer, a key researcher behind transformers and MoE, is joining OpenAI as head of architecture research, moving from Google.
new MoE from ai2, EMO
AI2 released EMO, a Mixture of Experts language model with 1B active parameters out of 14B total, trained on 1 trillion tokens and featuring document-level routing where experts cluster around domains.
@Jianlin_S: MoE (9): The Gate Normalization Debate https://kexue.fm/archives/11782
A blog post discussing the debate on gate normalization in Mixture of Experts (MoE) models.
Emergent Modularity in Mixture-of-Experts Models (8 minute read)
Ai2 releases EMO, a 14B-parameter mixture-of-experts language model trained to develop emergent modularity. It allows using a small subset of experts for specific tasks while maintaining near full-model performance.