@jbhuang0604: Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’v…

X AI KOLs Following 06/18/26, 03:34 AM News

mixture-of-experts transformer sparse-moe routing load-balancing research openai google

Summary

The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.

Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’ve been a recurring foundation for explaining key ideas. GOAT! MoE: https://youtu.be/0QQlYR1r6pQ SwiGLU: https://youtu.be/JRaPNrpsQ9s MQA: https://youtu.be/Y-o545eYjXM Transformer: https://youtu.be/rcWMRA9E5RI

Original Article

View Cached Full Text

Cached at: 06/18/26, 06:04 AM

Huge!

It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’ve been a recurring foundation for explaining key ideas. GOAT!

MoE: https://youtu.be/0QQlYR1r6pQ SwiGLU: https://youtu.be/JRaPNrpsQ9s MQA: https://youtu.be/Y-o545eYjXM Transformer: https://youtu.be/rcWMRA9E5RI

TL;DR: Mixture-of-Experts architecture enables scaling model parameters while maintaining training and inference efficiency by routing each token to only a sparse subset of experts, making it a key technology in many advanced AI models.

How Feed-Forward Networks (FFNs) Store and Retrieve Facts

Transformers are mainly composed of alternating stacked attention layers and feed-forward networks (FFNs). The attention mechanism captures contextual dependencies between tokens, while the FFN is responsible for retrieving factual information from model weights. Take a token embedding vector as an example: suppose we have the sentence “The capital of Taiwan is here,” and the model needs to complete it correctly. The FFN consists of three steps:

Up-projection: A linear layer (weight matrix W_up, bias vector) maps the input embedding vector x to the hidden feature dimension (typically 4 times the input dimension), resulting in a vector z. Each row of W can be seen as a semantic concept; the dot product measures the relevance of the input to these concepts. For example, certain questions (semantic directions) may have a high response (positive value) to an input token, while others are near zero or negative.
Nonlinear activation: An activation function like ReLU is applied to remove irrelevant responses, producing the activation vector a.
Down-projection: Another linear layer (weight matrix W_down) maps a back to the original dimension. The column vectors of W_down encode factual information (e.g., “Taipei”, “TSMC”) related to specific questions. The output y is a weighted sum of the columns of W_down, weighted by a.

Finally, y is added back to the original token embedding via a residual connection, preserving contextual information while integrating newly retrieved facts.

From Dense FFNs to Sparse Mixture of Experts (MoE)

Increasing the hidden dimension of FFNs enhances model expressiveness (more questions, finer-grained facts) but slows down speed and increases memory usage. Solution: each token only needs to activate a small set of relevant units. The core idea of sparse MoE: divide a large FFN into multiple smaller, more specialized “expert” networks, and activate only a sparse subset of experts for each token.

Routing Mechanism

The router computes a score (logit) for each token–expert pair, applies softmax to obtain a probability distribution, and selects the top-k experts with the highest probabilities. The input token is sent to the selected experts for processing, and the output is a weighted average of those experts’ outputs (weights are the router probabilities). The router is simple to implement: it has a learnable weight vector for each expert; logit = dot product (input embedding, expert weight vector).

Fine-Grained Experts and Shared Experts

Recent work (e.g., DeepSeek-V3) uses many small experts (256), activating 8 per token—called “fine-grained experts”. This significantly improves performance and has been adopted by open-source models like Qwen 2.5, Kim K2, etc. Another design includes a shared expert that is always activated, learning general information, while the routed experts focus on specific patterns. But experimental results vary: DeepSeek benefits, while OLMoE observed no improvement, potentially because converting one routed expert to a shared expert limits the diversity of expert combinations.

Expert Capacity and Load Balancing

Capacity and Overflow Issues

In distributed deployment, each device has a fixed capacity (number of tokens it can handle). For example, with 16 tokens, 8 experts, and a capacity factor of 1, each expert can handle 2 tokens. However, routing is dynamic and imbalanced, causing some experts to overflow (tokens are dropped). In practice, dropped tokens are passed directly to the next layer via a residual connection, skipping the expert computation. Increasing the capacity factor reduces overflow but increases computation and communication costs (wasted padding from empty slots).

Drop-Free Block Sparse Matrix Multiplication

Traditional implementations force each expert to receive the same number of tokens, leading to a trade-off between “overflow or waste”. A newer approach: represent expert computation as block diagonal matrix multiplication. When experts receive different numbers of tokens, each block has a different size (non-uniform routing), and block sparse matrix multiplication kernels can compute without instantiating zero blocks, avoiding dropping or padding.

Load Balancing During Training

With random initial weights, all experts are equally poor. If the router assigns most tokens to a few experts, those experts get more updates and become better, creating a self-reinforcing cycle while other experts become nearly useless. The goal of load balancing is to distribute token processing evenly.

Method 1: Noisy Top-k Gating

Add Gaussian noise to the logits to encourage the router to select a more diverse set of experts. Also, a learnable scaling factor can be added per expert.

Method 2: Soft Constraints (Auxiliary Losses)

Expert Importance: Sum the router probabilities assigned to an expert over the entire batch. Encourage a uniform distribution of expert importance. Measurable via the coefficient of variation (CV = std/mean): uniform distribution gives CV=0. But this metric can be misleading—e.g., the first four experts each accumulate 0.25, the last four each accumulate 0.125; importance is uniform (CV=0), yet the actual token assignment is highly skewed (first four get 5 tokens each, last four get 0).
Direct Load: Count the number of tokens routed to each expert. This more directly encourages each expert to receive roughly equal training samples.

Sources

YouTube Video: Huge! It’s amazing how often Noam’s papers end up at the center of the field… (https://www.youtube.com/watch?v=0QQlYR1r6pQ)

Noam Shazeer (@NoamShazeer): I’m excited to share that I’ll be joining OpenAI and look forward to working with the exceptional team there.

It was a difficult decision to move on. I’m incredibly proud of the amazing team at Google and everything we’ve built together. It has been an honor and a pleasure to

@jbhuang0604: Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’v…

How Feed-Forward Networks (FFNs) Store and Retrieve Facts

From Dense FFNs to Sparse Mixture of Experts (MoE)

Routing Mechanism

Fine-Grained Experts and Shared Experts

Expert Capacity and Load Balancing

Capacity and Overflow Issues

Drop-Free Block Sparse Matrix Multiplication

Load Balancing During Training

Method 1: Noisy Top-k Gating

Method 2: Soft Constraints (Auxiliary Losses)

Sources

Similar Articles

Mixture of Experts (MoEs) in Transformers

@markchen90: Very excited to welcome @NoamShazeer to OpenAI as our new lead for architecture research! His work on transformers, MoE…

new MoE from ai2, EMO

@Jianlin_S: MoE (9): The Gate Normalization Debate https://kexue.fm/archives/11782

Emergent Modularity in Mixture-of-Experts Models (8 minute read)

Submit Feedback

Similar Articles

Mixture of Experts (MoEs) in Transformers

@markchen90: Very excited to welcome @NoamShazeer to OpenAI as our new lead for architecture research! His work on transformers, MoE…

@Jianlin_S: MoE (9): The Gate Normalization Debate https://kexue.fm/archives/11782

Emergent Modularity in Mixture-of-Experts Models (8 minute read)