How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
Summary
This paper develops a principled scaling theory for Mixture-of-Experts (MoE) architectures, introducing the Maximally Scale-Stable Parameterization (MSSP) that ensures stable training and hyperparameter transfer across width, depth, expert width, and number of experts, validated by experiments.
View Cached Full Text
Cached at: 05/15/26, 06:27 AM
# How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization Source: [https://arxiv.org/abs/2605.14200](https://arxiv.org/abs/2605.14200) [View PDF](https://arxiv.org/pdf/2605.14200) > Abstract:Recent frontier large language models predominantly rely on Mixture\-of\-Experts \(MoE\) architectures\. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N\_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale\. We take a principled step toward resolving this gap by analyzing three different scaling regimes: \(I\) co\-scaling $N\\asymp N\_e$, \(II\) co\-scaling $N\\asymp M\\asymp K$, and \(III\) full proportional scaling of $N, N\_e, M$, and $K$\. For each regime, we develop a novel Dynamical Mean Field Theory \(DMFT\) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis\. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal\-update \($\\mu$\) desiderata\. We then show that the resulting $\\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning\-rate transfer\. We trace these pathologies to scale\-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability\. Guided by this principle, we derive a Maximally Scale\-Stable Parameterization \(MSSP\) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics \- qualitatively distinct from the $\\mu$P limit \- through a separate DMFT analysis\. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes\. Combined with existing depth\-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts\. ## Submission history From: Leena Chennuru Vankadara \[[view email](https://arxiv.org/show-email/7eb074a5/2605.14200)\] **\[v1\]**Wed, 13 May 2026 23:32:00 UTC \(32,556 KB\)
Similar Articles
What is the point of MoE models, beyond being faster?
A discussion about the advantages of Mixture of Experts (MoE) models over dense models beyond speed, considering RAM constraints and scaling limits.
Mixture of Experts (MoEs) in Transformers
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameter growth with depth while improving efficiency and performance over standard MoE baselines.
Is there a limit on the number of active parameters in an MoE model?
Discussion on the limit of active parameters in Mixture-of-Experts (MoE) models, questioning whether there is a cap on active parameter count beyond which quality doesn't improve.
Emergent Modularity in Mixture-of-Experts Models (8 minute read)
Ai2 releases EMO, a 14B-parameter mixture-of-experts language model trained to develop emergent modularity. It allows using a small subset of experts for specific tasks while maintaining near full-model performance.