How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

arXiv cs.LG Papers

Summary

This paper develops a principled scaling theory for Mixture-of-Experts (MoE) architectures, introducing the Maximally Scale-Stable Parameterization (MSSP) that ensures stable training and hyperparameter transfer across width, depth, expert width, and number of experts, validated by experiments.

arXiv:2605.14200v1 Announce Type: new Abstract: Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($\mu$) desiderata. We then show that the resulting $\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $\mu$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:27 AM

# How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
Source: [https://arxiv.org/abs/2605.14200](https://arxiv.org/abs/2605.14200)
[View PDF](https://arxiv.org/pdf/2605.14200)

> Abstract:Recent frontier large language models predominantly rely on Mixture\-of\-Experts \(MoE\) architectures\. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N\_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale\. We take a principled step toward resolving this gap by analyzing three different scaling regimes: \(I\) co\-scaling $N\\asymp N\_e$, \(II\) co\-scaling $N\\asymp M\\asymp K$, and \(III\) full proportional scaling of $N, N\_e, M$, and $K$\. For each regime, we develop a novel Dynamical Mean Field Theory \(DMFT\) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis\. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal\-update \($\\mu$\) desiderata\. We then show that the resulting $\\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning\-rate transfer\. We trace these pathologies to scale\-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability\. Guided by this principle, we derive a Maximally Scale\-Stable Parameterization \(MSSP\) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics \- qualitatively distinct from the $\\mu$P limit \- through a separate DMFT analysis\. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes\. Combined with existing depth\-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts\.

## Submission history

From: Leena Chennuru Vankadara \[[view email](https://arxiv.org/show-email/7eb074a5/2605.14200)\] **\[v1\]**Wed, 13 May 2026 23:32:00 UTC \(32,556 KB\)

Similar Articles

Mixture of Experts (MoEs) in Transformers

Hugging Face Blog

Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.