How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

arXiv cs.LG 05/15/26, 04:00 AM Papers

Summary

This paper develops a principled scaling theory for Mixture-of-Experts (MoE) architectures, introducing the Maximally Scale-Stable Parameterization (MSSP) that ensures stable training and hyperparameter transfer across width, depth, expert width, and number of experts, validated by experiments.

arXiv:2605.14200v1 Announce Type: new Abstract: Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($\mu$) desiderata. We then show that the resulting $\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $\mu$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

Original Article

View Cached Full Text

Cached at: 05/15/26, 06:27 AM

# How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
Source: [https://arxiv.org/abs/2605.14200](https://arxiv.org/abs/2605.14200)
[View PDF](https://arxiv.org/pdf/2605.14200)

> Abstract:Recent frontier large language models predominantly rely on Mixture\-of\-Experts \(MoE\) architectures\. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N\_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale\. We take a principled step toward resolving this gap by analyzing three different scaling regimes: \(I\) co\-scaling $N\\asymp N\_e$, \(II\) co\-scaling $N\\asymp M\\asymp K$, and \(III\) full proportional scaling of $N, N\_e, M$, and $K$\. For each regime, we develop a novel Dynamical Mean Field Theory \(DMFT\) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis\. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal\-update \($\\mu$\) desiderata\. We then show that the resulting $\\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning\-rate transfer\. We trace these pathologies to scale\-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability\. Guided by this principle, we derive a Maximally Scale\-Stable Parameterization \(MSSP\) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics \- qualitatively distinct from the $\\mu$P limit \- through a separate DMFT analysis\. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes\. Combined with existing depth\-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts\.

## Submission history

From: Leena Chennuru Vankadara \[[view email](https://arxiv.org/show-email/7eb074a5/2605.14200)\] **\[v1\]**Wed, 13 May 2026 23:32:00 UTC \(32,556 KB\)

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Similar Articles

What is the point of MoE models, beyond being faster?

Mixture of Experts (MoEs) in Transformers

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Is there a limit on the number of active parameters in an MoE model?

Emergent Modularity in Mixture-of-Experts Models (8 minute read)

Submit Feedback

Similar Articles

What is the point of MoE models, beyond being faster?

Mixture of Experts (MoEs) in Transformers

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Is there a limit on the number of active parameters in an MoE model?

Emergent Modularity in Mixture-of-Experts Models (8 minute read)