$\phi$-Balancing for Mixture-of-Experts Training

arXiv cs.LG Papers

Summary

This paper proposes φ-balancing, a principled framework for load balancing in Mixture-of-Experts models that directly targets population-level expert balance using convex duality and mirror descent, achieving more stable expert utilization and outperforming prior methods on reasoning and code generation benchmarks.

arXiv:2605.15403v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:40 AM

# ϕ-Balancing for Mixture-of-Experts Training
Source: [https://arxiv.org/html/2605.15403](https://arxiv.org/html/2605.15403)
Jonathan LiQi WangRunlong LiaoShuozhe LiChen LiangNi LaoQiang Liu

###### Abstract

Mixture\-of\-Experts \(MoE\) models rely on balanced expert utilization to fully realize their scalability\. However, existing load\-balancing methods are largely heuristic and operate on noisy mini\-batch assignment statistics, introducing bias relative to population\-level objectives\. We proposeϕ\\phi\-balancing, a principled framework that directly targets population\-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution\. Using convex duality, we derive an equivalent min\-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA\-based routing adjustment with negligible overhead\. Across large\-scale pretraining and downstream fine\-tuning,ϕ\\phi\-balancing consistently outperforms prior Switch\-style and loss\-free baselines, demonstrating more stable and effective expert utilization\.

Mixture\-of\-Experts, Load Balancing, Transformer, Sparsity

## 1Introduction

Mixture\-of\-Experts \(MoE\) Transformers have emerged as an effective approach for scaling deep learning models by dynamically selecting a small subset of expert modules for each input token\. This strategy substantially increases model capacity while keeping computation nearly constant\(shazeer2017outrageously;fedus2022switch\), enabling large\-scale language and vision models with billions of parameters to operate at roughly constant FLOPs\(lepikhin2021gshard;riquelme2021scaling;fedus2022switch\)\.

Algorithm 1ϕ\\phi\-balancingfor one MoE layer0:strictly convex, symmetric, and differentiable

ϕ\\phi,

η∈\(0,1\]\\eta\\in\(0,1\],

α\>0\\alpha\>0,

𝐦←𝟎\\mathbf\{m\}\\leftarrow\\mathbf\{0\}, routing frequencies

fef\_\{e\}\([4](https://arxiv.org/html/2605.15403#S2.E4)\)

1:Compute routing probabilities

pi,ep\_\{i,e\}for each token

ii
2:

𝐩e←1T​∑i=1Tpi,e\\mathbf\{p\}\_\{e\}\\leftarrow\\frac\{1\}\{T\}\\sum\_\{i=1\}^\{T\}p\_\{i,e\}for

e=1,…,Ee=1,\\dots,E\(expert loads\)

3:Let

𝐩=\(𝐩1,…,𝐩E\)\\mathbf\{p\}=\(\\mathbf\{p\}\_\{1\},\\dots,\\mathbf\{p\}\_\{E\}\)
4:

𝐦←\(1−η\)​𝐦\+η​𝐩\\mathbf\{m\}\\leftarrow\(1\-\\eta\)\\mathbf\{m\}\+\\eta\\mathbf\{p\}\(EMA of loads\)

5:

ℒaux←\{ST\-MoE:∑e=1Efe𝐩eOurs:​∑e=1E∇ϕ​\(𝐦\)e​𝐩e\\mathcal\{L\}\_\{\\text\{aux\}\}\\leftarrow\\left\\\{\\begin\{aligned\} \\hbox\{\\pagecolor\{red\!8\}$\\displaystyle\\textbf\{ST\-MoE:\}\\;\\;\\textstyle\\sum\_\{e=1\}^\{E\}f\_\{e\}\\mathbf\{p\}\_\{e\}~~~~~$\}\\\\\[\-2\.0pt\] \\hbox\{\\pagecolor\{blue\!8\}$\\displaystyle\\textbf\{Ours:\}\\;\\;\\textstyle\\sum\_\{e=1\}^\{E\}\\nabla\\phi\(\\mathbf\{m\}\)\_\{e\}\\mathbf\{p\}\_\{e\}$\}\\end\{aligned\}\\right\.
6:Update model using

∇\(ℒtask\+α⋅E⋅ℒaux\)\\nabla\\big\(\\mathcal\{L\}\_\{\\text\{task\}\}\+\\alpha\\cdot E\\cdot\\mathcal\{L\}\_\{\\text\{aux\}\}\\big\)

![Refer to caption](https://arxiv.org/html/2605.15403v1/x1.png)Figure 1:Performance gains on reasoning and code generation benchmarks\.We compare the proposed method \(Ours\) against the ST\-MoE baseline on the Moonlight\-16B\-A3B\-Instruct architecture\(LiuSY25\)\. The proposed approach outperforms the baseline across all selected tasks, yielding significant gains in mathematical reasoning \(Math500\), general capability \(LiveBench\), code synthesis \(HumanEval\), and logic \(BBH\)\.A key challenge in MoE training is to ensure balanced utilization of experts, which is essential for fully leveraging model capacity and avoiding performance degradation\. A number of methods have been proposed to address this challenge, including Switch\-style load\-balancing losses\(shazeer2017outrageously;lepikhin2021gshard;fedus2022switch\)and more recent loss\-free balancing approaches\(wang2024auxiliary\)\. However, an often unspoken issue is that most existing balancing objectives are heuristic in nature, as they do not correspond to minimizing a well\-defined population\-level objective\. In principle, the true goal is to achieve balanced expert usage under the whole data distribution\. In contrast, widely used methods such as Switch\-style MoE \(ST\-MoE\) rely on per\-mini\-batch statistics and realized expert assignment frequencies, which introduce systematic bias relative to population\-level uniformity objectives\.

We proposeϕ\\phi\-balancing, a principled load\-balancing framework that directly targets population\-level expert balance\. Our approach formulates load balancing as the minimization of a strictly convex, symmetric, and differentiable potentialϕ\\phiapplied to the population mean routing distribution\. To avoid the bias introduced by per\-batch approximation, we adopt a min\-max formulation via convex duality and apply online mirror descent to solve the resulting inner problem\. This yields a simple yet broad family of algorithms, shown in Algorithm[1](https://arxiv.org/html/2605.15403#alg1), that maintains an exponential moving average \(EMA\) of routing probabilities with negligible overhead, processed through the mirror map∇ϕ\\nabla\\phi\.

Empirically, we find thatϕ\\phi\-balancing consistently outperforms ST\-MoE across a wide range of settings \(Figure[1](https://arxiv.org/html/2605.15403#S1.F1)\), including pretraining MoE\-augmented Gemma models\(Kamath25;Liang25\), where we systematically scale the number of active parametersNN, expert countEE, and routing granularityGGunder controlled compute budgets, and ablations on EMA\-based load tracking, the choice of mirror mapϕ\\phi, and the EMA decay rate\. While many choices forϕ\\phiare possible, we recommend the negative entropy function as the most effective in practice\.

We further evaluate per\-benchmark LoRA fine\-tuning on instruction\-tuned MoE backbones\(liu2024deepseek;dai2024deepseekmoe;LiuSY25\)across seven benchmarks, totaling approximately 40,000 NVIDIA H100 HBM3\-80GB GPU hours for all experiments\.

## 2Background on Mixtures of Experts

We consider a standard decoder\-only Transformer composed ofLLlayers\. In a dense Transformer, each layer processes the input sequence via a Self\-Attention module followed by a shared Feed\-Forward Network \(FFN\)\. The MoE architecture replaces this dense FFN with a sparse modular layer consisting of a learnable router and a set ofEEexperts,\{FFN1,…,FFNE\}\\\{\\text\{FFN\}\_\{1\},\\dots,\\text\{FFN\}\_\{E\}\\\}\(shazeer2017outrageously\)\.

Let𝒙=\(𝒙i\)i=1T∈ℝT×d\\boldsymbol\{x\}=\(\\boldsymbol\{x\}\_\{i\}\)\_\{i=1\}^\{T\}\\in\\mathbb\{R\}^\{T\\times d\}denote the input to a layer, whereTTis the sequence length andddis the model hidden dimension\. For each token𝒙i\\boldsymbol\{x\}\_\{i\}, the MoE layer output𝒚i\\boldsymbol\{y\}\_\{i\}is computed as the router\-weighted sum of the experts:

𝒚i=∑e=1ER​\(𝒙i\)e⋅FFNe​\(𝒙i;dffn\)\.\\boldsymbol\{y\}\_\{i\}=\\sum\_\{e=1\}^\{E\}R\(\\boldsymbol\{x\}\_\{i\}\)\_\{e\}\\cdot\\text\{FFN\}\_\{e\}\(\\boldsymbol\{x\}\_\{i\};d\_\{\\text\{ffn\}\}\)\.\(1\)Here, each expert is parameterized as a standard two\-layer MLP\. Following recent state\-of\-the\-art implementations\(shazeer2020glu;dai2024deepseekmoe;agarwal2025gpt\), we utilize the SwiGLU activation function, defined as:

FFNe​\(𝒖\)=W2\(e\)⋅SwiGLU​\(W1\(e\)​𝒖\),\\text\{FFN\}\_\{e\}\(\\boldsymbol\{u\}\)=W\_\{2\}^\{\(e\)\}\\cdot\\text\{SwiGLU\}\(W\_\{1\}^\{\(e\)\}\\boldsymbol\{u\}\),\(2\)whereW1\(e\)∈ℝdffn×dW\_\{1\}^\{\(e\)\}\\in\\mathbb\{R\}^\{d\_\{\\text\{ffn\}\}\\times d\}andW2\(e\)∈ℝd×dffnW\_\{2\}^\{\(e\)\}\\in\\mathbb\{R\}^\{d\\times d\_\{\\text\{ffn\}\}\}are independent parameters for expertee\.

### 2\.1Sparse Routing Mechanism

The computational efficiency of MoEs relies on the routing functionR​\(⋅\)R\(\\cdot\), which enforces sparsity by directing each token to a small subset ofkkexperts \(wherek≪Ek\\ll E\)\. The router typically consists of a learnable projection matrixWr∈ℝE×dW\_\{r\}\\in\\mathbb\{R\}^\{E\\times d\}\. The*routing weights*are determined by normalizing the projection scores over the top\-kkindices\(shazeer2017outrageously\):

R​\(𝒙\)=softmax​\(Top\-​k​\(Wr​𝒙\)\)\.R\(\\boldsymbol\{x\}\)=\\text\{softmax\}\\left\(\\text\{Top\-\}k\(W\_\{r\}\\boldsymbol\{x\}\)\\right\)\.\(3\)TheTop\-​k​\(⋅\)\\text\{Top\-\}k\(\\cdot\)operator sets all logits to−∞\-\\inftyexcept for thekklargest elements\. Consequently,R​\(𝒙\)eR\(\\boldsymbol\{x\}\)\_\{e\}is zero for all non\-selected experts, allowing the model to skip the majority of expert computations\. If we only have one activated expert, then we will not dosoftmaxto avoid zero gradient on the router logits\. This conditional computation decouples parameter count from inference cost; however, it introduces the load\-balancing challenges that we address in Section[3](https://arxiv.org/html/2605.15403#S3)\.

### 2\.2Baseline Load Balancing Strategy

![Refer to caption](https://arxiv.org/html/2605.15403v1/x2.png)Figure 2:Pretraining scaling studies under controlled per\-token compute\.We evaluate routing stability and optimization across three orthogonal MoE scaling axes, while keeping the per\-token computational cost \(FLOPs\) approximately constant within each study by adjusting expert size as needed\.\(Left\) Active\-parameter scaling:we train models withE=16E=16experts andA=2A=2active experts per token, varying the number of*active parameters*N∈\{111​M,338​M,588​M,986​M\}N\\in\\\{111\\text\{M\},\\,338\\text\{M\},\\,588\\text\{M\},\\,986\\text\{M\}\\\}\.\(Middle\) Granularity scaling:for fixed model sizeMMand activation ratioA/EA/E, we vary the granularity factorG∈\{2,4,8,16,32\}G\\in\\\{2,4,8,16,32\\\}by increasing the total number of experts from1616to256256while proportionally shrinking each expert, so per\-token FLOPs remain constant\.\(Right\) Expert\-count scaling \(activation ratio\):we isolate the effect ofA/EA/Eby holding the compute budgetMM, the number of activated expertsA=2A=2, and the expert size \(granularity\) fixed, and varying the total number of expertsE∈\{8,16,32,64,128\}E\\in\\\{8,16,32,64,128\\\}\.While the router constitutes a negligible fraction of the total parameter count, it orchestrates the utilization of the model’s vast expert capacity\. Here, we recall the standard auxiliary load\-balancing loss \(LBL\) used by ST\-MoE\(fedus2022switch\)\. This formulation remains the dominant paradigm for training large\-scale sparse models, including DeepSeek\(liu2024deepseek\), OlMoE\(muennighoff2024olmoe\), and DeepSpeed\-MoE\(rajbhandari2022deepspeedmoe\)\.

The LBL objective encourages tokens to be distributed uniformly across theEEexperts\. For a minibatch ofTTtokens, let𝐩e\\mathbf\{p\}\_\{e\}denote the batch\-mean*pre\-top\-kk*routing probability assigned to expertee, letpi,ep\_\{i,e\}denote the routing probability of experteefor token𝒙i\\boldsymbol\{x\}\_\{i\}, and letfef\_\{e\}denote the realized routing frequency of experteeunder top\-kkrouting:

𝐩e\\displaystyle\\mathbf\{p\}\_\{e\}=1T​∑i=1Tpi,e,where​pi,e:=softmax​\(Wr​𝒙i\)e,\\displaystyle=\\frac\{1\}\{T\}\\sum\_\{i=1\}^\{T\}p\_\{i,e\},\\quad\\text\{where \}p\_\{i,e\}=\\text\{softmax\}\(W\_\{r\}\\boldsymbol\{x\}\_\{i\}\)\_\{e\},\(4\)fe\\displaystyle f\_\{e\}=1k​T​∑i=1T𝕀​\(e∈Top\-​k​\(Wr​𝒙i\)\)\.\\displaystyle=\\frac\{1\}\{kT\}\\sum\_\{i=1\}^\{T\}\\mathbb\{I\}\\\!\\left\(e\\in\\text\{Top\-\}k\(W\_\{r\}\\boldsymbol\{x\}\_\{i\}\)\\right\)\.
The auxiliary loss is defined as the dot product of these two vectors:

ℒaux=∑e=1Efe⋅𝐩e\.\\mathcal\{L\}\_\{\\text\{aux\}\}=\\sum\_\{e=1\}^\{E\}f\_\{e\}\\cdot\\mathbf\{p\}\_\{e\}\.\(5\)As shown byfedus2022switch, minimizing \([5](https://arxiv.org/html/2605.15403#S2.E5)\) encourages both the gating probabilities and the discrete selections to approach a uniform distribution\.

Loss\-free balancing\.Rather than introducing an explicit load\-balancing loss, which can inject interference gradients and degrade task learning,*loss\-free balancing*\(wang2024auxiliary\)enforces balance by directly modifying the routing decision\. Concretely, it adds a learned, expert\-specific bias to the router logits*before*the top\-kkselection, and updates these biases online using each expert’s recent utilization\.

![Refer to caption](https://arxiv.org/html/2605.15403v1/x3.png)Figure 3:Pre\-Training dynamics and expert utilization\.We compareϕ\\phi\-balancing \(red, solid\) against ST\-MoE \(blue, dashed\) over 10k steps\.\(Left\) Validation Loss and Accuracyshow thatϕ\\phi\-balancing \(negative entropy\) achieves comparable or superior convergence\.\(Right\) Gini coefficient and Expert Loading Analysisdemonstrates significantly lower routing imbalance forϕ\\phi\-balancing\.ϕ\\phi\-balancing maintains tighter bounds between maximum and minimum expert load, staying closer to the perfect allocation line \(green\) compared to ST\-MoE, which exhibits higher variance in expert capacity usage\.

## 3ϕ\\phi\-balancing

In this section, we introduce theϕ\\phi\-balancing loss\. Unlike classical approaches that enforce balance only within individual mini\-batches, our goal is to regularize*global*expert usage over the entire data distribution\. Concretely, we encourage globally uniform expert utilization via a strictly convex, symmetric, and differentiable potential functionϕ\\phi\.

### 3\.1The Global Load\-Balancing Objective

Let𝐩​\(𝒙;θ\)∈ΔE\\mathbf\{p\}\(\\boldsymbol\{x\};\\theta\)\\in\\Delta^\{E\}denote the predicted routing probability vector for an input tokenxx, parameterized byθ\\theta\(i\.e\.,𝐩​\(𝒙;θ\)=softmax​\(Wr​𝒙\)\\mathbf\{p\}\(\\boldsymbol\{x\};\\theta\)=\\text\{softmax\}\(W\_\{r\}\\boldsymbol\{x\}\)\)\. For a specific expertee,𝐩​\(𝒙;θ\)e\\mathbf\{p\}\(\\boldsymbol\{x\};\\theta\)\_\{e\}represents the probability mass assigned to that expert\. We define the*global mean routing distribution*𝐩¯​\(θ\)\\bar\{\\mathbf\{p\}\}\(\\theta\)as the expectation of the routing probabilities over the distribution of tokens𝒟\\mathcal\{D\}induced by the training corpus:

𝐩¯​\(θ\)=𝔼𝒙∼𝒟​\[𝐩​\(𝒙;θ\)\],\\bar\{\\mathbf\{p\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\boldsymbol\{x\}\\sim\\mathcal\{D\}\}\\left\[\\mathbf\{p\}\(\\boldsymbol\{x\};\\theta\)\\right\],\(6\)which satisfies∑e=1E𝐩¯​\(θ\)e=1\\sum\_\{e=1\}^\{E\}\\bar\{\\mathbf\{p\}\}\(\\theta\)\_\{e\}=1\.

#### Load balancing via convex duality\.

Our goal is to encourage the token population\-level routing distribution𝐩¯​\(θ\)\\bar\{\\mathbf\{p\}\}\(\\theta\)to be uniform, so that in expectation, all experts are utilized equally over the data distribution\. We formulate this objective as the optimization problem

minθ⁡ℒbal​\(θ\):=minθ⁡ϕ​\(𝐩¯​\(θ\)\),\\min\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{bal\}\}\(\\theta\):=\\min\_\{\\theta\}\\phi\(\\bar\{\\mathbf\{p\}\}\(\\theta\)\),\(7\)where the potential functionϕ:ℝE→ℝ\\phi:\\mathbb\{R\}^\{E\}\\to\\mathbb\{R\}is chosen to be strictly convex, symmetric, and differentiable\.

The strict convexity and symmetry ofϕ\\phiguarantee that the objective in \([7](https://arxiv.org/html/2605.15403#S3.E7)\) attains a unique minimum over the probability simplex at the uniform distribution, which is formalized by LemmaLABEL:lem:uniform\-minimizerin AppendixLABEL:app:proofs\. Representative choices ofϕ\\phiare summarized in Table[1](https://arxiv.org/html/2605.15403#S3.T1)\. Importantly,ϕ\\phiis not restricted to additive or separable forms such as∑eψ​\(𝐩e\)\\sum\_\{e\}\\psi\(\\mathbf\{p\}\_\{e\}\)and can capture more general dependencies across experts\.

#### The estimation challenge\.

Optimizing \([7](https://arxiv.org/html/2605.15403#S3.E7)\) directly with stochastic gradient descent is problematic\. Since𝐩¯​\(θ\)\\bar\{\\mathbf\{p\}\}\(\\theta\)is an expectation over the dataset, it must be estimated, and using the local mean of a mini\-batchℬ\\mathcal\{B\}, denoted as𝐩^=1\|ℬ\|​∑x∈ℬ𝐩​\(𝒙;θ\)\\hat\{\\mathbf\{p\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{x\\in\\mathcal\{B\}\}\\mathbf\{p\}\(\\boldsymbol\{x\};\\theta\), introduces significant bias\. Becauseϕ\\phiis non\-linear, the expectation of the function is not the function of the expectation:

𝔼ℬ​\[ϕ​\(𝐩^\)\]≠ϕ​\(𝔼ℬ​\[𝐩^\]\)=ϕ​\(𝐩¯​\(θ\)\)\.\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\phi\(\\hat\{\\mathbf\{p\}\}\)\]\\neq\\phi\\\!\\left\(\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\hat\{\\mathbf\{p\}\}\]\\right\)=\\phi\(\\bar\{\\mathbf\{p\}\}\(\\theta\)\)\.\(8\)For small batch sizes,*this bias artificially forces the router to balance every individual mini\-batch rather than the global distribution*, potentially degrading performance\.

#### Duality and mirror descent\.

To address the estimation challenges induced by batch\-wise statistics, we leverage convex duality to decouple population\-level estimation from per\-batch updates\. Using the identity

ϕ​\(𝐩\)=sup𝐪∈ℝE⟨𝐩,𝐪⟩−ϕ∗​\(𝐪\),\\phi\(\\mathbf\{p\}\)=\\sup\_\{\\mathbf\{q\}\\in\\mathbb\{R\}^\{E\}\}\\langle\\mathbf\{p\},\\mathbf\{q\}\\rangle\-\\phi^\{\*\}\(\\mathbf\{q\}\),\(9\)we obtain the min\-max problem

minθ⁡max𝐪∈ℝE⁡\(𝔼𝒙∼𝒟​\[⟨𝐩​\(𝒙;θ\),𝐪⟩\]−ϕ∗​\(𝐪\)\),\\min\_\{\\theta\}\\max\_\{\\mathbf\{q\}\\in\\mathbb\{R\}^\{E\}\}\\left\(\\mathbb\{E\}\_\{\\boldsymbol\{x\}\\sim\\mathcal\{D\}\}\[\\left\\langle\\mathbf\{p\}\(\\boldsymbol\{x\};\\theta\),\\mathbf\{q\}\\right\\rangle\]\-\\phi^\{\*\}\(\\mathbf\{q\}\)\\right\),\(10\)where𝐪∈ℝE\\mathbf\{q\}\\in\\mathbb\{R\}^\{E\}denotes the dual variable\. Intuitively, each component𝐪e\\mathbf\{q\}\_\{e\}represents the accumulated congestion cost of expertee\. When an expert becomes over\-utilized \(large𝐩e\\mathbf\{p\}\_\{e\}\), its price𝐪e\\mathbf\{q\}\_\{e\}increases, amplifying the penalty⟨𝐩,𝐪⟩\\left\\langle\\mathbf\{p\},\\mathbf\{q\}\\right\\ranglein the primal objective\. This encourages the router to shift probability mass toward under\-utilized experts\.

For any fixedθ\\theta, strict convexity and the first\-order optimality condition imply that the inner maximization problem admits a unique maximizer given by𝐪⋆=∇ϕ​\(𝐩¯​\(θ\)\)\\mathbf\{q\}^\{\\star\}=\\nabla\\phi\\\!\\left\(\\bar\{\\mathbf\{p\}\}\(\\theta\)\\right\)\. Computing𝐪⋆\\mathbf\{q\}^\{\\star\}exactly is infeasible in practice, as it requires access to the full data distribution\. Moreover, directly applying gradient ascent on the dual variable suffers from high variance when only mini\-batch estimates are available\. Instead, we exploit the convex structure of the dual problem and adopt mirror descent, which naturally yields a stable online estimator\.

Denote by𝐩t\\mathbf\{p\}\_\{t\}the empirical mean routing distribution over the mini\-batchℬt\\mathcal\{B\}\_\{t\}at iterationtt:

𝐩t:=1\|ℬt\|​∑x∈ℬt𝐩​\(𝒙;θt\)\.\\mathbf\{p\}\_\{t\}:=\\frac\{1\}\{\|\\mathcal\{B\}\_\{t\}\|\}\\sum\_\{x\\in\\mathcal\{B\}\_\{t\}\}\\mathbf\{p\}\(\\boldsymbol\{x\};\\theta\_\{t\}\)\.
A single mirror ascent step\(beck2003mirror\)on the dual objective \([10](https://arxiv.org/html/2605.15403#S3.E10)\) is equivalent to maintaining an exponential moving average \(EMA\) of the batch routing distributions followed by a price update:𝐦t\+1\\displaystyle\\mathbf\{m\}\_\{t\+1\}←\(1−η\)​𝐦t\+η​𝐩t\\displaystyle\\leftarrow\(1\-\\eta\)\\mathbf\{m\}\_\{t\}\+\\eta\\mathbf\{p\}\_\{t\}\(11\)𝐪t\+1\\displaystyle\\mathbf\{q\}\_\{t\+1\}←∇ϕ​\(𝐦t\+1\),\\displaystyle\\leftarrow\\nabla\\phi\(\\mathbf\{m\}\_\{t\+1\}\),where𝐦\\mathbf\{m\}represents the primal variable corresponding to𝐪\\mathbf\{q\}andη∈\(0,1\]\\eta\\in\(0,1\]is the step size\.

The full derivation is provided in AppendixLABEL:app:proofs\.

Table 1:Summary ofϕ\\phi\-balancing variants\.The choice of the potential functionϕ\\phidetermines the relationship between the accumulated expert usage state𝐦t\+1\\mathbf\{m\}\_\{t\+1\}and the auxiliary lossℒaux\\mathcal\{L\}\_\{\\text\{aux\}\}\. Here, summations are taken over the expertsee, andqqdenotes the conjugate exponent such that1p\+1q=1\\frac\{1\}\{p\}\+\\frac\{1\}\{q\}=1\. We follow the convention0​log⁡0:=00\\log 0:=0\.VariantPrimal Potentialϕ​\(𝐩\)\\phi\(\\mathbf\{p\}\)Dual Potentialϕ∗​\(𝐪\)\\phi^\{\*\}\(\\mathbf\{q\}\)Auxiliary Lossℒaux\\mathcal\{L\}\_\{\\textnormal\{aux\}\}Euclidean Norm \(p=2p=2\)12​‖𝐩‖22\\frac\{1\}\{2\}\\\|\\mathbf\{p\}\\\|\_\{2\}^\{2\}12​‖𝐪‖22\\frac\{1\}\{2\}\\\|\\mathbf\{q\}\\\|\_\{2\}^\{2\}∑𝐩t,e⋅𝐦t\+1,e\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\\mathbf\{m\}\_\{t\+1,e\}ℓp\\ell\_\{p\}Norm \(p\>1p\>1\)1p​‖𝐩‖pp\\frac\{1\}\{p\}\\\|\\mathbf\{p\}\\\|\_\{p\}^\{p\}1q​‖𝐪‖qq\\frac\{1\}\{q\}\\\|\\mathbf\{q\}\\\|\_\{q\}^\{q\}∑𝐩t,e⋅sgn⁡\(𝐦t\+1,e\)​\|𝐦t\+1,e\|p−1\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\\operatorname\{sgn\}\(\\mathbf\{m\}\_\{t\+1,e\}\)\|\\mathbf\{m\}\_\{t\+1,e\}\|^\{p\-1\}Softℓ1\\ell\_\{1\}Norm \(δ\>0\\delta\>0\)∑\(\|𝐩e\|−δ​log⁡\(1δ​\|𝐩e\|\+1\)\)\\sum\(\|\\mathbf\{p\}\_\{e\}\|\-\\delta\\log\(\\frac\{1\}\{\\delta\}\|\\mathbf\{p\}\_\{e\}\|\+1\)\)∑−δ​\(\|𝐪e\|\+log⁡\(1−\|𝐪e\|\)\)∗\\sum\-\\delta\(\|\\mathbf\{q\}\_\{e\}\|\+\\log\(1\-\|\\mathbf\{q\}\_\{e\}\|\)\)^\{\*\}∑𝐩t,e⋅𝐦t\+1,e​\(\|𝐦t\+1,e\|\+δ\)−1\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\\mathbf\{m\}\_\{t\+1,e\}\(\|\\mathbf\{m\}\_\{t\+1,e\}\|\+\\delta\)^\{\-1\}Negative Entropy∑𝐩e​log⁡𝐩e\\sum\\mathbf\{p\}\_\{e\}\\log\\mathbf\{p\}\_\{e\}∑exp⁡\(𝐪e−1\)\\sum\\exp\(\\mathbf\{q\}\_\{e\}\-1\)∑𝐩t,e⋅\(log⁡𝐦t\+1,e\+1\)\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\(\\log\\mathbf\{m\}\_\{t\+1,e\}\+1\)Tsallis Entropy \(α\>0,α≠1\\alpha\>0,\\alpha\\neq 1\)∑\(𝐩eα−𝐩e\)​\(α−1\)−1\\sum\(\\mathbf\{p\}\_\{e\}^\{\\alpha\}\-\\mathbf\{p\}\_\{e\}\)\(\\alpha\-1\)^\{\-1\}no simple closed form∑𝐩t,e⋅\(α​𝐦t\+1,eα−1−1\)​\(α−1\)−1\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\(\\alpha\\mathbf\{m\}\_\{t\+1,e\}^\{\\alpha\-1\}\-1\)\(\\alpha\-1\)^\{\-1\}Rényi Entropy \(α∈\(0,1\)\\alpha\\in\(0,1\)\)1α−1​log⁡\(∑𝐩eα\)\\frac\{1\}\{\\alpha\-1\}\\log\(\\sum\\mathbf\{p\}\_\{e\}^\{\\alpha\}\)no simple closed form∑𝐩t,e⋅\(α​𝐦t\+1,eα−1\)​\(\(α−1\)​∑𝐦jα\)−1\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\(\\alpha\\mathbf\{m\}\_\{t\+1,e\}^\{\\alpha\-1\}\)\(\(\\alpha\-1\)\\sum\\mathbf\{m\}\_\{j\}^\{\\alpha\}\)^\{\-1\}Pseudo\-Huber \(δ\>0\\delta\>0\)∑\(𝐩e2\+δ2−δ\)\\sum\(\\sqrt\{\\mathbf\{p\}\_\{e\}^\{2\}\+\\delta^\{2\}\}\-\\delta\)∑\(−δ​1−𝐪e2\+δ\)†\\sum\(\-\\delta\\sqrt\{1\-\\mathbf\{q\}\_\{e\}^\{2\}\}\+\\delta\)^\{\\dagger\}∑𝐩t,e⋅𝐦t\+1,e​\(𝐦t\+1,e2\+δ2\)−12\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\\mathbf\{m\}\_\{t\+1,e\}\(\\mathbf\{m\}\_\{t\+1,e\}^\{2\}\+\\delta^\{2\}\)^\{\-\\frac\{1\}\{2\}\}Log\-cosh \(β\>0\\beta\>0\)∑1β​log⁡cosh⁡\(β​𝐩e\)\\sum\\frac\{1\}\{\\beta\}\\log\\cosh\(\\beta\\mathbf\{p\}\_\{e\}\)∑\(1\+𝐪e2​β​log⁡\(1\+𝐪e\)\+1−𝐪e2​β​log⁡\(1−𝐪e\)\)‡\\sum\(\\frac\{1\+\\mathbf\{q\}\_\{e\}\}\{2\\beta\}\\log\(1\+\\mathbf\{q\}\_\{e\}\)\+\\frac\{1\-\\mathbf\{q\}\_\{e\}\}\{2\\beta\}\\log\(1\-\\mathbf\{q\}\_\{e\}\)\)^\{\\ddagger\}∑𝐩t,e⋅tanh⁡\(β​𝐦t\+1,e\)\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\\tanh\(\\beta\\mathbf\{m\}\_\{t\+1,e\}\)Softplus∑log⁡\(exp⁡\(𝐩e\)\+1\)\\sum\\log\(\\exp\(\\mathbf\{p\}\_\{e\}\)\+1\)∑\(𝐪e​log⁡𝐪e\+\(1−𝐪e\)​log⁡\(1−𝐪e\)\)§\\sum\(\\mathbf\{q\}\_\{e\}\\log\\mathbf\{q\}\_\{e\}\+\(1\-\\mathbf\{q\}\_\{e\}\)\\log\(1\-\\mathbf\{q\}\_\{e\}\)\)^\{\\S\}∑𝐩t,e⋅\(exp⁡\(−𝐦t\+1,e\)\+1\)−1\\sum\\mathbf\{p\}\_\{t,e\}\\cdot\(\\exp\(\-\\mathbf\{m\}\_\{t\+1,e\}\)\+1\)^\{\-1\}

∗when‖𝐪‖∞<1\\\|\\mathbf\{q\}\\\|\_\{\\infty\}<1, otherwise∞\\infty†when‖𝐪‖∞≤1\\\|\\mathbf\{q\}\\\|\_\{\\infty\}\\leq 1, otherwise∞\\infty‡when\|𝐪e\|<1\|\\mathbf\{q\}\_\{e\}\|<1, otherwise0when\|𝐪e\|=1\|\\mathbf\{q\}\_\{e\}\|=1, otherwise∞\\infty§when𝐪∈\[0,1\]E\\mathbf\{q\}\\in\[0,1\]^\{E\}, otherwise∞\\infty

Using𝐪t\+1\\mathbf\{q\}\_\{t\+1\}as an approximation for𝐪⋆\\mathbf\{q\}^\{\\star\}, the loss w\.r\.t\.θ\\thetabecomes

ℒaux=⟨𝐩t,𝐪t\+1⟩=∑e=1E𝐩t,e​∇ϕ​\(𝐦t\+1\)e,\\mathcal\{L\}\_\{\\text\{aux\}\}=\\left\\langle\\mathbf\{p\}\_\{t\},\\mathbf\{q\}\_\{t\+1\}\\right\\rangle=\\sum\_\{e=1\}^\{E\}\\mathbf\{p\}\_\{t,e\}\\nabla\\phi\(\\mathbf\{m\}\_\{t\+1\}\)\_\{e\},\(12\)which yields Algorithm[1](https://arxiv.org/html/2605.15403#alg1)\. Note that we should apply the stop\-gradient operator to𝐦t\+1\\mathbf\{m\}\_\{t\+1\}\(and hence𝐪t\+1\\mathbf\{q\}\_\{t\+1\}\) when optimizing the router, so that gradients flow only through𝐩t\\mathbf\{p\}\_\{t\}\.

#### Related methods\.

As shown in Algorithm[1](https://arxiv.org/html/2605.15403#alg1), our method differs from the ST\-MoE loss only in replacing the realized frequencyfef\_\{e\}inℒswitch∝∑efe⋅𝐩e\\mathcal\{L\}\_\{\\text\{switch\}\}\\propto\\sum\_\{e\}f\_\{e\}\\cdot\\mathbf\{p\}\_\{e\}with our∇ϕ​\(𝐦t\+1\)e\\nabla\\phi\(\\mathbf\{m\}\_\{t\+1\}\)\_\{e\}\. The*hard dispatch fraction*fef\_\{e\}\(the percentage of tokens actually sent to expertee\) introduces discrete, non\-differentiable assignment noise\. In contrast, our method relies solely on the dual variable𝐪\\mathbf\{q\}, which tracks the history of the*soft routing probabilities*\. Consequently, our regularizer operates entirely within the smooth probability space, avoiding the instability associated with discrete routing decisions\.

DeepSeek MoE\(wang2024auxiliary;deepseekv2\)similarly maintains an EMA of recent expert loads to dynamically update per\-expert routing\-score biases before the top\-kkdecision\. However, this approach still relies on the hard routing frequencyfef\_\{e\}and does not correspond to principled optimization of a population\-level objective as in our derivation\.

### 3\.2Examples ofϕ\\phi

The behavior of the min\-max LBL \([10](https://arxiv.org/html/2605.15403#S3.E10)\) is governed by the potentialϕ\\phi\. This function determines how the accumulated routing statistics \(dual vector𝐪\\mathbf\{q\}\) are mapped to the expert prices \(primal vector𝐦\\mathbf\{m\}\) according to \([11](https://arxiv.org/html/2605.15403#S3.E11)\)\. We summarize in Table[1](https://arxiv.org/html/2605.15403#S3.T1)several examples ofϕ\\phi, all of which are strictly convex, symmetric, and differentiable\.

#### Euclidean potential\.

Setting the potential to the squared Euclidean normϕ​\(𝐦\)=12​‖𝐦‖22\\phi\(\\mathbf\{m\}\)=\\frac\{1\}\{2\}\\\|\\mathbf\{m\}\\\|\_\{2\}^\{2\}yields the conjugateϕ∗​\(𝐪\)=12​‖𝐪‖22\\phi^\{\*\}\(\\mathbf\{q\}\)=\\frac\{1\}\{2\}\\\|\\mathbf\{q\}\\\|\_\{2\}^\{2\}\. Since the link function is defined as𝐪=∇ϕ​\(𝐦\)\\mathbf\{q\}=\\nabla\\phi\(\\mathbf\{m\}\), this choice induces the identity map, effectively equating the price vector to the state:𝐪t\+1=𝐦t\+1\\mathbf\{q\}\_\{t\+1\}=\\mathbf\{m\}\_\{t\+1\}\.

#### ℓp\\ell\_\{p\}potentials\.

A simple smooth family parameterized byp\>1p\>1is

ϕ​\(𝐦\)=1p​‖𝐦‖pp=1p​∑e=1E𝐦ep,\\phi\(\\mathbf\{m\}\)=\\frac\{1\}\{p\}\\\|\\mathbf\{m\}\\\|\_\{p\}^\{p\}=\\frac\{1\}\{p\}\\sum\_\{e=1\}^\{E\}\\mathbf\{m\}\_\{e\}^\{p\},which yields the link function𝐪=∇ϕ​\(𝐦\)=𝐦p−1\\mathbf\{q\}=\\nabla\\phi\(\\mathbf\{m\}\)=\\mathbf\{m\}^\{p\-1\}\(since𝐦∈ΔE\\mathbf\{m\}\\in\\Delta^\{E\}is nonnegative\)\. The exponentppcontrols the elasticity of the pricing mechanism:

- •*p→1p\\to 1\(dampened\):*The exponent vanishes, driving prices toward uniformity regardless of usage history\. This effect is also approximated by the*softℓ1\\ell\_\{1\}potential*with ϕ​\(𝐦\)=‖𝐦‖1−δ​‖log⁡\(1δ∣𝐦∣\+1\)‖1,\\phi\(\\mathbf\{m\}\)=\\\|\\mathbf\{m\}\\\|\_\{1\}\-\\delta\\left\\\|\\log\\\!\\left\(\\frac\{1\}\{\\delta\}\|\\mathbf\{m\}\|\+1\\right\)\\right\\\|\_\{1\},and link function 𝐪=∇ϕ​\(𝐦\)=𝐦\|𝐦\|\+δ\.\\mathbf\{q\}=\\nabla\\phi\(\\mathbf\{m\}\)=\\frac\{\\mathbf\{m\}\}\{\|\\mathbf\{m\}\|\+\\delta\}\.
- •*p→∞p\\to\\infty\(aggressive\):*The exponent diverges, causing small disparities in usage to result in extreme price penalties\.

#### Negative Shannon entropic potential\.

Settingϕ​\(𝐦\)=∑𝐦e​log⁡𝐦e\\phi\(\\mathbf\{m\}\)=\\sum\\mathbf\{m\}\_\{e\}\\log\\mathbf\{m\}\_\{e\}yields the dual relationship

𝐪=∇ϕ​\(𝐦\)=log⁡\(𝐦\)\+𝟏\.\\mathbf\{q\}=\\nabla\\phi\(\\mathbf\{m\}\)=\\log\(\\mathbf\{m\}\)\+\\mathbf\{1\}\.This establishes an exponential link between the primal distribution and the dual prices, i\.e\.𝐦e≈exp⁡\(𝐪e\)\\mathbf\{m\}\_\{e\}\\approx\\exp\(\\mathbf\{q\}\_\{e\}\)\. Unlike the linear response, this penalizes low\-probability experts aggressively, effectively acting as a soft barrier\.

#### Negative Tsallis entropic potential\.

The negative Tsallis entropy is parameterized byα\>0\\alpha\>0andα≠1\\alpha\\neq 1and defined as

ϕ​\(𝐦\)=∑e=1E𝐦eα−𝐦eα−1,\\phi\(\\mathbf\{m\}\)=\\sum\_\{e=1\}^\{E\}\\frac\{\\mathbf\{m\}^\{\\alpha\}\_\{e\}\-\\mathbf\{m\}\_\{e\}\}\{\\alpha\-1\},with gradient

∇ϕ​\(𝐦\)=α​𝐦α−1−1α−1\.\\nabla\\phi\(\\mathbf\{m\}\)=\\frac\{\\alpha\\mathbf\{m\}^\{\\alpha\-1\}\-1\}\{\\alpha\-1\}\.It converges to the negative Shannon entropy in the limit asα→1\\alpha\\to 1\.

#### Negative Rényi entropic potential\.

Another family that generalizes the negative Shannon entropy is the negative Rényi entropy, parameterized byα∈\(0,1\)\\alpha\\in\(0,1\)and defined as

ϕ​\(𝐦\)=1α−1​log⁡\(∑e=1E𝐦eα\)\\phi\(\\mathbf\{m\}\)=\\frac\{1\}\{\\alpha\-1\}\\log\\left\(\\sum\_\{e=1\}^\{E\}\\mathbf\{m\}\_\{e\}^\{\\alpha\}\\right\)with gradient

∇ϕ​\(𝐦\)=α​𝐦α−1\(α−1\)​∑j=1E𝐦jα\.\\nabla\\phi\(\\mathbf\{m\}\)=\\frac\{\\alpha\\mathbf\{m\}^\{\\alpha\-1\}\}\{\(\\alpha\-1\)\\sum\_\{j=1\}^\{E\}\\mathbf\{m\}\_\{j\}^\{\\alpha\}\}\.It converges to the negative Shannon entropy in the limit asα→1\\alpha\\to 1\.

#### Robust potentials\.

There are several choices ofϕ\\phibased on*smooth robust losses*, whose sigmoidal gradients control how aggressively large usage disparities translate into prices\. The*pseudo\-Huber potential*behaves quadratically in a neighborhood of the origin but smoothly transitions to an approximately linear regime, thereby limiting the influence of extreme outliers\. The precise transition scale is controlled by the parameterδ\>0\\delta\>0\. Similar properties are enjoyed by the*log\-cosh*and*softplus potentials*, whose respective link functions𝐪=tanh⁡\(β​𝐦\)\\mathbf\{q\}=\\tanh\(\\beta\\mathbf\{m\}\)and

𝐪=σ​\(𝐦\):=1exp⁡\(−𝐦\)\+1\\mathbf\{q\}=\\sigma\(\\mathbf\{m\}\):=\\frac\{1\}\{\\exp\(\-\\mathbf\{m\}\)\+1\}are especially well\-behaved\.

## 4Experiments

We evaluateϕ\\phi\-balancing across a range of settings and find thatϕ\\phi\-balancing with negative entropy consistently performs best \(Figure[3](https://arxiv.org/html/2605.15403#S2.F3)\), outperforming Switch\-style and loss\-free load\-balancing baselines across model scales, architectures, and downstream tasks\. In large\-scale Gemma pretraining,ϕ\\phi\-balancing yields more stable routing, lower validation loss, and substantially reduced capacity violations when varying model scale, expert count, and granularity\. In downstream fine\-tuning, these stability gains translate into stronger task performance and more consistent expert specialization across domains\. Our ablations show that history\-aware population tracking is critical for robustness, and that entropy\-based potentials provide the best overall trade\-off between routing stability and downstream accuracy\.

### 4\.1Gemma\-based Language Model Pretraining

We first evaluateϕ\\phi\-balancing on MoE\-augmented Gemma language models\(Liang25\)\. Unless otherwise stated, all models use top\-2 routing and are trained on C4\(RaffelSRLNMZLL20\)with the same Gemma\-style pretraining recipe \(see AppendixLABEL:app:hpfor details on hyperparameters\) using negative entropy asϕ\\phi\. Following the settings intian2025towards, we systematically vary \(i\) the number of*active*parametersNN, \(ii\) the number of expertsEE; and \(iii\) the MoE*granularity*GG\(Figure[2](https://arxiv.org/html/2605.15403#S2.F2)\)\.

#### Scaling active parameters\.

To study howϕ\\phi\-balancing behaves across model scales, we train a family of MoE Transformers withE=16E=16experts andA=2A=2active experts per token, and vary the number of active parametersNNin\{111​M,338​M,588​M,986​M\}\\\{111\\text\{M\},338\\text\{M\},588\\text\{M\},986\\text\{M\}\\\}\. Here,NNcounts only the parameters that are touched for a single token under top\-22routing\. For each scale, we compareϕ\\phi\-balancing against standard Switch\-style load balancing and loss\-free load balancing\. We see that the proposedϕ\\phi\-balancing strategy consistently outperforms both baselines across all tested model scales, achieving the lowest validation loss at the 986M parameter mark\.

#### Scaling the number of experts\.

Next, we fix the total active parameter budget and per\-token compute, and vary the number of expertsE∈\{8,16,32,64,128\}E\\in\\\{8,16,32,64,128\\\}, keeping the number of active experts atA=2A=2\. AsEEincreases, we proportionally reduce the size of each expert so that the total FLOPs per token remain approximately constant\. This isolates the effect of expert multiplicity, allowing us to test whetherϕ\\phi\-balancing continues to stabilize routing when many small experts are available\. We see that the performance gap betweenϕ\\phi\-balancing and both baselines is maintained across the entire range of activation ratios, indicating that the benefit ofϕ\\phi\-balancing is robust to the level of model sparsity\.

#### Scaling granularity\.

Finally, we study the effect of MoE granularity by varying the granularity factorG∈\{2,4,8,16,32\}G\\in\\\{2,4,8,16,32\\\}, defined asG=dff/dexpertG=d\_\{\\mathrm\{ff\}\}/d\_\{\\mathrm\{expert\}\}, wheredexpertd\_\{\\mathrm\{expert\}\}denotes the hidden dimension of a single expert anddffd\_\{\\mathrm\{ff\}\}is the total feed\-forward dimension of the MoE layer\. IncreasingGGincreases the total number of experts while proportionally decreasing the size of each expert, so that the overall capacity and per\-token compute remain fixed and the activation ratioA/EA/Eis held constant\. Intuitively, largerGGcorresponds to slicing the feed\-forward capacity into finer\-grained experts\. This setting is particularly sensitive to routing instability, and serves as a stress test forϕ\\phi\-balancing versus conventional load\-balancing losses\.

Table 2:Ablation on mirror mapsϕ\\phi\.We report validation loss and maximum global load\-balance violation \(MaxVioglobal\\text\{MaxVio\}\_\{\\text\{global\}\}\), defined in AppendixLABEL:app:notation; lower is better\.

Similar Articles

Mixture of Experts (MoEs) in Transformers

Hugging Face Blog

Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.