FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models

arXiv cs.LG Papers

Summary

FlexMoE proposes a one-for-all nested intra-expert pruning method for MoE language models, enabling multiple deployable subnetworks from a single training run with minimal performance loss.

arXiv:2606.27866v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models scale model ability with sparsely activated experts, making this architecture a standard recipe for modern large models. However, sparse activation does not remove the deployment burden of storing and serving all experts, and the available deployment budget can vary substantially across devices, users, and workloads. Existing MoE compression methods are still largely fixed-budget, typically optimizing one compressed endpoint at each chosen target budget. We study a different setting: converting a large pretrained MoE LLM into a nested family of deployable subnetworks across budgets. Our method first ranks expert FFN channels by their importance, then lets each expert learn a discrete action to prune its channels. By gradually increasing cost pressure, a single action-training run exports a series of action masks from high to low budgets, each of which identifies a reliable smaller subnetwork nested in the ranked base model. Moreover, we use a single recovery fine-tune at a mid pruning budget (40%) to recover degraded model quality and transfer the recovered model to other unseen budgets. Overall, our framework surpasses recent MoE compression baselines. Specifically, on Qwen2-57B-A14B, our method retains ~99.8% of base performance while pruning 50% of routed expert parameters even without fine-tuning. For deployment, our pruned subnetworks deliver real memory reduction and throughput gains, and further support realtime online budget switching with kernel-level co-design.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:26 AM

# FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models
Source: [https://arxiv.org/html/2606.27866](https://arxiv.org/html/2606.27866)
Fan Mo National University of Singapore e1583153@u\.nus\.edu Yuxuan Han National University of Singapore han\_yuxuan@u\.nus\.edu Geng Zhang National University of Singapore zhangg@comp\.nus\.edu\.sg Wangbo Zhao National University of Singapore wangbo\.zhao96@gmail\.com Yang You National University of Singapore youy@comp\.nus\.edu\.sg

###### Abstract

Mixture\-of\-Experts \(MoE\) language models scale model ability with sparsely activated experts, making this architecture a standard recipe for modern large models\. However, sparse activation does not remove the deployment burden of storing and serving all experts, and the available deployment budget can vary substantially across devices, users, and workloads\. Existing MoE compression methods are still largely fixed\-budget, typically optimizing one compressed endpoint at each chosen target budget\. We study a different setting: converting a large pretrained MoE LLM into a nested family of deployable subnetworks across budgets\. Our method first ranks expert FFN channels by their importance, then lets each expert learn a discrete action to prune its channels\. By gradually increasing cost pressure, a single action\-training run exports a series of action masks from high to low budgets, each of which identifies a reliable smaller subnetwork nested in the ranked base model\. Moreover, we use a single recovery fine\-tune at a mid pruning budget \(40%\) to recover degraded model quality and transfer the recovered model to other unseen budgets\. Overall, our framework surpasses recent MoE compression baselines\. Specifically, on Qwen2\-57B\-A14B, our method retains∼99\.8%\\sim 99\.8\\%of base performance while pruning50%50\\%of routed expert parameters even without fine\-tuning\. For deployment, our pruned subnetworks deliver real memory reduction and throughput gains, and further support realtime online budget switching with kernel\-level co\-design\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.27866v1/figures/flexmoe-preview.png)Figure 1:Comparison between existing fixed\-budget MoE compression methods and FlexMoE\. Existing methods typically optimize or instantiate a compressed endpoint for a specified target budget, including expert pruning/skipping methods such as NAEE\(Luet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib4)\), expert\-weight decomposition methods such as MoE\-SVD\(Liet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib7)\)and TD\-MoE\(Xuet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib8)\)\. In contrast, FlexMoE enables one\-for\-all structural pruning and multi\-budget deployment without maintaining a separate compressed checkpoint for each deployment target\.Mixture\-of\-Experts \(MoE\) has become one of the most effective recipes for scaling large models: by increasing total parameter capacity while activating only a small subset of experts per token, MoE models can achieve strong quality with much lower active computation\(Lepikhinet al\.,[2021](https://arxiv.org/html/2606.27866#bib.bib1); Feduset al\.,[2022](https://arxiv.org/html/2606.27866#bib.bib2)\)\. This design has already powered a broad range of successful systems, from top\-tier large\-scale research systems, to a range of compact yet powerful open\-source models, indicating that sparse expert architectures are an increasingly standard way to scale modern foundation models\.

However, sparsely activated experts do not remove the deployment burden of storing and serving all experts\. In deployment, one does not always need, or have the budget \(We use the term budget here to denote the deployment\-side resource envelope available to a model under a given serving setting, including device hardware capacity, memory footprint, latency/throughput targets, context\-length demands, or service\-level objectives \(SLOs\)\) to use the largest available MoE model operating at its maximum capacity\. In practice, such budgets vary across users, devices, platforms and workloads: one service may need long\-context or multi\-agent reasoning under a generous budget, while another may serve lightweight chatbot requests under much stricter latency and cost constraints\(Fuet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib13); Gaoet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib34); Maet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib39)\)\. This makes MoE compression not merely a model reduction problem, but also a deployment adaptation problem\.

Existing MoE compression methods, however, are still largely fixed\-budget\. Expert\-pruning methods remove, skip or merge experts structurally, while compression methods such as low\-rank decomposition or expert\-internal pruning typically optimize one compressed endpoint at a chosen target budget\(Luet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib4); Baiet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib9); Yanget al\.,[2024b](https://arxiv.org/html/2606.27866#bib.bib5); Guet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib6); Liet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib7); Xuet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib8)\)\. This is useful when a deployment target is known in advance, but much less convenient when budgets vary across scenarios or change over time: moving to another operating point often requires rerunning the entire compression pipeline, reloading model or maintaining multiple models across separate budgets\. Moreover, recent systems work highlights that realistic LLM serving must operate under non\-stationary traffic and mixed request requirements, where online reconfiguration is often valuable but difficult to realize cleanly\(Gaoet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib34); Qianet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib35)\)\.

Motivated by these observations, we proposeFlexMoE, a nested intra\-expert pruning framework that converts one pretrained MoE model into a family of materializable subnetworks across budgets \(We use the term budget here to indicate the ratio of pruned parameters over all parameters in routed experts\)\. As illustrated in Figure[2](https://arxiv.org/html/2606.27866#S2.F2), our pipeline first reranks expert FFN channels by estimated importance to enable top\-retained channel slicing\. We then let each expert learn a retention ratio from a predefined discrete ratio set continuously in a training loop with increasing pruning pressure, allowing a single training run to yield a family of action masks across multiple budgets, where each mask identifies a reliable pruned subnetwork nested in the ranked pretrained MoE model\. Optionally, we further perform a single mid\-budget fine\-tuning stage to recover a shared set of weights from parameter pruning, enabling model reuse across the entire budget family\. This results in a “train\-once, deploy\-many” pruning pipeline\. In addition to achieve real throughput benefits for the real\-time online budget adjustment scenario, we further explore and pair this budget\-family with a deployment\-oriented system co\-design by kernel\-level optimization\.

We conducted comprehensive experiments on Mixtral\-8x7B, Phi\-3\.5\-MoE, and Qwen2\-57B\-A14B\. Across these backbones, FlexMoE surpasses strong MoE compression baselines while offering a substantially more flexible deployment\. In particular, on Qwen2\-57B\-A14B, the pruned subnet retains about 99\.8% of the base performance at 50% pruning budget and still preserves about 92\.9% at 80% budget\. We further show that the shared weights recovered from single mid\-budget fine\-tune strategy transfers well to unseen higher and lower budgets, and that the resulting subnet family delivers real throughput gains and more flexible interfaces in deployment, especially when coupled with our exploratory algorithm–system co\-design for online budget switching\.

## 2Related Work

### 2\.1MoE Compression

As shown in Figure[1](https://arxiv.org/html/2606.27866#S1.F1), MoE compression methods mainly operate in two directions\. A substantial line of MoE param\-pruning work reduces, skips, or merges experts at the expert level\. NAEE shows that experts in pretrained MoE LLMs are not equally important and studies expert pruning and skipping to improve deployment efficiency\(Luet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib4)\)\. DiEP pushes this direction further with differentiable expert pruning, learning which experts to retain under compression objectives rather than relying only on heuristic expert ranking\(Baiet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib9)\)\. On the expert\-merging side, HC\-SMoE clusters and merges functionally similar experts to reduce model size without retraining\(Chenet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib21)\)\. For skipping, MoNE replaces redundant expert outputs with lightweight novices instead of retaining full expert calculation, providing another way to reduce parameters and deployment cost\(Zhanget al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib22)\)\. Another line compresses expert weights internally while largely preserving the original MoE structure\. MoE\-SVD and D2\-MoE are representative approaches for pretrained MoE LLMs, while TD\-MoE further extends this direction to cross\-expert joint tensor decomposition within each layer\(Liet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib7); Guet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib6); Xuet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib8)\)\. MoE\-I2is also related in that it combines inter\-expert pruning with intra\-expert decomposition\(Yanget al\.,[2024b](https://arxiv.org/html/2606.27866#bib.bib5)\)\. Our work is closest to this intra\-expert compression family, but most of them are based on low\-rank decomposition\. However, our pruning method is training\-based, fine\-grained\. We also impose a sliceable nested structure within each expert from the same pretrained MoE base model that shares same router allocation and expert topology, enabling more flexible deployment interfaces and budget switching\.

### 2\.2Nested Subnetworks, Mask Learning, and Ranking

The broader train\-once, deploy\-many idea comes from elastic and nested subnetwork training\. US\-Nets introduced universally slimmable subnetworks, and later MatFormer, Flextron, and AmoebaLLM extended nested or many\-in\-one parameter sharing to transformers and large language models\(Yu and Huang,[2019](https://arxiv.org/html/2606.27866#bib.bib10); Devvritet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib11); Caiet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib12); Fuet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib13)\)\. More recent MoE work has started to bring elasticity into MoE language models directly by learning coarse\-to\-fine expert ranking or slimmable expert widths\(Wanget al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib19); Tastanet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib20)\)\. Our work is different: rather than introducing elasticity during MoE pre\-training, we start from pretrained MoE LLMs and derive a family of nested subnetworks across budgets at post\-training stage\.

Our method also draws on differentiable structure learning and gradient\-based importance ranking\. MaskLLM and Gumbel\-Softmax provide standard tools for learning discrete structural decisions with gradient\-based optimization\(Fanget al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib15); Janget al\.,[2017](https://arxiv.org/html/2606.27866#bib.bib17)\), while RECAP and earlier Taylor criteria show that grouped first\-order saliency is effective for structured pruning decisions\(Ilhanet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib14); Molchanovet al\.,[2019](https://arxiv.org/html/2606.27866#bib.bib16)\)\. We combine these ingredients in a new setting: learning per\-expert channel prefix slicing action over importance\-ordered expert channels\.

![Refer to caption](https://arxiv.org/html/2606.27866v1/figures/method-pipeline.png)Figure 2:Overall visualization of FlexMoE pipeline\. First it reranks each expert FFN projection channels by estimated importance to enable top\-retained channel slicing\. Then let each expert learn a retention ratio continuously in an optimization loop with increasing pruning pressure, allowing a single training run yields a family of action masks across multiple budgets, where each saved mask points out a pruned subnetwork nested in the reranked base MoE model\. Then it followed with an optional single point fine\-tuning stage at mid\-budget \(40%\) to recover a shared set of weights from parameter pruning, enabling recovered weights reuse across the entire budget family\.

## 3Method

We consider a pretrained MoE LLM withLLMoE layers andEEexperts per layer\. For a hidden statehhrouted to experteein layerll, we write the expert FFN as

FFNl,e​\(h\)=Wl,ed​o​w​n​\(ϕ​\(Wl,egate​h\)⊙Wl,eup​h\),\\mathrm\{FFN\}\_\{l,e\}\(h\)=W^\{down\}\_\{l,e\}\\Big\(\\phi\(W^\{\\mathrm\{gate\}\}\_\{l,e\}h\)\\odot W^\{\\mathrm\{up\}\}\_\{l,e\}h\\Big\),\(1\)whereϕ​\(⋅\)\\phi\(\\cdot\)is the activation function and the intermediate FFN width isdffd\_\{\\mathrm\{ff\}\}\.

### 3\.1Intra\-Expert Channel Ranking

Directly applying channel prefix slicing to pretrained experts is brittle\. We therefore perform a one\-time hidden\-channel reordering step that converts each expert FFN into an importance\-ordered layout, so that smaller prefixes preserve more important parameters under prefix\-slice pruning\.

For expert\(l,e\)\(l,e\), we define a structured parameter groupΘl,e,j\\Theta\_\{l,e,j\}for thejj\-th FFN hidden channel, consisting of the corresponding rows ofWl,egateW^\{\\mathrm\{gate\}\}\_\{l,e\}andWl,eupW^\{\\mathrm\{up\}\}\_\{l,e\}together with the matching column ofWl,ed​o​w​nW^\{down\}\_\{l,e\}\. In other words,Θl,e,j\\Theta\_\{l,e,j\}collects all parameters attached to hidden channeljjin the FFN\. The FFN hidden channels are permutation\-invariant as long as the same permutation is applied consistently to the corresponding rows/columns inWl,egateW^\{\\mathrm\{gate\}\}\_\{l,e\},Wl,eupW^\{\\mathrm\{up\}\}\_\{l,e\}andWl,ed​o​w​nW^\{down\}\_\{l,e\}\(Navonet al\.,[2023](https://arxiv.org/html/2606.27866#bib.bib45)\)\. This allows us to reorder hidden channels by importance without changing the full expert function, while making prefix channel slicing action much more meaningful\.

We estimate a first\-order Taylor saliency for each hidden\-channel on a small ranking setℬrank\\mathcal\{B\}\_\{\\mathrm\{rank\}\}of calibration batches\. For batchb∈ℬrankb\\in\\mathcal\{B\}\_\{\\mathrm\{rank\}\}, the batch\-wise and final group saliency are computed as

gθ\(b\)=∂ℒ\(b\)∂θ,sl,e,j\(b\)=∑θ∈Θl,e,j\(θ​gθ\(b\)\)2,sl,e,j=1\|ℬrank\|​∑b∈ℬranksl,e,j\(b\)\.g^\{\(b\)\}\_\{\\theta\}=\\frac\{\\partial\\mathcal\{L\}^\{\(b\)\}\}\{\\partial\\theta\},\\qquad s^\{\(b\)\}\_\{l,e,j\}=\\sum\_\{\\theta\\in\\Theta\_\{l,e,j\}\}\(\\theta\\,g^\{\(b\)\}\_\{\\theta\}\)^\{2\},\\qquad s\_\{l,e,j\}=\\frac\{1\}\{\|\\mathcal\{B\}\_\{\\mathrm\{rank\}\}\|\}\\sum\_\{b\\in\\mathcal\{B\}\_\{\\mathrm\{rank\}\}\}s^\{\(b\)\}\_\{l,e,j\}\.\(2\)Intuitively,sl,e,js\_\{l,e,j\}estimates the extent of param\-groupΘl,e,j\\Theta\_\{l,e,j\}\(channeljj\) affects to the lossℒ\\mathcal\{L\}\. This follows the grouped first\-order Taylor view of structured saliency and is closely related to the ranking signal used in RECAP\(Molchanovet al\.,[2019](https://arxiv.org/html/2606.27866#bib.bib16); Ilhanet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib14)\)\. However, in our setting, we use it only once as a preprocessing step to sort expert FFN hidden channels before action learning rather than as part of an iterative prune–recover procedure\. We then sortsl,e,js\_\{l,e,j\}in descending order to obtain a permutationπl,e\\pi\_\{l,e\}, or equivalently a permutation matrixPl,eP\_\{l,e\}, which is applied consistently across the expert FFN hidden dimension:

W~l,egate=Pl,e​Wl,egate,W~l,eup=Pl,e​Wl,eup,W~l,ed​o​w​n=Wl,ed​o​w​n​Pl,e⊤\.\\widetilde\{W\}^\{\\mathrm\{gate\}\}\_\{l,e\}=P\_\{l,e\}W^\{\\mathrm\{gate\}\}\_\{l,e\},\\qquad\\widetilde\{W\}^\{\\mathrm\{up\}\}\_\{l,e\}=P\_\{l,e\}W^\{\\mathrm\{up\}\}\_\{l,e\},\\qquad\\widetilde\{W\}^\{down\}\_\{l,e\}=W^\{down\}\_\{l,e\}P\_\{l,e\}^\{\\top\}\.\(3\)
After reordering, a retention ratior∈\(0,1\]r\\in\(0,1\]corresponds to keeping only the topk=⌈r​dff⌉k=\\lceil r\\,d\_\{\\mathrm\{ff\}\}\\rceilchannels in all ranked channels\. Letm​\(r\)∈\{0,1\}dffm\(r\)\\in\\\{0,1\\\}^\{d\_\{\\mathrm\{ff\}\}\}be a prefix mask whose firstkkentries are one and the remaining entries are zero\. The corresponding sliced expert is

FFNl,e\(r\)​\(h\)=W~l,ed​o​w​n​\(m​\(r\)⊙ϕ​\(W~l,egate​h\)⊙W~l,eup​h\),\\mathrm\{FFN\}^\{\(r\)\}\_\{l,e\}\(h\)=\\widetilde\{W\}^\{down\}\_\{l,e\}\\Big\(m\(r\)\\odot\\phi\(\\widetilde\{W\}^\{\\mathrm\{gate\}\}\_\{l,e\}h\)\\odot\\widetilde\{W\}^\{\\mathrm\{up\}\}\_\{l,e\}h\\Big\),\(4\)whereϕ​\(⋅\)\\phi\(\\cdot\)denotes the gate activation \(e\.g\., SiLU\)\. During action training, all parameters are kept and we only use prefix maskm​\(r\)m\(r\)to simulate pruned expert outputs, but during deployment, parameters masked by learned actions can be dropped or excluded from forward calculations to reduce cost\.

This importance\-ordered FFN layout yields a shared nested expert weight space\. This nesting property is crucial for learning a coherent family of budget\-specific action masks, enabling one\-for\-all recovery fine\-tuning across subnetworks and online budget switching\.

### 3\.2Action Mask Learning

#### Problem Formulation\.

Given the importance\-ordered experts, we learn a token\-independent discrete slice action for each expert\. Let the action set be

𝒜=\{r1,…,rK\},\\mathcal\{A\}=\\\{r\_\{1\},\\dots,r\_\{K\}\\\},\(5\)where eachrk∈\(0,1\]r\_\{k\}\\in\(0,1\]is a predefined channel retention ratio, e\.g\.,0\.10\.1,0\.40\.4,0\.70\.7, or1\.01\.0\(fully retained\)\. For every expert\(l,e\)\(l,e\), we maintain trainable action logitsαl,e∈ℝK\\alpha\_\{l,e\}\\in\\mathbb\{R\}^\{K\}\. To sample subnetworks under one\-hot operation while preserving gradients for action logits training, we use Straight\-Through Gumbel\-Softmax\(Janget al\.,[2017](https://arxiv.org/html/2606.27866#bib.bib17); Fanget al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib15)\)\. In detail, the relaxed action distributionzl,esoftz^\{\\mathrm\{soft\}\}\_\{l,e\}and the hard sampled actionzl,ehardz^\{\\mathrm\{hard\}\}\_\{l,e\}is

zl,esoft=softmax​\(αl,e\+gl,eτ\),zl,ehard=one​\_​hot​\(arg⁡maxk⁡zl,e,ksoft\)\.z^\{\\mathrm\{soft\}\}\_\{l,e\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\alpha\_\{l,e\}\+g\_\{l,e\}\}\{\\tau\}\\right\),\\qquad z^\{\\mathrm\{hard\}\}\_\{l,e\}=\\mathrm\{one\\\_hot\}\\\!\\left\(\\arg\\max\_\{k\}z^\{\\mathrm\{soft\}\}\_\{l,e,k\}\\right\)\.\(6\)
wheregl,e,k∼Gumbel​\(0,1\)g\_\{l,e,k\}\\sim\\mathrm\{Gumbel\}\(0,1\)are added to the original action distribution, and introduce a linearly annealedτ\\tauto control the degree of impact of Gumbel noise and promote early stage action exploration and subnet sampling\. The sampled expert subnet forward pass uses the straight\-through estimator

z~l,e=sg​\[zl,ehard−zl,esoft\]\+zl,esoft,\\tilde\{z\}\_\{l,e\}=\\mathrm\{sg\}\\\!\\left\[z^\{\\mathrm\{hard\}\}\_\{l,e\}\-z^\{\\mathrm\{soft\}\}\_\{l,e\}\\right\]\+z^\{\\mathrm\{soft\}\}\_\{l,e\},\(7\)wheresg\(\.\)\\mathrm\{sg\}\(\.\)means stop gradient\.z~l,e\\tilde\{z\}\_\{l,e\}ensures actions are discrete in the forward pass but differentiable in the backward pass\. All straight\-through estimators jointly defines a sampled action mask during training stage which can be applied to each expert and produce a nested subnet in full MoE model:

ℳ=\{z~l,e\}l=1,e=1L,E\.\\mathcal\{M\}=\\\{\\tilde\{z\}\_\{l,e\}\\\}\_\{l=1,e=1\}^\{L,E\}\.\(8\)
We optimize these action logits on a calibration dataset with a quality–cost objective:

ℒaction=ℒqual\+λcost​\(t\)​𝒞​\(p,q\)−β​\(t\)​ℋ​\(p\),\\mathcal\{L\}\_\{\\mathrm\{action\}\}=\\mathcal\{L\}\_\{\\mathrm\{qual\}\}\+\\lambda\_\{\\mathrm\{cost\}\}\(t\)\\,\\mathcal\{C\}\(p,q\)\-\\beta\(t\)\\,\\mathcal\{H\}\(p\),\(9\)The first term,ℒqual\\mathcal\{L\}\_\{\\mathrm\{qual\}\}, is a teacher\-guided quality\-preservation term that keeps sampled subnetworks \(student\) close to the full model \(teacher\); in our implementation it consists of the LLM cross\-entropy loss term on sampled student subnet plus a teacher–student KL distribution loss term\. For cost objective,𝒞​\(p,q\)\\mathcal\{C\}\(p,q\)is an expected load\-sensitive computing cost for the sampled subnet,

𝒞​\(p,q\)=∑l=1L∑e=1Eql,e​∑k=1Kpl,e,k​rk,\\mathcal\{C\}\(p,q\)=\\sum\_\{l=1\}^\{L\}\\sum\_\{e=1\}^\{E\}q\_\{l,e\}\\sum\_\{k=1\}^\{K\}p\_\{l,e,k\}\\,r\_\{k\},\(10\)wherepl,e=softmax​\(αl,e\)p\_\{l,e\}=\\mathrm\{softmax\}\(\\alpha\_\{l,e\}\)denotes the currently learned clean action distribution without Gumbel sampling noise andτ\\tauscaling\.ql,eq\_\{l,e\}is the frequency of assigning tokens to expert\(l,e\)\(l,e\)in this MoE layer \(expert load ratio\)\. Introducingql,eq\_\{l,e\}makes the optimization load\-sensitive: assigning thicker actions to highly routed experts preserves task accuracy, but also brings more computation and incurs larger cost penalties, so actions must learn an expert\-wise accuracy–efficiency trade\-off\. We further include an entropy regularizerℋ​\(p\)\\mathcal\{H\}\(p\)computed from the clean action probabilitiespl,ep\_\{l,e\}\. It’s the mean action categorical entropy over all experts \(−∑kpl,e,k​log⁡pl,e,k\-\\sum\_\{k\}p\_\{l,e,k\}\\log p\_\{l,e,k\}\), averaged across layers and experts\. This term encourages exploration, prevents premature collapse to a single action, and helps the model discover more reliable MoE subnets\. Entropy weightβ​\(t\)\\beta\(t\)is also annealed linearly\.

Before starting action optimization and pruning, action logits are initialized to the thickest action \(full model retained\), and during optimization, we gradually increaseλcost​\(t\)\\lambda\_\{\\mathrm\{cost\}\}\(t\)so that actions under training will move from weaker to stronger channel pruning, and progressively pushes experts toward thinner\. At any training checkpoint, the hardened actionz^l,e\\hat\{z\}\_\{l,e\}induces the assigned action and channel retention ratior^l,e\\hat\{r\}\_\{l,e\}for expert\(l,e\)\(l,e\),

z^l,e=one​\_​hot​\(arg⁡maxk⁡pl,e,k\),r^l,e=∑k=1Kz^l,e,k​rk,\\hat\{z\}\_\{l,e\}=\\mathrm\{one\\\_hot\}\\\!\\left\(\\arg\\max\_\{k\}p\_\{l,e,k\}\\right\),\\qquad\\hat\{r\}\_\{l,e\}=\\sum\_\{k=1\}^\{K\}\\hat\{z\}\_\{l,e,k\}\\,r\_\{k\},\(11\)which together define the currently trained global action mask and its corresponding prune budget

ℳ^=\{r^l,e\}l=1,e=1L,E,ρ^=1−1L​E​∑l=1L∑e=1Er^l,e\.\\hat\{\\mathcal\{M\}\}=\\\{\\hat\{r\}\_\{l,e\}\\\}\_\{l=1,e=1\}^\{L,E\},\\qquad\\hat\{\\rho\}=1\-\\frac\{1\}\{LE\}\\sum\_\{l=1\}^\{L\}\\sum\_\{e=1\}^\{E\}\\hat\{r\}\_\{l,e\}\.\(12\)Here,ℳ^\\hat\{\\mathcal\{M\}\}specifies the selected retention ratio \(action\) of every expert, whileρ^\\hat\{\\rho\}gives the overall prune budget \(percentage of total pruned expert parameters\) of the corresponding MoE subnet produced byℳ^\\hat\{\\mathcal\{M\}\}\. In practice, we may sample multiple subnet actions on the same batch and averageℒqual\\mathcal\{L\}\_\{\\mathrm\{qual\}\}to prevent data variance\. Finally, since each action maskℳ^\\hat\{\\mathcal\{M\}\}points out a materializable MoE subnet nested in full model, by saving only these action masks along the training trajectory, we can yield a sequence of budget\-specific pruned MoE subnets from full model in a single action\-learning run\.

#### Clip FFN Forward Kernel Co\-Design\.

When deploying the nested pruned MoE models, we found that a naive Python implementation of online FFN channel prefix\-slicing degrades throughput by two factors\. First, learning one retention ratio per expert creates many expert FFNs with different effective widths at runtime, which violates GPU’s preference of handling large, shape\-regular GEMMs\. Practical MoE inference frameworks usually further improve utilization by batching multiple experts into one batched GEMM, but in our approach this execution pattern largely degenerates into per\-expert single GEMMs\. This also prompted us to use a discrete action set to reduce misaligned shapes rather than continuous retention ratios\. Second, standard MoE experts store and compute gate and up projections as one connected gate\-up weight matrix, but under this weights layout, to get required connected gate\-up weight matrix, online inplace slicing requires 2 extra slice and 1 concatenate operations with working set of the entire gate\-up weight matrix, while transmitting and applying action masks introduce additional host\-side scheduling overhead\. This bottleneck is a direct consequence of combining nested weights prefix slicing with online budget\-conditioned inference, and does not arise when running static pruned subnet checkpoints\. To relieve these introduced computational overhead, we implement a customized kernel to mitigate these bottlenecks and explore potentials of runtime online budget adjustment of our FlexMoE\. We first bucket routed experts by retained width, align each width upward to a hardware\-friendly size, and invoke cuBLAS batched GEMMs per bucket to reduce original per\-expert fragmented small\-shape execution\. We further store gate\-up weights in an interleaved layout, so 1 prefix slice operation over the interleaved gate\-up tensor is able to get all required connected weights rather than 2 separate slices\. The batched GEMM could then directly produce packed gate\-up outputs, which are then consumed by a fused kernel that reads the interleaved gate\-up activation and computes the gated outputs, reducing the concatenation working set from entire gate\-up weights to its smaller activation\. See Appendix[C](https://arxiv.org/html/2606.27866#A3)for more details\.

### 3\.3Recovery Fine\-Tuning

Pure parameter pruning can still degrade language modeling quality, so FlexMoE paired with an optional one\-step fine\-tuning stage\. Instead of recovering each pruned MoE subnetwork separately, we choose one mid\-budget action maskℳmid\\mathcal\{M\}^\{\\mathrm\{mid\}\}and fine\-tune only that masked model\. Concretely, we freeze the channel\-ranked base weightsW0W\_\{0\}and attach LoRA adapters\(Huet al\.,[2022](https://arxiv.org/html/2606.27866#bib.bib18)\)to the expert FFNs enabled by action mask, training only the adapter parametersφ\\varphi\.

Under a fixed mid\-budget maskℳmid\\mathcal\{M\}^\{\\mathrm\{mid\}\}, we view the student model as the sum of a masked full model branch and a LoRA branch,

fstu​\(x;ℳmid,W0,φ\)=fbase​\(x;ℳmid,W0\)\+flora​\(x;ℳmid,Δ​W​\(φ\)\),f\_\{\\mathrm\{stu\}\}\(x;\\mathcal\{M\}^\{\\mathrm\{mid\}\},W\_\{0\},\\varphi\)=f\_\{\\mathrm\{base\}\}\(x;\\mathcal\{M\}^\{\\mathrm\{mid\}\},W\_\{0\}\)\+f\_\{\\mathrm\{lora\}\}\(x;\\mathcal\{M\}^\{\\mathrm\{mid\}\},\\Delta W\(\\varphi\)\),\(13\)We optimize the LoRA parameters with a task loss plus a teacher–student distillation term:

𝔼\(x,y\)∼𝒟rec​\[λtask​CE​\(fstu​\(x;ℳmid,W0,φ\),y\)\+λkl​KL​\(ftea​\(x;W0\)∥fstu​\(x;ℳmid,W0,φ\)\)\],\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{rec\}\}\}\\Big\[\\lambda\_\{\\mathrm\{task\}\}\\,\\mathrm\{CE\}\\big\(f\_\{\\mathrm\{stu\}\}\(x;\\mathcal\{M\}^\{\\mathrm\{mid\}\},W\_\{0\},\\varphi\),y\\big\)\+\\lambda\_\{\\mathrm\{kl\}\}\\,\\mathrm\{KL\}\\big\(f\_\{\\mathrm\{tea\}\}\(x;W\_\{0\}\)\\;\\\|\\;f\_\{\\mathrm\{stu\}\}\(x;\\mathcal\{M\}^\{\\mathrm\{mid\}\},W\_\{0\},\\varphi\)\\big\)\\Big\],\(14\)where the teacherftea​\(x;W0\)f\_\{\\mathrm\{tea\}\}\(x;W\_\{0\}\)is the inplace base full model without masking and LoRA adapters, and the student is theMmidM^\{\\mathrm\{mid\}\}masked LoRA\-augmented subnet\. After training, we merge the learned adapters back into the ranked base weights,W¯\\overline\{W\}, and reuse these recovered weights for every action maskℳ\(b\)\\mathcal\{M\}^\{\(b\)\}with different budgetsb∈Bb\\in B:

f\(b\)​\(x\)=f​\(x;ℳ\(b\),W¯\),ℳ\(b\)∈\{ℳ\(1\),…,ℳ\(B\)\}\.f^\{\(b\)\}\(x\)=f\(x;\\mathcal\{M\}^\{\(b\)\},\\overline\{W\}\),\\qquad\\mathcal\{M\}^\{\(b\)\}\\in\\\{\\mathcal\{M\}^\{\(1\)\},\\dots,\\mathcal\{M\}^\{\(B\)\}\\\}\.\(15\)This extends the train\-once, deploy\-many feature from action learning to fine\-tuning recovery: one action\-training run yields a series of nested subnetworks, and by leveraging the invariant nesting property of frozen router and expert topology, one mid\-budget recovery yields one shared recovered full model that is adaptable to all masks across upstream and downstream budgets\.

## 4Experiments

### 4\.1Experimental Setup

#### Implementation Details and Evaluation Tasks\.

We use three pretrained MoE LLMs: Mixtral\-8x7B\(Jianget al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib3)\)as the main model, and Phi\-3\.5\-MoE\(Abdinet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib46)\)and Qwen2\-57B\-A14B\(Yanget al\.,[2024a](https://arxiv.org/html/2606.27866#bib.bib47)\)as cross\-model validation with different MoE architectures and increasing sparsity\. We defined discrete action set as𝒜=\{0\.1,0\.4,0\.7,1\}\\mathcal\{A\}=\\\{0\.1,0\.4,0\.7,1\\\}, and throughout this section, the pruning ratio \(budget\) is defined by the global prune budgetρ^\\hat\{\\rho\}in Eq\. \([12](https://arxiv.org/html/2606.27866#S3.E12)\) of an action mask applied to the ranked full MoE model\. During action learning, we export one action mask whenever theρ^\\hat\{\\rho\}increases by roughly1%1\\%\. For recovery fine\-tuning, we choose the40%40\\%budget action mask as the recovery point\. Channel importance ranking, action learning, and recovery fine\-tuning stages use Zyda\-2\(Tokpanovet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib30)\)as the calibration dataset\. As pruning quality evaluation, we report zero\-shot accuracy on seven widely used reasoning benchmarks implemented withlm\-eval\-harness: ARC\-Challenge, ARC\-Easy, HellaSwag, OpenBookQA, PIQA, WinoGrande, and MathQA\(Gaoet al\.,[2021](https://arxiv.org/html/2606.27866#bib.bib23); Clarket al\.,[2018](https://arxiv.org/html/2606.27866#bib.bib24); Zellerset al\.,[2019](https://arxiv.org/html/2606.27866#bib.bib25); Mihaylovet al\.,[2018](https://arxiv.org/html/2606.27866#bib.bib26); Bisket al\.,[2020](https://arxiv.org/html/2606.27866#bib.bib27); Sakaguchiet al\.,[2020](https://arxiv.org/html/2606.27866#bib.bib28); Aminiet al\.,[2019](https://arxiv.org/html/2606.27866#bib.bib29)\)\. Action learning and recovery fine\-tuning were run on2×2\\timesNVIDIA H200 for convenience\. Importance ranking and all other experiments are run on a single NVIDIA H200\.

#### Baselines\.

We fix the baseline family throughout the main comparison\. Our primary baselines are MoE\-SVD and TD\-MoE, since both are latest strong pretrained MoE compression methods that also preserve the router and expert topology\(Liet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib7); Xuet al\.,[2026](https://arxiv.org/html/2606.27866#bib.bib8)\)\. We additionally report NAEE and MoE\-I2as broader references, representing expert\-level pruning/skipping and mixed inter\-/intra\-expert compression, respectively\(Luet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib4); Yanget al\.,[2024b](https://arxiv.org/html/2606.27866#bib.bib5)\)\. To avoid selective reporting, we use a shared comparison grid centered on 20%, 40%, and 60% prune budgets, and include each baseline whether the corresponding model–budget pair is publicly available, either in the original paper, its appendix, or its public OpenReview revision/author response\. When a baseline is still unavailable for a given model–budget pair, we marked as N/A and left it absent\.

### 4\.2Main Results and Analysis

PrunedRatioMixtral\-8x7BPhi\-3\.5\-MoEQwen2\-57B\-A14BMethodARC\-cARC\-eHellaSOBQAPIQAWinoGMathQAAvgMethodAvgMethodAvg0%Base model5784653682764363\.29Base model62\.00Base model58\.7120%NAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)4776583279724057\.71NAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)N/ANAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)55\.86MoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)4879553278743757\.57MoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)N/AMoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)N/AMoE\-SVD\(fine\-tuned\)[2025](https://arxiv.org/html/2606.27866#bib.bib7)5580613381733860\.14MoE\-SVD\(fine\-tuned\)[2025](https://arxiv.org/html/2606.27866#bib.bib7)61\.14MoE\-SVD[2025](https://arxiv.org/html/2606.27866#bib.bib7)56\.57TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)5383643382774061\.71TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)61\.00TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)58\.29FlexMoE \(Ours\)5482643581774061\.86FlexMoE \(Ours\)63\.29FlexMoE \(Ours\)∗58\.4340%NAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)3663462572643548\.71NAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)57\.57NAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)53\.14MoE\-I2\(P\+F\)[2024b](https://arxiv.org/html/2606.27866#bib.bib5)3871432669663149\.14MoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)45\.29MoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)N/AMoE\-SVD[2025](https://arxiv.org/html/2606.27866#bib.bib7)3872432771673250\.00MoE\-SVD[2025](https://arxiv.org/html/2606.27866#bib.bib7)55\.86MoE\-SVD[2025](https://arxiv.org/html/2606.27866#bib.bib7)48\.14TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)4777572879763557\.00TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)57\.86TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)55\.57FlexMoE \(Ours\)4977603380733458\.00FlexMoE \(Ours\)59\.43FlexMoE \(Ours\)∗58\.8660%NAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)2342331762552636\.86NAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)N/ANAEE[2024](https://arxiv.org/html/2606.27866#bib.bib4)44\.00MoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)2244321858552336\.00MoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)N/AMoE\-I2[2024b](https://arxiv.org/html/2606.27866#bib.bib5)N/AMoE\-SVD[2025](https://arxiv.org/html/2606.27866#bib.bib7)2345331962552537\.43MoE\-SVD[2025](https://arxiv.org/html/2606.27866#bib.bib7)48\.57MoE\-SVD[2025](https://arxiv.org/html/2606.27866#bib.bib7)46\.86TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)2855382165622441\.86TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)49\.86TD\-MoE[2026](https://arxiv.org/html/2606.27866#bib.bib8)51\.57FlexMoE \(Ours\)3565512374712849\.57FlexMoE \(Ours\)53\.00FlexMoE \(Ours\)∗58\.57

Table 1:Task accuracy results for FlexMoE across models\. All applied action masks at these budgets were exported from the same action training run for the corresponding model\. We apply single\-point recovery fine\-tuning at the 40% pruning budget, and 20% and 60% results are obtained bydirectly reusing the same recovered model without extra fine\-tuning\. For Qwen2\-57B\-A14B,∗denotes that we do not apply fine\-tuning stage and evaluated by directly applying the learned action mask to the channel\-ranked base model\. Full results on Phi\-3\.5\-MoE and Qwen2\-57B\-A14B are in Appendix TableLABEL:tab:appendix\_full\_results\. All numbers are zero\-shot accuracy \(%\)\.#### Task Accuracy Results Analysis\.

Across models, the main practical pattern is consistent: On Mixtral\-8x7B, the recovered model at the 40% point achieves the best average score, while its recovered weights transferred still remain strongest again at 20% and 60%\. On Phi\-3\.5\-MoE, our recovered model still achieves the best average score in all budgets\. On Qwen2\-57B\-A14B, it still outperforms all baselines even without fine\-tuning recovery\. Observing the excellent performance of Qwen2, in addition to comparing it with the baseline, we also tested the performance on Qwen2 at compression ratios of 50% and 80%\. Surprisingly, without fine\-tuning recovery, average scores show that it remains nearly lossless at a50%50\\%parameter prune budget \(∼99\.8%\\sim 99\.8\\%of base performance\), even at an80%80\\%prune budget, it still retains about92\.9%92\.9\\%of the base average score \(results see appendix TableLABEL:tab:appendix\_full\_results\)\. These results demonstrate the effectiveness of FlexMoE in pruning quality preservation\.

#### Effectiveness of Action Learning\.

Figure[3](https://arxiv.org/html/2606.27866#S4.F3)shows that the learned action distributions are clearly non\-uniform across depth, indicating that FlexMoE does not merely recover a global prune ratio, but learns structured budget allocation\. Moreover, the learned profiles are model\-dependent: Mixtral\-8x7B and Phi\-3\.5\-MoE retain relatively thicker actions in earlier MoE layers, which is consistent with prior observations that MoE layers can differ substantially in compression sensitivity, with early MoE blocks often requiring more capacity or precision\(Liet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib33); Baiet al\.,[2025](https://arxiv.org/html/2606.27866#bib.bib9)\)\. While Qwen2\-57B\-A14B uses a finer\-grained routed\-expert design together with shared expert architecture, so preserving large routed\-expert width in the earliest layers is less critical\. This suggests that the action learner adapts to model\-specific expert redundancy patterns\. A second trend is that the layer\-wise action distribution becomes more uniform on more strongly sparse MoE backbones with more experts, as reflected by the gradually weaker color on Phi, and more concentrated probability values on Qwen2\. We interpret this as evidence that finer\-grained expert architectures with less channels have stronger within\-layer substitutability, so the quality gap between experts and actions is smaller and budget can be distributed more evenly\. These interpretations are further supported by the structure\-destruction ablation in Appendix[B\.2](https://arxiv.org/html/2606.27866#A2.SS2), where layer\-wise/globally shuffled actions consistently yield worse results, confirming that our proposed expert\-wise action learning captures meaningful architecture\-dependent structure rather than random budget convergence and allocation\.

![Refer to caption](https://arxiv.org/html/2606.27866v1/figures/cross-model-heatmap.png)Figure 3:Layer\-wise action distributions learned by FlexMoE at 40% prune budget\. Lighter cell color indicate the higher fraction of experts in that layer assigned to the corresponding action\.
#### Comparison of Fine\-Tuning Strategies\.

![Refer to caption](https://arxiv.org/html/2606.27866v1/figures/ft-method-compare.png)Figure 4:average downstream accuracy of different recovery strategies on Mixtral\-8x7B across prune budgetsTable 2:Offline clipping throughput results under SGLang\.To support more flexible deployment across budgets, the recovered model should be robust under multiple pruning action masks\. As detailed in Appendix[B\.3](https://arxiv.org/html/2606.27866#A2.SS3), our first attempt followed the AmoebaLLM\-style cross\-budget fine\-tuning \(CP\-FT\): apply sandwich sampling and a division factor to balance different distillation loss scale across budgets\(Yu and Huang,[2019](https://arxiv.org/html/2606.27866#bib.bib10); Fuet al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib13)\)\. In practice, however, TableLABEL:tab:appendix\_full\_resultsshows that our single\-point fine\-tuning strategy \(SP\-FTandSP\-XFER\) already provides a strong shared recovered model across budgets better thanCP\-FTeven with reduced training cost\. To put it more intuitively, the blue curve in Figure[4](https://arxiv.org/html/2606.27866#S4.F4.1)shows thatCP\-FTrecovery approach underperforms both per\-budget fine\-tuning and our single\-point fine\-tuning over most budgets\. By contrast, the red curve stays much closer to the green curve \(coincide at 40% budget, as fine\-tune applied here\), showing that in our setting, a single mid\-budget recovery point is already sufficient to produce a reusable shared recovered model, and its performance on transferred unseen budgets is still close to per\-budget fine\-tuning\. This makes the middle budget a particularly effective compromise: it does not fully optimize any one endpoint, but it provides the best trade\-off between quality and cross\-budget reuse\. In Appendix[B\.3](https://arxiv.org/html/2606.27866#A2.SS3), we provide more ablation experiment results and detailed analysis to support the advantages of our proposed one\-step mid\-budget fine\-tuning strategy\.

### 4\.3Deployment Performance and Analysis

#### Experiment settings\.

We finally evaluated deployment performance of the pruned budget family under a serving\-oriented runtime\. We use the SGLang engine\(Zhenget al\.,[2024](https://arxiv.org/html/2606.27866#bib.bib40)\)on a single H200 GPU under a synthetic workload with 4096 prompt requests, input length=64=64, output length=256=256111SGLang is a serving\-oriented runtime which is designed for high\-throughput structured LLM execution, and this setup is intended to approximate a realistic serving regime with substantial concurrent traffic, rather than a single\-request latency test\.\. Our primary metric is model throughput \(tok/s\), defined by summarizing the model prefill input and decoding output throughput\. Results are averaged across multiple runs\.

#### Offline Pruning Throughput\.

![Refer to caption](https://arxiv.org/html/2606.27866v1/figures/qwen2_bsz_trend.png)Figure 5:Qwen2\-57B\-A14B offline clipping throughput across batch sizes\.To evaluate the static deployment value of FlexMoE, we first test*offline clipping*, where the pruned subnet is exported as a standalone fixed\-budget checkpoint and directly loaded by the serving runtime\. Table[2](https://arxiv.org/html/2606.27866#S4.T2)shows clear end\-to\-end throughput gains across all three models with substantially reduced checkpoint size, confirming that the learned budget family yields both lower memory cost and real deployment speedup\. We further analyze the impact of serving batch size on Qwen2 in Figure[5](https://arxiv.org/html/2606.27866#S4.F5.1)\. Throughput gains become much more visible as batch size increases, suggesting that the benefit of reduced expert FFN computation is still mainly limited on scheduler\-level concurrency, and are more effectively translated under higher GPU utilization\. However, when throughput gains are modest at small batch sizes, the pruned smaller checkpoints still improve deployment feasibility and increasing the memory headroom available for caches or longer contexts\.

#### Toward Co\-Designed Online Budget Scheduling\.

Figure 6:Online clipping throughputWith clip FFN kernel co\-design, FlexMoE demonstrates the potential for online budget adjustment\. We implemented and tested another deployment scenario–*online clipping*: the server still keeps the channel\-ranked full base model in memory, and operator could adjust online inference budget by specifying budget\-specific action masks at runtime\. This makes budget switching possible without unloading the current service or reloading another checkpoint\. Table[6](https://arxiv.org/html/2606.27866#S4.F6)shows the effectiveness of our co\-designed kernel on Mixtral\-8x7B under this scenario\. Using customized kernel, Mixtral recovers real speedups compared to naive Python implementation, demonstrating that the benefits of parameter\-reducing have begun to materialize and covered operation overheads for online parameter pruning\. The kernel optimized path still remains slower than the corresponding offline mode, which is expected because online frequent budget\-conditioned slicing and dispatch overhead for action masks cannot be entirely removed under this scenario\. This is still a worthwhile trade\-off—online clipping deployment sacrifices some peak efficiency, but gains the ability to switch budgets on the fly without service interruption or checkpoint reload\. We take this result as an exploratory study for algorithm–system co\-design of FlexMoE toward more fine\-grained strategies for online realtime budget adjustment\.

## 5Conclusion

We presented FlexMoE, a post\-training compression framework that converts a pretrained MoE LLM into a nested family of deployable subnetworks\. By ranking expert FFN channels by estimated importance and learning per\-expert one discrete retention action across budgets, our method obtains a series of reliable pruned subnetworks nested in large pretrained MoE from a single action\-training run\. We further showed that a one\-step recovery fine\-tune at a single mid\-budget point is already sufficient to produce shared recovered model that transfer well to other unseen budgets\. Experiments on various MoE models show that the proposed framework surpasses recent strong MoE compression baselines and becomes more effective on sparser MoE LLMs\. Finally, offline pruned subnetworks deliver real throughput gains at their fixed budget, and a kernelized co\-design makes runtime budget switching feasible\. We hope this work provides a new perspective of MoE model structure search and a practical foundation for budget\-adaptive MoE model deployment and inference\.

#### Limitations and Future Works\.

While our main focus is MoE structure search and static expert\-parameter pruning, the multi\-budget shared\-weight family produced by FlexMoE opens a promising direction for future work on stronger system\-level co\-design and dynamic budget\-adaptation strategies for deployment and online serving\.

## References

- M\. Abdin, J\. Aneja, H\. Awadalla, A\. Awadallah, A\. A\. Awan, N\. Bach, A\. Bahree, A\. Bakhtiari, J\. Bao, H\. Behl,et al\.\(2024\)Phi\-3 technical report: a highly capable language model locally on your phone\.arXiv preprint arXiv:2404\.14219\.Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- A\. Amini, S\. Gabriel, S\. Lin, R\. Koncel\-Kedziorski, Y\. Choi, and H\. Hajishirzi \(2019\)MathQA: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2357–2367\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1245)Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- S\. Bai, H\. Li, J\. Zhang, Z\. Hong, and S\. Guo \(2025\)DiEP: adaptive mixture\-of\-experts compression through differentiable expert pruning\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2),[§4\.2](https://arxiv.org/html/2606.27866#S4.SS2.SSS0.Px2.p1.1)\.
- Y\. Bisk, R\. Zellers, R\. Le Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- R\. Cai, S\. Muralidharan, G\. Heinrich, H\. Yin, Z\. Wang, J\. Kautz, and P\. Molchanov \(2024\)Flextron: many\-in\-one flexible large language model\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 5298–5311\.Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p1.1)\.
- I\. Chen, H\. Liu, W\. Sun, C\. Chao, Y\. Hsu, and C\. Lee \(2024\)Retraining\-free merging of sparse mixture\-of\-experts via hierarchical clustering\.arXiv preprint arXiv:2410\.08589\.Cited by:[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- Devvrit, S\. Kudugunta, A\. Kusupati, T\. Dettmers, K\. Chen, I\. Dhillon, Y\. Tsvetkov, H\. Hajishirzi, S\. Kakade, A\. Farhadi, and P\. Jain \(2024\)MatFormer: nested transformer for elastic inference\.InAdvances in Neural Information Processing Systems,External Links:[Document](https://dx.doi.org/10.5555/3737916.3742377)Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p1.1)\.
- G\. Fang, H\. Yin, S\. Muralidharan, G\. Heinrich, J\. Pool, J\. Kautz, P\. Molchanov, and X\. Wang \(2024\)MaskLLM: learnable semi\-structured sparsity for large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p2.1),[§3\.2](https://arxiv.org/html/2606.27866#S3.SS2.SSS0.Px1.p1.9)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p1.1)\.
- Y\. Fu, Z\. Yu, J\. Li, J\. Qian, Y\. Zhang, X\. Yuan, D\. Shi, R\. Yakunin, and Y\. C\. Lin \(2024\)AmoebaLLM: constructing any\-shape large language models for efficient and instant deployment\.InAdvances in Neural Information Processing Systems,Cited by:[§B\.3](https://arxiv.org/html/2606.27866#A2.SS3.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.27866#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.27866#S4.SS2.SSS0.Px3.p1.1)\.
- L\. Gao, J\. Tow, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, K\. McDonell, N\. Muennighoff, J\. Phang, L\. Reynolds, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2021\)A framework for few\-shot language model evaluation\.Note:ZenodoExternal Links:[Document](https://dx.doi.org/10.5281/zenodo.5371628)Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- S\. Gao, J\. Yin, F\. Wang, and W\. Dong \(2026\)FLYING serving: on\-the\-fly parallelism switching for large language model serving\.arXiv preprint arXiv:2602\.22593\.Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p2.1),[§1](https://arxiv.org/html/2606.27866#S1.p3.1)\.
- H\. Gu, W\. Li, L\. Li, Q\. Zhu, M\. G\. Lee, S\. Sun, W\. Xue, and Y\. Guo \(2025\)Delta decompression for MoE\-based LLMs compression\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§3\.3](https://arxiv.org/html/2606.27866#S3.SS3.p1.3)\.
- F\. Ilhan, G\. Su, S\. F\. Tekin, T\. Huang, S\. Hu, and L\. Liu \(2024\)Resource\-efficient transformer pruning for finetuning of large models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 16206–16215\.Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p2.1),[§3\.1](https://arxiv.org/html/2606.27866#S3.SS1.p3.9)\.
- E\. Jang, S\. Gu, and B\. Poole \(2017\)Categorical reparameterization with gumbel\-softmax\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p2.1),[§3\.2](https://arxiv.org/html/2606.27866#S3.SS2.SSS0.Px1.p1.9)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, E\. Bou Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- D\. Lepikhin, H\. Lee, Y\. Xu, D\. Chen, O\. Firat, Y\. Huang, M\. Krikun, N\. Shazeer, and Z\. Chen \(2021\)GShard: scaling giant models with conditional computation and automatic sharding\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p1.1)\.
- P\. Li, X\. Jin, Y\. Cheng, and T\. Chen \(2024\)Examining post\-training quantization for mixture\-of\-experts: a benchmark\.arXiv preprint arXiv:2406\.08155\.Cited by:[§4\.2](https://arxiv.org/html/2606.27866#S4.SS2.SSS0.Px2.p1.1)\.
- W\. Li, L\. Li, H\. Gu, Y\. Huang, M\. G\. Lee, S\. Sun, W\. Xue, and Y\. Guo \(2025\)MoE\-SVD: structured mixture\-of\-experts LLMs compression via singular value decomposition\.InInternational Conference on Machine Learning,Cited by:[Figure 1](https://arxiv.org/html/2606.27866#S1.F1),[Figure 1](https://arxiv.org/html/2606.27866#S1.F1.3.2),[§1](https://arxiv.org/html/2606.27866#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2),[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.17.5.1.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.17.5.10.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.17.5.12.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.20.8.1.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.20.8.10.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.20.8.12.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.23.11.1.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.23.11.10.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.23.11.12.2)\.
- X\. Lu, Q\. Liu, Y\. Xu, A\. Zhou, S\. Huang, B\. Zhang, J\. Yan, and H\. Li \(2024\)Not all experts are equal: efficient expert pruning and skipping for mixture\-of\-experts large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6159–6172\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.334)Cited by:[Figure 1](https://arxiv.org/html/2606.27866#S1.F1),[Figure 1](https://arxiv.org/html/2606.27866#S1.F1.3.2),[§1](https://arxiv.org/html/2606.27866#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2),[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.16.4.11.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.16.4.13.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.16.4.2.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.19.7.11.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.19.7.13.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.19.7.2.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.22.10.11.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.22.10.13.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.22.10.2.2)\.
- X\. Ma, H\. Hong, T\. Um, J\. Lee, S\. Choy, W\. Lee, and M\. Jeon \(2026\)ORBITFLOW: slo\-aware long\-context llm serving with fine\-grained kv cache reconfiguration\.arXiv preprint arXiv:2601\.10729\.Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p2.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2381–2391\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1260)Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- P\. Molchanov, A\. Mallya, S\. Tyree, I\. Frosio, and J\. Kautz \(2019\)Importance estimation for neural network pruning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11264–11272\.Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p2.1),[§3\.1](https://arxiv.org/html/2606.27866#S3.SS1.p3.9)\.
- A\. Navon, A\. Shamsian, I\. Achituve, E\. Fetaya, G\. Chechik, and H\. Maron \(2023\)Equivariant architectures for learning in deep weight spaces\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§3\.1](https://arxiv.org/html/2606.27866#S3.SS1.p2.11)\.
- S\. Qian, K\. Liu, P\. C\. Sruthi, L\. Tan, and Y\. Zhang \(2026\)Towards resiliency in large language model serving with kevlarflow\.arXiv preprint arXiv:2601\.22438\.Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p3.1)\.
- K\. Sakaguchi, R\. Le Bras, C\. Bhagavatula, and Y\. Choi \(2020\)WinoGrande: an adversarial winograd schema challenge at scale\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 8732–8740\.Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- N\. Tastan, S\. Laskaridis, K\. Nandakumar, and S\. Horvath \(2026\)MoSE: mixture of slimmable experts for efficient and adaptive language models\.arXiv preprint arXiv:2602\.06154\.Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p1.1)\.
- Y\. Tokpanov, P\. Glorioso, Q\. Anthony, and B\. Millidge \(2024\)Zyda\-2: a 5 trillion token high\-quality dataset\.arXiv preprint arXiv:2411\.06068\.Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- Y\. Wang, Q\. Hu, Y\. Ding, R\. Wang, Y\. Gong, J\. Jiao, Y\. Shen, P\. Cheng, and J\. Su \(2025\)Training matryoshka mixture\-of\-experts for elastic inference\-time expert utilization\.arXiv preprint arXiv:2509\.26520\.Cited by:[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p1.1)\.
- Y\. Xu, Y\. Wang, X\. Peng, H\. Zang, M\. Chen, P\. Xia, and Z\. Wen \(2026\)TD\-MoE: tensor decomposition for MoE models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=D9cnZNZfxX)Cited by:[Figure 1](https://arxiv.org/html/2606.27866#S1.F1),[Figure 1](https://arxiv.org/html/2606.27866#S1.F1.3.2),[§1](https://arxiv.org/html/2606.27866#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2),[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.18.6.1.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.18.6.10.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.18.6.12.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.21.9.1.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.21.9.10.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.21.9.12.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.24.12.1.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.24.12.10.2),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.12.12.24.12.12.2)\.
- A\. Yang, B\. Yang, B\. Hui, B\. Zheng, B\. Yu, C\. Zhou, C\. Li, C\. Li, D\. Liu, F\. Huang,et al\.\(2024a\)Qwen2 technical report\.arXiv preprint arXiv:2407\.10671\.Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- C\. Yang, Y\. Sui, J\. Xiao, L\. Huang, Y\. Gong, Y\. Duan, W\. Jia, M\. Yin, Y\. Cheng, and B\. Yuan \(2024b\)MoE\-I2: compressing mixture of experts models through inter\-expert pruning and intra\-expert low\-rank decomposition\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 10456–10466\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.612)Cited by:[§1](https://arxiv.org/html/2606.27866#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2),[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.1.1.1.1.3),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.10.10.10.2.3),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.11.11.11.3.3),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.2.2.2.2.3),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.3.3.3.3.3),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.5.5.5.1.4),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.6.6.6.2.3),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.7.7.7.3.3),[Table 1](https://arxiv.org/html/2606.27866#S4.T1.9.9.9.1.3)\.
- J\. Yu and T\. S\. Huang \(2019\)Universally slimmable networks and improved training techniques\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 1803–1811\.Cited by:[§B\.3](https://arxiv.org/html/2606.27866#A2.SS3.SSS0.Px1.p1.2),[§2\.2](https://arxiv.org/html/2606.27866#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.27866#S4.SS2.SSS0.Px3.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1472)Cited by:[§4\.1](https://arxiv.org/html/2606.27866#S4.SS1.SSS0.Px1.p1.6)\.
- G\. Zhang, Y\. Han, Y\. Lou, W\. Zhao, Y\. Zhang, and Y\. You \(2025\)MoNE: replacing redundant experts with lightweight novices for structured pruning of moe\.arXiv preprint arXiv:2507\.00390\.Cited by:[§2\.1](https://arxiv.org/html/2606.27866#S2.SS1.p1.2)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez, C\. Barrett, and Y\. Sheng \(2024\)SGLang: efficient execution of structured language model programs\.arXiv preprint arXiv:2312\.07104\.Cited by:[§4\.3](https://arxiv.org/html/2606.27866#S4.SS3.SSS0.Px1.p1.2)\.

## Appendix AAdditional Experimental Details and Analysis

### A\.1Full Result Tables

Table 3:Full cross\-model results with all reported per\-task accuracies\. All numbers are zero\-shot accuracy \(%\)\.RatioMethodARC\-cARC\-eHellaSOBQAPIQAWinoGMathQAAvgMixtral\-8x7B\(8 experts per layer, top\-2 activated per token\)0%Base model5784653682764363\.2920%NAEE4776583279724057\.7120%MoE\-I24879553278743757\.5720%MoE\-SVD\(fine\-tuned\)5580613381733860\.1420%TD\-MoE5383643382774061\.7120%Ours \(CP\-FT\)5381623381763760\.4320%Ours \(SP\-XFER\)5482643581774061\.8625%Ours \(CP\-FT\)4879613380753658\.8625%Ours \(SP\-XFER\)5381633480753860\.5740%NAEE3663462572643548\.7140%MoE\-I2\(P\+F\)3871432669663149\.1440%MoE\-SVD3872432771673250\.0040%TD\-MoE4777572879763557\.0040%Ours \(CP\-FT\)4475573179683155\.0040%Ours \(SP\-FT\)4977603380733458\.0045%Ours \(CP\-FT\)4374563078722854\.4345%Ours \(SP\-XFER\)4476583178723155\.7150%MoE\-SVD\(fine\-tuned\)3767502573642849\.1450%Ours \(CP\-FT\)4373562778672753\.0050%Ours \(SP\-XFER\)3971563078712953\.4355%Ours \(CP\-FT\)3566482875632648\.7155%Ours \(SP\-XFER\)3669542877692851\.5760%NAEE2342331762552636\.8660%MoE\-I22244321858552336\.0060%MoE\-SVD2345331962552537\.4360%TD\-MoE2855382165622441\.8660%Ours \(CP\-FT\)3364462873622747\.5760%Ours \(SP\-XFER\)3565512374712849\.57Phi\-3\.5\-MoE\(16 experts per layer, top\-2 activated per token\)0%Base model5677684079763862\.0020%MoE\-SVD\(fine\-tuned\)5481653979743661\.1420%TD\-MoE5577653979743861\.0020%Ours \(SP\-XFER\)5783653979813963\.2940%NAEE4873613576733757\.5740%MoE\-I24059272970672545\.2940%MoE\-SVD4872583575723155\.8640%TD\-MoE5075613578733357\.8640%Ours4974583576652855\.0040%Ours \(SP\-FT\)5380593878713759\.4360%MoE\-SVD4060463071682548\.5760%TD\-MoE4170462873682349\.8660%Ours \(SP\-XFER\)4674513275652853\.00Qwen2\-57B\-A14B\(8 share \+ 64 routed experts per layer, 8 \+ top\-8 activated per token\)–Base model4775633380743958\.7120%NAEE4272593179723655\.8620%MoE\-SVD4574613078733556\.5720%TD\-MoE4979593180733758\.2920%Ours4674633381743858\.4340%NAEE3972542876713253\.1440%MoE\-SVD3363452971653148\.1440%TD\-MoE4577542978713555\.5740%Ours4877623380733958\.8650%Ours4776613380734058\.5760%NAEE2958442169612644\.0060%MoE\-SVD3262442769643046\.8660%TD\-MoE4173462874683151\.5760%Ours4879593180743958\.5780%Ours4473542977733254\.57Table 3:Full cross\-model results with all reported per\-task accuracies\. All numbers are zero\-shot accuracy \(%\) \(continued\)\.We report the complete per\-task results here for clarity\. TableLABEL:tab:appendix\_full\_resultspreserves the full task breakdown for all evaluated model–budget pairs\. We use 3 short tags for our implementation variants:SP\-FTdenotes the shared recovered model is fine\-tuned exactly at this budget \(Single\-Point Fine\-Tuning\);SP\-Xferdenotes the shared recovered model reused at this unseen budget \(Single\-Point fine\-tuning transFER\); andCP\-FTdenotes the recovered model is jointly fine\-tuned over multiple budget action masks as ablation \(Cross\-Point Fine\-Tuning, see Appendix[B\.3](https://arxiv.org/html/2606.27866#A2.SS3.SSS0.Px1)\)\. If there’s no tag attached, it denotes directly applying the trained action mask to the channel\-ranked base model without recovery fine\-tuning\. Compared with the main\-text tables, we additionally include several intermediate budget points on Mixtral\-8x7B \(25%, 45%, 50%, and 55%\), which show that a single action\-training run remains stable along the traversed budget path rather than only at the three shared comparison points 20%, 40%, and 60%\. For Qwen2\-57B\-A14B, we also report two additional stress\-test points at 50% and 80% pruning to further probe the limit of the learned subnet family on a highly sparse MoE backbone\.

### A\.2Additional Discussion on Cross\-Budget Transfer and Cross\-Model Trends

#### Cross\-Budget Transfer\.

The full tables make two empirical patterns particularly clear\. The first is cross\-budget transfer\. On Mixtral,SP\-XFERis already competitive at low compression, but its relative advantage over strong baselines becomes more visible as pruning becomes stronger\. This is not surprising\. First, at low compression ratios, the feasible pruning space is itself small: only a limited fraction of expert parameters is removed, so many methods can still find similarly good solutions and the headroom for separation is correspondingly narrow\. In this regime, the main practical gain of our pipeline is not a dramatic accuracy gap, but the fact that one recovery point already provides a reusable weight set for several deployment budgets\.

The second reason is structural and is tied directly to how single\-point recovery fine\-tune works in our framework\. The shared recovery point is trained on a middle\-budget subnet, so the learned LoRA parameters mainly cover the prefix channels that are active at that budget\. Those channels are exactly the ones reused by all tighter descendant masks, which explains why transfer remains strong at higher compression\. By contrast, looser budgets expose more tail channels that not covered by adapter parameters during recovery and therefore receive no updates when merged into base model\. This creates a trade\-off: compared with independently per\-budget fine\-tuning, single\-point recovery sacrificed only a limited amount of accuracy at some high budgets due to not updated tail channels, but removes the need to train, store, and maintain separate recovered models under different budget points\.

#### Cross model trends\.

Another pattern is that FlexMoE becomes increasingly effective on more strongly sparse MoE backbones\. The progression from Mixtral to Phi to Qwen2 is visible both in the tables and in our action\-training dynamics\. Mixtral\-8x7B has only 8 routed experts per layer with top\-2 routing, so each active expert accounts for a relatively large fraction of the routed computation\. As a result, pruning inside one expert is more sensitive and the quality preservation term in action learning remains harder to suppress under increasing cost pressure\. Phi\-3\.5\-MoE is moderately more sparse and already shows stronger transfer\. Qwen2\-57B\-A14B combines a much larger routed\-expert pool with shared experts, so each routed expert contributes a smaller fraction of the total effective computation and there is substantially more room for structured expert\-internal compression before quality degrades sharply\.

Our training\-time observations are consistent with this interpretation\. Under comparable settings, action learning on Mixtral takes about 48 hours to drive the retained parameter ratio from full capacity to the target region, compared with about 30 hours on Phi\-3\.5\-MoE and only about 7 hours on Qwen2\. Intuitively, in the more strongly sparse models, removing capacity from one routed expert perturbs the full teacher–student gap less, so the cost term can push the policy toward thinner actions more easily\. We do not attribute this trend to expert count alone, since routing design, shared experts, and other architectural factors also matter\. Still, taken together, the full results and the training dynamics strongly suggest that stronger sparsity makes the quality–cost trade\-off optimized by FlexMoE easier to satisfy and more favorable in practice\.

## Appendix BMore Ablation Study

### B\.1Effect of Importance\-Aware Channel Reordering

We ablate the importance\-aware reordering stage on Mixtral\-8x7B\. The goal is to test whether prefix slicing remains effective without first sorting expert FFN channels by importance\. We compare three variants:base model, which directly applies our action\-learning pipeline on the original unranked model;100 ranked, where channels are ranked using 100 calibration samples; and5k ranked, where the same reordering procedure uses 5,000 calibration samples\. A subtle but important point is that, at the same target prune ratio, the action masks are not shared across these variants\. Instead, each model variant runs its own action\-learning process and uses the exported mask from that run\. This is the fairest comparison: different channel orderings change the meaning of prefix slicing itself, so forcing the same mask across different orderings would confound the effect of reordering with an incompatible pruning pattern\. In contrast, our current setup keeps the overall training pipeline identical to each model and only changes the ranking stage to obtain model variants\. For each model variant, the action masks at different prune ratios are all exported from a single action\-training run\. There’s no recovery fine\-tuning applied\.

Table 4:Ablation on importance\-aware reordering for prefix slicing on Mixtral\-8x7B\. All numbers are zero\-shot accuracy \(%\)\. “100 ranked” and “5k ranked” denote that the channel ordering is estimated using 100 and 5,000 calibration samples, respectively\.Table[4](https://arxiv.org/html/2606.27866#A2.T4)shows that importance\-aware reordering is crucial for making prefix slicing effective\. Without reordering, the unreordered base model reaches only 54\.43 average accuracy at the 20% prune ratio and 48\.86 at 40%\. In contrast, the ranked variants substantially improve performance, reaching 59\.71/60\.00 at 20% and 53\.43/53\.86 at 40%\. The improvement is especially clear at the tighter 40% budget, where preserving a more informative prefix matters more\. This supports our core motivation: prefix channel retention action is only meaningful when the channel layout has first been transformed into an importance\-ordered space\. Once a modest number of calibration samples \(steps\) is available, the learned order converges and already becomes stable enough to support effective action learning, and further increasing the ranking set yields only marginal gains\. This suggests that the ranking stage is not only effective but also practically sample\-efficient: the importance estimates converge quickly, and the resulting channel\-ranked layout is robust enough to support top\-retained channel prefix slicing actions\.

### B\.2Effect of Expert\-Wise Action Learning

#### Uniform and Random Action\-Mask Ablation\.

The learned action mask in Figure[8](https://arxiv.org/html/2606.27866#A2.F8)suggests that the structure captured by FlexMoE is not only layer\-wise but also expert\-wise: the distribution of actions across layers is different, and within the same layer, different experts are often assigned different retention actions rather than sharing one uniform budget\. This motivates our structure\-destruction ablation\. Starting from a learned action mask at a target budget, we construct two controls while keeping the overall action histogram and pruning budget unchanged:global shuffle, which randomly permutes all expert actions across the whole model and therefore destroys both inter\-layer and expert\-wise structure; andin\-layer shuffle, which randomly permutes expert actions only within each layer, preserving the layer\-wise action counts but removing expert\-wise specialization inside that layer\. We then apply these shuffled masks back to the same ranked model and evaluate accuracy under the same seven downstream datasets used in the main experiments, reporting only the averaged score\. We also include auniformablation, where all experts are assigned the same retention ratio under the target budget while still keeping the same global pruning ratio\.

Table[8](https://arxiv.org/html/2606.27866#A2.F8)shows that destroying the learned structure consistently hurts performance\. For both Mixtral\-8x7B and Qwen2\-57B\-A14B, the learned mask yields the strongest average accuracy, while both shuffled controls degrade the result to different degrees\. The gap becomes larger at the tighter pruning budget, indicating that structured action learning matters more when the model has less room to absorb poor budget allocation\. Moreover, global shuffle is consistently worse than in\-layer shuffle, which shows that the learned policy captures not only meaningful at layer\-wise allocation across depth, but also expert\-wise specialization within a layer\. Another notable trend is that Qwen2 is less sensitive than Mixtral to the in\-layer shuffle, remaining much closer to the learned mask\. This is consistent with our broader cross\-model observation that more strongly sparse MoE backbones exhibit stronger expert and action substitutability\. Overall, these results reinforce that FlexMoE is not merely learning a global prune ratio, but a structured action pattern across both layers and experts\.

![Refer to caption](https://arxiv.org/html/2606.27866v1/figures/intro_action_mask.png)Figure 7:An example of learned action mask at the 40% budget on Mixtral\-8x7B, each cell records the action chosen by one expert maps to a predefined weight\-retention ratio\.
Figure 8:Uniform and shuffled action\-mask ablation\. All numbers are average zero\-shot accuracy \(%\) over the same seven evaluation datasets used in the main experiments\.

### B\.3Advantages of Single Point Fine\-Tuning \(SP\-FT\) Recovery

#### Cross\-budget Fine\-Tuning \(CP\-FT\) Ablation Details\.

To obtain one recovered model that can serve multiple action masks, we first implemented a cross\-point fine\-tuning strategy \(CP\-FT\) inspired by the sandwich rule in universally slimmable networks \(US\-Nets\) and the many\-subnet distillation idea of AmoebaLLMYu and Huang \[[2019](https://arxiv.org/html/2606.27866#bib.bib10)\], Fuet al\.\[[2024](https://arxiv.org/html/2606.27866#bib.bib13)\]\. In each micro\-step, we load a pool of trained action masks across budgets\. We attached LoRA to full expert FFN projections and activate it for both the full teacher and subnet forwards\. The full subnet is constrained with standard language\-model CE loss to prevent full model degradation and affecting the inplace distillation objectives of subnets\. While each sampled subnet is trained by soft distillation against the full\-subnet teacher\. Following the AmoebaLLM\-style balancing factor used in our implementation, each subnet distillation loss is reweighted by the ratio between the magnitude of the full\-subnet CE loss and that of the subnet distillation loss, so that no low budget subnets dominates optimization because of higher distillation loss\. Concretely, with full\-subnet lossℒfull\\mathcal\{L\}\_\{\\mathrm\{full\}\}and sampled subnet losses\{ℒsub\(i\)\}i=1K\\\{\\mathcal\{L\}\_\{\\mathrm\{sub\}\}^\{\(i\)\}\\\}\_\{i=1\}^\{K\}, we optimize

ℒCP​\-​FT=ℒfull\+∑i=1K\|ℒfull\|\|ℒsub\(i\)\|\+ϵ​ℒsub\(i\),\\mathcal\{L\}\_\{\\mathrm\{CP\\text\{\-\}FT\}\}=\\mathcal\{L\}\_\{\\mathrm\{full\}\}\+\\sum\_\{i=1\}^\{K\}\\frac\{\\left\|\\mathcal\{L\}\_\{\\mathrm\{full\}\}\\right\|\}\{\\left\|\\mathcal\{L\}\_\{\\mathrm\{sub\}\}^\{\(i\)\}\\right\|\+\\epsilon\}\\,\\mathcal\{L\}\_\{\\mathrm\{sub\}\}^\{\(i\)\},\(16\)whereℒfull\\mathcal\{L\}\_\{\\mathrm\{full\}\}is the full\-subnet CE loss and eachℒsub\(i\)\\mathcal\{L\}\_\{\\mathrm\{sub\}\}^\{\(i\)\}is the distillation loss from a sampled action mask\. For each optimization step, the sampled masks \(subnets\) follow a US\-Nets sandwich pattern consisting of the minimum\-ratio mask and several random intermediate masks from the action pool, while the full model is always included as teacher\.

#### Why Single Point Fine Tuning \(SP\-FT\) is More Effective in Practice\.

According to Figure[4](https://arxiv.org/html/2606.27866#S4.F4.1), we found that thisCP\-FTstrategy underperforms both per\-budget fine\-tuning and our current single\-point alternative over most budgets, whereasSP\-FTfollowed by direct transfer already stays much closer to the per\-budget fine\-tuning results\. We believe the reason is structural\.CP\-FTjointly optimizes many masks with substantially different active channel prefixes, so the shared LoRA update must satisfy several incompatible subnet configurations at once\. In practice this might still weakens specialization and blurs the recovery signal at any one budget\. By contrast, leveraging the invariant nesting property of frozen route and expert topology,SP\-FTconcentrates all recovery capacity on one concrete subnet and then reuses the merged weights across nearby masks\. Although this sacrifices a small amount of accuracy relative to independently fine\-tuning every budget, it removes the need to train, store, and maintain one recovered model per budget point\.

#### Why the Middle Budget Generalizes Best\.

We further tested single\-point recovery \(SP\-FT\) at three different budgets and then transferred the recovered weights to all three target masks\. Table[10](https://arxiv.org/html/2606.27866#A2.F10)and Figure[10](https://arxiv.org/html/2606.27866#A2.F10)show a clear asymmetry\. Fine\-tuning at the high\-budget point \(20%\) generalizes poorly when transferred downward: performance degrades monotonically as pruning becomes stronger, which is consistent with the intuition that high\-budget recovery still spends its capacity on tail channels that are later removed by tighter masks\. Interestingly, it is also not the best even on the 20% target itself, where the mid\-budget recovery point \(40%\) performs better\. A plausible explanation is that moderate masking might acts as a useful regularizer: compared with 20%\-FT, 40%\-FT is forced to recover only the more reusable prefix channels, which improves robustness even at slightly looser budgets\. On the other side, low\-budget recovery \(60%\) behaves as expected: it is strongest or nearly strongest near its own fine\-tuned budget, but its upward transfer remains weaker because the recovered update only covers a relatively small active prefix and therefore cannot adequately restore the additional channels exposed at looser masks\. Overall, the 40% recovery point provides the best global trade\-off between coverage and specialization: it does not fully optimize any endpoint, but it yields the strongest one\-step generalization across the whole budget family\.

Figure 9:Single\-point recovery\-point ablation on Mixtral\-8x7B\. Each row is a target mask, and each column indicates the recovered model trained on that fixed budget point\. All numbers are average zero\-shot accuracy \(%\) over the same seven evaluation datasets used in the main experiments\.
![Refer to caption](https://arxiv.org/html/2606.27866v1/figures/diff-ft-budget.png)Figure 10:Recovery\-point generalization curve\. The x\-axis is the prune budget and the y\-axis is average downstream accuracy\. Curves correspond to single\-point recovery performed at budget 20%, 40%, and 60%, respectively\.

## Appendix CKernel Co\-Design Implementation Details

### C\.1Naive Python Online Clipped FFN

Our first online clipping approach directly implemented based on HuggingFace\-style original model forward path and applies budget\-conditioned weights clipping inside the expert FFN forward at runtime\. In the current Mixtral implementation, this path first identifies all routed experts in the current batch, and then iterates over them\. In the original full\-path \(without weights pruning\) implementation, the gate and up projections are stored contiguously as a gate–up weight tensor, so one expert forward can be executed with a single larger GEMM to enable larger GPU utilization, followed by a chunk into gate and up projection outputs\. By contrast, online clipping breaks this fast path: once a runtime mask requests a retained widthke<Ik\_\{e\}<I, the implementation must first slice gate rows and up rows separately, concatenates into a clipped contiguous gate–up weight tensor, slice the down projection to the same width, and then launch the matrix multiplication and chunk\. Since this happens inside a Python\-side per\-expert loop, the runtime pays both dispatch overhead and extra tensor\-manipulation overhead before the actual GEMM\.

Input :hidden states

XX, routed expert assignments, router weights, per\-expert retained widths

\{ke\}\\\{k\_\{e\}\\\}, dense expert weights

\{Weg​u,Wed​o​w​n\}\\\{W^\{gu\}\_\{e\},W^\{down\}\_\{e\}\\\}
Output :final routed experts FFN output

YY
1

2Initialize

Y←0Y\\leftarrow 0
3Find all active experts in the current batch

4foreach*active expertee*do

5Gather routed token states

XeX\_\{e\}and router weights for expert

ee
6Read retained width

kek\_\{e\}
7if*ke=Ik\_\{e\}=I*then

//Original full path: use full contiguous gate\-up weight

8

Zeg​u←Xe​\(Weg​u\)⊤Z^\{gu\}\_\{e\}\\leftarrow X\_\{e\}\(W^\{gu\}\_\{e\}\)^\{\\top\}
9

\(Ge,Ue\)←chunk⁡\(Zeg​u,2\)\(G\_\{e\},U\_\{e\}\)\\leftarrow\\operatorname\{chunk\}\(Z^\{gu\}\_\{e\},2\)
10

He←SiLU⁡\(Ge\)⊙UeH\_\{e\}\\leftarrow\\operatorname\{SiLU\}\(G\_\{e\}\)\\odot U\_\{e\}
11

Oe←He​\(Wed​o​w​n\)⊤O\_\{e\}\\leftarrow H\_\{e\}\(W^\{down\}\_\{e\}\)^\{\\top\}
12

13else

//Online clipped path: extra slicing and reconstruction

14

Weg​a​t​e←Weg​u\[0:ke,:\]W^\{gate\}\_\{e\}\\leftarrow W^\{gu\}\_\{e\}\[0\\\!:\\\!k\_\{e\},:\]
//extra slice

15

Weu​p←Weg​u\[I:I\+ke,:\]W^\{up\}\_\{e\}\\leftarrow W^\{gu\}\_\{e\}\[I\\\!:\\\!I\+k\_\{e\},:\]
//extra slice

W^ed​o​w​n←Wed​o​w​n\[:,0:ke\]\\widehat\{W\}^\{down\}\_\{e\}\\leftarrow W^\{down\}\_\{e\}\[:,0\\\!:\\\!k\_\{e\}\]
//extra slice

W^eg​u←cat⁡\(Weg​a​t​e,Weu​p\)\\widehat\{W\}^\{gu\}\_\{e\}\\leftarrow\\operatorname\{cat\}\(W^\{gate\}\_\{e\},W^\{up\}\_\{e\}\)
//hotspot

16

Z^eg​u←Xe​\(W^eg​u\)⊤\\widehat\{Z\}^\{gu\}\_\{e\}\\leftarrow X\_\{e\}\(\\widehat\{W\}^\{gu\}\_\{e\}\)^\{\\top\}
17

\(G^e,U^e\)←chunk⁡\(Z^eg​u,2\)\(\\widehat\{G\}\_\{e\},\\widehat\{U\}\_\{e\}\)\\leftarrow\\operatorname\{chunk\}\(\\widehat\{Z\}^\{gu\}\_\{e\},2\)
18

H^e←SiLU⁡\(G^e\)⊙U^e\\widehat\{H\}\_\{e\}\\leftarrow\\operatorname\{SiLU\}\(\\widehat\{G\}\_\{e\}\)\\odot\\widehat\{U\}\_\{e\}
19

Oe←H^e​\(W^ed​o​w​n\)⊤O\_\{e\}\\leftarrow\\widehat\{H\}\_\{e\}\(\\widehat\{W\}^\{down\}\_\{e\}\)^\{\\top\}
20

21Weight

OeO\_\{e\}by router scores and scatter\-add into

YY
22

return*YY*

Algorithm 1Naive Python Online Clipped FFNThis implementation is simple but inefficient for two reasons\. First, once the learned action mask assigns different retained widths to different experts, the runtime can no longer naturally batch expert FFNs into one regular MoE matrix multiplication; it instead degrades toward many small expert\-wise GEMMs plus Python\-side scheduling\. Second, compared with the dense full path, clipped execution inserts extra slice and cat operations on the weight tensors before every clipped expert forward\. Especially for tensor concatenation step, where PyTorch’s internal implementation involves additional empty memory allocation and tensor data copying\. These overheads are specific to our nested prefix\-slicing setting: they arise because the scenario is asked to materialize budget\-specific clipped subnetworks on the fly from one shared MoE checkpoint across all budget family\.

### C\.2Online\-Reordered Shared Weight Layout

To reduce the online clipping overhead, we first arrange the cross\-budget shared model weights into a layout that is more convenient for runtime prefix slicing actions\. In the original expert, the connected gate\-up projection weight is stored as

Worigg​u=\[g0,…,gI−1,u0,…,uI−1\],W^\{gu\}\_\{\\mathrm\{orig\}\}=\[g\_\{0\},\\ldots,g\_\{I\-1\},u\_\{0\},\\ldots,u\_\{I\-1\}\],so to obtainWclipg​uW^\{gu\}\_\{\\mathrm\{clip\}\}with only topkkchannels requires two separate row slices followed by one concatenation:

W\[:k\]g​a​t​e,W\[:k\]u​p,Wclipg​u=cat⁡\(W\[:k\]g​a​t​e,W\[:k\]u​p\)W^\{gate\}\_\{\[:k\]\},\\;W^\{up\}\_\{\[:k\]\},\\;W^\{gu\}\_\{\\mathrm\{clip\}\}=\\operatorname\{cat\}\(W^\{gate\}\_\{\[:k\]\},W^\{up\}\_\{\[:k\]\}\)
Meanwhile, the original down projection is stored asWorigd​o​w​n∈ℝH×IW^\{down\}\_\{\\mathrm\{orig\}\}\\in\\mathbb\{R\}^\{H\\times I\}, so clipping also requires a larger overhead column slice\.

Our export path rewrites the shared weights into this new weights layout:

Wreordg​u=\[g0,u0,g1,u1,…,gI−1,uI−1\],Wreordd​o​w​n=\(Worigd​o​w​n\)⊤∈ℝI×H\.W^\{gu\}\_\{\\mathrm\{reord\}\}=\[g\_\{0\},u\_\{0\},g\_\{1\},u\_\{1\},\\ldots,g\_\{I\-1\},u\_\{I\-1\}\],\\qquad W^\{down\}\_\{\\mathrm\{reord\}\}=\(W^\{down\}\_\{\\mathrm\{orig\}\}\)^\{\\top\}\\in\\mathbb\{R\}^\{I\\times H\}\.Under this layout, clipping toWclipg​uW^\{gu\}\_\{\\mathrm\{clip\}\}with topkkchannels becomes one prefix slice on the interleaved gate\-up tensor and one row slice on the transposed down tensor:

Wclipg​u=Wreordg​u\[0:2k,:\],Wclipd​o​w​n=Wreordd​o​w​n\[0:k,:\]\.W^\{gu\}\_\{\\mathrm\{clip\}\}=W^\{gu\}\_\{\\mathrm\{reord\}\}\[0:2k,:\],\\qquad W^\{down\}\_\{\\mathrm\{clip\}\}=W^\{down\}\_\{\\mathrm\{reord\}\}\[0:k,:\]\.
In the original online path, the runtime must rebuild aWclipg​uW^\{gu\}\_\{\\mathrm\{clip\}\}before every clipped expert forward\. In the reordered layout, we can obtain all requested weights by performing only 1 direct prefix slicing\. This reduces the online scheduling cost from repeated tensor assembly\.

### C\.3Kernelized Clipped FFN Forward

Input :hidden states

XX, routed expert ids

eid\\mathrm\{eid\}, router weights

α\\alpha, reordered weights

\{W~eg​u,W~ed​o​w​n\}\\\{\\widetilde\{W\}^\{gu\}\_\{e\},\\widetilde\{W\}^\{down\}\_\{e\}\\\}, retained widths

\{ke\}\\\{k\_\{e\}\\\}
Output :routed experts FFN output

YY
1

2

Y←𝟎Y\\leftarrow\\mathbf\{0\}
ℬ←BucketByRetI⁡\(\{ke\}\)\\mathcal\{B\}\\leftarrow\\operatorname\{BucketByRetI\}\(\\\{k\_\{e\}\\\}\)

//group active experts by retained width

3

4foreach*\(k,ℰk\)∈ℬ\(k,\\mathcal\{E\}\_\{k\}\)\\in\\mathcal\{B\}*do

keff←AlignUp⁡\(k,kalign\)k\_\{\\mathrm\{eff\}\}\\leftarrow\\operatorname\{AlignUp\}\(k,k\_\{\\mathrm\{align\}\}\)
//aligned working width to hardware

5

6

\(Xk,eidk,αk\)←GatherRoutedToken⁡\(X,eid,α,ℰk\)\(X\_\{k\},\\mathrm\{eid\}\_\{k\},\\alpha\_\{k\}\)\\leftarrow\\operatorname\{GatherRoutedToken\}\(X,\\mathrm\{eid\},\\alpha,\\mathcal\{E\}\_\{k\}\)
7

\(π,eidk\)←Sort⁡\(eidk\)\(\\pi,\\mathrm\{eid\}\_\{k\}\)\\leftarrow\\operatorname\{Sort\}\(\\mathrm\{eid\}\_\{k\}\)
Xk←Xk​\[π\]X\_\{k\}\\leftarrow X\_\{k\}\[\\pi\],

αk←αk​\[π\]\\alpha\_\{k\}\\leftarrow\\alpha\_\{k\}\[\\pi\]
//group token segment contiguous to expert

8

9

𝒮←\[\]\\mathcal\{S\}\\leftarrow\[\\,\],

𝒲g​u←\[\]\\mathcal\{W\}\_\{gu\}\\leftarrow\[\\,\],

𝒲d​o​w​n←\[\]\\mathcal\{W\}\_\{down\}\\leftarrow\[\\,\];

10foreach*token segment\(s​t​a​r​t,e​n​d,e\)\(start,end,e\)ineidk\\mathrm\{eid\}\_\{k\}*do

11

𝒮\.append\(Xk\[start:end,:\]\)\\mathcal\{S\}\.\\operatorname\{append\}\\\!\\left\(X\_\{k\}\[start\\\!:\\\!end,:\]\\right\)
12

𝒲g​u\.append\(W~eg​u\[0:2keff,:\]\)\\mathcal\{W\}\_\{gu\}\.\\operatorname\{append\}\\\!\\left\(\\widetilde\{W\}^\{gu\}\_\{e\}\[0\\\!:\\\!2k\_\{\\mathrm\{eff\}\},:\]\\right\)
𝒲d​o​w​n\.append\(W~ed​o​w​n\[0:keff,:\]\)\\mathcal\{W\}\_\{down\}\.\\operatorname\{append\}\\\!\\left\(\\widetilde\{W\}^\{down\}\_\{e\}\[0\\\!:\\\!k\_\{\\mathrm\{eff\}\},:\]\\right\)
//build grouped\-GEMM operand views

13

14

15

Zg​u←cublasGroupedGemm⁡\(𝒮,𝒲g​u⊤\)Z\_\{gu\}\\leftarrow\\operatorname\{cublasGroupedGemm\}\(\\mathcal\{S\},\\mathcal\{W\}\_\{gu\}^\{\\top\}\)
16

G←Zg​u\[:,0::2\]G\\leftarrow Z\_\{gu\}\[:,\\,0::2\]
U←Zg​u\[:,1::2\]U\\leftarrow Z\_\{gu\}\[:,\\,1::2\]
//read interleaved activations

17

H←SiLU⁡\(G\)⊙UH\\leftarrow\\operatorname\{SiLU\}\(G\)\\odot U
H\[:,k:keff\]←0H\[:,\\,k:k\_\{\\mathrm\{eff\}\}\]\\leftarrow 0
//delete up\-aligned activations

18;

19

Zd​o​w​n←cublasGroupedGemm⁡\(H,𝒲d​o​w​n\)Z\_\{down\}\\leftarrow\\operatorname\{cublasGroupedGemm\}\(H,\\mathcal\{W\}\_\{down\}\)
20

Zd​o​w​n←Zd​o​w​n⊙αkZ\_\{down\}\\leftarrow Z\_\{down\}\\odot\\alpha\_\{k\}
Y←ScatterAdd⁡\(Y,Zd​o​w​n,π\)Y\\leftarrow\\operatorname\{ScatterAdd\}\(Y,Z\_\{down\},\\pi\)
//add outputs to its original position

21

22

return*YY*

Algorithm 2Kernelized Online Clipped FFN Co\-DesignOn top of the reordered shared weights, we implement a customized CUDA path for online clipped FFN execution\. The key idea is to avoid treating every active expert as a fully irregular independent GEMM\. Instead, the discrete action set enabled us to bucket active experts by their retained width, align each bucket width upward to a hardware\-friendly effective widthkeffk\_\{\\mathrm\{eff\}\}, and then process all routed tokens in that bucket with grouped GEMMs rather than isolated expert\-wise GEMMs\.

Within one bucket, routed tokens are first sorted by expert id\. This step does not change the computation, but it makes tokens belonging to the same expert in a contiguous group, so the implementation can form per\-expert tensor views rather than materializing scattered copies\. These views include: \(i\) the routed token segment contiguously grouped by expert, \(ii\) the up\-aligned prefix\-sliced interleaved gate–up weightW~eg​u\[0:2keff,:\]\\widetilde\{W\}^\{gu\}\_\{e\}\[0:2k\_\{\\mathrm\{eff\}\},:\]and \(iii\) the up\-aligned prefix\-sliced transposed down weightW~ed​o​w​n\[0:keff,:\]\\widetilde\{W\}^\{down\}\_\{e\}\[0:k\_\{\\mathrm\{eff\}\},:\]\. The resulting view lists are then passed directly to cuBLAS grouped GEMM function\. In this sense, the weight reordering and the kernel\-level weights bucketing are tightly coupled: the former reduce costs for necessary parameter\-slicing actions, and the latter converts many irregular expert calls back into a grouped matrix\-multiplication workload\.

After the first grouped GEMM using interleavedW~eg​u\\widetilde\{W\}^\{gu\}\_\{e\}, the output activations are still in interleaved form\. Instead of reconstructing connected gate\-up tensors explicitly in weight space, it reads interleaved gate/up outputs directly from this grouped\-GEMM output, applies gate,SiLU\\mathrm\{SiLU\}and writes the compact hidden activationH​\[:,j\]H\[:,j\]\. Then it zero\-masks padded channels in\[ke,keff\)\[k\_\{e\},k\_\{\\mathrm\{eff\}\}\)introduced by alignment\. Importantly, this reconstruction is now performed on the activation tensor of shape roughly\[routed tokens,2​keff\]\[\\text\{routed tokens\},\\,2k\_\{\\mathrm\{eff\}\}\], usually \(depends on concurrency\) much smaller than original gate\-up weight tensor\. Therefore, the reconstruction cost for slicing gate\-up projection is paid on a much smaller working set\. After that, a second grouped GEMM then applies the down projection\.

Overall, this co\-design relieves the scheduling hotspots of online budget\-switching and turns the theoretical parameter and computation reduction of nested subnetworks into real throughput gains\.

Similar Articles

Less is MoE: Trimming Experts in Domain-Specialist Language Models

arXiv cs.LG

This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.

EMO: Pretraining Mixture of Experts for Emergent Modularity

Hugging Face Daily Papers

EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.

MobileMoE: Scaling On-Device Mixture of Experts

Hugging Face Daily Papers

MobileMoE introduces efficient on-device mixture-of-experts language models with sub-billion parameters, achieving better performance and efficiency than dense baselines and existing MoE models. The models are trained on open-source datasets and demonstrate significant speedups on commodity smartphones.

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Hugging Face Daily Papers

ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.