Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression
Summary
Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.
View Cached Full Text
Cached at: 06/18/26, 05:40 AM
# Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression
Source: [https://arxiv.org/html/2606.18304](https://arxiv.org/html/2606.18304)
Jiacheng WangGe YangYongcheng JingJinyang GuoXianglong LiuDacheng Tao
###### Abstract
Mixture\-of\-Experts \(MoE\) models scale compute efficiently, yet they remain expensive to deploy due to substantial memory footprint and inference overhead\. Prior methods mainly operate at the expert level, either removing whole experts or ranking experts by importance\. However, such expert\-wise decisions are too coarse to identify redundancy, and often misallocate pruning budgets and limits compression\. To alleviate this dilemma, we observe that information in MoE experts is highly concentrated in a few channels, leaving substantial redundancy even in “high importance” experts\. Accordingly, we propose a structural pruning framework tailored for MoEs, reforming the prune\-ratio objective to maximizing channel\-score coverage via an efficient attribution\-based approximation\. Experiments on DeepSeek and Qwen MoEs retain accuracy under 50% or 25% pruning joinly with 4\-bit quantization, reducing the memory footprint of Qwen3\-30B\-A3B by 5\.27×\\times, and outperforming state\-of\-the\-art baselines under diverse benchmarks\.111Our code is available at[https://github\.com/yifu\-ding/MoE\-Slimming](https://github.com/yifu-ding/MoE-Slimming)\.
Mixture\-of\-Experts, Model Compression, Structural Pruning, Multimodal Large Language Models
## 1Introduction
Mixture\-of\-Experts \(MoE\) architectures have become a dominant paradigm for scaling language models, offering high parameter capacity while maintaining manageable computation by activating only a subset of experts for each token\(Xueet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib51); Qwen\-Team,[2025](https://arxiv.org/html/2606.18304#bib.bib27); Guoet al\.,[2025a](https://arxiv.org/html/2606.18304#bib.bib50)\)\. To effectively deploy large modern MoEs and accelerate the inference, structural pruning, which removes entire channels or experts to yield hardware\-efficient dense smaller model, offers a promising solution\(Maet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib52); Gaoet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib53); Anet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib54); Guoet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib2)\)\. Quantization, which reduces model bit\-widths, is another complementary efficiency approach\(Gonget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib3); Lvet al\.,[2026](https://arxiv.org/html/2606.18304#bib.bib4)\)\. In contrast to dense models, which use a single FFN per\-layer shared by all tokens, MoEs comprise multiple experts with token\-dependent routing\. Experts are activated at vastly different frequencies and exhibit non\-uniform internal redundancy\(Huanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib19); Zhanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib9)\)\. Consequently, pruning decisions are tightly related to data\-dependent activation\.

Figure 1:Overview of our pruning framework: estimating expert importance via an attribution\-based approximation \(left\), maximizing score coverage to avoid wasting capacity \(middle\), and applying alignment\-aware redistribution for compact storage and kernel\-friendly low\-bit inference \(right\)\.Allocating good prune ratios across heterogeneous experts becomes substantially harder as modern MoEs scale to hundreds of experts, compared to earlier MoEs with only a handful of experts\. Expert\-wise, loss\-based ablations\(Zhanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib9); Luet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib17)\)require evaluating each expert separately, so the cost scales linearly with the number of experts and becomes impractical at scale\(Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8); Baiet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib7)\)\. Routing statistics\(Heet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib10); Leeet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib11); Xieet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib12)\)are cheap to collect, but they only capture selection frequency and aggregation proportion, rather than the experts’ true contribution\. Moreover, both methods make decisions at the expert level, treating each expert as a whole unit and failing to characterize its internal redundancy, namelyhow much capacity can be safely removedunder significant expert heterogeneity\. As a result, accurate and scalable capacity allocation across experts in large MoEs remains underexplored\.
In this paper, we rethink MoE structural pruning based on our observation that MoE information is highly concentrated in a small fraction of channels, making expert\-level importance too coarse to capture internal redundancy\. To the best of our knowledge, we are the first to show that even “high\-importance” experts may not require large capacity\. This motivates a score\-coverage\-maximized allocation that prioritizes high contributed structures and avoids wasting capacity on low\-score tails\.
We proposeAttribution\-guided and Coverage\-Maximized Expert\-wise Pruning, a framework tailored for MoE slimming\. As shown in[Figure1](https://arxiv.org/html/2606.18304#S1.F1), instead of allocating prune ratios directly from expert\-level importance, we maximizechannel score coverageunder a global budget, which better aligns with the highly concentrated and unbalanced information distribution in modern MoEs\.
Our framework consists of three components, as shown in[Figure1](https://arxiv.org/html/2606.18304#S1.F1): \(1\)Attribution\-guided Loss Approximation\(ALA\) efficiently estimates expert importance layerwisely, without exhaustive ablation\. \(2\)Coverage\-maximized Budget Allocation\(CBA\) uses ALA scores and performs coverage\-driven capacity allocation under a global budget, retaining high\-contribution channels while pruning low\-score tails\. \(3\)Alignment\-Aware Redistribution\(AAR\) adjusts dimensions after the initial allocation to satisfy low\-bit kernel constraints, ensuring seamless integration with quantized storage and efficient inference\.
Our framework achieves impressive results on representative MoE architectures, including DeepSeek and Qwen MoEs, across diverse downstream benchmarks\. On general knowledge benchmarks, it delivers over5×5\\timescompression with an average accuracy drop of at most 1%\. On reasoning benchmarks, the compressed models consistently approach or even surpass the original counterpart across various models and tasks\. These results demonstrate the effectiveness of our fine\-grained, expert\-wise pruning framework and provide a practical path toward efficient MoE deployment\.
The main contributions are summarized as follows:
- •We observe that MoE information is concentrated in a small fraction of channels, making expert\-level importance too coarse to capture expert internal redundancy\.
- •We are the first to introduce*channel score coverage*as a pruning objective, reformulating capacity allocation as maximizing coverage under a global budget to avoid wasting capacity on low\-score tails\.
- •We propose an*attribution\-guided loss approximation*to enable scalable importance expert estimation with20×20\\timesfewer GPU hours, and*alignment\-aware redistribution*for satisfy kernel shape constraints, allowing kernel\-friendly storage and efficient inference\.
- •Experiments on DeepSeek and Qwen MoEs deliver over5×5\\timescompression with strong accuracy, with under 1% drop on general knowledge, and 94\.5 on MATH500 for Qwen3\-30B\-A3B under aggressive 50% pruning\.
## 2Related Works
Due to space limitations, a more comprehensive discussion is provided in Appendix[AppendixD](https://arxiv.org/html/2606.18304#A4)\.
MoE Compression\.For efficient deployment of large MoEs, prior work explores: \(i\)*Expert trimming*and*expert skipping*to reduce runtime computation\(Liuet al\.,[2024a](https://arxiv.org/html/2606.18304#bib.bib26); Baiet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib7); Luet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib17); Chenet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib21); Huanget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib1)\)\. \(ii\)*Expert slimming*to compress each expert via pruning, quantization, or low\-rank factorization\(Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8); Xieet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib12); Chenet al\.,[2025a](https://arxiv.org/html/2606.18304#bib.bib20); Guoet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib2); Chenet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib5)\)\. Concurrent to this work, an anonymized submission \(provided in the supplementary material\) studies MoE pruning with a focus on structural pruning along the hidden dimension\(Anonymous,[2026](https://arxiv.org/html/2606.18304#bib.bib57)\)\. \(iii\)*Expert merging*to combine similar experts\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib23); Guoet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib24)\)\. Most approaches operate at the granularity of whole experts or apply uniform compression for each expert, while only limited work explores heterogeneous compression across experts, e\.g\., different low\-rank ranks\(Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8)\)and mixed\-precision bitwidth assignments\(Chenet al\.,[2025a](https://arxiv.org/html/2606.18304#bib.bib20)\)\)\.
##### Expert Importance Estimation\.
A key challenge in MoE compression is estimating per\-expert importance\. Existing approaches commonly rely on router outputs \(gate weights, token usage\)\(Heet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib10); Leeet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib11); Huanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib19)\), activation\-based metrics\(Donget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib15); Zhaoet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib23)\), performance\-based criteria \(e\.g\., loss or accuracy degradation under ablation\)\(Liuet al\.,[2024a](https://arxiv.org/html/2606.18304#bib.bib26); Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8)\), or learnable scalars\(Baiet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib7)\)\. However, these signals are often inadequate for MoE slimming because they operate at the expert level and ignore expert internal information concentration, making them only suitable for expert trimming rather than fine\-grained expert slimming\. Our approach advances prior works by replacing costly expert\-wise ablation with efficient approximation, and further goes beyond ranking experts to channel\-level budgets allocation via global score\-coverage maximization\.
## 3Pre\-analysis:*The Inherent Difficulty of Expert\-Level Importance Estimation*
MoEs sparsely route tokens across experts, and experts contribute unequally to final performance, making expert importance estimation a key problem in MoE compression\. However, existing methods typically rely on router outputs or expert statistics, which are often coarse and unreliable for fine\-grained slimming allocations\. Below, we revisit common metrics and their limitations, and motivate our approach by highlighting a fundamental mismatch between expert contribution and internal redundancy\.
### 3\.1Limitation of Heuristic Metrics
Existing metrics suffer from two key limitations: \(1\)router outputs\(e\.g\., routing weights or token usage\) only quantify token engagement but do not indicate whether an expert’s output is beneficial or harmful; \(2\)raw statistics\(e\.g\., weight, activation or gradients\) exhibit layer\-dependent magnitude across layers, and also poorly correlate with the actual contribution of experts within a layer\.

Figure 2:Misalignment between router outputs and expert\-wise ablated NLL\. \(a\) and \(b\) rank the top 50 experts by router weight and token usage\. The NLL \(bars\) demonstrates a weak correlation with router outputs\. Notably, the orange bars highlight that even selected experts can provide negative contributions\.##### Routers can have wrong choices\.
Some prior works estimate expert importance using router outputs, such as post\-softmax probabilities or the number of tokens routed to each expert\. However,[Figure2](https://arxiv.org/html/2606.18304#S3.F2)shows that these routing statistics can be seriously misaligned with an expert’s true contribution, measured by expert\-wise ablated Negative Log\-Likelihood \(NLL\)\. Concretely, we plot expert\-wise ablatedΔ\\DeltaNLL \(bars\) alongside router probabilities and token usage on Qwen1\.5\-MoE\-A2\.7B: \(a\) plots the top 50 experts ranked by router weight and \(b\) ranked experts by token usage\. Empirically, both router probabilities and token usage exhibit weak correlation withΔ\\DeltaNLL\. Highly prioritized or frequently activated experts can cause only a minor loss increase when removed \(blue bars\), and some even reduce the loss \(orange bars below zero\)\. This suggests that routing signals mainly reflect selection and how the experts’ output are aggregated, rather than whether an expert is beneficial, and the selected expert can be noisy or even harmful\.
Takeaway 1:Router\-derived statistics \(post\-softmax weights, token usage\) only reflect the engagement of experts, but do not tell the actual contribution of experts\.

Figure 3:Incomparability of raw statistics \(weight, activation and gradient\) across layers and experts\. \(a\) shows raw statistics of weights and activations grow monotonically while gradients decay with depth; and \(b\) reveals intra\-layer uncorrelation between these statistics with actualΔNLL\\Delta\\mathrm\{NLL\}when ablated\.
##### Incomparability of raw statistics across layers or experts\.
Beyond router signals, another common heuristic estimate expert importance from raw forward or backward statistics \(e\.g\., weights, activations, or gradients\)\. However, these quantities are not comparable across layers and often not informative across experts within a layer, making them unreliable as direct importance proxies\. \(1\)*Cross\-layer magnitude bias\.*In[Figure3](https://arxiv.org/html/2606.18304#S3.F3)\(a\), the layer\-wise mean±\\pmstd of raw weights and activations exhibits a depth\-dependent trend, whereas gradients decay with depth\. Such behavior arises due to residual accumulation, normalization, and so on\. In contrast, layerwiseΔ\\DeltaNLL \(blue markers\) follows a different pattern and does not align with any raw statistic, indicating that magnitudes are inherently layer\-dependent and unsuitable for cross\-layer comparison\. \(2\)*Intra\-Layer non\-correlation\.*[Figure3](https://arxiv.org/html/2606.18304#S3.F3)\(b\) shows a similar issue within a single layer: after sorting experts by expert\-wise ablatedΔ\\DeltaNLL \(bars\), the corresponding weight, activation, and gradient statistics \(mean±\\pmstd\) exhibit no meaningful relation to loss impact, failing to distinguish helpful experts\.
Takeaway 2:Raw statistics \(weights, activations, gradients\) exhibit weak cross\-layer and intra\-layer correlation with actual loss when removing the expert, fail to reliably represent the expert importance\.
### 3\.2Mismatch between Redundancy and Contribution
Some prior work estimates expert importance by measuring the loss increase when an expert is entirely removed\. While this yields an expert\-wise ranking, it does not indicate how much capacity can be safely removed within each expert\.
##### Visualization of channel redundancy\.
To examine how information is distributed inside each expert, in[Figure4](https://arxiv.org/html/2606.18304#S3.F4)\(a\), we sort channels by their scores \(see Appendix[SectionC\.3\.1](https://arxiv.org/html/2606.18304#A3.SS3.SSS1)\) in descending order, and plot the cumulative fraction of the score covered by the top\-k%k\\%channels\. The results reveal pronounced heterogeneity in intra\-expert redundancy: for some experts, nearly 40% channels conveys negligible information, whereas other experts exhibit much weaker concentration\. This expert\-specific redundancy cannot be captured by expert\-level importance signals alone\.

Figure 4:\(a\) Cumulative channel score distribution, which reveal that many experts possess highly centralized channels\. \(b\) Layerwise output loss under various prune ratio, for some experts the loss drops rapidly after keeping only a small fraction of channels\.
##### Contribution can be recovered by few channels\.
Given the concentration patterns, whole\-expert ablation becomes an overly coarse proxy for redundancy\.[Figure4](https://arxiv.org/html/2606.18304#S3.F4)\(b\) shows that even when removing all channels causes a noticeableΔ\\DeltaNLL, the loss may drop rapidly as only a small fraction of channels is restored\. Therefore, whole\-expert ablation mainly serve as a binary signal about whether an expert is important, rather than quantifying how much redundancy exists within the expert or how pruning budgets should be allocated across channels\.
Takeaway 3:Expert\-wise ablation measures expert\-level contribution \(Δ\\DeltaNLL\), but do not reflect the internal redundancy: channel scores can be highly concentrated in a small fraction of channels\.
In conclusion, there is still no accurate and scalable metric that quantifies such redundancy across layers and experts, which motivates a fine\-grained pruning strategy to determine the actual capacity of each expert\.

Figure 5:The overview of Attribution\-Guided & Coverage\-Maximized Expert\-wise Pruning framework for MoE models\.
## 4Proposed Method
In this section, we present a structural\-pruning method for MoE slimming\. An overview is shown in[Figure5](https://arxiv.org/html/2606.18304#S3.F5)\.
### 4\.1Attribution\-Guided Loss Approximation
##### Rationale\.
As[Section3\.1](https://arxiv.org/html/2606.18304#S3.SS1.SSS0.Px1)show, routing outputs and raw statistics can be weakly aligned with an expert’s true impact on the final output\. This raises a key question:how can we obtain an accurate yet efficient proxy of loss impact for expert\-wise prune budget allocation?
In this subsection, we propose an Attribution\-based Loss Approximation \(ALA\) to estimate expert contributions, producing a scalable expert\-wise loss proxy that initializes our coverage\-maximized pruning algorithm in[Section4\.2](https://arxiv.org/html/2606.18304#S4.SS2)\.
##### Derivation\.
Lethℓ∈ℝdh\_\{\\ell\}\\in\\mathbb\{R\}^\{d\}be the input hidden state of the MoE block in layerℓ\\ell, and the output can be written as
yℓ=∑e∈ℰℓgℓ,e\(hℓ\)zℓ,e\.y\_\{\\ell\}\\;=\\;\\sum\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}g\_\{\\ell,e\}\(h\_\{\\ell\}\)\\,z\_\{\\ell,e\}\.\(1\)wherezℓ,e=fℓ,e\(hℓ\)z\_\{\\ell,e\}=f\_\{\\ell,e\}\(h\_\{\\ell\}\)andgℓ,e\(hℓ\)≥0g\_\{\\ell,e\}\(h\_\{\\ell\}\)\\geq 0are the expert output and the router probability for selected expertse∈ℰℓe\\in\\mathcal\{E\}\_\{\\ell\}\. Removing experteecorresponds to settingzℓ,e=0z\_\{\\ell,e\}=0, which perturbs the layer output by
Δyℓ\(e\)=−gℓ,ezℓ,e\.\\Delta y\_\{\\ell\}^\{\(e\)\}=\-g\_\{\\ell,e\}\\,z\_\{\\ell,e\}\.\(2\)Letℒ\\mathcal\{L\}denote the loss evaluated against the original layer output\. We approximate the loss change using a first\-order Taylor expansion aroundyℓy\_\{\\ell\}\. Thus, the loss change induced by removing experteeis
Δℒ\(e\)≈\(∂ℒ∂yℓ\)⊤Δyℓ\(e\)=−\(∂ℒ∂yℓ\)⊤\(gℓ,ezℓ,e\)\.\\Delta\\mathcal\{L\}^\{\(e\)\}\\approx\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial y\_\{\\ell\}\}\\right\)^\{\\top\}\\Delta y\_\{\\ell\}^\{\(e\)\}=\-\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial y\_\{\\ell\}\}\\right\)^\{\\top\}\(g\_\{\\ell,e\}z\_\{\\ell,e\}\)\.\(3\)Using the chain rule, the loss gradient with respect to the expert output satisfies∂ℒ∂zℓ,e=gℓ,e∂ℒ∂yℓ\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\ell,e\}\}=g\_\{\\ell,e\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial y\_\{\\ell\}\}\. Put this into the first\-order expansion yields the final approximation form
Δℒℓ\(e\)≈−\(∂ℒℓ∂zℓ,e\)⊤zℓ,e,\\Delta\\mathcal\{L\}\_\{\\ell\}^\{\(e\)\}\\approx\-\\left\(\\frac\{\\partial\\mathcal\{L\}\_\{\\ell\}\}\{\\partial z\_\{\\ell,e\}\}\\right\)^\{\\top\}z\_\{\\ell,e\},\(4\)which serves as a proxy of expert contribution, and we compute for all experts within a layer in one backward pass\.
##### Implementation and efficiency\.
We collectΔℒℓ\(e\)\\Delta\\mathcal\{L\}\_\{\\ell\}^\{\(e\)\}on a calibration set of roughly 3M tokens using an exponential moving average \(EMA\)\. We perturb all experts at layerℓ\\ellby uniformly scaling their activation outputs with a small factor, and then apply a simple square\-root smoothing to the loss, obtaining the expert\-wise importance priorϕ\\boldsymbol\{\\phi\}\.
Table 1:Comparisons of time costs \(GPU hours\) between loss\-based importance estimation using expert\-wise ablation and ours\.We compare calibration time in[Table1](https://arxiv.org/html/2606.18304#S4.T1)against expert\-wise ablation under the same data amount and iterations\. Our method reduces 14\-26×\\timestime cost due to a smaller search space\. While heuristics such as greedy search\(Caoet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib55)\)or genetic algorithms\(Liuet al\.,[2024a](https://arxiv.org/html/2606.18304#bib.bib26)\)can reduce ablation cost, they can probably fall into local optimum\. Meanwhile, it is noticeable that none of them has been validated on MoEs with hundreds of experts\.
### 4\.2Coverage\-Maximized Budget Allocation
Algorithm 1Coverage\-Maximized Allocation Search1:Input:Score allocation weights
ϕ∈ℝ\+\|𝒢\|\\boldsymbol\{\\phi\}\\in\\mathbb\{R\}\_\{\+\}^\{\|\\mathcal\{G\}\|\}; prefix sums
\{𝒮g\(n\)\}g∈𝒢\\\{\\mathcal\{S\}\_\{g\}\(n\)\\\}\_\{g\\in\\mathcal\{G\}\}; total scores
\{Sgtot\}g∈𝒢\\\{\\mathrm\{S\}^\{tot\}\_\{g\}\\\}\_\{g\\in\\mathcal\{G\}\}; channel budget
NbudgetN\_\{\\mathrm\{budget\}\}; total channels
NtotN^\{tot\}; tolerance
ε\\varepsilon\.
2:Output:Channel budgets
\{Ng⋆\}g∈𝒢\\\{N\_\{g\}^\{\\star\}\\\}\_\{g\\in\\mathcal\{G\}\}
3:
αmin←0\\alpha\_\{\\min\}\\leftarrow 0,
αmax←1\\alpha\_\{\\max\}\\leftarrow 1
4:while
αmin<αmax\\alpha\_\{\\min\}<\\alpha\_\{\\max\}do
5:
α←\(αmin\+αmax\)/2\\alpha\\leftarrow\(\\alpha\_\{\\min\}\+\\alpha\_\{\\max\}\)/2
6:
𝝆←min\(αϕ,1\)\\boldsymbol\{\\rho\}\\leftarrow\\min\\big\(\\alpha\\,\\boldsymbol\{\\phi\},\\,1\\big\)
7:
N\(𝝆\)←∑g∈𝒢min\{n\|𝒮g\(n\)≥ρg\(α\)Sgtot\}N\(\\boldsymbol\{\\rho\}\)\\leftarrow\\sum\_\{g\\in\\mathcal\{G\}\}\\min\\left\\\{n\\,\\middle\|\\,\\mathcal\{S\}\_\{g\}\(n\)\\geq\\rho\_\{g\}\(\\alpha\)\\,\\mathrm\{S\}^\{tot\}\_\{g\}\\right\\\}
8:if
\|N\(𝝆\)−Nbudget\|≤εNtot\\bigl\|N\(\\boldsymbol\{\\rho\}\)\-N\_\{\\mathrm\{budget\}\}\\bigr\|\\leq\\varepsilon\\,N^\{tot\}then
9:
Ng⋆←min\{n\|𝒮g\(n\)≥ρg\(α\)Sgtot\}N\_\{g\}^\{\\star\}\\leftarrow\\min\\left\\\{n\\,\\middle\|\\,\\mathcal\{S\}\_\{g\}\(n\)\\geq\\rho\_\{g\}\(\\alpha\)\\,\\mathrm\{S\}^\{tot\}\_\{g\}\\right\\\},
∀g∈𝒢\\forall g\\in\\mathcal\{G\}
10:break
11:endif
12:if
N\(𝝆\(α\)\)\>NbudgetN\(\\boldsymbol\{\\rho\}\(\\alpha\)\)\>N\_\{\\mathrm\{budget\}\}then
13:
αmax←α\\alpha\_\{\\max\}\\leftarrow\\alpha
14:else
15:
αmin←α\\alpha\_\{\\min\}\\leftarrow\\alpha
16:endif
17:endwhile
18:return
\{Ng⋆\}g∈𝒢\\\{N\_\{g\}^\{\\star\}\\\}\_\{g\\in\\mathcal\{G\}\}
Motivated by the mismatch between experts contribution and their internal redundancy observed in[Section3\.2](https://arxiv.org/html/2606.18304#S3.SS2), we propose a new objective that directly rewards retaining the concentrated, high\-contribution channels\.
##### Unified coverage formulation for inter\- and intra\-layer allocation\.
Consider a group𝒢\\mathcal\{G\}, which can be either all layers or all experts within a single layer\. Each layer or expertg∈𝒢g\\in\\mathcal\{G\}contains channelsc∈𝒞gc\\in\\mathcal\{C\}\_\{g\}with non\-negative scoressg,c≥0s\_\{g,c\}\\geq 0\. Sorting channels bysg,cs\_\{g,c\}in descending order, then we notate them assg,\(1\)≥⋯≥sg,\(\|𝒞g\|\)s\_\{g,\(1\)\}\\geq\\cdots\\geq s\_\{g,\(\|\\mathcal\{C\}\_\{g\}\|\)\}\. And then, we precompute the prefix sums ofnnchannels as
𝒮g\(n\)=∑i=1nsg,\(i\),Sgtot=𝒮g\(\|𝒞g\|\),\\mathcal\{S\}\_\{g\}\(n\)=\\sum\_\{i=1\}^\{n\}s\_\{g,\(i\)\},\\qquad\\mathrm\{S\}^\{tot\}\_\{g\}=\\mathcal\{S\}\_\{g\}\(\|\\mathcal\{C\}\_\{g\}\|\),\(5\)Given precomputed prefix sums, the coverage ratio for top\-nnchannels is computed directly asρg\(n\)=𝒮g\(n\)/Sgtot\\rho\_\{g\}\(n\)=\\mathcal\{S\}\_\{g\}\(n\)/\\mathrm\{S\}^\{tot\}\_\{g\}\.
The core idea of our algorithm is to change the objective ofprune ratioallocation to thechannel score coverageallocation\. Given a global prune targetpp, we have total channel budgetNbudget\(p\)=\(1−p\)Ntot=\(1−p\)∑g∈𝒢\|𝒞g\|N\_\{\\mathrm\{budget\}\}\(p\)=\(1\-p\)N^\{tot\}=\(1\-p\)\\sum\_\{g\\in\\mathcal\{G\}\}\|\\mathcal\{C\}\_\{g\}\|\. We allocate channels by searching for the largest target coverage vector𝝆∈\[0,1\]\|𝒢\|\\boldsymbol\{\\rho\}\\in\[0,1\]^\{\|\\mathcal\{G\}\|\}that maximizes total covered score while retaining the minimum number of channels:
N\(𝝆\)=∑g∈𝒢Ng\(ρg\)=∑g∈𝒢min\{n\|𝒮g\(n\)≥ρgSgtot\},N\(\\boldsymbol\{\\rho\}\)=\\sum\_\{g\\in\\mathcal\{G\}\}N\_\{g\}\(\\rho\_\{g\}\)=\\sum\_\{g\\in\\mathcal\{G\}\}\\min\\left\\\{n\\,\\middle\|\\,\\mathcal\{S\}\_\{g\}\(n\)\\geq\\rho\_\{g\}\\,\\mathrm\{S\}^\{tot\}\_\{g\}\\right\\\},\(6\)where eachρg∈𝝆\\rho\_\{g\}\\in\\boldsymbol\{\\rho\}corresponds to the coverage ratio ofgg,N\(𝝆\)N\(\\boldsymbol\{\\rho\}\)is the minimal number of channels needed to reach coverage𝝆\\boldsymbol\{\\rho\}\. Since𝒮g\(n\)\\mathcal\{S\}\_\{g\}\(n\)is monotone innn,N\(𝝆\)N\(\\boldsymbol\{\\rho\}\)can be obtained efficiently via binary search \(Appendix[Algorithm3](https://arxiv.org/html/2606.18304#alg3)\)\.
The pipeline of CBA is shown in[Algorithm1](https://arxiv.org/html/2606.18304#alg1)\. We initialize𝝆\\boldsymbol\{\\rho\}using non\-negative importance priorϕ∈ℝ\|𝒢\|\\boldsymbol\{\\phi\}\\in\\mathbb\{R\}^\{\|\\mathcal\{G\}\|\}derived from ALA, and a single scaling factorα\\alpha\(line 6 in[Algorithm1](https://arxiv.org/html/2606.18304#alg1)\)\. We apply binary search overα\\alphato find the largestα⋆\\alpha^\{\\star\}such thatN\(𝝆\(α⋆\)\)≤NbudgetN\(\\boldsymbol\{\\rho\}\(\\alpha^\{\\star\}\)\)\\leq N\_\{\\mathrm\{budget\}\}, which yields the final budgets for each itemg∈𝒢g\\in\\mathcal\{G\}:Ng⋆=Ng\(ρg\(α⋆\)\)N\_\{g\}^\{\\star\}=N\_\{g\}\(\\rho\_\{g\}\(\\alpha^\{\\star\}\)\)\.
##### Inter\-layer vs\. intra\-layer instantiation\.
The procedure above is identical for inter\-layer and intra\-layer allocation, which only differ in the definition of group𝒢\\mathcal\{G\}and the initialization of importance estimationϕ\\boldsymbol\{\\phi\}: \(1\)Inter\-layer allocation\.𝒢\\mathcal\{G\}consists of all layers, and𝒞g\\mathcal\{C\}\_\{g\}\(∀g∈𝒢\\forall g\\in\\mathcal\{G\}\) includes all channels in one layer\. We setϕ\\boldsymbol\{\\phi\}using the layerwise loss, and our algorithm producing budgetsNℓ⋆N\_\{\\ell\}^\{\\star\}for all layers\. \(2\)Intra\-layer allocation\.𝒢=\{\(ℓ,1\),…,\(ℓ,E\)\}\\mathcal\{G\}=\\\{\(\\ell,1\),\\ldots,\(\\ell,E\)\\\}contains all experts at layerℓ\\ell, and𝒞g\\mathcal\{C\}\_\{g\}means channels for expertg∈𝒢g\\in\\mathcal\{G\}\.ϕℓ\\boldsymbol\{\\phi\}\_\{\\ell\}is derived from our ALA \([Section4\.1](https://arxiv.org/html/2606.18304#S4.SS1)\), and run the same search under the layer budgetNℓ⋆N\_\{\\ell\}^\{\\star\}to obtainNℓ,e⋆,∀e∈ℰℓN\_\{\\ell,e\}^\{\\star\},\\forall e\\in\\mathcal\{E\}\_\{\\ell\}\.
Overall, our CBA algorithm takes the ALA outcome as budget initialization, and translates them into kept channels by maximizing the accumulated scores within each expert\. As illustrated in[Figure5](https://arxiv.org/html/2606.18304#S3.F5), unlike prune\-by\-ratio baselines \(gray\) that can waste capacity on low\-score tails, our method retains only high\-score channels to maximize score coverage \(red\)\. Time breakdown of CBA is provided in Appendix[Table12](https://arxiv.org/html/2606.18304#A3.T12), and more details are in Appendix[SectionA\.1](https://arxiv.org/html/2606.18304#A1.SS1)\.
### 4\.3Alignment\-Aware Redistribution
##### Rationale\.
To remain compatible with low\-bit quantization after pruning, inference backends \(e\.g\., BitsAndBytes\) require the input dimensions to be multiples of a hardware\-friendly block size\. Otherwise, frameworks may \(i\) trigger warnings and fall back to slower generic implementations, \(ii\) suffer degraded throughput\. For example, Qwen3\-30B\-A3B drops from 14\.21 tokens/s when dimensions are aligned to 128, while 10\.23 tokens/s when not aligned, see[Table2](https://arxiv.org/html/2606.18304#S4.T2)\. And \(iii\) incur padding that wastes storage and compute while conveying no information\. Qwen3\-30B\-A3B would have4\.1%4\.1\\%padded channels, corresponding to≈4\.0×108\\approx 4\.0\\times 10^\{8\}wasted parameters \(see Appendix[SectionB\.2](https://arxiv.org/html/2606.18304#A2.SS2)\)\.
Table 2:Throughput and latency of Qwen MoE models with and without channel alignment \(under 50% sparsity\)\.Our coverage\-based allocation produces per\-expert channel budgetsNℓ,eN\_\{\\ell,e\}optimal for score coverage, but may violate low\-bit GEMM constraints that require channel dimensions to be multiples of a block sizeaa\(e\.g\.,6464or128128\)\. We therefore apply an Alignment\-Aware Redistribution \(AAR\) that converts\{Nℓ,e\}\\\{N\_\{\\ell,e\}\\\}into aligned budgets\{Nℓ,ealigned\}\\\{N^\{\\mathrm\{aligned\}\}\_\{\\ell,e\}\\\}while approaching as close as possible to the original allocation\.
##### Downward alignment\.
First, we drop extremely small experts by a minimal channel thresholdmm: experts withNℓ,e<mN\_\{\\ell,e\}<mare set to zero and excluded from redistribution\. Because overly slim experts can convey more noise than information\. For each remaining expert, we apply downward alignment to the nearest multiple ofaa, producing a kernel\-compatible base budget byNℓ,ebase=⌊N~ℓ,e/a⌋⋅aN^\{\\mathrm\{base\}\}\_\{\\ell,e\}=\\big\\lfloor\\tilde\{N\}\_\{\\ell,e\}/a\\big\\rfloor\\cdot a\. The released quota after rounding in layerℓ\\ellisRℓ=Nℓ⋆−∑eNℓ,ebaseR\_\{\\ell\}=N\_\{\\ell\}^\{\\star\}\-\\sum\_\{e\}N^\{\\mathrm\{base\}\}\_\{\\ell,e\}\(red slices\), corresponding toqℓ=⌊Rℓ/a⌋q\_\{\\ell\}=\\lfloor R\_\{\\ell\}/a\\rflooradditionalaa\-blocks that can be reassigned\.
##### Hamilton largest\-remainder apportionment\.
We perform alignment\-aware redistribution via Hamilton’s largest\-remainder rule\. After rounding each active expert’s channel budget down to the nearest multiple ofaa, we obtain per\-expert fractional remaindersrℓ,e∈\[0,1\)r\_\{\\ell,e\}\\in\[0,1\)that quantify how close each expert is to the nextaa\-block\. We then allocate theqℓq\_\{\\ell\}availableaa\-blocks to the experts with the largest remainders, which can be written as
bℓ,e=𝕀\[e∈\{π\(1\),…,π\(qℓ\)\}\],b\_\{\\ell,e\}=\\mathbb\{I\}\\left\[e\\in\\\{\\pi\(1\),\\ldots,\\pi\(q\_\{\\ell\}\)\\\}\\right\],\(7\)whereπ\\pisorts experts in descending order ofrℓ,er\_\{\\ell,e\}\. The final aligned channel budgets are
Nℓ,e′=Nℓ,ebase\+a⋅bℓ,e\.N^\{\\prime\}\_\{\\ell,e\}=N^\{\\mathrm\{base\}\}\_\{\\ell,e\}\+a\\cdot b\_\{\\ell,e\}\.\(8\)This preserves the coverage\-based allocation as closely as possible while ensuring divisibility byaafor efficient low\-bit kernels\. The complete redistribution procedure and all implementation details are provided in Appendix[SectionA\.2](https://arxiv.org/html/2606.18304#A1.SS2)\.
## 5Experiments
### 5\.1Experimental Setup
##### Models and Compared Methods\.
We evaluate our method on representative open\-source MoEs covering different scales, including DeepSeek\-MoE\-16B\(Daiet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib36)\), DeepSeek\-V2\-Lite\(DeepSeek\-AI,[2024](https://arxiv.org/html/2606.18304#bib.bib37)\), Qwen1\.5\-MoE\-A2\.7B\(Team,[2024](https://arxiv.org/html/2606.18304#bib.bib35)\), and Qwen3\-30B\-A3B\-Thinking\(Qwen\-Team,[2025](https://arxiv.org/html/2606.18304#bib.bib27)\)\. We compare against recent LLM or MoE compression methods, including Wanda\(Sunet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib48)\)using unstructural pruning, MoNE\(Zhanget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib22)\)\) with structural pruning, both of which denoted asPx%\. EAC\-MoE\(Chenet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib21)\)and[Heet al\.](https://arxiv.org/html/2606.18304#bib.bib10)jointly combine pruning and quantization \(Px%Qyb\)\. MoE\-I2\(Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8)\)proposes low\-rank decomposition \(Lx%\), and PuzzleMoE\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib23)\)applies expert merge \(Mx%\)\. Here,x%\\mathrm\{x\}\\%represents the parameter reduction ratio, andyb\\mathrm\{yb\}means it uses y\-bit quantization\. Detailed introductions can be found in[SectionC\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2)\.
Implementations\.We adopt 25% channel pruning with 4\-bit quantized via alignment, noted as OursQ\(P25%Q4b\), and more aggressive 50% pruning without quantization or alignment, noted as Ours \(P50%\)\. We generate pruning allocation using C4\(Raffelet al\.,[2019](https://arxiv.org/html/2606.18304#bib.bib42)\)for general benchmarks for knowledge, GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.18304#bib.bib38)\)or OpenCodeReasoning\(Ahmadet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib43)\)for math and code, followed by lightweight fine\-tuning on Alpaca\(Taoriet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib41)\)\. Extended configurations are provided in Appendix[SectionC\.1](https://arxiv.org/html/2606.18304#A3.SS1)\.
Table 3:Comparison on Qwen MoE models\. MMLU is evaluated under 5\-shot setting, while other tasks are evaluated zero\-shot\.Table 4:Reasoning benchmarks with math and code tasks\.Table 5:Comparison on Deepseek MoE models\. MMLU is evaluated under 5\-shot setting, while other tasks are evaluated zero\-shot\.
### 5\.2Overall Results
##### Results on General Tasks\.
[Table3](https://arxiv.org/html/2606.18304#S5.T3)and[Table5](https://arxiv.org/html/2606.18304#S5.T5)report zero\-shot accuracy on knowledge tasks together with storage with Qwen MoEs and DeepSeek MoEs, respectively\. Under the quantization\-aware setting, OursQconsistently preserves or improves accuracy while substantially reducing storage across all models\. In particular, on Deepseek\-MoE\-16B, Qwen1\.5\-MoE\-A2\.7B and Qwen3\-30B\-A3B, OursQeven surpasses the original model after lightweigt fine\-tuning on average performance, with more than5×5\\timesstorage reduction by jointly using structural pruning and quantization\. On Deepseek\-V2\-Lite, OursQalso achieves better performance even under more aggressive compression ratio compared to Wanda and MoNE\. Overall, results show that our attribution\-guided, coverage\-maximized allocation achieves strong compression with negligible accuracy loss, and alignment\-aware redistribution allows us to integrate with low\-bit quantization to achieve further storage saving\. We further compare channel\-level pruning with expert\-level pruning baselines at matched storage budgets in Appendix[SectionC\.2\.1](https://arxiv.org/html/2606.18304#A3.SS2.SSS1), where the Pareto frontier in[Figure9](https://arxiv.org/html/2606.18304#A3.F9)shows that the advantage becomes larger as the compression budget tightens\.
##### Results on Reasoning Benchmarks\.
We also report accuracy and pass@1 performance on math and code reasoning tasks in[Table4](https://arxiv.org/html/2606.18304#S5.T4)\. On Qwen1\.5\-MoE\-A2\.7B, our method in both quantization and non\-quantization settings largely retains GSM8K accuracy \(58\.20 vs\. 61\.50\) while improving HumanEval from 34\.20 to 38\.14\. In contrast, previous method\(Guoet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib24)\)which also allocate different prune ratio on experts based on similarity clustering degrades reasoning accuracy, especially on GSM8K\. For the larger Qwen3\-MoE\-30B\-A3B, our method remains robust at higher difficulty, which reaches 95\.0 on MATH500 under 50% sparsity, indicating that attribution\-guided coverage allocation can preserve the critical intermediate representation space while reduce noisy informations during complex reasoning even under aggressive structural compression\.

Figure 6:Comparison of storage and runtime memory usage \(GB\)\.
##### Storage and Memory Reduction\.
We report the storage footprint and peak memory usage during runtime in[Figure6](https://arxiv.org/html/2606.18304#S5.F6)\. Our method yields substantial memory savings\. ApplyingP50%nearly halves peak memory on all MoEs, e\.g\., from 57\.24GB to 32\.02GB on Qwen3\-30B\-A3B\. CombiningP25%withQ4bachieves the smallest storage, reducing it by over3×3\\times, although the throughput may drop slightly due to on\-the\-fly dequantization\. Results for additional MoEs are reported in Appendix[SectionC\.2\.3](https://arxiv.org/html/2606.18304#A3.SS2.SSS3)\.
Table 6:Comparison of inter\- and intra\-layer allocation strategies\.
### 5\.3Ablation Studies
[Table6](https://arxiv.org/html/2606.18304#S5.T6)compares inter\-layer and intra\-layer sparsity allocation under a 50% pruning budget\. Simple heuristics \(uniform or U\-shaped schedules\) consistently underperform data\-driven strategies, indicating that the expert importance in MoE is highly unbalanced\. Coverage\-based allocation strategy improves both inter\- and intra\-layer results\. For inter\-layer allocation, coverage initialized with smoothed loss performs best \(40\.0 on ARC\-c, 58\.2 on GSM8K\), approaching the non\-pruned model \(40\.4 and 61\.5\)\.
Table 7:Comparison of smoothing functions on Qwen1\.5\-MoE\-A2\.7B underP50%\.For intra\-layer allocation, coverage initialized with the attribution\-based proxy also outperforms others\. These results confirm that coverage\-based allocation is robust under aggressive pruning, and attribution\-approximated loss yields stronger importance estimates and better performance\.
##### Smoothing of layerwise loss\.
The square\-root smoothing is a simple monotone\-concave transform for compressing the dynamic range of layerwise losses\. As shown in[Table7](https://arxiv.org/html/2606.18304#S5.T7), all smoothed variants outperform the unsmoothed baseline, and square\-root gives the best average\. Definitions of all smoothing functions and full per\-task results are provided in Appendix[SectionC\.3\.3](https://arxiv.org/html/2606.18304#A3.SS3.SSS3.Px2)\.
Table 8:Comparison of two AAR residual reallocation strategies on Qwen1\.5\-MoE\-A2\.7B with different alignment block sizesaa\. CSQA Avg is the mean accuracy over PIQA, ARC\-c, ARC\-e, BoolQ, HellaSwag, and WinoGrande\.
##### AAR residual reallocation strategy\.
In AAR reallocation, we compare two criteria:*largest removed channels*\(l\-r\-c\), which prioritizes structural capacity recovery, and*largest removed scores*\(l\-r\-s\), which targets score\-weighted importance loss\.[Table8](https://arxiv.org/html/2606.18304#S5.T8)suggests that channel count and importance\-weighted loss are strongly correlated under coverage\-based pruning, both of which yield comparable accuracy across all block sizesa∈\{64,128,256\}a\\in\\\{64,128,256\\\}\.
### 5\.4Visualization and Analysis

Figure 7:Raw loss \(top colorbar\), score coverage ratio \(purple colorbar\) vs\. channel keep ratio \(channels retained after structured pruning, pink colorbar\) for each layer on Qwen3\-30B\-A3B\.##### Coverage ratio vs\. Prune ratio\.
To examine the coverage\-based allocation, we visualize the layer\-wise raw loss, score coverage ratio and channel keep ratio in[Figure7](https://arxiv.org/html/2606.18304#S5.F7)\. Our method consistently retains a high fraction of cumulative channel scores by maximizing score coverage \(about 90% to 99% underP25%\), while the kept channels vary widely from 60% to 93%\. This highlights that layers or experts with highly concentrated scores can preserve most information with relatively few channels\. We also provide expert\-level visualizations, channel score distributions, and the resulting sparsity allocation in Appendix[SectionC\.5\.1](https://arxiv.org/html/2606.18304#A3.SS5.SSS1)\.
##### Robustness across routing architectures\.
Our method is not tied to a specific routing design\. The main\-text experiments already cover Qwen\-style standard top\-kkrouting and the more constrained DeepSeek routing with load balancing\. We further switch the activated experts from top\-22to top\-11on Qwen1\.5\-MoE\-A2\.7B and DeepSeek\-V2\-Lite\. As shown in[Table9](https://arxiv.org/html/2606.18304#S5.T9), accuracy drops under top\-11routing, but our method still preserves most of the original performance at5050% pruning under both settings\. Full results and router entropy statistics are provided in Appendix[SectionC\.4\.2](https://arxiv.org/html/2606.18304#A3.SS4.SSS2), confirming that expert heterogeneity persists across routing dynamics, layer depths, and data sources\.
![[Uncaptioned image]](https://arxiv.org/html/2606.18304v1/x8.png)
Figure 8:Combinations of pruning and quantization\. The blue point is the 16\-bit baseline atP=0%P=0\\%\.
Table 9:Robustness under different top\-kkrouting strategies\.Top\-kkP%AvgQwen1\.5\-MoE\-A2\.7B4062\.9845058\.272062\.9825058\.271060\.5615055\.15DeepSeek\-V2\-Lite6062\.2365059\.082054\.8425050\.121062\.2315059\.08
##### Additional analyses\.
Appendix[SectionC\.4\.1](https://arxiv.org/html/2606.18304#A3.SS4.SSS1)evaluates calibration\-corpus sensitivity and shows that general tasks remain stable across general\-domain corpora, while math and code benefit from matched calibration data\. We also provide wider pruning–quantization sweeps, second\-order attribution comparisons, and AAR hyperparameter studies in Appendix[SectionsC\.2\.2](https://arxiv.org/html/2606.18304#A3.SS2.SSS2),[C\.3\.2](https://arxiv.org/html/2606.18304#A3.SS3.SSS2)and[C\.3\.4](https://arxiv.org/html/2606.18304#A3.SS3.SSS4), which support the same accuracy–efficiency and robustness trends\.
## 6Conclusion
We propose an attribution\-guided, expert\-wise slimming framework for MoEs that reformulates pruning as maximizing channel\-score coverage, which better captures internal redundancy and avoids allocating capacity to low\-contribution structures\. With alignment\-aware redistribution, the pruned model remains kernel\-compatible for low\-bit quantization and achieves substantial compression while preserving accuracy\. Experiments on modern MoEs demonstrate a practical path toward efficient MoE deployment\.
## Impact Statement
This paper presents work whose goal is to advance the field of machine learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.
## Acknowledgement
This work was supported by the National Natural Science Foundation of China \(Nos\. 62476018, 62525601\), the Academic Excellence Foundation of BUAA for PhD Students, and the Fundamental Research Funds for the Central Universities\.
Dr Tao’s research is partially supported by NTU RSR and Start Up Grants\.
## References
- W\. U\. Ahmad, S\. Narenthiran, S\. Majumdar, A\. Ficek, S\. Jain, J\. Huang, V\. Noroozi, and B\. Ginsburg \(2025\)Opencodereasoning: advancing data distillation for competitive coding\.arXiv preprint arXiv:2504\.01943\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px4.p1.3),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p2.4)\.
- Y\. An, X\. Zhao, T\. Yu, M\. Tang, and J\. Wang \(2024\)Fluctuation\-based adaptive structured pruning for large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 10865–10873\.Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1)\.
- Anonymous \(2026\)Orchestrating hidden\-intermediate pruning\-and\-distill for moes slimming\.Note:Anonymous ICML 2026 submission \(under review\)Anonymized concurrent submission\. The anonymized PDF is provided in the supplementary material\.Cited by:[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- S\. Bai, H\. Li, J\. Zhang, Z\. Hong, and S\. Guo \(2025\)DiEP: adaptive mixture\-of\-experts compression through differentiable expert pruning\.Advances in neural information processing systems\.Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§1](https://arxiv.org/html/2606.18304#S1.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2019\)PIQA: reasoning about physical commonsense in natural language\.InAAAI Conference on Artificial Intelligence,Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- M\. Cao, G\. Li, J\. Ji, J\. Zhang, X\. Ma, S\. Liu, and L\. Yin \(2024\)Condense, don’t just prune: enhancing efficiency and performance in moe layer pruning\.arXiv preprint arXiv:2412\.00069\.Cited by:[§4\.1](https://arxiv.org/html/2606.18304#S4.SS1.SSS0.Px3.p2.1)\.
- H\. Chen, C\. Lv, L\. Ding, H\. Qin, X\. Zhou, Y\. Ding, X\. Liu, M\. Zhang, J\. Guo,et al\.\(2024\)DB\-LLM: accurate dual\-binarization for efficient LLMs\.InFindings of the Association for Computational Linguistics: ACL,Cited by:[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.External Links:2107\.03374Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- Y\. Chen, Y\. Xie, R\. Yang, W\. Jiang, W\. Wang, Y\. He, Y\. Chen, P\. Zhao, and Y\. Wang \(2025a\)Collaborative compression for large\-scale moe deployment on edge\.Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p3.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- Y\. Chen, Y\. Shao, P\. Wang, and J\. Cheng \(2025b\)EAC\-moe: expert\-selection aware compressor for mixture\-of\-experts large language models\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2.p1.3),[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- M\. N\. R\. Chowdhury, M\. Wang, K\. E\. Maghraoui, N\. Wang, P\. Chen, and C\. Carothers \(2024\)A provably effective method for pruning experts in fine\-tuned sparse mixture\-of\-experts\.InInternational Conference on Machine Learning,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2405.16646)Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InNorth American Chapter of the Association for Computational Linguistics,Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.InarXiv\.org,Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.InarXiv\.org,Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px4.p1.3),[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p2.4)\.
- D\. Dai, C\. Deng, C\. Zhao, R\. Xu, H\. Gao, D\. Chen, J\. Li, W\. Zeng, X\. Yu, Y\. Wu, Z\. Xie, Y\. K\. Li, P\. Huang, F\. Luo, C\. Ruan, Z\. Sui, and W\. Liang \(2024\)DeepSeekMoE: towards ultimate expert specialization in mixture\-of\-experts language models\.pp\. 1280–1297\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- DeepSeek\-AI \(2024\)DeepSeek\-v2: a strong, economical, and efficient mixture\-of\-experts language model\.External Links:2405\.04434Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- Z\. Dong, H\. Peng, P\. Liu, W\. X\. Zhao, D\. Wu, F\. Xiao, and Z\. Wang \(2025\)Domain\-specific pruning of large mixture\-of\-experts models with few\-shot demonstrations\.Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Gao, C\. Lin, T\. Hua, T\. Zheng, Y\. Shen, H\. Jin, and Y\. Hsu \(2024\)DISP\-llm: dimension\-independent structural pruning for large language models\.InNeural Information Processing Systems,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2410.11988)Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1)\.
- R\. Gong, Y\. Ding, Z\. Wang, C\. Lv, X\. Zheng, J\. Du, Y\. Yong, S\. Gu, H\. Qin,et al\.\(2025\)A survey of low\-bit large language models: basics, systems, and algorithms\.Neural Networks,pp\. 107856\.Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025a\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1)\.
- H\. Guo, J\. Yao, B\. Wang, J\. Du, S\. Cao, D\. Di, S\. Zhang, and Z\. Li \(2025b\)Cluster\-driven expert pruning for mixture\-of\-experts large language models\.arXiv preprint arXiv:2504\.07807\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2.p1.3),[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1),[§5\.2](https://arxiv.org/html/2606.18304#S5.SS2.SSS0.Px2.p1.1)\.
- J\. Guo, J\. Wu, Z\. Wang, J\. Liu, G\. Yang, Y\. Ding, R\. Gong, H\. Qin, and X\. Liu \(2024\)Compressing large language models by joint sparsification and quantization\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- S\. He, D\. Dong, L\. Ding, and A\. Li \(2025\)Towards efficient mixture of experts: a holistic study of compression techniques\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=HTpMOl6xSI)Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2.p1.3),[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§1](https://arxiv.org/html/2606.18304#S1.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8),[Table 3](https://arxiv.org/html/2606.18304#S5.T3.11.11.11.3),[Table 3](https://arxiv.org/html/2606.18304#S5.T3.9.9.9.2),[Table 5](https://arxiv.org/html/2606.18304#S5.T5.26.26.26.2),[Table 5](https://arxiv.org/html/2606.18304#S5.T5.28.28.28.3),[Table 5](https://arxiv.org/html/2606.18304#S5.T5.5.5.5.2),[Table 5](https://arxiv.org/html/2606.18304#S5.T5.7.7.7.3)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.ArXivabs/2009\.03300\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- W\. Huang, Y\. Liao, J\. Liu, R\. He, H\. Tan, S\. Zhang, H\. Li, S\. Liu, and X\. Qi \(2024\)MC\-moe: mixture compressor for mixture\-of\-experts llms gains more\.InarXiv\.org,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2410.06270)Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Huang, Z\. Wang, Z\. Yuan, Y\. Ding, R\. Gong, J\. Guo, X\. Liu, and J\. Zhang \(2025\)MoDES: accelerating mixture\-of\-experts multimodal large language models via dynamic expert skipping\.arXiv preprint arXiv:2511\.15690\.Cited by:[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)Livecodebench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p1.1)\.
- J\. Lee, S\. Hwang, A\. Qiao, D\. F\. Campos, Z\. Yao, and Y\. He \(2025\)Stun: structured\-then\-unstructured pruning for scalable moe pruning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13660–13676\.Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§1](https://arxiv.org/html/2606.18304#S1.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Lee, T\. Ajanthan, and P\. H\. S\. Torr \(2018\)SNIP: single\-shot network pruning based on connection sensitivity\.InInternational Conference on Learning Representations,Cited by:[6th item](https://arxiv.org/html/2606.18304#A3.I1.i6.p1.1.1)\.
- L\. Li, Q\. Zhu, J\. Wang, W\. Li, H\. Gu, S\. Han, and Y\. Guo \(2025\)Sub\-moe: efficient mixture\-of\-expert llms compression via subspace expert merging\.InarXiv\.org,Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- E\. Liu, J\. Zhu, Z\. Lin, X\. Ning, M\. B\. Blaschko, S\. Yan, G\. Dai, H\. Yang, and Y\. Wang \(2024a\)Efficient expert pruning for sparse mixture\-of\-experts language models: enhancing performance and reducing inference costs\.InarXiv\.org,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2407.00945)Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p6.2),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.18304#S4.SS1.SSS0.Px3.p2.1)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024b\)DoRA: weight\-decomposed low\-rank adaptation\.arXiv preprint arXiv:2402\.09353\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px4.p1.3)\.
- X\. Lu, Q\. Liu, Y\. Xu, A\. Zhou, S\. Huang, B\. Zhang, J\. Yan, and H\. Li \(2024\)Not all experts are equal: efficient expert pruning and skipping for mixture\-of\-experts large language models\.InAnnual Meeting of the Association for Computational Linguistics,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.14800)Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§1](https://arxiv.org/html/2606.18304#S1.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- C\. Lv, B\. Zhang, Y\. Yong, R\. Gong, Y\. Huang, S\. Gu, J\. Wu, Y\. Shi, J\. Guo,et al\.\(2026\)LLMC\+: benchmarking vision\-language model compression with a plug\-and\-play toolkit\.InAAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1)\.
- X\. Ma, G\. Fang, and X\. Wang \(2023\)Llm\-pruner: on the structural pruning of large language models\.Advances in neural information processing systems36,pp\. 21702–21720\.Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1)\.
- A\. Muzio, A\. Sun, and C\. He \(2024\)SEER\-moe: sparse expert efficiency through regularization for mixture\-of\-experts\.InarXiv\.org,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2404.05089)Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1)\.
- Qwen\-Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px1.p1.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p1.1),[§1](https://arxiv.org/html/2606.18304#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- C\. Raffel, N\. M\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2019\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.J\. Mach\. Learn\. Res\.21,pp\. 140:1–140:67\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px4.p1.3),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p2.4)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: a graduate\-level google\-proof q&a benchmark\.External Links:2311\.12022,[Link](https://arxiv.org/abs/2311.12022)Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(undefined\)WinoGrande: an adversarial winograd schema challenge at scale\.Proceedings of the AAAI Conference on Artificial Intelligence\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- O\. Skean, M\. R\. Arefin, D\. Zhao, N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Layer by layer: uncovering hidden representations in language models\.arXiv preprint arXiv:2502\.02013\.Cited by:[§A\.1](https://arxiv.org/html/2606.18304#A1.SS1.SSS0.Px1.p4.1)\.
- J\. Song, Y\. Chen, X\. Wang, C\. Shen, and M\. Song \(2019\)Deep model transferability from attribution maps\.Advances in Neural Information Processing Systems32\.Cited by:[4th item](https://arxiv.org/html/2606.18304#A3.I1.i4.p1.5.1),[5th item](https://arxiv.org/html/2606.18304#A3.I1.i5.p1.1.1)\.
- M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter \(2023\)A simple and effective pruning approach for large language models\.arXiv preprint arXiv:2306\.11695\.Cited by:[3rd item](https://arxiv.org/html/2606.18304#A3.I1.i3.p1.1.1),[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2.p1.3),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Stanford alpaca: an instruction\-following llama model\.GitHub\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px4.p1.3),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p2.4)\.
- Q\. Team \(2024\)Qwen1\.5\-moe: matching 7b model performance with 1/3 activated parameters\.External Links:[Link](https://qwenlm.github.io/blog/qwen-moe/)Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- Y\. Xie, Z\. Zhang, D\. Zhou, C\. Xie, Z\. Song, X\. Liu, Y\. Wang, X\. Lin, and A\. Xu \(2024\)MoE\-pruner: pruning mixture\-of\-experts large language model using the hints from its router\.InarXiv\.org,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2410.12013)Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§1](https://arxiv.org/html/2606.18304#S1.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1)\.
- H\. Xu, H\. Wu, X\. Ke, J\. Wu, R\. Xu, and J\. Xu \(2025\)MCMoE: completing missing modalities with mixture of experts for incomplete multimodal action quality assessment\.External Links:2511\.17397,[Link](https://arxiv.org/abs/2511.17397)Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1)\.
- F\. Xue, Z\. Zheng, Y\. Fu, J\. Ni, Z\. Zheng, W\. Zhou, and Y\. You \(2024\)OpenMoE: an early effort on open mixture\-of\-experts language models\.arXiv preprint arXiv:2402\.01739\.Cited by:[§1](https://arxiv.org/html/2606.18304#S1.p1.1)\.
- C\. Yang, Y\. Sui, J\. Xiao, L\. Huang, Y\. Gong, Y\. Duan, W\. Jia, M\. Yin, Y\. Cheng, and B\. Yuan \(2024\)MoE\-i2: compressing mixture of experts models through inter\-expert pruning and intra\-expert low\-rank decomposition\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 10456–10466\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2.p1.3),[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px4.p1.3),[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p3.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§1](https://arxiv.org/html/2606.18304#S1.p2.1),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InAnnual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- G\. Zhang, Y\. Han, Y\. Lou, W\. Zhao, Y\. Zhang, and Y\. You \(2025\)MoNE: replacing redundant experts with lightweight novices for structured pruning of moe\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2.p1.3),[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
- Y\. Zhang and T\. Math\-AI \(2025\)American invitational mathematics examination \(aime\) 2025\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px5.p1.1)\.
- Z\. Zhang, X\. Liu, H\. Cheng, C\. Xu, and J\. Gao \(2024\)Diversifying the expert knowledge for task\-agnostic pruning in sparse mixture\-of\-experts\.InarXiv\.org,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2407.09590)Cited by:[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§1](https://arxiv.org/html/2606.18304#S1.p1.1),[§1](https://arxiv.org/html/2606.18304#S1.p2.1)\.
- Y\. Zhao, Z\. Wang, and M\. Zhang \(2025\)PuzzleMoE: efficient compression of large mixture\-of\-experts models via sparse expert merging and bit\-packed inference\.arXiv preprint arXiv:2511\.04805\.Cited by:[§C\.1](https://arxiv.org/html/2606.18304#A3.SS1.SSS0.Px2.p1.3),[Appendix D](https://arxiv.org/html/2606.18304#A4.p2.1),[Appendix D](https://arxiv.org/html/2606.18304#A4.p4.1),[§2](https://arxiv.org/html/2606.18304#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.18304#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.18304#S5.SS1.SSS0.Px1.p1.8)\.
### Contents
AAlgorithms\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2606.18304#A1)
A\.1Complete Process of Maximum Coverage Allocation Algorithm\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.1](https://arxiv.org/html/2606.18304#A1.SS1)
A\.2Hamilton Apportionment Redistribution\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.2](https://arxiv.org/html/2606.18304#A1.SS2)
BDerivation and Proof\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2606.18304#A2)
B\.1Complete Proof of Attribution\-based Loss Approximation\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.1](https://arxiv.org/html/2606.18304#A2.SS1)
B\.2Derivation of Expected Redundant Channels\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.2](https://arxiv.org/html/2606.18304#A2.SS2)
CExperiments\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2606.18304#A3)
C\.1Experimental Setup\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.1](https://arxiv.org/html/2606.18304#A3.SS1)
C\.2Overall Comparisons\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2](https://arxiv.org/html/2606.18304#A3.SS2)
C\.2\.1Pareto Frontier of Channel\-level vs\. Expert\-level Pruning Methods\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2\.1](https://arxiv.org/html/2606.18304#A3.SS2.SSS1)
C\.2\.2Wider Pruning–Quantization Combinations\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2\.2](https://arxiv.org/html/2606.18304#A3.SS2.SSS2)
C\.2\.3Speedup and Memory Usage with Different Alignment Granularity\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2\.3](https://arxiv.org/html/2606.18304#A3.SS2.SSS3)
C\.2\.4Calibration Runtime Breakdown\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2\.4](https://arxiv.org/html/2606.18304#A3.SS2.SSS4)
C\.3Further Ablation Studies on Proposed Methods\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3](https://arxiv.org/html/2606.18304#A3.SS3)
C\.3\.1Channel Score Metric Selection\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3\.1](https://arxiv.org/html/2606.18304#A3.SS3.SSS1)
C\.3\.2First\-order vs\. Second\-order Attribution Score\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3\.2](https://arxiv.org/html/2606.18304#A3.SS3.SSS2)
C\.3\.3Loss Smoothing\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3\.3](https://arxiv.org/html/2606.18304#A3.SS3.SSS3)
C\.3\.3\.1Raw Loss vs\. Smoothed Losses as the Target Coverage Ratio\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3\.3](https://arxiv.org/html/2606.18304#A3.SS3.SSS3.Px1)
C\.3\.3\.2Alternative Smoothing Functions for Layerwise Loss\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3\.3](https://arxiv.org/html/2606.18304#A3.SS3.SSS3.Px2)
C\.3\.4Hyperparameter Sensitivity in CBA and AAR\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3\.4](https://arxiv.org/html/2606.18304#A3.SS3.SSS4)
C\.4Sensitivity and Robustness Analysis\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.4](https://arxiv.org/html/2606.18304#A3.SS4)
C\.4\.1Sensitivity to Calibration Corpus\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.4\.1](https://arxiv.org/html/2606.18304#A3.SS4.SSS1)
C\.4\.2Robustness Across Routing Policies\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.4\.2](https://arxiv.org/html/2606.18304#A3.SS4.SSS2)
C\.5Visualizations\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.5](https://arxiv.org/html/2606.18304#A3.SS5)
C\.5\.1Visualization of Loss, Scores and Sparsity Allocation at Expert\-Level\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.5\.1](https://arxiv.org/html/2606.18304#A3.SS5.SSS1)
DMore Related Works\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2606.18304#A4)
## Appendix AAlgorithms
### A\.1Complete Process of Maximum Coverage Allocation Algorithm
This section provides full algorithmic details for the Coverage\-Based Allocation \(CBA\) introduced in Section 4\.2, including the concrete inter\-layer and intra\-layer instantiations and the associated search procedures\.
##### Inter\-layer Allocation\.
We first allocate the overall channels to each layer based on layerwise saliency coverage ratio search\. Given a target prune ratiop∈\(0,1\)p\\in\(0,1\), we have the overall remaining channels for the prune modelNtarget=\(1−p\)IELN\_\{\\mathrm\{target\}\}=\(1\-p\)\\,IEL, whereI,E,LI,E,Lare the intermediate dimension, expert numbers in each layer, and number of layers, respectively\.
Step 1: Channel saliency preparation\.Letc∈𝒞ℓc\\in\\mathcal\{C\}\_\{\\ell\}represent all the intermediate channels at layerℓ\\ell\(i\.e\.,\|𝒞ℓ\|=IE\|\\mathcal\{C\_\{\\ell\}\}\|=I\\,E\)\. Each channel has a non\-negative saliency scoresℓ,c≥0s\_\{\\ell,c\}\\geq 0, which can be computed through various criteria, such as activation magnitude, weight norms, gradient norms, or simple combinations of them\. We provide an ablation on the selection of channel saliency criteria in[SectionC\.3\.1](https://arxiv.org/html/2606.18304#A3.SS3.SSS1)\.
Step 2: Prefix sums calculation\.We sort channels layerwisely by their saliency scores in descending order, and rewritesℓ,\(1\)≥sℓ,\(2\)≥⋯≥sℓ,\(\|𝒞ℓ\|\)s\_\{\\ell,\(1\)\}\\geq s\_\{\\ell,\(2\)\}\\geq\\dots\\geq s\_\{\\ell,\(\|\\mathcal\{C\}\_\{\\ell\}\|\)\}for the sorted scores\. Next, we calculate the prefix sums of the leadingnnchannels at layerℓ\\ellas𝒮ℓ\(n\)\\mathcal\{S\}\_\{\\ell\}\(n\)by[Algorithm2](https://arxiv.org/html/2606.18304#alg2):
𝒮ℓ\(n\)=∑i=0nsℓ,\(i\),where0≤n≤\|𝒞ℓ\|\.\\mathcal\{S\}\_\{\\ell\}\(n\)\\;=\\;\\sum\_\{i=0\}^\{n\}s\_\{\\ell,\(i\)\},~\\text\{where\}~0\\leq n\\leq\|\\mathcal\{C\}\_\{\\ell\}\|\.\(9\)Based on the prefix sums, the saliency coverage ratio of top\-nnchannels in layerℓ\\ellcan be obtained in𝒪\(1\)\\mathcal\{O\}\(1\)time asρℓ\(n\)=𝒮ℓ\(n\)Sℓtot\\rho\_\{\\ell\}\(n\)=\\frac\{\\mathcal\{S\}\_\{\\ell\}\(n\)\}\{\\mathrm\{S\}^\{tot\}\_\{\\ell\}\}, whereSℓtot=∑c∈𝒞ℓsℓ,c\\mathrm\{S\}^\{tot\}\_\{\\ell\}=\\sum\_\{c\\in\\mathcal\{C\}\_\{\\ell\}\}s\_\{\\ell,c\}is the total saliency score\.
Algorithm 2Prefix sums𝒮g\(n\)\\mathcal\{S\}\_\{g\}\(n\)for a groupgg\(a layer or an expert\)1:Input:sorted channel scores
sg,\(1\)≥sg,\(2\)≥⋯≥sg,\(\|𝒞g\|\)s\_\{g,\(1\)\}\\geq s\_\{g,\(2\)\}\\geq\\cdots\\geq s\_\{g,\(\|\\mathcal\{C\}\_\{g\}\|\)\}for group
gg; total channel number
\|𝒞g\|\|\\mathcal\{C\}\_\{g\}\|
2:Output:prefix sums
𝒮g\(n\)\\mathcal\{S\}\_\{g\}\(n\)for
n=1,…,\|𝒞g\|n=1,\\dots,\|\\mathcal\{C\}\_\{g\}\|
3:
𝒮g\(0\)←0\\mathcal\{S\}\_\{g\}\(0\)\\leftarrow 0
4:for
n=1n=1to
\|𝒞g\|\|\\mathcal\{C\}\_\{g\}\|do
5:
𝒮g\(n\)←𝒮g\(n−1\)\+sg,\(n\)\\mathcal\{S\}\_\{g\}\(n\)\\leftarrow\\mathcal\{S\}\_\{g\}\(n\-1\)\+s\_\{g,\(n\)\}
6:endfor
7:return
𝒮g\(n\)\\mathcal\{S\}\_\{g\}\(n\)for all
n=1,…,\|𝒞g\|n=1,\\dots,\|\\mathcal\{C\}\_\{g\}\|
Step 3: Layerwise loss collection\.Inspired by previous studies that layers have diverse functionalities and redundancy\(Skeanet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib49)\), we use a small calibration dataset to collect layerwise loss by injecting a scaling as noise into a specific layer, and computing the Negative Log\-Likelihood Loss \(NLL\) compared to the original model\. Since the raw loss can have a large range, we conduct a simple smooth by square\-root, and get the initial layerwise target saliency coverage ratio𝐰∈ℝL\\mathbf\{w\}\\in\\mathbb\{R\}^\{L\}\.
Step 4: Best coverage ratio by binary search\.Given a target global pruning ratiopp, our goal is to find the largest saliency coverage ratio𝝆⋆∈\[0,1\]\{\\boldsymbol\{\\rho\}\}^\{\\star\}\\in\[0,1\]that covers as more high\-saliency channels as possible, under the constraint of total remaining channelsNtargetN\_\{\\mathrm\{target\}\}\. We conduct an one\-dimensional binary search to approach the supremum𝝆⋆\{\\boldsymbol\{\\rho\}\}^\{\\star\}\.
As shown in[Algorithm1](https://arxiv.org/html/2606.18304#alg1), we apply a coefficientα\\alphaas the starting point \(\#L5\):
𝝆\(α\)=min\(αϕ,1\),\\boldsymbol\{\\rho\}\(\\alpha\)=\\min\\big\(\\alpha\\,\\boldsymbol\{\\phi\},1\\big\),\(10\)where𝝆\(α\)=\{ρ1,ρ2,…,ρL\}\\boldsymbol\{\\rho\}\(\\alpha\)=\\\{\\rho\_\{1\},\\rho\_\{2\},\\dots,\\rho\_\{L\}\\\}\. Then, we accumulate the minimal number of channels required in each layer to reach the saliency coverage ratio as
N\(𝝆\(α\)\)\\displaystyle N\(\\boldsymbol\{\\rho\}\(\\alpha\)\)=∑ℓ∈\[1,L\]Nℓ\(ρℓ\)\\displaystyle=\\sum\_\{\\ell\\in\[1,L\]\}N\_\{\\ell\}\(\\rho\_\{\\ell\}\)\(11\)=∑ℓ∈\[1,L\]min\{n\|𝒮ℓ\(n\)≥ρℓSℓtot\}\.\\displaystyle=\\sum\_\{\\ell\\in\[1,L\]\}\\min\\left\\\{n\\,\\middle\|\\,\\mathcal\{S\}\_\{\\ell\}\(n\)\\;\\geq\\;\\rho\_\{\\ell\}\\ \\mathrm\{S\}^\{tot\}\_\{\\ell\}\\right\\\}\.\(12\)Since𝒮ℓ\(n\)\\mathcal\{S\}\_\{\\ell\}\(n\)is monotonic, we can quickly findNℓ\(ρ\)N\_\{\\ell\}\(\\rho\)inO\(1\)O\(1\)time\. Algorithm is provided in[Algorithm3](https://arxiv.org/html/2606.18304#alg3)\.
Algorithm 3Binary search of minimal channelsNg\(ρ\)N\_\{g\}\(\\rho\)for target coverageρ\\rhoin groupgg1:Input:prefix sums
𝒮g\(n\)\\mathcal\{S\}\_\{g\}\(n\); target coverage
ρ∈\[0,1\]\\rho\\in\[0,1\]; total score
Sgtot\\mathrm\{S\}^\{tot\}\_\{g\}; channel number
\|𝒞g\|\|\\mathcal\{C\}\_\{g\}\|
2:Output:minimal number of channels
Ng\(ρ\)N\_\{g\}\(\\rho\)
3:
nmin←1,nmax←\|𝒞g\|n\_\{\\min\}\\leftarrow 1,\\;n\_\{\\max\}\\leftarrow\|\\mathcal\{C\}\_\{g\}\|
4:while
nmin<nmaxn\_\{\\min\}<n\_\{\\max\}do
5:
n←⌊\(nmin\+nmax\)/2⌋n\\leftarrow\\left\\lfloor\(n\_\{\\min\}\+n\_\{\\max\}\)/2\\right\\rfloor
6:if
𝒮g\(n\)≥ρSgtot\\mathcal\{S\}\_\{g\}\(n\)\\geq\\rho\\,\\mathrm\{S\}^\{tot\}\_\{g\}then
7:
nmax←nn\_\{\\max\}\\leftarrow n
8:else
9:
nmin←n\+1n\_\{\\min\}\\leftarrow n\+1
10:endif
11:endwhile
12:
Ng\(ρ\)←nminN\_\{g\}\(\\rho\)\\leftarrow n\_\{\\min\}
13:return
Ng\(ρ\)N\_\{g\}\(\\rho\)
Step 5: Termination condition\.If the gap between the current total number of preserved channelsN\(ρ\)N\(\\rho\)and the targetNbudgetN\_\{\\mathrm\{budget\}\}\(\#L8\) falls below the toleranceϵ\\epsilon\(to control pruning precision,ϵ\\epsilonis set to 0\.01 as default\), the loop is terminated right away \(\#L9\)\. In practice, we also impose a maximum iterations to prevent long searching\. If not terminated, we perform a binary search overα∈\[0,1\]\\alpha\\in\[0,1\]\(\#L12–16\), and finally get the supremeα⋆\\alpha^\{\\star\}such thatN\(α⋆\)≤Nbudget\+ϵNtotN\(\\alpha^\{\\star\}\)\\leq N\_\{\\mathrm\{budget\}\}\+\\epsilon N^\{tot\}, i\.e\.,p\(𝝆⋆\)≈pp\(\\boldsymbol\{\\rho\}^\{\\star\}\)\\approx p\.
##### Intra\-Layer Allocation\.
After layerwise budgetNℓ⋆N\_\{\\ell\}^\{\\star\}is fixed, we perform the coverage search for intra\-layer pruning for each expert\.
Algorithm 4Intra\-layer coverage search at layerℓ\\ellwith an expert\-wise importance prior1:Input:experts
ℰℓ=\{1,…,E\}\\mathcal\{E\}\_\{\\ell\}=\\\{1,\\ldots,E\\\}at layer
ℓ\\ell; non\-negative importance prior
\{ϕℓ,e\}e∈ℰℓ\\\{\\phi\_\{\\ell,e\}\\\}\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}; prefix sums
\{𝒮ℓ,e\(n\),Sℓ,etot\}e∈ℰℓ\\\{\\mathcal\{S\}\_\{\\ell,e\}\(n\),\\,\\mathrm\{S\}^\{tot\}\_\{\\ell,e\}\\\}\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}; per\-expert channel counts
\{\|𝒞ℓ,e\|\}e∈ℰℓ\\\{\|\\mathcal\{C\}\_\{\\ell,e\}\|\\\}\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}; layer budget
Nℓ⋆N^\{\\star\}\_\{\\ell\}; tolerance
ε\\varepsilon
2:Output:expert\-wise channel budgets
\{Nℓ,e⋆\}e∈ℰℓ\\\{N^\{\\star\}\_\{\\ell,e\}\\\}\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}
3:
Nℓtot←∑e∈ℰℓ\|𝒞ℓ,e\|N^\{tot\}\_\{\\ell\}\\leftarrow\\sum\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}\|\\mathcal\{C\}\_\{\\ell,e\}\|
4:
αmin←0,αmax←1\\alpha\_\{\\min\}\\leftarrow 0,\\;\\alpha\_\{\\max\}\\leftarrow 1
5:while
αmin<αmax\\alpha\_\{\\min\}<\\alpha\_\{\\max\}do
6:
α←\(αmin\+αmax\)/2\\alpha\\leftarrow\(\\alpha\_\{\\min\}\+\\alpha\_\{\\max\}\)/2
7:foreach expert
e∈ℰℓe\\in\\mathcal\{E\}\_\{\\ell\}do
8:
ρℓ,e←min\(αϕℓ,e,1\)\\rho\_\{\\ell,e\}\\leftarrow\\min\(\\alpha\\phi\_\{\\ell,e\},\\,1\)
9:
Nℓ,e\(ρℓ,e\)←min\{n∣𝒮ℓ,e\(n\)≥ρℓ,eSℓ,etot\}N\_\{\\ell,e\}\(\\rho\_\{\\ell,e\}\)\\leftarrow\\min\\\{n\\mid\\mathcal\{S\}\_\{\\ell,e\}\(n\)\\geq\\rho\_\{\\ell,e\}\\,\\mathrm\{S\}^\{tot\}\_\{\\ell,e\}\\\}\([Algorithm3](https://arxiv.org/html/2606.18304#alg3)\)
10:endfor
11:
Nℓ\(𝝆\)←∑e∈ℰℓNℓ,e\(ρℓ,e\)N\_\{\\ell\}\(\\boldsymbol\{\\rho\}\)\\leftarrow\\sum\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}N\_\{\\ell,e\}\(\\rho\_\{\\ell,e\}\)
12:if
\|Nℓ\(𝝆\)−Nℓ⋆\|≤εNℓtot\\bigl\|N\_\{\\ell\}\(\\boldsymbol\{\\rho\}\)\-N^\{\\star\}\_\{\\ell\}\\bigr\|\\leq\\varepsilon\\,N^\{tot\}\_\{\\ell\}then
13:
Nℓ,e⋆←Nℓ,e\(ρℓ,e\)N^\{\\star\}\_\{\\ell,e\}\\leftarrow N\_\{\\ell,e\}\(\\rho\_\{\\ell,e\}\),
∀e∈ℰℓ\\forall e\\in\\mathcal\{E\}\_\{\\ell\}
14:break
15:endif
16:if
Nℓ\(𝝆\)\>Nℓ⋆N\_\{\\ell\}\(\\boldsymbol\{\\rho\}\)\>N^\{\\star\}\_\{\\ell\}then
17:
αmax←α\\alpha\_\{\\max\}\\leftarrow\\alpha
18:else
19:
αmin←α\\alpha\_\{\\min\}\\leftarrow\\alpha
20:endif
21:endwhile
22:return
\{Nℓ,e⋆\}e∈ℰℓ\\\{N^\{\\star\}\_\{\\ell,e\}\\\}\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}
Step 1: Channel saliency reuse\.We reuse the channel saliency scores obtained above and denote assℓ,e,cs\_\{\\ell,e,c\}for experteeat layerℓ\\ell\. Next, since we have calculated the layerwise prefix sums, it only costs<<0\.1msto recompute the expert\-wise prefix sums𝒮ℓ,e\(n\)\\mathcal\{S\}\_\{\\ell,e\}\(n\)\. Time breakdown can be found in Appendix[SectionC\.2\.4](https://arxiv.org/html/2606.18304#A3.SS2.SSS4)\.
Step 2: Expert\-wise importance estimator\.We provide an efficient and accurate expert\-wise loss approximation in[Section4\.1](https://arxiv.org/html/2606.18304#S4.SS1.SSS0.Px2), taking place of the time\-consuming expert\-wise loss collection by ablating a specific expert one\-by\-one\. Letϕℓ\\boldsymbol\{\\phi\}\_\{\\ell\}denote the expert\-wise importance at layerℓ\\ell\.
Step 3: Best coverage ratio by binary search\.Similar to inter\-layer saliency coverage search, given a target number of remaining channelsNℓ⋆N\_\{\\ell\}^\{\\star\}, we start from an initial scaling factorα∈\(0,1\)\\alpha\\in\(0,1\)and have
𝝆ℓ\(α\)=min\(αϕℓ,1\),∀ℓ∈\[1,L\],\\boldsymbol\{\\rho\}\_\{\\ell\}\(\\alpha\)=\\min\\big\(\\alpha\\,\\boldsymbol\{\\phi\}\_\{\\ell\},1\\big\),\\qquad\\forall\\ell\\in\[1,L\],\(13\)where𝝆ℓ\(α\)=\{ρℓ,1,ρℓ,2,…,ρℓ,E\}\\boldsymbol\{\\rho\}\_\{\\ell\}\(\\alpha\)=\\\{\\rho\_\{\\ell,1\},\\rho\_\{\\ell,2\},\\dots,\\rho\_\{\\ell,E\}\\\}\. Next, the minimal channels required at layerℓ\\ellto reach the coverage is
Nℓ\(𝝆ℓ\(α\)\)\\displaystyle N\_\{\\ell\}\(\\boldsymbol\{\\rho\}\_\{\\ell\}\(\\alpha\)\)=∑e∈\[1,E\]Nℓ,e\(ρℓ,e\)\\displaystyle=\\sum\_\{e\\in\[1,E\]\}N\_\{\\ell,e\}\(\\rho\_\{\\ell,e\}\)\(14\)=∑e∈\[1,E\]min\{n\|𝒮ℓ,e\(n\)≥ρℓ,eSℓ,e\},\\displaystyle=\\sum\_\{e\\in\[1,E\]\}\\min\\left\\\{n\\,\\middle\|\\,\\mathcal\{S\}\_\{\\ell,e\}\(n\)\\geq\\rho\_\{\\ell,e\}\\,S\_\{\\ell,e\}\\right\\\},\(15\)whereSℓ,etot=∑c∈𝒞ℓ,esℓ,e,c\\mathrm\{S\}^\{tot\}\_\{\\ell,e\}=\\sum\_\{c\\in\\mathcal\{C\}\_\{\\ell,e\}\}s\_\{\\ell,e,c\}is the total saliency score of experteeat layerℓ\\ell\. We then iteratively searchα\\alphato adjust𝝆ℓ\\boldsymbol\{\\rho\}\_\{\\ell\}, comparingNℓ\(𝝆ℓ\)N\_\{\\ell\}\(\\boldsymbol\{\\rho\}\_\{\\ell\}\)against the target channel budgetNℓ⋆N\_\{\\ell\}^\{\\star\}, until we find the optimalα⋆\\alpha^\{\\star\}that\|Nℓ\(𝝆ℓ⋆\)−Nℓ⋆\|≤εNℓtot,\\big\|N\_\{\\ell\}\(\\boldsymbol\{\\rho\}\_\{\\ell\}^\{\\star\}\)\-N\_\{\\ell\}^\{\\star\}\\big\|\\leq\\varepsilon N^\{tot\}\_\{\\ell\},where𝝆ℓ⋆=α⋆𝐰ℓ,∀ℓ∈\[1,L\]\\boldsymbol\{\\rho\}\_\{\\ell\}^\{\\star\}=\\alpha^\{\\star\}\\mathbf\{w\}\_\{\\ell\},\\forall\\ell\\in\[1,L\]\. Each layer repeats the same binary search\. The complete algorithm for intra\-layer allocation can be found in[Algorithm4](https://arxiv.org/html/2606.18304#alg4)\.
Combining inter\-layer and intra\-layer allocation, we obtain the final pruning budget for each expertNℓ,e⋆N\_\{\\ell,e\}^\{\\star\}that satisfy the global pruning constraintppwhile maximizing the saliency coverage ratio with the given layerwise/expert\-wise importance\. In this way, layers/experts that have more redundancy will have a smaller budget, on the contrary, layers/experts that have dispersed channel saliency can have more remaining channels\.
As for the computational complexity of the overall allocation process, since𝒮ℓ\(n\)\\mathcal\{S\}\_\{\\ell\}\(n\)and𝒮ℓ,e\(n\)\\mathcal\{S\}\_\{\\ell,e\}\(n\)are non\-decreasing innn,N\(𝝆\(α\)\)N\(\\boldsymbol\{\\rho\}\(\\alpha\)\)andNℓ\(𝝆ℓ\(α\)\)N\_\{\\ell\}\(\\boldsymbol\{\\rho\}\_\{\\ell\}\(\\alpha\)\)are non\-decreasing inα\\alpha, the binary search can be efficiently performed overα\\alphain𝒪\(1\)\\mathcal\{O\}\(1\)time to find the optimalα⋆\\alpha^\{\\star\}for global and for all the layers\.
### A\.2Hamilton apportionment redistribution\.
We regard the channel redistribution problem as the classicalHamilton apportionment\. LetNℓ,eN\_\{\\ell,e\}denote the allocated channels of experteein layerllgiven by our maximized coverage algorithm introduced above, and letaabe the supported GEMM block size \(e\.g\.,a=64,128,…a=64,128,\\dots\)\. We additionally introduce a minimal channel thresholdmmto eliminate extremely small experts that carries little information due to too few channels left\.
Step 1: Minimal\-channel trimming\.We first trim experts whose allocated channels are smaller thanmm:
N~ℓ,e=\{0,Nℓ,e<m,Nℓ,e,Nℓ,e≥m,𝒜ℓ=\{e∣N~ℓ,e\>0\},\\tilde\{N\}\_\{\\ell,e\}=\\begin\{cases\}0,&N\_\{\\ell,e\}<m,\\\\ N\_\{\\ell,e\},&N\_\{\\ell,e\}\\geq m,\\end\{cases\}\\qquad\\mathcal\{A\}\_\{\\ell\}=\\\{e\\mid\\tilde\{N\}\_\{\\ell,e\}\>0\\\},\(16\)where𝒜ℓ\\mathcal\{A\}\_\{\\ell\}denotes the active experts after trimming\.
Step 2: Downward alignment\.For each remaining experte∈𝒜ℓe\\in\\mathcal\{A\}\_\{\\ell\}, we roundN~l,e\\tilde\{N\}\_\{l,e\}down to the nearest multiple ofaa, ensuring compatibility with low\-bit GEMM kernels:
Nℓ,ebase=⌊N~ℓ,ea⌋⋅a,e∈𝒜ℓ,N^\{\\mathrm\{base\}\}\_\{\\ell,e\}=\\left\\lfloor\\frac\{\\tilde\{N\}\_\{\\ell,e\}\}\{a\}\\right\\rfloor\\cdot a,\\qquad e\\in\\mathcal\{A\}\_\{\\ell\},\(17\)and setNℓ,ebase=0N^\{\\mathrm\{base\}\}\_\{\\ell,e\}=0for trimmed expertse∉𝒜ℓe\\notin\\mathcal\{A\}\_\{\\ell\}\.
Step 3: Compute remaining quota and available blocks\.The channel budget released by trimming and alignment is collected and segmented as units ofaa\-blocks\. Therefore, the remaining quota to be redistributed , and the number of available blocks in layerℓ\\ellcan be expressed as
Rℓ=Nℓ⋆−∑e∈\[1,E\]Nℓ,ebase,qℓ=⌊Rℓa⌋\.R\_\{\\ell\}=N\_\{\\ell\}^\{\\star\}\-\\sum\_\{e\\in\[1,E\]\}N\_\{\\ell,e\}^\{\\mathrm\{base\}\},\\quad q\_\{\\ell\}=\\left\\lfloor\\frac\{R\_\{\\ell\}\}\{a\}\\right\\rfloor\.\(18\)
Step 4: Hamilton apportionment over experts\.We redistribute the remaining quota inqℓq\_\{\\ell\}discreteaa\-blocks\.
The fractional remainder of each expert induced by downward alignment is:
rℓ,e=N~ℓ,e−Nℓ,ebasea∈\[0,1\),e∈𝒜ℓ\.r\_\{\\ell,e\}=\\frac\{\\tilde\{N\}\_\{\\ell,e\}\-N^\{\\mathrm\{base\}\}\_\{\\ell,e\}\}\{a\}\\in\[0,1\),\\quad e\\in\\mathcal\{A\}\_\{\\ell\}\.\(19\)To approach the original allocation derived by the expert importance, each expert can receive at most one additional block\. We sortrℓ,er\_\{\\ell,e\}in descending order, and letπ\\pibe a permutation of𝒜ℓ\\mathcal\{A\}\_\{\\ell\}such thatrℓ,π\(1\)≥rℓ,π\(2\)≥⋯≥rℓ,π\(\|𝒜ℓ\|\)r\_\{\\ell,\\pi\(1\)\}\\geq r\_\{\\ell,\\pi\(2\)\}\\geq\\cdots\\geq r\_\{\\ell,\\pi\(\|\\mathcal\{A\}\_\{\\ell\}\|\)\}\. The largestqℓq\_\{\\ell\}experts can receive the additionalaa\-block, which can be simply written as
bℓ,e=𝕀\[e∈\{π\(1\),…,π\(qℓ\)\}\],b\_\{\\ell,e\}=\\mathbb\{I\}\\left\[e\\in\\\{\\pi\(1\),\\ldots,\\pi\(q\_\{\\ell\}\)\\\}\\right\],\(20\)Finally, the aligned channels are
Nℓ,e′=Nℓ,ebase\+a⋅bℓ,e\.N^\{\\prime\}\_\{\\ell,e\}=N^\{\\mathrm\{base\}\}\_\{\\ell,e\}\+a\\cdot b\_\{\\ell,e\}\.\(21\)The resulting aligned channels approaches the original layerwise allocation budget, satisfies expert capacity constraints, and guarantees that every expert has a channel dimension divisible byaa\. It enables the pruned model to be stored and computed by low\-bit quantization, yielding both effective compression and inference speedup without redundant zero padding on MoE models\.
## Appendix BDerivation and Proof
### B\.1Complete Proof of Attribution\-based Loss Approximation in[Section4\.1](https://arxiv.org/html/2606.18304#S4.SS1)
Lethℓ∈ℝdh\_\{\\ell\}\\in\\mathbb\{R\}^\{d\}be the input hidden state of the MoE block in layerℓ\\ell, and letzℓ,e=fℓ,e\(hℓ\)∈ℝdz\_\{\\ell,e\}\\;=\\;f\_\{\\ell,e\}\(h\_\{\\ell\}\)\\in\\mathbb\{R\}^\{d\}denote the output of experteebefore gating\. The output of MoE layerℓ\\ellis the weighted sum of top\-kkexperts:
yℓ=∑e∈ℰℓgℓ,e\(hℓ\)zℓ,e\.y\_\{\\ell\}\\;=\\;\\sum\_\{e\\in\\mathcal\{E\}\_\{\\ell\}\}g\_\{\\ell,e\}\(h\_\{\\ell\}\)\\,z\_\{\\ell,e\}\.\(22\)whereℰℓ\\mathcal\{E\}\_\{\\ell\}is the top\-kkexperts selected by the router at layerℓ\\ell, and\|ℰℓ\|=k\|\\mathcal\{E\}\_\{\\ell\}\|=k\(kkis typically set as1,2,4,81,2,4,8in modern MoE\)\.gℓ,e\(hℓ\)≥0g\_\{\\ell,e\}\(h\_\{\\ell\}\)\\geq 0is the router weight of expertee\.
We measure the contribution of experteeat layerℓ\\ellby the loss change when removing this expert\. If the experte∈ℰℓe\\in\\mathcal\{E\}\_\{\\ell\}is ranked as top\-kkby the router and selected for a specific token, removing it corresponds to replacingzℓ,ez\_\{\\ell,e\}with zero, which will induce a perturbation in the layer output
Δyℓ\(e\)\\displaystyle\\Delta y\_\{\\ell\}^\{\(e\)\}=y^ℓ\(e\)−yℓ\\displaystyle=\\hat\{y\}\_\{\\ell\}^\{\(e\)\}\-y\_\{\\ell\}\(23\)=∑e′∈ℰℓ∖\{e\}gℓ,e′zℓ,e′−∑e′∈ℰℓgℓ,e′zℓ,e′\\displaystyle=\\sum\_\{e^\{\\prime\}\\in\\mathcal\{E\}\_\{\\ell\}\\setminus\\\{e\\\}\}g\_\{\\ell,e^\{\\prime\}\}z\_\{\\ell,e^\{\\prime\}\}\\;\-\\;\\sum\_\{e^\{\\prime\}\\in\\mathcal\{E\}\_\{\\ell\}\}g\_\{\\ell,e^\{\\prime\}\}z\_\{\\ell,e^\{\\prime\}\}\(24\)=−gℓ,ezℓ,e,\\displaystyle=\-\\,g\_\{\\ell,e\}\\,z\_\{\\ell,e\},\(25\)whereyℓ^\(e\)\\hat\{y\_\{\\ell\}\}^\{\(e\)\}is the layer output when removing expertee\.
Letℒ\\mathcal\{L\}be the loss compared to the original layer’s output\. For any perturbationΔy\\Delta yapplied to the layer outputyℓy\_\{\\ell\}, the loss can be written by the first\-order Taylor expansion as
ℒ\(yℓ\+Δy\)=ℒ\(yℓ\)\+\(∂ℒ∂yℓ\)⊤Δy\+𝒪\(‖Δy‖2\)\.\\mathcal\{L\}\(y\_\{\\ell\}\+\\Delta y\)\\;=\\;\\mathcal\{L\}\(y\_\{\\ell\}\)\+\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial y\_\{\\ell\}\}\\right\)^\{\\top\}\\Delta y\+\\mathcal\{O\}\\bigl\(\\\|\\Delta y\\\|^\{2\}\\bigr\)\.\(26\)In our case, removing experteeat layerℓ\\ellinducesΔyℓ\(e\)\\Delta y\_\{\\ell\}^\{\(e\)\}\([Equation23](https://arxiv.org/html/2606.18304#A2.E23)\) to the block output, and the loss change is
Δℒ\(e\)=ℒ\(yℓ\+Δyℓ\(e\)\)−ℒ\(yℓ\),\\Delta\\mathcal\{L\}^\{\(e\)\}\\;=\\;\\mathcal\{L\}\\bigl\(y\_\{\\ell\}\+\\Delta y\_\{\\ell\}^\{\(e\)\}\\bigr\)\-\\mathcal\{L\}\(y\_\{\\ell\}\),\(27\)which can be approximated by only keeping the first\-order term in[Equation26](https://arxiv.org/html/2606.18304#A2.E26)as
Δℒ\(e\)≈\(∂ℒ∂yℓ\)⊤Δyℓ\(e\)=−\(∂ℒ∂yℓ\)⊤\(gℓ,ezℓ,e\)\.\\Delta\\mathcal\{L\}^\{\(e\)\}\\;\\approx\\;\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial y\_\{\\ell\}\}\\right\)^\{\\top\}\\Delta y\_\{\\ell\}^\{\(e\)\}=\-\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial y\_\{\\ell\}\}\\right\)^\{\\top\}\\bigl\(g\_\{\\ell,e\}z\_\{\\ell,e\}\\bigr\)\.\(28\)
By the chain rule, the gradient w\.r\.t\. the expert outputzℓ,ez\_\{\\ell,e\}is
∂ℒ∂zℓ,e=gℓ,e∂ℒ∂yℓ,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\ell,e\}\}\\;=\\;g\_\{\\ell,e\}\\,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial y\_\{\\ell\}\},\(29\)so the final loss change can be estimated as
Δℒ\(e\)≈−\(∂ℒ∂zℓ,e\)⊤zℓ,e\.\\Delta\\mathcal\{L\}^\{\(e\)\}\\;\\approx\\;\-\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\ell,e\}\}\\right\)^\{\\top\}z\_\{\\ell,e\}\.\(30\)
The approximated loss is then used to measure the importance of experts at layerℓ\\ellaltogether\.
### B\.2Derivation of expected redundant channels in[Section4\.3](https://arxiv.org/html/2606.18304#S4.SS3)Rationale
From the pruning perspective, one can define a purely logical sparsity level as
slogical=1−KD,s\_\{\\text\{logical\}\}=1\-\\frac\{K\}\{D\},whereDDis the original channel dimensionality andKKis the number of channels retained after pruning\. Under 4\-bit quantization, however, parameters are physically stored and processed in fixed\-size blocks\. With a block size of 64, a linear layer with effective hidden sizeKKis packed as
D~=⌈K64⌉⋅64,\\tilde\{D\}=\\left\\lceil\\frac\{K\}\{64\}\\right\\rceil\\cdot 64,and the corresponding physical compression ratio becomes
sphysical=1−D~D\.s\_\{\\text\{physical\}\}=1\-\\frac\{\\tilde\{D\}\}\{D\}\.If we do not explicitly alignKKduring pruning, each expert can waste between 0 and 63 channels at the storage level\. Assuming that the residueKmod64K\\bmod 64is approximately uniform in\{0,…,63\}\\\{0,\\dots,63\\\}, the expected padding overhead per expert is
𝔼\[D~−K\]=164∑r=163\(64−r\)=31\.5channels\.\\mathbb\{E\}\[\\tilde\{D\}\-K\]=\\frac\{1\}\{64\}\\sum\_\{r=1\}^\{63\}\(64\-r\)=31\.5\\text\{ channels\}\.
For a Qwen3\-style MoE block with hidden sizeD=768D=768,E=128E=128experts andL=64L=64layers, this corresponds to roughly
31\.5768≈4\.1%\\frac\{31\.5\}\{768\}\\approx 4\.1\\%
## Appendix CExperiments
### C\.1Experimental Setup
##### Models\.
We conduct experiments on the following representative open\-source MoE LLMs that cover different scales and architectural choices: DeepSeek\-MoE\-16B\(Daiet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib36)\), DeepSeek\-V2\-Lite\(DeepSeek\-AI,[2024](https://arxiv.org/html/2606.18304#bib.bib37)\), Qwen1\.5\-MoE\-A2\.7B\(Team,[2024](https://arxiv.org/html/2606.18304#bib.bib35)\), and Qwen3\-30B\-A3B\-Thinking\(Qwen\-Team,[2025](https://arxiv.org/html/2606.18304#bib.bib27)\)\.
##### Compared Methods\.
We compare our approach with advanced MoE compression baselines with various techniques: EAC\-MoE\(Chenet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib21)\)performs joint pruning and quantization; we report its configurations withα=0\.3\\alpha=0\.3\(11% sparsity\) andα=0\.7\\alpha=0\.7\(38% sparsity\), and the corresponding average bitwidth of 3\.03\.[Heet al\.](https://arxiv.org/html/2606.18304#bib.bib10)\(Heet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib10)\)combines expert trimming and slimming; we report the 25% layer or block drop setting together with 4\-bit AWQ quantization\. MoE\-I2\(Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8)\)jointly applies inter\-expert pruning to remove redundant experts and intra\-expert low\-rank decomposition to reduce the parameter redundancy within remaining expert\. PuzzleMoE\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib23)\)focuses on expert merging by 25% or 50%, and provides customized CUDA kernels for efficient inference\. MoNE\(Zhanget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib22)\)prunes MoE models by replacing redundant experts with lightweight counterparts\. C\-Prune\(Guoet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib24)\)addresses intra\-layer and inter\-layer expert redundancy in MoE LLMs via a two\-stage framework of layer\-wise expert clustering followed by global cluster pruning\. Wanda\(Sunet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib48)\)is a training\-free unstructured pruning method that scores each weight by the product of its magnitude and the corresponding input activation norm, requiring no retraining or weight reconstruction\. We report the results from the original papers under the closest comparable settings\.
##### Pruning Settings\.
We use channel pruning as structural sparsification technique for easy implementation by mainstream inference engine\. In the following experiments, we adopt two variants:Oursapplies 50% channel sparsity without quantization, and thus does not require alignment\-aware redistribution\. If an expert is assigned zero channel after pruning, we trim the expert and shrink the corresponding router dimension, so that the expert is never selected\. Furthermore,OursQapplies 25% channel sparsity and further performs 4\-bit quantization using BitsAndBytes NF4\. Meanwhile, we enable Alignment\-Aware Redistribution when applied quantization with granularitya=128a=128and enforce the minimum expert channel sizem=128m=128\. We selecta=128a=128andm=128m=128based on a small grid search over feasible settings, constrained by linear layer shapes and quantized operator support\. We report throughput and peak memory trade\-offs of the explored settings in Appendix[SectionC\.2\.3](https://arxiv.org/html/2606.18304#A3.SS2.SSS3),[Figure10](https://arxiv.org/html/2606.18304#A3.F10), and choose the setting with best overall efficiency\.
##### Calibration and Fine\-tuning\.
We use C4\(Raffelet al\.,[2019](https://arxiv.org/html/2606.18304#bib.bib42)\)as the calibration dataset for commonsense tasks\. For reasoning benchmarks, we calibrate using samples drawn from GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.18304#bib.bib38)\)or OpenCodeReasoning\(Ahmadet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib43)\)depending on the task category\. After pruning, we follow[Yanget al\.](https://arxiv.org/html/2606.18304#bib.bib8)to perform fine\-tuning on Alpaca\(Taoriet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib41)\)for 2 epochs\. We fine\-tune the MoE blocks using DoRA\(Liuet al\.,[2024b](https://arxiv.org/html/2606.18304#bib.bib44)\)with rank 32 and learning rate1e−41\\mathrm\{e\}\{\-4\}, while adapting the routing module with rank 4 and learning rate1e−61\\mathrm\{e\}\{\-6\}\. We use AdamW with warmup ratio 0\.1 and clip gradient exceeding 0\.5, without weight decay\. All training is conducted on 4×\\timesH20 GPUs\. The training cost is 12 GPU hours for Qwen1\.5\-MoE\-A2\.7B and 48 GPU hours for Qwen3\-30B\-A3B, and models of similar scale exhibit comparable training time\.
##### Benchmarks and Evaluation\.
We evaluate using two widely adopted toolkits: the LM Evaluation Harness222[https://github\.com/EleutherAI/lm\-evaluation\-harness](https://github.com/EleutherAI/lm-evaluation-harness)and OpenCompass333[https://github\.com/open\-compass/opencompass](https://github.com/open-compass/opencompass)\. We report zero\-shot performance on general reasoning and knowledge benchmarks, including ARC\-C\(Clarket al\.,[2018](https://arxiv.org/html/2606.18304#bib.bib29)\), ARC\-E\(Clarket al\.,[2018](https://arxiv.org/html/2606.18304#bib.bib29)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.18304#bib.bib30)\), PIQA\(Bisket al\.,[2019](https://arxiv.org/html/2606.18304#bib.bib31)\), BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2606.18304#bib.bib32)\), WinoGrande\(Sakaguchiet al\.,[undefined](https://arxiv.org/html/2606.18304#bib.bib34)\), and MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.18304#bib.bib33)\), and math/code benchmarks with 8\-shot, including GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.18304#bib.bib38)\), HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2606.18304#bib.bib39)\), MATH500\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib45)\), AIME25\(Zhang and Math\-AI,[2025](https://arxiv.org/html/2606.18304#bib.bib58)\), GPQA\(Reinet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib59)\), and LiveCodeBench\(Jainet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib40)\)\. We follow the default task configurations and official evaluation protocols provided by the toolkits and report standard metrics, e\.g\.,accuracyfor multiple\-choice tasks,exact matchfor math, andpass@1for code generation\. For long\-context evaluations, we set the maximum sequence and output lengths as follows: AIME25 usesMAX\_SEQ\_LEN=65536andMAX\_OUT\_LEN=32768; MATH500 usesMAX\_SEQ\_LEN=16384andMAX\_OUT\_LEN=4096; LiveCodeBench\_v6\_academic usesMAX\_SEQ\_LEN=32768andMAX\_OUT\_LEN=16384\.
### C\.2Overall Comparisons
#### C\.2\.1Pareto Frontier of Channel\-level vs\. Expert\-level Pruning Methods
We provide the full per\-task accuracy for channel\-level pruning and expert\-level pruning baselines under matched storage budgets\. Across moderate\-to\-aggressive budgets \(25%–75%\), channel\-level pruning consistently stays on the Pareto frontier in[Figure9](https://arxiv.org/html/2606.18304#A3.F9)\. Under the mildest 13\.3% pruning setting, the channel budget is loose enough that expert\-level pruning remains competitive, but the advantage of channel\-level allocation becomes pronounced once the compression budget tightens\.
Table 10:Per\-task accuracy of channel\-level \(Ours\) vs\. expert\-level pruning baselines on Qwen1\.5\-MoE\-A2\.7B at matched storage budgets\. Our method performs channel\-level structural pruning, whereas the competing baselines adopt expert\-level pruning\.Figure 9:Pareto frontier of average downstream\-task accuracy versus compressed model storage \(GB\) for Qwen1\.5\-MoE\-A2\.7B\. Our channel\-level pruning consistently dominates expert\-level baselines across the full compression range\.
#### C\.2\.2Wider Pruning–Quantization Combinations
Based on our current experiments,P25%Q4bgives the best accuracy\-efficiency tradeoff among the default deployment\-oriented settings, but we do not claim it is a universal global optimum\. We sweep a wider range ofP/QP/Qcombinations on Qwen1\.5\-MoE\-A2\.7B in[Table11](https://arxiv.org/html/2606.18304#A3.T11)\. Stronger compression gives lower storage but larger accuracy drop, while milder compression preserves accuracy better\. The framework therefore supports flexible operating points depending on deployment constraints\.
Table 11:Storage and downstream accuracy under different combinations of pruning ratioPPand quantization bitwidthQQon Qwen1\.5\-MoE\-A2\.7B\. The default OursQconfiguration is highlighted\.
#### C\.2\.3Speedup and Memory Usage with Different Alignment Granularity
[Figure10](https://arxiv.org/html/2606.18304#A3.F10)reports throughput \(tokens/s\) and runtime peak memory of Qwen1\.5\-MoE\-A2\.7B as a function of the minimum kept\-channel thresholdmm\(rows\) and the alignment block sizeaa\(columns\) used in our Alignment\-Aware Redistribution\.
Across all settings, combining channel pruning with 4\-bit quantization substantially reduces peak memory compared to the unpruned baseline, confirming that AAR successfully integrates structural sparsity with low\-bit storage\. Larger block sizesaaalign channel counts to coarser multiples, which enables larger and more regular GEMM kernels and is accordingly reflected in higher throughput\. However, coarser alignment reduces the degree to which each expert’s channel count tracks the original CBA solution, so there is a natural trade\-off between kernel efficiency and allocation fidelity\.
The minimum\-channel thresholdmmcontrols the smallest permissible expert width after alignment\. Smallmmallows very thin experts to survive, which can hurt throughput due to irregular kernel sizes, while largemmforces low\-importance experts to retain more channels than necessary, slightly increasing memory\. The heatmap shows thatm=128m=128witha=128a=128ora=256a=256achieves a favorable balance: throughput is near the maximum achievable value, and memory remains well below the unpruned baseline\. These are the default settings used throughout the main experiments\.

Figure 10:Throughput and runtime memory usage of Qwen1\.5\-MoE\-A2\.7B with different minimal channel numbers and alignment granularity\.
#### C\.2\.4Calibration Runtime Breakdown
Table 12:Time breakdown of generating prune allocation of 50% sparsity\.StageTime \(ms\)Time \(%\)Qwen1\.5\-MoE\-A2\.7BOverall generating prune plan2078\.31100%– Inter\-layer coverage search63\.003\.03%Smooth weights30\.351\.46%Compute prefix sum14\.160\.68%Binary search17\.900\.86%– Intra\-layer coverage search1836\.4988\.36%Recompute prefix sum0\.08<<0\.01%Binary Search \(24 MoE layers\)1836\.4188\.36%– per layer76\.523\.68%– Alignment\-aware redistribution178\.828\.60%ComputeKbaseK^\{\\mathrm\{base\}\}52\.842\.54%Compute headroom14\.960\.72%Allocate chunks110\.145\.30%Clamp toII0\.880\.04%Deepseek\-MoE\-16BOverall generating prune plan2914\.31100%– Inter\-layer coverage search73\.812\.53%Smooth weights30\.481\.05%Compute prefix sum15\.570\.53%Binary search27\.140\.93%– Intra\-layer coverage search2647\.7090\.85%Recompute prefix sum0\.10<<0\.01%Binary Search \(27 MoE layers\)2647\.6090\.85%– per layer98\.063\.36%– Alignment\-aware redistribution192\.806\.62%ComputeKbaseK^\{\\mathrm\{base\}\}56\.121\.93%Compute headroom17\.160\.59%Allocate chunks118\.574\.07%Clamp toII0\.950\.03%Qwen3\-30B\-A3BOverall generating prune plan \(total\)10063\.12100%– Inter\-layer coverage search87\.430\.87%Smooth weights29\.360\.29%Compute prefix sum19\.910\.20%Binary search37\.150\.37%– Intra\-layer coverage search9619\.2895\.59%Recompute prefix sum0\.09<<0\.01%Binary Search \(MoE 48 layers\)9619\.1995\.59%– per layer200\.401\.99%– Alignment\-aware redistribution356\.413\.54%ComputeKbaseK^\{\\mathrm\{base\}\}20\.440\.20%Compute headroom17\.020\.17%Allocate chunks317\.403\.15%Clamp toII1\.550\.02%
We provide a time breakdown of the calibration process in Table[12](https://arxiv.org/html/2606.18304#A3.T12), demonstrating the efficiency of the proposed method\. Even for a large MoE model such as Qwen3\-30B\-A3B, the total calibration time remains within 10 seconds\.
### C\.3Further Ablation Studies on Proposed Methods
#### C\.3\.1Channel Score Metric Selection
In this ablation, we only change the channel score definition \(sℓ,cs\_\{\\ell,c\}\) while keeping all other components of the pipeline fixed\.
##### Definition of metrics\.
- •Weight Magnitude \(channel\-wise L2 norm\)\.For experteein layerℓ\\elland a projectionϕ\\phiwith weight matrixWℓ,e\(ϕ\)∈ℝOϕ×IϕW\_\{\\ell,e\}^\{\(\\phi\)\}\\in\\mathbb\{R\}^\{O\_\{\\phi\}\\times I\_\{\\phi\}\}, we define the importance of input channelccby the L2 norm of the corresponding weight column: sℓ,e,c\(ϕ,W\)=∥Wℓ,e,c\(ϕ\)∥2Oϕ=\(∑o∈Oϕ\(Wℓ,e,o,c\(ϕ\)\)2\)12\.s\_\{\\ell,e,c\}^\{\(\\phi,W\)\}=\\lVert W^\{\(\\phi\)\}\_\{\\ell,e,c\}\\rVert\_\{2\}^\{O\_\{\\phi\}\}=\\Big\(\\sum\_\{o\\in O\_\{\\phi\}\}\\big\(W\_\{\\ell,e,o,c\}^\{\(\\phi\)\}\\big\)^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}\.\(31\)
- •Activation Magnitude \(channel\-wise L2 norm\)\.For experteein layerℓ\\elland a projectionϕ\\phi, letAℓ,e,t,c\(ϕ\)A^\{\(\\phi\)\}\_\{\\ell,e,t,c\}denote the activation of channelccat tokentt\. We compute the channel magnitude by an L2 norm over𝒯\\mathcal\{T\}tokens: sℓ,e,c\(ϕ,A\)=∥Aℓ,e,c\(ϕ\)∥2𝒯=\(∑t∈𝒯\(Aℓ,e,t,c\(ϕ\)\)2\)12\.s\_\{\\ell,e,c\}^\{\(\\phi,A\)\}=\\lVert A^\{\(\\phi\)\}\_\{\\ell,e,c\}\\rVert\_\{2\}^\{\\mathcal\{T\}\}=\\Big\(\\sum\_\{t\\in\\mathcal\{T\}\}\{\\big\(A\_\{\\ell,e,t,c\}^\{\(\\phi\)\}\\big\)\}^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}\.\(32\)
- •Weight×\\timesActivation\(Sunet al\.,[2023](https://arxiv.org/html/2606.18304#bib.bib48)\)\.We follow Wanda to combine the magnitude of weight with per\-channel activation\. We compute the inner\-product, and then reducing along the output dimension\. sℓ,e,c\(ϕ,WA\)=∑o∈Oϕ\(\|Wℓ,e,o,c\(ϕ\)\|⋅∥Aℓ,e,c\(ϕ\)∥2𝒯\)\.s\_\{\\ell,e,c\}^\{\(\\phi,\\,WA\)\}=\\sum\_\{o\\in O\_\{\\phi\}\}\\big\(\\big\|W\_\{\\ell,e,o,c\}^\{\(\\phi\)\}\\big\|\\cdot\\lVert A\_\{\\ell,e,c\}^\{\(\\phi\)\}\\rVert\_\{2\}^\{\\mathcal\{T\}\}\\big\)\.\(33\)
- •Gradient Saliency Map\(Songet al\.,[2019](https://arxiv.org/html/2606.18304#bib.bib46)\)\.We use a small calibration set, and collect the activation gradient by forward and backward propagation\. Letgℓ,e,t,c\(ϕ,A\)=∇Aℓ,e,t,c\(ϕ\)ℒg\_\{\\ell,e,t,c\}^\{\(\\phi,A\)\}=\\nabla\_\{A\_\{\\ell,e,t,c\}^\{\(\\phi\)\}\}\\mathcal\{L\}denote the gradient w\.r\.t\. the activation of channelccin experteeand projectionϕ\\phiat tokentt\. We aggregate the gradients with an L2 norm over the token dimension: sℓ,e,c\(ϕ,g\)=∥gℓ,e,c\(ϕ,A\)∥2𝒯=\(∑t∈𝒯\(gℓ,e,t,c\(ϕ,A\)\)2\)12\.s\_\{\\ell,e,c\}^\{\(\\phi,g\)\}=\\big\\lVert g\_\{\\ell,e,c\}^\{\(\\phi,A\)\}\\big\\rVert\_\{2\}^\{\\mathcal\{T\}\}=\\Big\(\\sum\_\{t\\in\\mathcal\{T\}\}\\big\(g\_\{\\ell,e,t,c\}^\{\(\\phi,A\)\}\\big\)^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}\.\(34\)
- •Activation×\\timesGradient Saliency Map\(Songet al\.,[2019](https://arxiv.org/html/2606.18304#bib.bib46)\)\.We use the element\-wise product between activation and its gradient to score channel importance\. We compute the absolute value and then average over token dimension: sℓ,e,c\(ϕ,gA\)=1\|𝒯\|∑t∈𝒯\|Aℓ,e,t,c\(ϕ\)⋅gℓ,e,t,c\(ϕ,A\)\|\.s\_\{\\ell,e,c\}^\{\(\\phi,\\,gA\)\}=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}\\Big\|A\_\{\\ell,e,t,c\}^\{\(\\phi\)\}\\cdot g\_\{\\ell,e,t,c\}^\{\(\\phi,A\)\}\\Big\|\.\(35\)
- •SNIP First\-order Sensitivity \(Weight×\\timesGradient\)\(Leeet al\.,[2018](https://arxiv.org/html/2606.18304#bib.bib47)\)\.We follow SNIP to score channels by the first\-order Taylor approximation, using the element\-wise product between weight and its gradient\. Letgℓ,e,o,c\(ϕ,W\)=∇Wℓ,e,o,c\(ϕ\)ℒg\_\{\\ell,e,o,c\}^\{\(\\phi,W\)\}=\\nabla\_\{W\_\{\\ell,e,o,c\}^\{\(\\phi\)\}\}\\mathcal\{L\}denote the gradient w\.r\.t\. the weight of input channelccin experteeand projectionϕ\\phiand output channeloo, which is obtained by backpropagating the loss on the calibration set\. We compute the absolute value and then reduce along the output dimension: sℓ,e,c\(ϕ,Wg\)=∑o∈Oϕ\|Wℓ,e,o,c\(ϕ\)⋅gℓ,e,o,c\(ϕ,W\)\|\.s\_\{\\ell,e,c\}^\{\(\\phi,\\,Wg\)\}=\\sum\_\{o\\in O\_\{\\phi\}\}\\Big\|W\_\{\\ell,e,o,c\}^\{\(\\phi\)\}\\cdot g\_\{\\ell,e,o,c\}^\{\(\\phi,W\)\}\\Big\|\.\(36\)
Table 13:Ablation on channel score definitionsℓ,cs\_\{\\ell,c\}\. Inter\-layer and intra\-layer allocation use the same maximum\-coverage procedure, while only the channel saliency scores differ\.As shown in[Table13](https://arxiv.org/html/2606.18304#A3.T13), the choice of channel scores has a non\-trivial impact on performance: results on the commonsense task \(ARC\-c\) varies moderately across metrics, whereas the gap becomes substantial on reasoning\-heavy benchmarks \(GSM8K and HumanEval\)\. For example, on GSM8K,Activationreaches 58\.2, whileWeightdrops to 25\.9, and other weight or gradient based variants also lag behind\. Overall, we useActivationas the default channel scoring metric in all experiments\.
#### C\.3\.2First\-order vs\. Second\-order Attribution Score
Our default attribution\-based scorese\(1\)s\_\{e\}^\{\(1\)\}is a first\-order Taylor approximation, motivated by computational efficiency\. To verify that this approximation does not compromise allocation quality, we additionally implement a lightweight but*exact*second\-order proxyse\(2\)s\_\{e\}^\{\(2\)\}by perturbing each expert output with a scalarαe\\alpha\_\{e\}and benchmark both against the true ablated scorese\(true\)=ℒe\(0\)−ℒe\(1\)s\_\{e\}^\{\(\\mathrm\{true\}\)\}=\\mathcal\{L\}\_\{e\}\(0\)\-\\mathcal\{L\}\_\{e\}\(1\)\.[Table14](https://arxiv.org/html/2606.18304#A3.T14)shows thatse\(1\)s\_\{e\}^\{\(1\)\}already correlates highly withse\(true\)s\_\{e\}^\{\(\\mathrm\{true\}\)\}\(Pearson0\.9590\.959; channel\-allocation Pearson0\.9660\.966\), whilese\(2\)s\_\{e\}^\{\(2\)\}matches it almost exactly\. The second\-order proxy improves the end\-to\-end average by\+1\.2\+1\.2%, at the cost of∼\\sim17×\\timeslonger calibration\. The first\-order score thus remains a strong default, and the second\-order proxy serves as an enhanced variant whenever the additional calibration budget is acceptable\.
Table 14:Comparison of the first\-order attribution scorese\(1\)s\_\{e\}^\{\(1\)\}used in our main results and an exact second\-order proxyse\(2\)s\_\{e\}^\{\(2\)\}on Qwen1\.5\-MoE\-A2\.7B underP25%Q4b\. The second\-order proxy is also benchmarked against the true ablated scorese\(true\)s\_\{e\}^\{\(\\mathrm\{true\}\)\}computed by directly removing each expert\.
#### C\.3\.3Loss Smoothing
##### Raw Loss vs\. Smoothed Losses as the Target Coverage Ratio
[Figure11](https://arxiv.org/html/2606.18304#A3.F11)illustrates how the square\-root smoothing transforms the raw layerwise loss into a more balanced coverage target, and how that target translates into the final channel keep ratio\.
The top colorbar shows the raw ablated loss per layer, which spans a wide dynamic range: a small number of critical layers dominate the signal while most layers contribute only modestly\. Directly using raw loss as the inter\-layer importance signal would therefore concentrate the retained budget on a few layers and drastically under\-budget the rest\. After applying square\-root smoothing \(middle colorbar\), the dynamic range is compressed: the most sensitive layers are down\-weighted, and moderately sensitive layers receive a proportionally larger share of the budget\. The resulting score\-coverage targets are more uniformly distributed across layers, enabling a stable and globally balanced pruning allocation\.
The bottom colorbar shows the actual channel keep ratio produced by the coverage\-maximized allocation under this smoothed target\. Layers that are assigned a higher coverage ratio \(darker color\) retain more of their channels, while layers with highly concentrated scores can meet the same target with a smaller fraction\. Comparing the middle and bottom colorbars illustrates the decoupling between coverage target and channel count that is central to our method: a high coverage target does not imply a large channel budget when the score distribution is concentrated\.

Figure 11:Losses \(raw and smoothed\), coverage ratio and channel keep ratio after pruning\.
##### Alternative Smoothing Functions for Layerwise Loss
The square\-root smoothing used in inter\-layer allocation is not a theoretically essential component; rather, it is a simple realization of monotone\-concave dynamic\-range compression\. Without smoothing, a few high\-loss layers capture most of the channel budget while low\-loss layers are over\-pruned\. Applying any monotone\-concave transform reduces this imbalance by suppressing outlier values while preserving the relative ordering of layers\.
We compare the default⋅\\sqrt\{\\cdot\}smoothing against three standard alternatives\. Letx≥0x\\geq 0denote the raw layerwise loss,μ\\muandσ\\sigmathe mean and standard deviation of the losses across all layers\.
- •Square\-root \(ours\):g\(x\)=xg\(x\)=\\sqrt\{x\}\.
- •Log smoothing:g\(x\)=log\(1\+αx\)g\(x\)=\\log\(1\+\\alpha x\), withα=5\\alpha=5\.
- •Huber\-style smoothing: g\(x\)=\{x,x≤δ,δ\+δ\(x−δ\),x\>δ,δ=μ\+0\.5σ\.g\(x\)=\\begin\{cases\}x,&x\\leq\\delta,\\\\ \\delta\+\\sqrt\{\\delta\\,\(x\-\\delta\)\},&x\>\\delta,\\end\{cases\}\\qquad\\delta=\\mu\+0\.5\\,\\sigma\.
- •Clip\-based smoothing:g\(x\)=clip\(x,μ−kσ,μ\+kσ\)g\(x\)=\\operatorname\{clip\}\(x,\\;\\mu\-k\\sigma,\\;\\mu\+k\\sigma\), withk=0\.5k=0\.5\.
All four functions are monotone \(preserving relative layer ordering\) and concave \(compressing the dynamic range\)\. All smoothed variants substantially outperform the unsmoothed baseline as shown in[Table15](https://arxiv.org/html/2606.18304#A3.T15), supporting the need for dynamic\-range compression\. The square\-root transform gives the best result while requiring no hyperparameter tuning\.
Table 15:Downstream accuracy with channel allocations derived from different monotone\-concave smoothing functions of the layerwise loss on Qwen1\.5\-MoE\-A2\.7B underP50%\.Figure 12:Smoothed layerwise loss under different monotone\-concave smoothing functions\.
#### C\.3\.4Hyperparameter Sensitivity in CBA and AAR
For Coverage\-Maximized Budget Allocation, we set the maximum number of binary\-search iterations to 50\. In practice, the search usually converges within 30 iterations, and the maximum only serves as a safeguard\. For Alignment\-Aware Redistribution, the minimum kept\-channel thresholdmmprevents overly thin experts and is constrained by hardware block size\. We ablatem∈\{64,128,256,512\}m\\in\\\{64,128,256,512\\\}in[Table16](https://arxiv.org/html/2606.18304#A3.T16);m=128m=128provides the best trade\-off and is used by default\. The residual reallocation strategy is summarized in the main text in[Table8](https://arxiv.org/html/2606.18304#S5.T8), with full per\-task results in[Table17](https://arxiv.org/html/2606.18304#A3.T17)\.
Table 16:Ablation on the minimum\-channel thresholdmmin AAR on Qwen1\.5\-MoE\-A2\.7B underP25%Q4b\.Table 17:Comparison of two AAR residual reallocation strategies on Qwen1\.5\-MoE\-A2\.7B with different alignment block sizesaa\.l\-r\-c:*largest removed channels*,l\-r\-s:*largest removed scores*\.
### C\.4Sensitivity and Robustness Analysis
#### C\.4\.1Sensitivity to Calibration Corpus
Our default setup follows common post\-training compression practice, using C4 for general tasks, GSM8K for math, and OpenCodeReasoning for code\. To examine the sensitivity systematically, we conduct an ablation on six calibration corpora: WikiText2, C4, Pile, RedPajama, GSM8K \(train\), and OpenCodeReasoning\. Results in[Table18](https://arxiv.org/html/2606.18304#A3.T18)show that general tasks are relatively robust to general\-domain corpora, while domain\-specific tasks benefit from domain\-matched calibration\. The sensitivity is therefore structured rather than arbitrary\.
Table 18:Sensitivity of OursQto the choice of calibration corpus on Qwen1\.5\-MoE\-A2\.7B underP25%Q4b\.
#### C\.4\.2Robustness Across Routing Policies
Our method is not tied to a specific routing design\. The main\-text experiments cover Qwen\-style standard top\-kkrouting and DeepSeek\-style routing with load\-balancing considerations\.[Figure13](https://arxiv.org/html/2606.18304#A3.F13)visualizes router entropy across tasks, layer depths, and MoE models, showing that the evaluated settings cover different routing dynamics rather than a single homogeneous pattern\.
Figure 13:Router entropy distributions across tasks, layer depths, and MoE models evaluated in the main text\. The distributions illustrate the routing\-dynamics variation used to evaluate robustness across architectures and tasks\.[Figure14](https://arxiv.org/html/2606.18304#A3.F14)further reports expert activation magnitudes across representative shallow, middle, and deep layers of Qwen1\.5\-MoE\-A2\.7B under different calibration corpora\. The distributions differ across both layers and corpora, confirming that expert heterogeneity is not an artifact of a single calibration source\.
Figure 14:Distribution of expert activation magnitudes across representative shallow, middle, and deep layers of Qwen1\.5\-MoE\-A2\.7B under different calibration corpora\. The figure supports the calibration robustness analysis by showing that activation heterogeneity persists across data sources\.To further evaluate robustness under different routing budgets, we switch the activated experts from top\-22to top\-11on Qwen1\.5\-MoE\-A2\.7B and DeepSeek\-V2\-Lite\. As expected, accuracy drops when fewer experts are activated, but our method still preserves most of the original performance at 50% pruning under both top\-1 and top\-2 routing\.
Table 19:Evaluation under different top\-kkrouting strategies for Qwen1\.5\-MoE\-A2\.7B and DeepSeek\-V2\-Lite, with and without 50% structural pruning\.
### C\.5Visualizations
#### C\.5\.1Visualization of Loss, Scores and Sparsity Allocation at Expert\-Level
[Figures15](https://arxiv.org/html/2606.18304#A3.F15)and[16](https://arxiv.org/html/2606.18304#A3.F16)visualize the expert\-level channel\-score distribution and the resulting allocation for Qwen1\.5\-MoE\-A2\.7B and Qwen3\-30B\-A3B, respectively\. Each bar stack represents one expert: darker segments at the top correspond to high\-scoring channels, while lighter segments reflect channels with smaller scores\. The fraction of dark segments therefore indicates how concentrated an expert’s score is\. Experts whose information is packed into a small number of channels exhibit darker, narrower stacks\.
The red lines report the fraction of channels retained after coverage\-maximized allocation\. Experts with a highly concentrated score distribution \(tall dark segments, small light tails\) are assigned a smaller channel budget, because a high coverage target can be met by keeping only the top channels\. Conversely, experts with flat, spread\-out distributions require a larger kept\-channel fraction to reach the same coverage threshold\. The yellow diamond markers show the attribution score of each expert, which controls the per\-expert coverage target in our intra\-layer allocation\. Experts with higher attribution scores receive a tighter coverage target \(more channels retained\) to avoid degrading high\-contribution experts, whereas low\-attribution experts are compressed more aggressively\.
For Qwen3\-30B\-A3B,[Figure16](https://arxiv.org/html/2606.18304#A3.F16)further compares two channel\-score metrics: activation\-based scores \(panels a, b\) and weight\-gradient\-based scores \(panels c, d\)\. Both metrics lead to similar kept\-channel allocations \(red lines\), confirming that the coverage\-maximized allocation is robust to the choice of scoring metric\.

Figure 15:Cumulative scores fraction \(blue stacked bars\), kepted channels \(red lines\) and attribution score \(yellow diamond markers\) for each expert in specific layers in Qwen1\.5\-MoE\-A2\.7B\.
Figure 16:Cumulative channel\-score fractions \(blue stacked bars\), kept channels \(red lines\), and attribution scores \(yellow diamonds\) for experts within a representative layer of Qwen3\-30B\-A3B\. Panels \(a\) and \(b\) use activation\-based scores, while \(c\) and \(d\) use weight gradient scores \(weight times gradient\)\. The channel\-score concentration pattern appears under both metrics\. Although experts exhibit substantial heterogeneity, our coverage\-maximized budget allocation yields similar kept\-channel allocations across metrics \(red\)\.
## Appendix DMore Related Works
Efficient MoE\.MoE compression and acceleration have attracted increasing interest as models continue to grow in scale, from Mixtral 8×\\times7B with 8 activated experts\(Jianget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib6)\), to Qwen3\-235B\-A22B with 128 experts among which 8 are activated for each token\(Qwen\-Team,[2025](https://arxiv.org/html/2606.18304#bib.bib27)\)\.
Existing methods can be broadly categorized into four major techniques\. \(1\)Expert trimmingremoves a subset of experts through data driven selection, so that low contribution experts are never loaded or computed, reducing both memory footprint and computations\(Liuet al\.,[2024a](https://arxiv.org/html/2606.18304#bib.bib26); Baiet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib7); Muzioet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib13); Chowdhuryet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib14); Donget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib15); Luet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib17); Zhanget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib22)\)\. \(2\)Expert skippingis a complementary approach that retains the full expert pool while skipping the computation of low importance experts at inference time, typically through routing thresholds or dynamic gating\(Liuet al\.,[2024a](https://arxiv.org/html/2606.18304#bib.bib26); Baiet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib7); Luet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib17); Xuet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib60); Chenet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib21)\)\. \(3\)Expert slimmingcompresses the internal structure and parameter of each expert by pruning, quantization, or low rank decomposition, while keeping the number of experts fixed\(Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8); Heet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib10); Leeet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib11); Xieet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib12); Xuet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib60); Chenet al\.,[2025a](https://arxiv.org/html/2606.18304#bib.bib20),[b](https://arxiv.org/html/2606.18304#bib.bib21)\)\. \(4\)Expert mergingclusters experts with similar behavior or activation patterns and combines them into fewer experts by averaging, SVD based factorization, or pairwise merging strategies\(Zhanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib9); Liet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib16); Zhaoet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib23); Guoet al\.,[2025b](https://arxiv.org/html/2606.18304#bib.bib24)\)\.
Despite the diverse aspects, most existing works only rank experts at the granularity of entire experts and do not explicitly analysis the redundancy within each expert\. MoE\-I2\(Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8)\)reduces the parameters via low rank decomposition and assigns higher ranks to more important experts while using lower ranks for less important ones\. However, the speedup is limited: the fragmentation into small kernels makes it difficult to reach peak throughput of one larger kernel, introducing additional overhead in kernel launching, cache hit, and memory access\. Chen et al\.\(Chenet al\.,[2025a](https://arxiv.org/html/2606.18304#bib.bib20)\)quantize all parameters to low bitwidth, compare the reconstruction error, and assign higher bitwidth to experts that are more sensitive to quantization\. This strategy, however, only feasible to methods that have a small search space, e\.g\., 4/8 bitwidth, which is insufficient for fine grained expert\-wise compression budget allocation\.
Expert Importance\.A key driver behind MoE efficiency designs is the highly unbalanced contribution of different experts\. This has motivated a variety of methods for measuring expert importance\. \(1\)Router basedstatistics are widely used, including average gate scores, processed token counts \(expert hit rates\), and gate variations during fine tuning\(Heet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib10); Leeet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib11); Xieet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib12); Muzioet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib13); Chowdhuryet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib14); Donget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib15); Liet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib16); Luet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib17); Xuet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib60); Chenet al\.,[2025a](https://arxiv.org/html/2606.18304#bib.bib20)\)\. \(2\)Activation basedmetrics such as gate weighted outputs or activation saliency are also employed\(Donget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib15); Liet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib16); Zhanget al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib22); Zhaoet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib23)\)\. \(3\)Loss or accuracy basedcriteria measure the performance drop when removing a particular expert or a subset of experts\. For example, they quantify the impact on reconstruction loss or downstream task performance after compression\(Liuet al\.,[2024a](https://arxiv.org/html/2606.18304#bib.bib26); Yanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib8); Zhanget al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib9); Luet al\.,[2024](https://arxiv.org/html/2606.18304#bib.bib17)\)\. \(4\)Learnable methodis recently proposed, which learns a set of importance scalars that are jointly optimized during fine tuning\(Baiet al\.,[2025](https://arxiv.org/html/2606.18304#bib.bib7)\)\.
However, one limitation is that, router based and performance based statistics are oftennot comparable across layers\. Routers and activations in different layers may behave very different in decision patterns, or follow different distributions\. Loss values can be depth dependent and unstable under different experimental setups, including the source of calibration data, the loss function, and the tokenization scheme\. As a result, many previous methods resort to assigning a uniform compression ratio to all layers instead of performing cross layer importance comparison\.
A second limitation is that most existing metrics exhibita high dynamic rangeand are primarily used to rank experts and entirely trim the least importantkkexperts, rather than to support precise allocation of compression ratios\. Only a few works explore redundancy within experts\. MoE\-I2and Liu et al\.\(Liuet al\.,[2024a](https://arxiv.org/html/2606.18304#bib.bib26)\)remove a small group of experts at a time and compare the resulting loss increase or accuracy drop in order to infer expert importance within each layer\. But such loss or accuracy based methods are only feasible for relatively small MoE models with a limited number of experts\. When the search space over layers and experts grows, even greedy or genetic strategies incur prohibitively high computational cost, which severely restricts their applicability to modern large scale MoE architectures\.Similar Articles
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.
Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
A novel end-to-end framework for LLM compression that jointly optimizes structural pruning and mixed-precision quantization, achieving significant perplexity reductions and speedups over state-of-the-art methods, especially at ultra-low bit precisions.
BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
BitsMoE introduces a spectral-energy-guided bit allocation framework for quantizing Mixture-of-Experts LLMs, achieving substantial accuracy improvements and speedups under ultra-low-bit quantization.
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression
ConMoE proposes a train-free prototype remapping framework for Mixture-of-Experts (MoE) compression, which selects a subset of experts as reusable prototypes and deterministically remaps original expert calls to them, reducing memory usage without weight updates or fine-tuning.
Less is MoE: Trimming Experts in Domain-Specialist Language Models
This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.