Generic Expert Coverage for Pruning SparseMixture-of-Experts Language Models
Summary
Proposes Generic TB-Coverage, a coverage-aware expert pruning method for sparse Mixture-of-Experts language models that uses only generic text corpora for calibration and preserves cross-corpus expert coverage, improving accuracy and reducing perplexity degradation.
View Cached Full Text
Cached at: 07/03/26, 05:45 AM
# Generic Expert Coverage for Pruning Sparse Mixture-of-Experts Language Models
Source: [https://arxiv.org/html/2607.01710](https://arxiv.org/html/2607.01710)
###### Abstract
Sparsely activated Mixture\-of\-Experts \(MoE\) language models contain substantial structured redundancy among routed experts, but pruning them without downstream calibration data remains challenging\. Existing expert\-pruning methods typically rely on a single aggregated importance score, which can bias the retained set toward experts favored by dominant calibration patterns\. We proposeGeneric TB\-Coverage, a coverage\-aware expert pruning method that uses only generic text corpora \(WikiText2 and C4\) for calibration\. Instead of collapsing expert utility into one score, our method profiles per\-expert utility separately on each corpus and enforces a fixed\-budget coverage rule that preserves high\-utility experts from each corpus before constructing the final pruning mask\. Across Qwen1\.5\-MoE\-A2\.7B and DeepSeek\-MoE\-16B\-Base at 25%, 50%, and 75% retention budgets, our method improves average accuracy on six common zero\-shot benchmarks over random pruning, REAP, and ExpertSparsity, while also reducing perplexity degradation on WikiText2 and C4\. The gains are largest under aggressive pruning \(25% and 50% retain\), suggesting that preserving cross\-corpus expert coverage is an effective generic\-data prior for MoE pruning\. Our improvements hold with fixed pruning budgets and no downstream calibration data\.
## 1Introduction
Sparse Mixture\-of\-Experts \(MoE\) language models improve parameter efficiency by activating only a small subset of experts per token, achieving strong performance while keeping per\-token computation manageable\(Shazeeret al\.[2017](https://arxiv.org/html/2607.01710#bib.bib22); Feduset al\.[2022](https://arxiv.org/html/2607.01710#bib.bib1); Jianget al\.[2024](https://arxiv.org/html/2607.01710#bib.bib2)\)\. This sparse activation also creates structured redundancy: many experts can be removed with limited quality loss if the retained subset preserves the model’s core behaviors\(Daiet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib4)\)\. The challenge is selecting that subset without access to downstream validation data\.
Existing expert\-pruning methods usually rank experts using a single scalar criterion, such as routing frequency, reconstruction utility, or REAP\-style importance\(Luet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib5)\)\. However, when calibration data are heterogeneous, scalar aggregation can over\-favor experts that dominate the average signal and under\-retain experts that support less frequent but still important generic language behaviors\. For example, consider two generic calibration corpora: WikiText2 \(encyclopedic, narrative text\) and C4 \(broad web text\)\. An expert that contributes strongly on encyclopedic patterns but not on web discourse will receive a moderate average score, even though it may be the*top\-ranked*expert on WikiText2\. A scalar ranking would likely discard it in favor of experts that score moderately on both corpora but are not critical for either\. This suggests that expert retention should preserve cross\-corpus coverage rather than only optimize a single aggregated importance score\.
This problem is especially acute when the goal is*general\-purpose pruning*without downstream calibration data\. A principled pruning method should not rely on downstream evaluation tasks during calibration, since good downstream numbers could then partly reflect task leakage rather than genuine generalization\. We study expert pruning in this practically important setting:*can we prune MoE experts using only generic language corpora while preserving broad downstream behavior?*
Our key observation is that even purely generic corpora do not exercise the same experts in the same way\. A pruning rule that preserves only globally dominant experts can therefore reduce expert coverage across generic behaviors before downstream transfer is ever evaluated\. We address this by profiling expert utility separately on multiple generic corpora and enforcing balanced protection across corpus\-specific rankings\.
We proposeGeneric TB\-Coverage\(Task\-Balanced Coverage\), a coverage\-aware expert pruning method\. The method profiles per\-expert importance separately for each generic calibration corpus, builds corpus\-specific expert rankings, and selects protected experts through a round\-robin coverage rule\. Protected experts are then merged into a reconstruction\-stable candidate mask, and the exact retention budget is restored by removing the lowest\-ranked unprotected experts\. The entire procedure uses only WikiText2 and C4 for calibration and requires no fine\-tuning\.
We evaluate on two MoE language models—Qwen1\.5\-MoE\-A2\.7B\(Baiet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib3)\)and DeepSeek\-MoE\-16B\-Base\(Daiet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib4)\)—at three expert\-retention budgets \(25%, 50%, 75%\)\. The primary metric is the average accuracy over six common zero\-shot benchmarks \(ARC\-Challenge, ARC\-Easy, HellaSwag, PIQA, WinoGrande, BoolQ\); we additionally report MMLU, GSM8K, and Math500 as auxiliary stress tests\.
Our contributions are:
1. 1\.We identify a failure mode of scalar expert ranking for downstream\-free MoE pruning: dominant calibration patterns can over\-concentrate the retained expert set\.
2. 2\.We propose a simple coverage\-aware pruning rule that preserves high\-utility experts across multiple generic corpora before fixed\-budget mask construction\.
3. 3\.We show that this rule improves pruning quality on two open MoE LMs across three retention budgets and six zero\-shot benchmarks\.
4. 4\.We analyze random\-pruning variance and show that single\-seed random baselines can be misleading in sparse MoE pruning\.
## 2Related Work
MoE model compression\.Sparse MoE architectures\(Shazeeret al\.[2017](https://arxiv.org/html/2607.01710#bib.bib22); Lepikhinet al\.[2021](https://arxiv.org/html/2607.01710#bib.bib20); Feduset al\.[2022](https://arxiv.org/html/2607.01710#bib.bib1)\)route each token to a small subset of experts, creating natural opportunities for expert\-level pruning\. ST\-MoE\(Zophet al\.[2022](https://arxiv.org/html/2607.01710#bib.bib19)\)studies training stability in sparse models, while DeepSeek\-MoE\(Daiet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib4)\)introduces fine\-grained expert segmentation that increases expert count and hence pruning potential\. MoEfication\(Fuet al\.[2023](https://arxiv.org/html/2607.01710#bib.bib21)\)converts dense feed\-forward layers into MoE structures\.
Expert importance criteria\.Several criteria have been proposed for ranking expert importance\. Routing frequency counts how often each expert is selected by the router\. REAP\-style methods compute importance scores that combine routing probability with expert output magnitude\. ExpertSparsity\(Luet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib5)\)uses reconstruction\-based layerwise pruning, selecting experts to minimize the reconstruction error between the pruned and original MoE layer outputs\. These methods share the limitation that they optimize a single scalar objective, which can cause over\-concentration of retained experts\.
LLM pruning\.Beyond expert pruning, structured pruning for large language models includes layer removal\(Menet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib7)\), width pruning\(Maet al\.[2023](https://arxiv.org/html/2607.01710#bib.bib8)\), and unstructured sparsity\(Frantar and Alistarh[2023](https://arxiv.org/html/2607.01710#bib.bib24); Sunet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib25)\)\. SliceGPT\(Ashkbooset al\.[2024](https://arxiv.org/html/2607.01710#bib.bib6)\)removes components via singular value decomposition\. These methods operate at the weight or layer level rather than the expert level and are complementary to our approach\.
Coverage and diversity in pruning\.The idea of preserving diversity in pruned models has been explored in network pruning for convolutional networks, where filter diversity criteria help maintain representational capacity\. To our knowledge, Generic TB\-Coverage is the first to explicitly introduce a coverage rule for expert\-level MoE pruning based on per\-corpus profiling\.
## 3Method
### Problem Formulation
Consider a sparse MoE language model withLLMoE layers\. At each MoE layerll, the model hasNNrouted experts\{e1,…,eN\}\\\{e\_\{1\},\\ldots,e\_\{N\}\\\}\. For each input token with hidden statehxh\_\{x\}, the router selects the top\-kkexperts and computes a weighted sum of their outputs:
MoE\(hx\)=∑e∈TopK\(hx\)ge\(x\)⋅fe\(hx\),\\text\{MoE\}\(h\_\{x\}\)=\\sum\_\{e\\in\\text\{TopK\}\(h\_\{x\}\)\}g\_\{e\}\(x\)\\cdot f\_\{e\}\(h\_\{x\}\),\(1\)wherege\(x\)g\_\{e\}\(x\)is the router’s softmax probability assigned to experteefor tokenxx\(set to zero for non\-selected experts\), andfe\(hx\)f\_\{e\}\(h\_\{x\}\)is the expert’s feed\-forward output on the hidden statehxh\_\{x\}\.
Given a retain ratioρ∈\(0,1\]\\rho\\in\(0,1\], the pruning budget retainsK=⌊ρN⌋K=\\lfloor\\rho N\\rfloorrouted experts per MoE layer\. The goal is to construct a binary expert maskml∈\{0,1\}Nm\_\{l\}\\in\\\{0,1\\\}^\{N\}with exactlyKKones per layer such that the pruned model preserves broad language performance\. After pruning, the router selects among the retained experts only\.
### Generic Expert Profiling
The method uses two generic calibration corpora: WikiText2\(Merityet al\.[2017](https://arxiv.org/html/2607.01710#bib.bib9)\)and C4\(Raffelet al\.[2020](https://arxiv.org/html/2607.01710#bib.bib10)\)\. For each corpust∈𝒯=\{WikiText2,C4\}t\\in\\mathcal\{T\}=\\\{\\text\{WikiText2\},\\text\{C4\}\\\}, we run the model on calibration text and collect per\-expert statistics independently for each MoE layer\.
For each layerll, expertee, and calibration corpustt, we compute a REAP\-style utility score that combines the router weight with the expert output norm, averaged over all calibration tokens where experteeis selected by the router:
st\(l\)\(e\)=𝔼x∼𝒟t\[ge\(x\)⋅‖fe\(hx\(l\)\)‖2\|e∈TopK\(x\)\],s\_\{t\}^\{\(l\)\}\(e\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\_\{t\}\}\\left\[g\_\{e\}\(x\)\\cdot\\\|f\_\{e\}\(h\_\{x\}^\{\(l\)\}\)\\\|\_\{2\}\\;\\big\|\\;e\\in\\text\{TopK\}\(x\)\\right\],\(2\)wherehx\(l\)h\_\{x\}^\{\(l\)\}is the hidden state at layerll,ge\(x\)g\_\{e\}\(x\)is the router weight for experteeon tokenxx, andfe\(hx\(l\)\)f\_\{e\}\(h\_\{x\}^\{\(l\)\}\)is the expert output\. The expectation is approximated by averaging over all tokens in the calibration set from corpusttfor which experteeis among the top\-kkselected experts\. This score captures both how strongly the router activates expertee\(throughgeg\_\{e\}\) and how much it contributes to the hidden state \(through‖fe‖2\\\|f\_\{e\}\\\|\_\{2\}\)\. We profile each calibration corpus independently, producing a separate score vector𝐬t\(l\)=\[st\(l\)\(e1\),…,st\(l\)\(eN\)\]\\mathbf\{s\}\_\{t\}^\{\(l\)\}=\[s\_\{t\}^\{\(l\)\}\(e\_\{1\}\),\\ldots,s\_\{t\}^\{\(l\)\}\(e\_\{N\}\)\]per MoE layer\.
### Coverage\-Aware Expert Protection
Instead of ranking experts by their average scores¯\(l\)\(e\)=1\|𝒯\|∑tst\(l\)\(e\)\\bar\{s\}^\{\(l\)\}\(e\)=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\}s\_\{t\}^\{\(l\)\}\(e\), Generic TB\-Coverage builds per\-corpus rankings and selects protected experts by round\-robin across corpora\.
For each MoE layerll, we define a per\-layer protection budgetBB\(a hyperparameter controlling how many experts receive coverage protection\), chosen such thatB≤KB\\leq Kfor every layer and retention budget, guaranteeing feasibility\. The round\-robin procedure works as follows:
1. 1\.Sort allNNexperts bysWikiText2\(l\)\(e\)s\_\{\\text\{WikiText2\}\}^\{\(l\)\}\(e\)in descending order, producing rankingπWiki\\pi\_\{\\text\{Wiki\}\}\.
2. 2\.Sort allNNexperts bysC4\(l\)\(e\)s\_\{\\text\{C4\}\}^\{\(l\)\}\(e\)in descending order, producing rankingπC4\\pi\_\{\\text\{C4\}\}\.
3. 3\.Starting with an empty protected setPlP\_\{l\}, alternate between the two rankings: at each step, take the highest\-ranked expert from the current corpus’s ranking that is not yet inPlP\_\{l\}, and add it\.
4. 4\.Stop when\|Pl\|=B\|P\_\{l\}\|=B\.
This ensures that both corpora contribute to the protected set regardless of the absolute score magnitudes\. An expert that is top\-ranked on WikiText2 but mid\-ranked on C4 will still be protected, even though its average score might not place it in the top\-KKby a single\-criterion ranking\. If both rankings agree on the top experts \(high overlap\), the round\-robin degenerates to standard top\-BBselection, so the rule introduces no harm in that case\.
### Budget\-Preserving Mask Construction
Algorithm 1Generic TB\-Coverage Expert PruningInput: MoE modelℳ\\mathcal\{M\}; corpora𝒯=\{WikiText2,C4\}\\mathcal\{T\}=\\\{\\text\{WikiText2\},\\text\{C4\}\\\}; retain ratioρ\\rho; protection budgetBB Output: MaskmmwithK=⌊ρN⌋K=\\lfloor\\rho N\\rfloorretained experts per layer
1:foreach MoE layer
lldo
2:foreach corpus
t∈𝒯t\\in\\mathcal\{T\}do
3:Profile
ℳ\\mathcal\{M\}on
tt; compute scores
st\(e\)s\_\{t\}\(e\)
4:endfor
5:Build per\-corpus rankings by
st\(e\)s\_\{t\}\(e\)
6:Select
PlP\_\{l\}by round\-robin over rankings \(
\|Pl\|=B\|P\_\{l\}\|=B\)
7:Init mask from reconstruction candidate
m^l\\hat\{m\}\_\{l\}
8:Merge: set
ml\(e\)=1m\_\{l\}\(e\)=1for all
e∈Ple\\in P\_\{l\}
9:if
∑eml\(e\)\>K\\sum\_\{e\}m\_\{l\}\(e\)\>Kthen
10:Remove lowest\-ranked unprotected by
s¯\(l\)\\bar\{s\}^\{\(l\)\}until
KKmet
11:endif
12:endfor
13:returnmask
mm
The final mask keeps exactlyKKexperts per MoE layer\. We initialize from a reconstruction\-stable candidate maskm^l\\hat\{m\}\_\{l\}\(obtained via layerwise reconstruction minimization on C4\) that already retainsKKexperts per layer\. We merge the protected setPlP\_\{l\}into this candidate: setml\(e\)=1m\_\{l\}\(e\)=1for alle∈Ple\\in P\_\{l\}\(protected experts are always retained\), then if the union exceedsKK, remove the lowest\-ranked*unprotected*experts in order of ascending average scores¯\(l\)\(e\)\\bar\{s\}^\{\(l\)\}\(e\)until exactlyKKexperts remain\. Protected experts are never removed; the budget is restored entirely by dropping unprotected reconstruction\-selected experts\.
This procedure is budget\-preserving: the final mask always retains exactlyKKexperts per MoE layer\. Algorithm[1](https://arxiv.org/html/2607.01710#alg1)summarizes the full method\.
### Complexity Analysis
With\|𝒯\|\|\\mathcal\{T\}\|calibration corpora, the additional profiling cost is\|𝒯\|\|\\mathcal\{T\}\|forward\-only passes over the calibration set \(no gradient computation\)\. With two corpora and approximately 512 sequences of length 1024 tokens, this is modest relative to model training\. Mask construction involves per\-layer sorting ofNNexperts \(O\(NlogN\)O\(N\\log N\)per layer\) and round\-robin selection \(O\(B\)O\(B\)per layer\), which is negligible relative to the profiling cost\. The method stores only per\-expert scalar scores per layer and does not require post\-pruning fine\-tuning\.
## 4Experiments
### Setup
Models\.We evaluate on two sparse MoE language models:
- •Qwen1\.5\-MoE\-A2\.7B\(Baiet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib3)\): 60 routed experts per MoE layer, with top\-4 routing\.
- •DeepSeek\-MoE\-16B\-Base\(Daiet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib4)\): 64 routed experts per MoE layer, with top\-6 routing\.
Retain budgets\.We report three expert\-retention ratios: 25%, 50%, and 75%\. For Qwen, this corresponds to retaining 15, 30, or 45 out of 60 routed experts per layer\. For DeepSeek, this corresponds to retaining 16, 32, or 48 out of 64 routed experts per layer\.
Calibration\.Our method uses WikiText2\(Merityet al\.[2017](https://arxiv.org/html/2607.01710#bib.bib9)\)and C4\(Raffelet al\.[2020](https://arxiv.org/html/2607.01710#bib.bib10)\)only\. We draw 512 sequences of length 1024 tokens from each corpus\. No downstream benchmark data is used during calibration\.
Baselines\.We compare against three baselines:
- •Random pruning: Uniform random expert removal per layer, reported as mean and standard deviation over six seeds \(0, 1, 2, 3, 4, 42\)\.
- •Original REAP: Direct top\-kkexpert selection using REAP\-style importance scores\. For a fair comparison with a method that does use task\-specific calibration, we run REAP with Evol\-CodeAlpaca calibration \(128 texts, sequence length 2048\)\.
- •ExpertSparsity\(Luet al\.[2024](https://arxiv.org/html/2607.01710#bib.bib5)\): Reconstruction\-based layerwise expert pruning using C4, adapted to our model wrappers\. We follow the original paper’s layerwise reconstruction protocol\.
We note that REAP uses domain\-specific calibration data while our method uses only generic corpora, making the comparison favorable to REAP in terms of calibration informativeness\.
Evaluation\.We evaluate on two language modeling metrics \(WikiText2 PPL, C4 PPL, computed on the standard validation splits using sliding\-window evaluation\) and six primary downstream tasks: ARC\-Challenge and ARC\-Easy\(Clarket al\.[2018](https://arxiv.org/html/2607.01710#bib.bib11)\), HellaSwag\(Zellerset al\.[2019](https://arxiv.org/html/2607.01710#bib.bib12)\), PIQA\(Bisket al\.[2020](https://arxiv.org/html/2607.01710#bib.bib13)\), WinoGrande\(Sakaguchiet al\.[2021](https://arxiv.org/html/2607.01710#bib.bib14)\), and BoolQ\(Clarket al\.[2019](https://arxiv.org/html/2607.01710#bib.bib15)\)\. We report*Common Avg*as the unweighted average over these six tasks\. We additionally report MMLU\(Hendryckset al\.[2021](https://arxiv.org/html/2607.01710#bib.bib16)\), GSM8K\(Cobbeet al\.[2021](https://arxiv.org/html/2607.01710#bib.bib17)\), and Math500\(Lightmanet al\.[2023](https://arxiv.org/html/2607.01710#bib.bib18)\)as auxiliary stress tests for knowledge and reasoning\. All evaluations use full validation/test sets with zero\-shot prompting \(no few\-shot examples\)\.
### Main Results
Table[1](https://arxiv.org/html/2607.01710#S4.T1)presents the main results\. Generic TB\-Coverage achieves the highest Common Avg across both models and all three retain budgets\.
Table 1:Main results on Qwen1\.5\-MoE\-A2\.7B and DeepSeek\-MoE\-16B\-Base\. Random pruning reports mean±\\pmstd over six seeds\. Other methods are deterministic\.Bold: best Common Avg per group\.Figure 1:Common Avg accuracy across methods and retain ratios on both models\. Generic TB\-Coverage \(green\) consistently achieves the highest average accuracy across all settings\.On Qwen1\.5\-MoE\-A2\.7B, Generic TB\-Coverage improves Common Avg over Paper ExpertSparsity by \+0\.041, \+0\.080, and \+0\.007 at 25%, 50%, and 75% retain, respectively\. On DeepSeek\-MoE\-16B\-Base, the improvements are \+0\.044, \+0\.035, and \+0\.006\. The gains are largest under aggressive pruning \(25% and 50% retain\), where retaining the right experts matters most\. At 75% retain, all methods perform closer together since only a quarter of experts are removed\.
Figure 2:WikiText2 and C4 perplexity \(log scale, lower is better\) across methods and retain ratios\. Generic TB\-Coverage achieves substantially lower PPL than all baselines, with the largest gains under aggressive pruning\.Figure[1](https://arxiv.org/html/2607.01710#S4.F1)visualizes the Common Avg comparison, and Figure[2](https://arxiv.org/html/2607.01710#S4.F2)shows the PPL results\. The perplexity improvements are substantial\. For DeepSeek at 25% retain, C4 PPL drops from 1734\.92 \(Original REAP\) and 615\.18 \(Paper ExpertSparsity\) to 423\.72 under Generic TB\-Coverage\. Similarly, WikiText2 PPL drops from 544\.15 and 176\.63 to 137\.03\. These reductions indicate that coverage\-aware expert selection not only preserves downstream task accuracy but also stabilizes the language modeling distribution\.
### Random Multi\-Seed Analysis
Random pruning is commonly used as a baseline, but single\-seed results can be misleading\. Table[2](https://arxiv.org/html/2607.01710#S4.T2)shows that random pruning exhibits high variance across seeds\.
Table 2:Random pruning \(mean±\\pmstd over 6 seeds\) vs\. Generic TB\-Coverage\. Single\-seed random results can appear competitive but are not reliable\.At 75% retain on Qwen, a single random seed \(seed 42\) achieves Common Avg of 0\.669, which appears competitive with our method’s 0\.662\. However, the multi\-seed mean is 0\.645±\\pm0\.041, and the worst seed drops to 0\.604\. Similarly, on DeepSeek at 75%, random PPL ranges from 10\.7 to 31\.6 across seeds\. This high variance demonstrates that single\-seed random pruning is not a reliable baseline and that principled coverage\-aware selection is necessary\.
### Discussion
Gains under aggressive pruning\.The improvement from Generic TB\-Coverage is largest at 25% and 50% retain, where retaining the right experts is critical\. At 75% retain, most experts are preserved and the gap between methods narrows\. This pattern is consistent across both models and suggests that coverage\-aware selection is most valuable when the pruning budget is tight\. Our method addresses selection bias in expert retention, but does not optimize routing adaptation after pruning\.
PPL stability\.Generic TB\-Coverage achieves the lowest perplexity in every setting\. The PPL improvements are especially large at 25% retain, where the method reduces Wiki PPL by 9%–24% and C4 PPL by 28%–39% over the next\-best baseline \(ExpertSparsity\)\. The absolute PPL values at 25% retain are high \(e\.g\., C4 PPL of 423\.72 for DeepSeek\), which reflects the severity of aggressive pruning rather than an evaluation artifact\. Under aggressive pruning, the language model distribution degrades substantially for all methods, and our method degrades least\.
Reasoning benchmarks\.GSM8K scores are near zero across all methods and retain budgets, and Math500 scores are similarly low\. We include these results as stress tests but do not claim that Generic TB\-Coverage preserves mathematical reasoning ability\. The low scores likely reflect the limited reasoning capacity of the base models rather than a shortcoming of the pruning method\.
Per\-corpus protection budget\.The current experiments use fixed protection budgets \(20 experts for Qwen, 24 for DeepSeek\) across all retain ratios\. These values were selected to provide sufficient coverage without dominating the mask at high retain ratios\. Sensitivity analysis of the protection budget is an important direction for future work\. We note that the coverage rule’s benefit could be further isolated by comparing against simple multi\-corpus aggregation baselines \(e\.g\., mean or max of per\-corpus scores\); we leave this systematic ablation to future work\.
## 5Conclusion
We have presented Generic TB\-Coverage, a coverage\-aware expert pruning method for sparse MoE language models\. By profiling expert utility separately on WikiText2 and C4 and protecting experts through a round\-robin coverage rule, the method preserves generic language behaviors that single\-criterion methods may discard\. Across two MoE models and three retain budgets, Generic TB\-Coverage improves downstream average accuracy and language modeling perplexity over random pruning, direct REAP, and reconstruction\-based ExpertSparsity, with the largest gains under aggressive pruning\.
Limitations\.Our method is intentionally simple and static: it does not adapt experts after pruning, learn corpus weights, or optimize a formal diversity objective\. The current study uses only two generic corpora and two open\-base MoE models of moderate scale \(2\.7B and 16B parameters\); whether the same coverage rule scales to instruction\-tuned models, larger expert counts, or pruning\-plus\-quantization settings remains open\. GSM8K and Math500 scores are near zero across all methods, so we cannot assess the method’s effect on reasoning ability\. The protection budgetBBis fixed across retain ratios and requires manual selection\. Finally, we do not measure end\-to\-end inference latency or memory footprint after pruning\.
## References
- S\. Ashkboos, M\. Croci, M\. G\. d\. Nascimento, J\. Hensman, D\. James, and S\. Hoeche \(2024\)SliceGPT: compress large language models by deleting and optimizing layers\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p3.1)\.
- J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang,et al\.\(2024\)Qwen1\.5\-moe: matching 7b model performance with 1/3 activated parameters\.Cited by:[§1](https://arxiv.org/html/2607.01710#S1.p6.1),[1st item](https://arxiv.org/html/2607.01710#S4.I1.i1.p1.1)\.
- Y\. Bisk, R\. Zellers, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- D\. Dai, C\. Deng, C\. Zhao, R\. X\. Xu, H\. Gao, D\. Chen, J\. Li, W\. Zeng, X\. Zhang, Y\. Wang,et al\.\(2024\)DeepSeek\-MoE: towards ultimate expert specialization in mixture\-of\-experts language models\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2607.01710#S1.p1.1),[§1](https://arxiv.org/html/2607.01710#S1.p6.1),[§2](https://arxiv.org/html/2607.01710#S2.p1.1),[2nd item](https://arxiv.org/html/2607.01710#S4.I1.i2.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2607.01710#S1.p1.1),[§2](https://arxiv.org/html/2607.01710#S2.p1.1)\.
- E\. Frantar and D\. Alistarh \(2023\)SparseGPT: massive language models can be accurately pruned in one\-shot\.InProceedings of the 40th International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p3.1)\.
- Z\. Fu, Q\. Zhang, X\. Liu, Z\. Liu,et al\.\(2023\)Go beyond the impossible: MoEfication of transformer models\.Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InProceedings of the International Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.Cited by:[§1](https://arxiv.org/html/2607.01710#S1.p1.1)\.
- D\. Lepikhin, H\. Lee, Y\. Xu, D\. Chen, O\. Firat, Y\. Huang, M\. Krikun, N\. Shazeer, and Z\. Chen \(2021\)GShard: scaling giant models with conditional computation and automatic sharding\.InProceedings of the International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- X\. Lu, A\. Huang, Y\. Liu, W\. Qiu, L\. Zhou, J\. Li, J\. Bian, G\. Li, and Z\. Li \(2024\)Not all experts are equal: efficient expert pruning and skipping for mixture of experts large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2607.01710#S1.p2.1),[§2](https://arxiv.org/html/2607.01710#S2.p2.1),[3rd item](https://arxiv.org/html/2607.01710#S4.I2.i3.p1.1)\.
- X\. Ma, G\. Fang, and X\. Wang \(2023\)LLM\-pruner: on the structural pruning of large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p3.1)\.
- X\. Men, M\. He, Q\. Xu, Y\. Wang, B\. Luo, and M\. Zhang \(2024\)ShortGPT: layers in large language models are more redundant than you expect\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p3.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2017\)Pointer sentinel mixture models\.InProceedings of the 5th International Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2607.01710#S3.SSx2.p1.1),[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p3.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Vol\.21,pp\. 1–67\.Cited by:[§3](https://arxiv.org/html/2607.01710#S3.SSx2.p1.1),[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p3.1)\.
- K\. Sakaguchi, R\. Le Bras, C\. Bhagavatula, and Y\. Choi \(2021\)WinoGrande: an adversarial winograd schema challenge at scale\.Vol\.64,pp\. 99–106\.Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InProceedings of the International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2607.01710#S1.p1.1),[§2](https://arxiv.org/html/2607.01710#S2.p1.1)\.
- M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter \(2024\)A simple and effective pruning approach for large language models\.InProceedings of the International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p3.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Cited by:[§4](https://arxiv.org/html/2607.01710#S4.SSx1.p5.1)\.
- B\. Zoph, I\. Bello, S\. Kumar, N\. Du, Y\. Huang, J\. Dean, N\. Shazeer, and W\. Fedus \(2022\)ST\-MoE: designing stable and transferable sparse expert models\.arXiv preprint arXiv:2202\.08906\.Cited by:[§2](https://arxiv.org/html/2607.01710#S2.p1.1)\.Similar Articles
Pruning and Distilling Mixture-of-Experts into Dense Language Models
A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
This paper investigates the effects of domain-specific expert pruning on both utility and factual reliability of Mixture-of-Experts (MoE) models in the biomedical domain. It finds that moderate pruning preserves in-domain utility without immediate reliability loss, but extreme pruning increases hallucination risks, and generalization degrades rapidly in cross-domain settings.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
HodgeCover uses higher-order topological coverage to compress sparse Mixture-of-Experts layers by addressing irreducible mergeability barriers that pairwise signals miss, matching state-of-the-art baselines on expert reduction and leading on aggressive compression.
SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs
SHAPE proposes a coalition-aware expert pruning framework for sparse MoE LLMs that uses Shapley-style attribution over routing traces to identify essential experts, achieving competitive accuracy under 20-40% pruning and reducing GPU memory footprint.