Less is MoE: Trimming Experts in Domain-Specialist Language Models

arXiv cs.LG Papers

Summary

This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.

arXiv:2606.05538v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:11 AM

# Less is MoE: Trimming Experts in Domain-Specialist Language Models
Source: [https://arxiv.org/html/2606.05538](https://arxiv.org/html/2606.05538)
Haoze He1, Xinkai Zou211footnotemark:1, Xuan Jiang3, Xingyuan Ding1, Ao Qu3 Juncheng Billy Li1, Heather Miller1 1Carnegie Mellon University2UCSD3MIT \{haozeh, xingyuad, junchenl, heather\.miller\}@cs\.cmu\.edu x9zou@ucsd\.edu \{xuanj, qua\}@mit\.du

###### Abstract

Mixture\-of\-Experts \(MoE\) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges\. Prior MoE compression approaches catastrophically fail when evaluated on general\-purpose benchmarks beyond commonsense reasoning\. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions\. To identify these dimensions, we use Fisher importance which outperforms activation\-, router\-score\-, and magnitude\-based alternatives, and identifies tiny sets of task\-critical dimensions: in Qwen1\.5\-MoE, removing as few as 12 of 1\.35M routed\-FFN intermediate dimensions collapsesGSM8Kaccuracy while largely preserving factual\-knowledge performance\. Building on this, we proposeFisher\-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance\. At the same 50% MoE compression ratio, Fisher\-MoE preserves model capability, while reducing weight memory by∼~\\sim45% and improving inference throughput by 21%\. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models\.

Less is MoE: Trimming Experts in Domain\-Specialist Language Models

Haoze He1††thanks:Equal contribution\., Xinkai Zou211footnotemark:1, Xuan Jiang3, Xingyuan Ding1, Ao Qu3Juncheng Billy Li1, Heather Miller11Carnegie Mellon University2UCSD3MIT\{haozeh, xingyuad, junchenl, heather\.miller\}@cs\.cmu\.edux9zou@ucsd\.edu \{xuanj, qua\}@mit\.du

## 1Introduction

Mixture\-of\-Experts \(MoE\) models have emerged as a dominant paradigm for scaling language model capacity while maintaining efficient inference through conditional computation\(Shazeeret al\.,[2017](https://arxiv.org/html/2606.05538#bib.bib63); Lepikhinet al\.,[2021](https://arxiv.org/html/2606.05538#bib.bib64); Feduset al\.,[2022](https://arxiv.org/html/2606.05538#bib.bib65)\)\. This enables models with tens of billions of parameters to achieve efficient inference, but their large total parameter footprint still poses significant challenges for deployment in terms of memory, storage, and serving\.

To reduce this footprint, prior work compresses MoE models by removing or merging experts based on heuristic importance metrics, such as activation frequency\(Muzioet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib38); Luet al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib39); Chenet al\.,[2022](https://arxiv.org/html/2606.05538#bib.bib41)\), router scores\(Xieet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib42); Guet al\.,[2025](https://arxiv.org/html/2606.05538#bib.bib48)\), or weight magnitudes\(Leeet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib43); Yanget al\.,[2024b](https://arxiv.org/html/2606.05538#bib.bib40); Liet al\.,[2023](https://arxiv.org/html/2606.05538#bib.bib49); Chenet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib50)\)\. Despite differences in these metrics, existing approaches share a common design choice: compression is performed at the granularity of entire experts\. Moreover, existing methods are primarily evaluated on commonsense reasoning benchmarks, which we find to be unstable and weak indicators of compression quality\.\(see Appendix[I](https://arxiv.org/html/2606.05538#A9)\) We instead evaluate on more challenging general\-purpose benchmarks spanning mathematical reasoning, code generation, knowledge, and multilingual understanding following the evaluation settings in official technical reports\(Qwen Team,[2024](https://arxiv.org/html/2606.05538#bib.bib66); Yanget al\.,[2025](https://arxiv.org/html/2606.05538#bib.bib53)\)\. The picture changes dramatically\. Under a fair controlled comparison using a unified compressionHeet al\.\([2025b](https://arxiv.org/html/2606.05538#bib.bib46)\)framework, we evaluate activation\-, score\-, and magnitude\-based methods at a fixed MoE compression ratiop=50%p=50\\%\(the fraction of routed\-expert FFN parameters removed; defined in §[3\.1](https://arxiv.org/html/2606.05538#S3.SS1)\)\. As shown in Figure[2](https://arxiv.org/html/2606.05538#S2.F2), all existing expert\-level approaches suffer catastrophic performance collapse on benchmarks such asGSM8K,HumanEval,MBPP, andMATH\.

![Refer to caption](https://arxiv.org/html/2606.05538v1/x1.png)Figure 1:Intermediate dimension compression of a single MoE expert FFN\. Edge colors encode the Fisher importance scoresi,jdims\_\{i,j\}^\{\\mathrm\{dim\}\}of each intermediate dimension \(red = high, blue = low\)\. The bottom50%50\\%of dimensions by Fisher score are removed \(*faded*\), reducingWigate,Wiup∈ℝdff×dW\_\{i\}^\{\\mathrm\{gate\}\},W\_\{i\}^\{\\mathrm\{up\}\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{ff\}\}\\times d\}andWidown∈ℝd×dffW\_\{i\}^\{\\mathrm\{down\}\}\\in\\mathbb\{R\}^\{d\\times d\_\{\\mathrm\{ff\}\}\}tod^ff=dff/2\\hat\{d\}\_\{\\mathrm\{ff\}\}=d\_\{\\mathrm\{ff\}\}/2without discarding any expert\.We trace the failure of existing MoE compression methods to two factors\. First, inaccurate importance metrics fail to estimate the importance of the parameters\. Second, compression at an overly coarse granularity assumes capability is localized at the expert level, whereas in reality it is distributed across experts but concentrated in a small subset of intermediate dimensions\. Our contribution is not the Fisher information itself but rather the choice of attributable intermediate dimension unit in an MoE, and aggregates parameter\-level Fisher scores into intermediate dimension scores of experts\. We use Fisher importance as the underlying scoring tool to localize this attribution \(§[2](https://arxiv.org/html/2606.05538#S2)\) and to design a finer\-grained compression method at the intermediate dimension level \(§[3](https://arxiv.org/html/2606.05538#S3)\)\.

#### \(1\) Fisher importance as a unit attribution tool\.

Prior methods rely on activation ratios, weight magnitude, or router scores, which we show are poor proxies for parameter importance\. In contrast, empirical Fisher information performs substantially better as both a compression metric and an attribution tool\. We validate this through three lines of evidence: \(a\) Fisher importance outperforms existing metrics under controlled comparison, \(b\) zeroing just 12 out of 1\.35 million intermediate dimensions identified by Fisher importance destroys mathematical reasoning while preserving general knowledge, and \(c\) removing the bottom 50% of dimensions preserves overall performance \(§[2](https://arxiv.org/html/2606.05538#S2)\)\.

#### \(2\) Fine\-grained intermediate dimension compression\.

Prior methods operate at the level of entire experts\. However, capability in MoE models is distributed across experts but concentrated in a small subset of intermediate dimensions\. Removing entire experts therefore discards critical dimensions alongside redundant ones\. We close this expert\-vs\-intermediate\-dimension localization gap by proposingFisher\-MoE, which converts the per\-dimension attribution above into a structurally smaller MoE: it performs fine\-grained compression within each expert by physically resizing the rows ofWgate,WupW^\{\\text\{gate\}\},W^\{\\text\{up\}\}and the columns ofWdownW^\{\\text\{down\}\}corresponding to low\-Fisher\-score dimensions \(§[3](https://arxiv.org/html/2606.05538#S3)\)\.

Our key contributions are:

- •We report a structural property of MoE models in which capability is not localized at the expert level but instead concentrated in a small subset of intermediate dimensions distributed across experts\. This suggests one reason for the failure of expert\-level MoE compression methods\.
- •We define the intermediate dimension as the attributable structure and use Fisher information as a tool for characterizing this structure\. We empirically show that Fisher importance identifies critical and redundant intermediate dimension\.
- •We proposeFisher\-MoE, a fine\-grained compression method that operates at the intermediate\-dimension level instead of expert\-level\. At ap=50%p=50\\%compression ratio, it preserves downstream performance while improving inference throughput by 21%\.

## 2Model Capability Attribution

Can we find an importance metric that ranks parameters by their impact on model capability? In this section, we suggest the*empirical Fisher information*as a better metric\. We first derive Fisher importance from this expansion and contrast it with prior heuristics \(§[2\.1](https://arxiv.org/html/2606.05538#S2.SS1)\) then provide three lines of empirical evidence that this score is a useful ranking signal: it outperforms all alternatives \(§[2\.1](https://arxiv.org/html/2606.05538#S2.SS1.SSS0.Px4)\), masking few Fisher ranked critical dimensions collapses generation\-heavy tasks while sparing knowledge tasks \(§[2\.2](https://arxiv.org/html/2606.05538#S2.SS2)\), and the set of Fisher\-important dimensions shared across tasks is small and removing it collapses every task \(§[2\.3](https://arxiv.org/html/2606.05538#S2.SS3)\)\. For brevity, we group benchmarks into four categories: Knowledge coversMMLU,CEvalandCMMLU; Code coversHumanEvalandMBPP; Reasoning isBBH; Math coversMATHandGSM8K\.

### 2\.1Empirical Fisher Information

Letpθ​\(y∣x\)p\_\{\\theta\}\(y\\mid x\)be the model’s output distribution and𝒟\\mathcal\{D\}a calibration dataset ofNNsamples with sequence lengthTT\. We denote byℒ​\(x,y\)=−log⁡pθ​\(y∣x\)\\mathcal\{L\}\(x,y\)=\-\\log p\_\{\\theta\}\(y\\mid x\)the loss\. Each expertEiE\_\{i\}is a gated FFN

Ei​\(𝐱\)=Widown​\(σ​\(Wigate​𝐱\)⊙Wiup​𝐱\),E\_\{i\}\(\\mathbf\{x\}\)=W\_\{i\}^\{\\text\{down\}\}\\left\(\\sigma\(W\_\{i\}^\{\\text\{gate\}\}\\mathbf\{x\}\)\\odot W\_\{i\}^\{\\text\{up\}\}\\mathbf\{x\}\\right\),\(1\)withWigate,Wiup∈ℝdff×dW\_\{i\}^\{\\text\{gate\}\},W\_\{i\}^\{\\text\{up\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{ff\}\}\\times d\},Widown∈ℝd×dffW\_\{i\}^\{\\text\{down\}\}\\in\\mathbb\{R\}^\{d\\times d\_\{\\text\{ff\}\}\}, andσ​\(⋅\)\\sigma\(\\cdot\)an element\-wise nonlinearity \(e\.g\., SiLU\)\.

#### Empirical Fisher approximation\.

A natural way to ask “how important is parameterθi\\theta\_\{i\}?” is to ask how much the predictive distribution moves when we perturb it\. For a smallδ∈ℝ\|θ\|\\delta\\in\\mathbb\{R\}^\{\|\\theta\|\}, a second\-order Taylor expansion of the KL divergence between the unperturbed and perturbed models yields

𝔼x​KL​\[pθ∥pθ\+δ\]=δ⊤​Fθ​δ\+O​\(‖δ‖3\),\\mathbb\{E\}\_\{x\}\\mathrm\{KL\}\\\!\\left\[p\_\{\\theta\}\\,\\\|\\,p\_\{\\theta\+\\delta\}\\right\]=\\delta^\{\\top\}F\_\{\\theta\}\\,\\delta\+O\(\\\|\\delta\\\|^\{3\}\),\(2\)whereFθ∈ℝ\|θ\|×\|θ\|F\_\{\\theta\}\\in\\mathbb\{R\}^\{\|\\theta\|\\times\|\\theta\|\}is the*Fisher Information Matrix*, Lettinggθ​\(x,y\):=∇θlog⁡pθ​\(y∣x\)g\_\{\\theta\}\(x,y\):=\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(y\\mid x\)denote the per\-token score function,

Fθ=𝔼x,y​\[gθ​\(x,y\)​gθ​\(x,y\)⊤\]\.F\_\{\\theta\}=\\mathbb\{E\}\_\{x,y\}\\\!\\left\[g\_\{\\theta\}\(x,y\)\\,g\_\{\\theta\}\(x,y\)^\{\\top\}\\right\]\.\(3\)The fullFθF\_\{\\theta\}is intractable for billion\-parameter models\. We adopt the*empirical Fisher*: we replace𝔼y∼pθ​\(y∣x\)\\mathbb\{E\}\_\{y\\sim p\_\{\\theta\}\(y\\mid x\)\}by the ground\-truths in𝒟\\mathcal\{D\}, yielding

F^θ=1N​∑\(x,y\)∈𝒟\(∇θlog⁡pθ​\(y∣x\)\)2\\hat\{F\}\_\{\\theta\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{\(x,y\)\\in\\mathcal\{D\}\}\\left\(\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(y\\mid x\)\\right\)^\{2\}\(4\)Throughout the rest of the paper we refer to this score simply as the*Fisher*\. When we apply this metric to MoE compression we call the resulting method Fisher\-MoE\.

#### Baselines\.

Prior MoE compression instead uses one of three metrics:

- •*Activation ratio:*siact=1N​T​∑\(𝐱,t\)∈𝒟𝟙​\[i∈T​\(𝐱t\)\]s\_\{i\}^\{\\text\{act\}\}=\\frac\{1\}\{NT\}\\sum\_\{\(\\mathbf\{x\},t\)\\in\\mathcal\{D\}\}\\mathds\{1\}\[i\\in T\(\\mathbf\{x\}\_\{t\}\)\], the fraction of tokens routed to expertii\(Muzioet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib38); Luet al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib39)\)\.
- •*Router score:*siscore=1N​T​∑gi​\(𝐱t\)s\_\{i\}^\{\\text\{score\}\}=\\frac\{1\}\{NT\}\\sum g\_\{i\}\(\\mathbf\{x\}\_\{t\}\), the average gating weight\(Xieet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib42)\)\.
- •*Magnitude:*simag=‖Wigate‖F\+‖Wiup‖F\+‖Widown‖Fs\_\{i\}^\{\\text\{mag\}\}=\\\|W\_\{i\}^\{\\text\{gate\}\}\\\|\_\{F\}\+\\\|W\_\{i\}^\{\\text\{up\}\}\\\|\_\{F\}\+\\\|W\_\{i\}^\{\\text\{down\}\}\\\|\_\{F\}, a data\-independent weight norm\(Leeet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib43); Yanget al\.,[2024b](https://arxiv.org/html/2606.05538#bib.bib40)\)\.

#### Extension to intermediate dimensions\.

A key practical advantage of Fisher importance is that the same derivation applies whetherWWis a router weight matrix, a full expert FFN, or a single intermediate dimension\. We instantiate Eq\.[4](https://arxiv.org/html/2606.05538#S2.E4)at three granularities used throughout the paper:*\(1\) Router\-level Fisher*:WWis the row of the router gate corresponding to expertii;*\(2\) Expert\-level Fisher*:WWis the union ofWigate,Wiup,WidownW\_\{i\}^\{\\text\{gate\}\},W\_\{i\}^\{\\text\{up\}\},W\_\{i\}^\{\\text\{down\}\}for expertii;*\(3\) Intermediate dimension Fisher*:WWis the rows ofWigate,WiupW\_\{i\}^\{\\text\{gate\}\},W\_\{i\}^\{\\text\{up\}\}and the column ofWidownW\_\{i\}^\{\\text\{down\}\}corresponding to a single intermediate dimensionjjof expertii\.Gim:=∇WimℒG\_\{i\}^\{m\}:=\\nabla\_\{W\_\{i\}^\{m\}\}\\\!\\mathcal\{L\}form∈\{g,u,d\}m\\in\\\{g,u,d\\\},

si,jFisher=1d​∑k\[\(Gig\)j,k2\+\(Giu\)j,k2\+\(Gid\)k,j2\]s\_\{i,j\}^\{\\text\{Fisher\}\}=\\frac\{1\}\{d\}\\sum\_\{k\}\\big\[\(G\_\{i\}^\{g\}\)\_\{j,k\}^\{2\}\+\(G\_\{i\}^\{u\}\)\_\{j,k\}^\{2\}\+\(G\_\{i\}^\{d\}\)\_\{k,j\}^\{2\}\\big\]\(5\)averaged over calibration samples; the sum runs over the hidden dimensionk∈\[d\]k\\in\[d\]and the denominatorddnormalizes by the total number of parameters tied to dimensionjj\. The intermediate dimension form is what enables both fine\-grained compression \(removing low\-scoring dimensions, §[3](https://arxiv.org/html/2606.05538#S3)\) and fine\-grained attribution \(identifying high\-scoring critical dimensions, §[2\.2](https://arxiv.org/html/2606.05538#S2.SS2)\)\.

We compare all four importance metrics under expert\-level compression at MoE compression ratiop=50%p=50\\%\. To ensure a fair comparison, we use the unified compression framework of\(Heet al\.,[2025b](https://arxiv.org/html/2606.05538#bib.bib46)\): all methods share identical calibration data \(GSM8K training set, 128 samples\), compression procedure, and evaluation protocol on Qwen1\.5\-MoE; only the importance metric differs\. Fisher importance dominates across all tasks \(Figure[2](https://arxiv.org/html/2606.05538#S2.F2)\)\. Full results can be found in Table[9](https://arxiv.org/html/2606.05538#A4.T9)\.

![Refer to caption](https://arxiv.org/html/2606.05538v1/x2.png)Figure 2:Expert\-level pruning with existing important metrics on Qwen1\.5\-MoE at50%50\\%compression ratio\.

### 2\.2Can Fisher Importance Identify Task\-Critical Parameters?

Table 1:Impact of masking top\-12 and removing critical/redundant intermediate dimensions on performance\.We rank all∼\\sim1\.35​M1\.35\\text\{M\}MoE FFN intermediate dimensions by Fisher score \(computed on128128GSM8Ktraining samples\) and zero\-mask the top fraction\. Masking just1212of1\.35​M1\.35\\text\{M\}intermediate dimensions \(0\.001%0\.001\\%\) collapsesGSM8Kfrom35\.9%35\.9\\%to0\.8%0\.8\\%andMATHfrom13\.0%13\.0\\%to1\.1%1\.1\\%, with code andBBHalso degrading sharply, while multiple\-choice knowledge tasks \(MMLU/CEval/CMMLU\) largely retain base \(Table[1](https://arxiv.org/html/2606.05538#S2.T1); full breakdown in Table[10](https://arxiv.org/html/2606.05538#A4.T10)\)\. Appendix[J](https://arxiv.org/html/2606.05538#A10)confirms none of the residual0\.8%0\.8\\%onGSM8Kreflects real computation – all1010“correct” answers are coincidental digit matches\. Figure[3](https://arxiv.org/html/2606.05538#S2.F3)shows the underlying cause: the Fisher distribution is extremely heavy\-tailed, with these twelve dimensions∼\\sim1000×1000\\timesabove the population mean\.

![Refer to caption](https://arxiv.org/html/2606.05538v1/figures/critical_neurons_3d_surface_new.png)Figure 3:Fisher scores of intermediate dimensions across experts and layers in Qwen1\.5\-MoE\.We further measure each dimension’s mean forward activation magnitude\. The population is sharply skewed: the top\-12 Fisher dimensions carry mean77–450×450\\timesthe median activation of all parameters\. \(Appendix[B](https://arxiv.org/html/2606.05538#A2)\) They match the*massive\-activation*signature ofSunet al\.\([2024a](https://arxiv.org/html/2606.05538#bib.bib81)\), further suggest Fisher identify critical parameters\.

To understand why masking only twelve dimensions devastates math reasoning, we trace the failure into the model’s attention dynamics\.*Attention sink*\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib82)\)stabilises autoregressive decoding\. Masking the top\-12 collapses mean<BOS\>attention in the mid\-stack \(layer 7∼\\sim17, where the sink dominates\) by∼\\sim70%70\\%reduction \(Appendix[C](https://arxiv.org/html/2606.05538#A3)\)\. This directly explains the dissociation in Table[1](https://arxiv.org/html/2606.05538#S2.T1): multi\-step generation depends on the sink\-stabilised token\-to\-token dynamics and degenerates into the echo and garbled outputs of Appendix[J](https://arxiv.org/html/2606.05538#A10), whereas MCQ tasks need only a single\-token classification and are largely sink\-insensitive\.

The cross\-domain analysis from §[H](https://arxiv.org/html/2606.05538#A8)suggests why removingGSM8K\-calibrated critical dimensions selectively destroys math reasoning\. Table[16](https://arxiv.org/html/2606.05538#A8.T16)shows that the pairwise overlap of Fisher\-important dimensions between tasks exhibits clear domain structure: math tasks \(GSM8K/MATH\) share 65\.3% of their important dimensions\. The high overlap explains why removingGSM8K\-calibrated critical dimensions also destroysMATHperformance, while the lower overlap with knowledge tasks \(∼\{\\sim\}55%\) suggests whyMMLU/CEval/CMMLUremain intact\.

### 2\.3Identify Universally Indispensable Dimensions

The domain\-overlap analysis identifies a set of intermediate dimensions retained by all eight evaluation tasks underp=50%p=50\\%\(𝒦∩\\mathcal\{K\}\_\{\\cap\}, 4\.88% of parameters\)\. Removing the universally*kept*dimensions collapses every task to near\-zero accuracy\. Removing the universally*discarded*dimensions \(𝒟∩\\mathcal\{D\}\_\{\\cap\}, 4\.01% of parameters\) preserves near\-full performance across all tasks \(Table[1](https://arxiv.org/html/2606.05538#S2.T1)\), confirming these dimensions are genuinely redundant\. This result, combined with §[2\.2](https://arxiv.org/html/2606.05538#S2.SS2), leads to a key insight for compression: removing the long redundant tail of intermediate dimensions ranked by Fisher preserves model performance\. We leverage this in §[3](https://arxiv.org/html/2606.05538#S3)\.

## 3Keep the Experts, Slim Their FFNs

Section[2](https://arxiv.org/html/2606.05538#S2)showed that Fisher ranks parameters in a way that aligns with downstream impact: removing the top\-ranked most disrupts generation, while the bottom\-ranked form a long redundant tail\. In this section, we leverage this ranking for compression\. We first formalize prior expert\-level compression and our finer\-grained intermediate dimension alternative \(§[3\.1](https://arxiv.org/html/2606.05538#S3.SS1)\), then demonstrate that intermediate dimension compression substantially outperforms expert\-level methods \(§[3\.2](https://arxiv.org/html/2606.05538#S3.SS2)\), and finally quantify the deployment benefits \(§[3\.3](https://arxiv.org/html/2606.05538#S3.SS3)\)\.

### 3\.1From Expert\-Level to Intermediate Dimension Compression

A Mixture\-of\-Experts layer consists ofnnexpert combined through a learned routing mechanism\. Given an input𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}, the MoE layer output is

f​\(𝐱;Φ\)=∑i∈T​\(𝐱\)gi​\(𝐱\)​Ei​\(𝐱;θi\),f\(\\mathbf\{x\};\\Phi\)=\\sum\_\{i\\in T\(\\mathbf\{x\}\)\}g\_\{i\}\(\\mathbf\{x\}\)\\,E\_\{i\}\(\\mathbf\{x\};\\theta\_\{i\}\),\(6\)whereT​\(𝐱\)T\(\\mathbf\{x\}\)denotes the set of top\-kkexperts selected by the router,gi​\(𝐱\)g\_\{i\}\(\\mathbf\{x\}\)is the routing weight for expertii, andΦ=\{\(θi,Wi,bi\)\}i=1n\\Phi=\\\{\(\\theta\_\{i\},W\_\{i\},b\_\{i\}\)\\\}\_\{i=1\}^\{n\}collects all layer parameters\. Let𝒫MoE\\mathcal\{P\}\_\{\\text\{MoE\}\}be all of the experts FFN weights, and𝒫^MoE\\hat\{\\mathcal\{P\}\}\_\{\\text\{MoE\}\}what remains after compression\. We define theMoE compression ratioasp=1−\|𝒫^MoE\|\|𝒫MoE\|\.p\\;=\\;1\-\\frac\{\|\\hat\{\\mathcal\{P\}\}\_\{\\text\{MoE\}\}\|\}\{\|\\mathcal\{P\}\_\{\\text\{MoE\}\}\|\}\.Atp=50%p=50\\%, expert\-level methods remove half of the experts, and intermediate dimension compression reducesdffd\_\{\\text\{ff\}\}inside experts\. Because the MLP weights dominate LLM parameter counts, while attention, embeddings, router, and layer norms are never compressed,p=50%p=50\\%still translates to a 43–48% whole\-model compression ratepmodelp\_\{\\text\{model\}\}across backbones \(reported in Table[29](https://arxiv.org/html/2606.05538#A13.T29)\)\.

#### Expert\-level compression\.

Prior methods select a subset𝒮⊂\{1,…,n\}\\mathcal\{S\}\\subset\\\{1,\\dots,n\\\}of experts to remove in each MoE layer\. The compressed layer becomes

f^​\(𝐱\)=∑i∈T​\(𝐱\)∖𝒮g^i​\(𝐱\)​Ei​\(𝐱;θi\),\\hat\{f\}\(\\mathbf\{x\}\)=\\sum\_\{i\\in T\(\\mathbf\{x\}\)\\setminus\\mathcal\{S\}\}\\hat\{g\}\_\{i\}\(\\mathbf\{x\}\)\\,E\_\{i\}\(\\mathbf\{x\};\\theta\_\{i\}\),\(7\)whereg^i\\hat\{g\}\_\{i\}denotes re\-normalized routing weights\. Expert selection uses activation ratio, router score, weight magnitude, or Fisher\.

![Refer to caption](https://arxiv.org/html/2606.05538v1/x3.png)Figure 4:Performance of Qwen1\.5\-MoE under expert\-level compression at three drop ratios \(25%, 50%, 75%\)\.
#### Proposed Intermediate dimension compression\.

Instead of removing entire experts, we operate at the granularity of individual FFN intermediate dimensions\. For each expertEiE\_\{i\}in each MoE layer, we use the Fisher importance scoresi,jdims\_\{i,j\}^\{\\text\{dim\}\}\(Eq\.[5](https://arxiv.org/html/2606.05538#S2.E5), formulated in §[2\.1](https://arxiv.org/html/2606.05538#S2.SS1)\) to rank every intermediate dimensionj∈\{1,…,dff\}j\\in\\\{1,\\dots,d\_\{\\text\{ff\}\}\\\}and remove the lowest\-scoring fractionpp\. The compressed expert becomes

E^i​\(𝐱\)=W^idown​\(σ​\(W^igate​𝐱\)⊙W^iup​𝐱\),\\hat\{E\}\_\{i\}\(\\mathbf\{x\}\)=\\hat\{W\}\_\{i\}^\{\\text\{down\}\}\\left\(\\sigma\(\\hat\{W\}\_\{i\}^\{\\text\{gate\}\}\\mathbf\{x\}\)\\odot\\hat\{W\}\_\{i\}^\{\\text\{up\}\}\\mathbf\{x\}\\right\),\(8\)whereW^igate,W^iup∈ℝd^ff×d\\hat\{W\}\_\{i\}^\{\\text\{gate\}\},\\hat\{W\}\_\{i\}^\{\\text\{up\}\}\\in\\mathbb\{R\}^\{\\hat\{d\}\_\{\\text\{ff\}\}\\times d\}andW^idown∈ℝd×d^ff\\hat\{W\}\_\{i\}^\{\\text\{down\}\}\\in\\mathbb\{R\}^\{d\\times\\hat\{d\}\_\{\\text\{ff\}\}\}are obtained by physically removing the rows ofWigate,WiupW^\{\\text\{gate\}\}\_\{i\},W^\{\\text\{up\}\}\_\{i\}and the corresponding columns ofWidownW^\{\\text\{down\}\}\_\{i\}\. The removal is determined by the Fishersi,jdims\_\{i,j\}^\{\\text\{dim\}\}together with a budget rule that fixes*where*the budgetppis spent\. We instantiate three rules:IntDim\-Ekeeps the top\(1−p\)\(1\{\-\}p\)fraction of dimensions within each expert;IntDim\-Lpools dimensions across experts in a layer and keeps the top\(1−p\)\(1\{\-\}p\)fraction per layer;IntDim\-Gpools dimensions across the whole model and keeps the top\(1−p\)\(1\{\-\}p\)globally\. All three share the same score and budget to isolate the effect of allocation flexibility\. They reduce the actual parameter count and computation per expert and preserve the model’s routing behavior\. We choose Fisher importance as our scoring criterion, as justified in §[2](https://arxiv.org/html/2606.05538#S2)\.

### 3\.2Intermediate Dimension Compression Outperforms Expert\-Level Methods

![Refer to caption](https://arxiv.org/html/2606.05538v1/x4.png)Figure 5:Expert\-level vs\. intermediate dimension compression atp=50%p=50\\%\.We compare expert\-level and intermediate dimension compression at the same compression ratiop=50%p=50\\%on Qwen1\.5\-MoE\-A2\.7B across eight general\-purpose benchmarks\. Results on additional backbones, including OLMoE, Qwen3\-30B, and Qwen3\.5\-35B, are provided in §[4](https://arxiv.org/html/2606.05538#S4)\. Both use Fisher and domain\-specific calibration data \(GSM8K128 training samples\)\. Atp=50%p=50\\%, expert\-level compression removes half routed experts per layer; IntDim\-E reducesdffd\_\{\\text\{ff\}\}by half within each expert\.

Table[5](https://arxiv.org/html/2606.05538#S3.F5)\(Full results in Table[12](https://arxiv.org/html/2606.05538#A4.T12)\) reveals that intermediate dimension compression dramatically outperforms expert\-level compression including Expert\-level Fisher across all benchmarks\. We further scale our experiments across different compression ratios in Fig\.[4](https://arxiv.org/html/2606.05538#S3.F4)\(Full results in Table[11](https://arxiv.org/html/2606.05538#A4.T11)\): withp=25%,50%,75%p=25\\%,50\\%,75\\%, intermediate dimension removal consistently retains base model performance and outperform expert\-level methods\. In particular, comparisons with Expert\-level Fisher highlight the advantage of fine\-grained compression\.

The core reason for this advantage is that knowledge in MoE models is distributed across all experts in a long\-tailed manner, rather than concentrated in a few\(Heet al\.,[2026](https://arxiv.org/html/2606.05538#bib.bib5)\)\. Each expert contains both essential and redundant intermediate dimensions\. Expert\-level compression must discard entire experts including their essential components\. Whereas intermediate dimension compression selectively removes only the least important computation within each expert, precisely the dimensions that §[2\.3](https://arxiv.org/html/2606.05538#S2.SS3)showed are genuinely dispensable\.

### 3\.3Deployment Cost and Inference Speedup

Table 2:Inference efficiency of 50% compressed Qwen1\.5\-MoE on H100\.Intermediate dimension compression achieves\(1\) parameter reductioncompared to the base model: the gate, up, and down projection matrices are physically resized, whereW^igate,W^iup∈ℝd^ff×d,W^idown∈ℝd×d^ff\\hat\{W\}\_\{i\}^\{\\text\{gate\}\},\\hat\{W\}\_\{i\}^\{\\text\{up\}\}\\in\\mathbb\{R\}^\{\\hat\{d\}\_\{\\text\{ff\}\}\\times d\},\\hat\{W\}\_\{i\}^\{\\text\{down\}\}\\in\\mathbb\{R\}^\{d\\times\\hat\{d\}\_\{\\text\{ff\}\}\}are compressed fromWigate,Wiup∈ℝdff×d,Widown∈ℝd×dffW^\{\\text\{gate\}\}\_\{i\},W^\{\\text\{up\}\}\_\{i\}\\in\\mathbb\{R\}^\{d\_\{\\text\{ff\}\}\\times d\},W^\{\\text\{down\}\}\_\{i\}\\in\\mathbb\{R\}^\{d\\times d\_\{\\text\{ff\}\}\}, producing a smaller model\.\(2\) Inference speedupcompared to the base model and other expert\-level removal baselines: Expert removal methods\(Muzioet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib38); Luet al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib39); Chenet al\.,[2022](https://arxiv.org/html/2606.05538#bib.bib41); Liet al\.,[2023](https://arxiv.org/html/2606.05538#bib.bib49); Chenet al\.,[2025](https://arxiv.org/html/2606.05538#bib.bib34)\)discard entire experts from MoE layers but keep the number of activation parameters per token the same\. The FFN intermediate dimension reduction translates to reduced active\-parameter savings per token instead\.

We benchmark the 50% compress ratio Qwen1\.5\-MoE against the base model on a single NVIDIA H100 GPU using vLLM with bf16 precision\.111We greedy decode 1024 prompts andmax\_new\_tokens= 256\. We use 1024\-prompt warm\-up precedes each timed run andgpu\_memory\_utilization=0\.85\.Concretely, each layer goes from86\.186\.1M to68\.868\.8M parameters \(a 20% per\-layer reduction\)\. The theoretical decode\-time speedup for Qwen1\.5\-MoE is2\.689B/2\.274B≈1\.18×\.2\.689\\text\{B\}/2\.274\\text\{B\}\\approx 1\.18\\times\.Table[2](https://arxiv.org/html/2606.05538#S3.T2)shows that the IntDim\-E compressed model achieves a1\.21×1\.21\\timeswall\-clock speedup, slightly exceeding the theoretical estimate\. The additional gain comes from the freed VRAM being automatically converted by vLLM into a larger KV cache, increasing the request capacity by 34%\. The VRAM and disk footprint reduces from 27 GB to 16 GB\.

## 4Experiments

Table 3:Performance of different pruning strategies at MoE compression ratiop=50%p=50\\%\.Upper:Qwen1\.5\-MoE\-A2\.7B and OLMoE\-1B\-7B on general benchmarks\.Lower:Qwen3\-30B\-A3B and Qwen3\.5\-35B\-A3B on frontier long\-CoT math reasoning\. Calibration uses the corresponding domain training data\. Here, “HE” is short for HumanEval, “MA” is short for MultiArith, and “Reas\.” is short for Reasoning\.MethodQwen1\.5\-MoE\-A2\.7BOLMoE\-1B\-7B\\cellcolorblue\!2Knowledge\\cellcolorblue\!6Coding\\cellcolorblue\!10Math\\cellcolorblue\!14Reas\.\\cellcolorred\!12Avg\.\\cellcolorblue\!2Knowledge\\cellcolorblue\!6Coding\\cellcolorblue\!10Math\\cellcolorblue\!14Reas\.\\cellcolorred\!12Avg\.\\columncolorblue\!2MMLU\\columncolorblue\!2CEval\\columncolorblue\!2CMMLU\\columncolorblue\!6HE\\columncolorblue\!6MBPP\\columncolorblue\!10MATH\\columncolorblue\!10GSM8K\\columncolorblue\!10MR\\columncolorblue\!14BBH\\columncolorred\!12\\columncolorblue\!2MMLU\\columncolorblue\!2CEval\\columncolorblue\!2CMMLU\\columncolorblue\!6HE\\columncolorblue\!6MBPP\\columncolorblue\!10MATH\\columncolorblue\!10GSM8K\\columncolorblue\!10MA\\columncolorblue\!14BBH\\columncolorred\!12Base Model\\columncolorblue\!259\.3\\columncolorblue\!259\.3\\columncolorblue\!260\.7\\columncolorblue\!632\.6\\columncolorblue\!624\.6\\columncolorblue\!1013\.0\\columncolorblue\!1035\.9\\columncolorblue\!1076\.7\\columncolorblue\!1430\.7\\columncolorred\!1243\.3\\columncolorblue\!253\.5\\columncolorblue\!233\.1\\columncolorblue\!233\.0\\columncolorblue\!69\.8\\columncolorblue\!615\.2\\columncolorblue\!1011\.6\\columncolorblue\!1053\.2\\columncolorblue\!1087\.5\\columncolorblue\!1426\.5\\columncolorred\!1235\.9MoE comp\. \(activation\)\\columncolorblue\!226\.3\\columncolorblue\!242\.5\\columncolorblue\!243\.1\\columncolorblue\!60\.0\\columncolorblue\!60\.0\\columncolorblue\!100\.0\\columncolorblue\!101\.9\\columncolorblue\!103\.3\\columncolorblue\!1424\.5\\columncolorred\!1215\.7\\columncolorblue\!233\.2\\columncolorblue\!222\.5\\columncolorblue\!226\.0\\columncolorblue\!69\.2\\columncolorblue\!611\.9\\columncolorblue\!101\.5\\columncolorblue\!105\.0\\columncolorblue\!1022\.2\\columncolorblue\!1422\.6\\columncolorred\!1217\.1MoE comp\.\\columncolorblue\!237\.2\\columncolorblue\!213\.5\\columncolorblue\!232\.3\\columncolorblue\!60\.0\\columncolorblue\!60\.8\\columncolorblue\!100\.0\\columncolorblue\!101\.9\\columncolorblue\!102\.0\\columncolorblue\!1423\.5\\columncolorred\!1212\.4\\columncolorblue\!231\.6\\columncolorblue\!222\.6\\columncolorblue\!227\.1\\columncolorblue\!69\.2\\columncolorblue\!610\.6\\columncolorblue\!103\.5\\columncolorblue\!107\.1\\columncolorblue\!1017\.3\\columncolorblue\!1424\.7\\columncolorred\!1217\.1MoE\-Pruner222For OLMoE\-1B\-7B, MoE\-Pruner uses low\-magnitude weights as the importance metric, since the original high\-magnitude variant collapses accuracy across all tasks\.\\columncolorblue\!218\.6\\columncolorblue\!233\.3\\columncolorblue\!235\.8\\columncolorblue\!60\.8\\columncolorblue\!60\.0\\columncolorblue\!100\.1\\columncolorblue\!101\.7\\columncolorblue\!102\.7\\columncolorblue\!1413\.2\\columncolorred\!1211\.8\\columncolorblue\!217\.3\\columncolorblue\!212\.9\\columncolorblue\!219\.0\\columncolorblue\!60\.0\\columncolorblue\!60\.0\\columncolorblue\!100\.0\\columncolorblue\!102\.1\\columncolorblue\!102\.7\\columncolorblue\!1412\.2\\columncolorred\!127\.4MoE comp\. \(Fisher\)\\columncolorblue\!239\.4\\columncolorblue\!248\.8\\columncolorblue\!247\.7\\columncolorblue\!612\.9\\columncolorblue\!68\.1\\columncolorblue\!102\.0\\columncolorblue\!1016\.9\\columncolorblue\!1015\.5\\columncolorblue\!1425\.7\\columncolorred\!1224\.1\\columncolorblue\!233\.8\\columncolorblue\!227\.0\\columncolorblue\!229\.0\\columncolorblue\!65\.5\\columncolorblue\!69\.1\\columncolorblue\!101\.9\\columncolorblue\!105\.2\\columncolorblue\!1026\.5\\columncolorblue\!1424\.0\\columncolorred\!1218\.0Expert\-level Fisher\\columncolorblue\!237\.9\\columncolorblue\!248\.6\\columncolorblue\!244\.2\\columncolorblue\!612\.9\\columncolorblue\!611\.7\\columncolorblue\!102\.4\\columncolorblue\!1018\.0\\columncolorblue\!1037\.7\\columncolorblue\!1424\.1\\columncolorred\!1226\.4\\columncolorblue\!234\.4\\columncolorblue\!227\.7\\columncolorblue\!227\.6\\columncolorblue\!67\.3\\columncolorblue\!610\.1\\columncolorblue\!101\.7\\columncolorblue\!106\.1\\columncolorblue\!1021\.5\\columncolorblue\!1425\.4\\columncolorred\!1218\.0Fisher\-IntDim\-E \(ours\)\\columncolorblue\!2\\cellcolorblue\!1050\.3\\columncolorblue\!2\\cellcolorblue\!1062\.0\\columncolorblue\!261\.6\\columncolorblue\!621\.2\\columncolorblue\!620\.5\\columncolorblue\!108\.0\\columncolorblue\!10\\cellcolorblue\!1835\.0\\columncolorblue\!1061\.5\\columncolorblue\!14\\cellcolorblue\!2228\.2\\columncolorred\!12\\cellcolorred\!2838\.7\\columncolorblue\!2\\cellcolorblue\!1039\.9\\columncolorblue\!2\\cellcolorblue\!1029\.2\\columncolorblue\!2\\cellcolorblue\!1029\.9\\columncolorblue\!6\\cellcolorblue\!149\.8\\columncolorblue\!67\.3\\columncolorblue\!102\.3\\columncolorblue\!1024\.9\\columncolorblue\!1075\.7\\columncolorblue\!14\\cellcolorblue\!2227\.3\\columncolorred\!1227\.4Fisher\-IntDim\-L \(ours\)\\columncolorblue\!2\\cellcolorblue\!1049\.1\\columncolorblue\!257\.5\\columncolorblue\!262\.0\\columncolorblue\!630\.5\\columncolorblue\!622\.8\\columncolorblue\!10\\cellcolorblue\!1812\.6\\columncolorblue\!1022\.0\\columncolorblue\!10\\cellcolorblue\!1889\.8\\columncolorblue\!14\\cellcolorblue\!2228\.3\\columncolorred\!12\\cellcolorred\!2841\.6\\columncolorblue\!2\\cellcolorblue\!1040\.4\\columncolorblue\!2\\cellcolorblue\!1030\.8\\columncolorblue\!2\\cellcolorblue\!1031\.4\\columncolorblue\!6\\cellcolorblue\!149\.8\\columncolorblue\!6\\cellcolorblue\!1417\.0\\columncolorblue\!10\\cellcolorblue\!187\.8\\columncolorblue\!10\\cellcolorblue\!1844\.3\\columncolorblue\!10\\cellcolorblue\!1887\.7\\columncolorblue\!14\\cellcolorblue\!2225\.1\\columncolorred\!12\\cellcolorred\!2832\.7Fisher\-IntDim\-G \(ours\)\\columncolorblue\!2\\cellcolorblue\!1049\.0\\columncolorblue\!2\\cellcolorblue\!1061\.3\\columncolorblue\!2\\cellcolorblue\!1064\.9\\columncolorblue\!6\\cellcolorblue\!1434\.2\\columncolorblue\!6\\cellcolorblue\!1424\.1\\columncolorblue\!10\\cellcolorblue\!189\.8\\columncolorblue\!10\\cellcolorblue\!1827\.2\\columncolorblue\!10\\cellcolorblue\!1892\.2\\columncolorblue\!14\\cellcolorblue\!2228\.2\\columncolorred\!12\\cellcolorred\!2843\.4\\columncolorblue\!2\\cellcolorblue\!1040\.4\\columncolorblue\!2\\cellcolorblue\!1031\.4\\columncolorblue\!2\\cellcolorblue\!1031\.7\\columncolorblue\!6\\cellcolorblue\!1410\.4\\columncolorblue\!6\\cellcolorblue\!1415\.7\\columncolorblue\!10\\cellcolorblue\!186\.9\\columncolorblue\!10\\cellcolorblue\!1841\.9\\columncolorblue\!10\\cellcolorblue\!1881\.0\\columncolorblue\!14\\cellcolorblue\!2225\.2\\columncolorred\!12\\cellcolorred\!2831\.6

In Subsection §[4\.1](https://arxiv.org/html/2606.05538#S4.SS1), we study how different MoE compression strategies behave when directly applied to the base model without any post\-training\. We compress Qwen1\.5\-MoE\-A2\.7B and OLMoE\-1B\-7B\-0125 atp=50%p=50\\%under task\-matched domain calibration, and report performance across the eight general\-task benchmarks above to isolate the effect of the importance metric and compression granularity\. In Subsection §[4\.2](https://arxiv.org/html/2606.05538#S4.SS2), we study whether the granularity advantage transfers to long\-CoT mathematical reasoning at frontier scale on Qwen3\-30B\-A3B and Qwen3\.5\-35B\-A3B with 128Stanford\-S1calibration data\. In Subsection §[4\.3](https://arxiv.org/html/2606.05538#S4.SS3), we study whether post\-trained compression models can achieve competitive results compared to full post\-trained models\. In Subsection §[4\.4](https://arxiv.org/html/2606.05538#S4.SS4), we study the out\-of\-domain generalization of single\-domain calibration\. We useGSM8Kcalibration for all methods and evaluate the resulting model on the seven non\-math benchmarks to test whether the Fisher signal preserves general capability when the calibration distribution is narrow\. In Subsection §[4\.5](https://arxiv.org/html/2606.05538#S4.SS5), we empirically show that Fisher\-MoE composes with quantization to further reduce model deployment cost\. We apply 4\-bit AWQ\(Linet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib45)\)on top of Fisher\-MoE atp=50%p=50\\%and compare the accuracy and footprint of*Ours\+AWQ*against*Base\+AWQ*to verify that the two reductions compose without amplifying quantization sensitivity\.

After evaluating different compression ratios in Fig\.[4](https://arxiv.org/html/2606.05538#S3.F4)\(p=25%,50%,75%p=25\\%,50\\%,75\\%\), we fix the MoE compression ratio top=50%p=50\\%\(defined in §[3\.1](https://arxiv.org/html/2606.05538#S3.SS1), the*whole\-model*parameter reduction is∼\\sim45% across four backbones; see Table[29](https://arxiv.org/html/2606.05538#A13.T29)\) for the remainder of this section, and focus on scaling across models and benchmarks, along with generalization and compatibility\. Detailed experimental settings, including models and dataset, baselines, calibration data, computational resources, and training and evaluation framework are in Appendix[E](https://arxiv.org/html/2606.05538#A5)\. We include an ablation study and cost analysis on the calibration sample sizeN∈32,64,128,256,512N\\in\{32,64,128,256,512\}in Appendix[F](https://arxiv.org/html/2606.05538#A6), and useN=128N=128; all methods use the same calibration data\. 128 samples cost less than 30 seconds for Fisher calculation on an H100 node for the 30B models\. Further comparisons with dense pruning and sparsity baselines are in Appendix[L](https://arxiv.org/html/2606.05538#A12), though they fall outside the scope of MoE architectures\.

### 4\.1Compression on Base Models: General Tasks

We compress each base model at the MoE compression ratiop=50%p=50\\%and evaluate across the eight general\-purpose benchmarks\. This isolates the effect of the compression algorithm: any performance loss reflects information that the importance metric and granularity failed to preserve\.

Table[3](https://arxiv.org/html/2606.05538#S4.T3)shows a consistent picture across both backbones\. Expert\-level baselines, regardless of importance metric, suffer catastrophic collapse on generation\-heavy tasks, includingGSM8K,MATH, andHumanEval\. Fisher\-MoE, on the other hand, preserves generation capability: it largely retains capability in math and code reasoning datasets\. The advantage holds across knowledge tasks too\.

Additionally, enabling more flexible allocation of intermediate dimension leads to further gains\. Specifically, Fisher\-IntDim\-G outperforms Fisher\-IntDim\-E on both OLMoE and Qwen1\.5\-MoE\. In the remainder of this section, we adopt Fisher\-IntDim\-G for scaling experiments, including larger models, long chain\-of\-thought \(CoT\) math reasoning, and Supervised Fine\-tuning \(SFT\)\.

OnCEval,CMMLU, andMultiArith, Fisher\-MoE matches or modestly exceeds the uncompressed base model\. We interpret these gains as evidence that Fisher\-based pruning can sometimes remove dimensions associated with brittle shortcut behavior or out\-of\-distribution output formats, thereby revealing capabilities already present in the base model rather than introducing new ones\. A similar pattern is observed for the Qwen3 and Qwen3\.5 models in §[4\.2](https://arxiv.org/html/2606.05538#S4.SS2)\. We further analyze this phenomenon quantitatively in Appendix[K](https://arxiv.org/html/2606.05538#A11)\. Compression reduces generation\-time biases that suppress useful computation already available in the original model\. Finally, we show that removing entire experts or attention heads causes severe degradation on generation\-heavy benchmarks, while FFN intermediate dimension compression is substantially more robust \(Appendix[Q](https://arxiv.org/html/2606.05538#A17)\)\.

### 4\.2Compression on Base Models: Math Reasoning at Larger Scale

We next test whether the granularity advantage carries over to long\-CoT math reasoning at larger scale, where errors compound across many decoding steps and any loss of generation capability is unforgiving\. We compress Qwen3\-30B\-A3B and Qwen3\.5\-35B\-A3B at the same MoE compression ratiop=50%p=50\\%and evaluate on five long\-CoT reasoning benchmarks in Tab\.[3](https://arxiv.org/html/2606.05538#S4.T3)\.

### 4\.3Fine\-Tuning Enabled Domain Enhancement

The base\-model results above measure the naive case: compressed weights deployed without any further training\. In practice, a compressed model is often used as the starting point for domain SFT\. The relevant question for deployment is therefore:*does compressing first then SFTing recover the base \+ SFT ceiling?*We compress atp=50%p=50\\%, then SFT on the corresponding domain training data, and compare against the uncompressed base \+ SFT\.

\(a\) Qwen1\.5\-MoE\-A2\.7B MethodMATHGSM8KMultiArithBase \+ SFT15\.950\.887\.8MoE\-Pruner \+ SFT3\.740\.635\.8MoE comp\. \+ SFT4\.930\.344\.7MoE comp\. \(activation\) \+ SFT4\.828\.547\.7MoE comp\. \(Fisher\) \+ SFT14\.346\.081\.8Expert\-level Fisher \+ SFT15\.446\.484\.8\\rowcolorblue\!8Fisher\-IntDim\-G \+ SFT \(ours\)17\.652\.088\.0

\(b\) OLMoE\-1B\-7B\-0125 MethodMATHGSM8KMultiArithBase \+ SFT12\.445\.192\.5MoE\-Pruner \+ SFT2\.315\.239\.5MoE comp\. \+ SFT9\.430\.687\.8MoE comp\. \(activation\) \+ SFT9\.831\.887\.0MoE comp\. \(Fisher\) \+ SFT11\.437\.189\.3Expert\-level Fisher \+ SFT10\.539\.189\.2\\rowcolorblue\!8Fisher\-IntDim\-G \+ SFT \(ours\)11\.943\.892\.3

Table 4:Compression at MoE ratiop=50%p=50\\%followed by SFT on domain training data\.*Base \+ SFT*is the uncompressed reference ceiling\.Results in Tab\.[4](https://arxiv.org/html/2606.05538#S4.T4)show that using half the FFN parameters, after SFT, Fisher\-MoE reaches competitive results compared with the uncompressed Base \+ SFT ceiling onGSM8K,MultiArith, andMATH\. In contrast, the activation\-, score\-, and magnitude\-based compression methods significantly under\-perform even after SFT\.

### 4\.4Out\-of\-Domain Generalization

Table 5:Evaluation of Qwen1\.5\-MoE compressed models\.GSM8Kis used for calibration\. In\-domain tasks include arithmetic reasoning \(MATH/GSM8K/MultiArith\); out\-of\-domain results are grouped into Knowledge \(MMLU/CEval/CMMLU\), Code \(HumanEval/MBPP\) and Reasoning \(BBH\)\. Full results in Table[13](https://arxiv.org/html/2606.05538#A4.T13)\.A reasonable concern with Fisher importance is that calibration on one domain may bias the retained dimensions toward that domain, hollowing out general capability\. We test this directly: we calibrate on math \(GSM8K, 128 samples\), apply 50% intermediate dimension compression, and evaluate the resulting model on seven Out\-of\-Domain \(OOD\) task categories\. Then we further compare the OOD performance of 50% intermediate dimension compression with other baselines\. Results in Tab\.[5](https://arxiv.org/html/2606.05538#S4.T5)show that for both in\-domain and OOD tasks, Fisher\-IntDim\-E dominates the activation\-, score\-, and magnitude\-based baselines by large margins\.

### 4\.5Compatibility with Quantization

Fisher\-MoE reduces parameter count structurally; AWQ\(Linet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib45)\)reduces bit\-width per parameter\. The two operate on orthogonal axes of model footprint, so combining them should multiply the savings if neither destroys the capability the other depends on\. We test this by applying AWQ \(settings in Appendix §[G](https://arxiv.org/html/2606.05538#A7)\) on top of IntDim\-E atp=50%p=50\\%and against AWQ on the base model\.

Table 6:Stacking AWQ on top of Fisher\-MoE at MoE ratiop=50%p=50\\%\. Full results in Table[14](https://arxiv.org/html/2606.05538#A4.T14)\.The two reductions compose: combiningIntDim\-Eatp=50%p=50\\%with 4\-bit AWQ shrinks the deployed footprint of Qwen1\.5\-MoE\-A2\.7B from 26\.67 GiB \(bf16, base\) to roughly 1/8 of that, while preserving the capability profile reported in §[3\.3](https://arxiv.org/html/2606.05538#S3.SS3)\. The accuracy loss from*IntDim\-E\+AWQ*relative to*IntDim\-E*matches the loss of*Base Model\+AWQ*relative to*Base Model*, confirming that Fisher\-MoE is compatible with quantization\.

## 5Conclusion

We revisit MoE compression through the lens of two questions:*which signal*should drive parameter selection, and*at what granularity*should compression operate\. For the first, we show that the empirical Fisher information outperforms activation\-, router\-, and magnitude\-based heuristics across backbone and benchmark \(§[2](https://arxiv.org/html/2606.05538#S2), §[4\.1](https://arxiv.org/html/2606.05538#S4.SS1)\. §[4\.2](https://arxiv.org/html/2606.05538#S4.SS2)\)\. For the second, we show that pushing compression below the expert level to FFN intermediate dimensions changes the regime \(§[3](https://arxiv.org/html/2606.05538#S3), §[4\.1](https://arxiv.org/html/2606.05538#S4.SS1)\)\. Beyond speedup and memory efficiency advantages \(§[3\.3](https://arxiv.org/html/2606.05538#S3.SS3), §[4\.5](https://arxiv.org/html/2606.05538#S4.SS5)\), our Fisher\-importance attribution analysis localizes a striking dissociation \(§[2\.2](https://arxiv.org/html/2606.05538#S2.SS2)\), suggesting that intermediate dimension Fisher is a useful tool for both compression \(§[3](https://arxiv.org/html/2606.05538#S3)\) and understanding where capabilities live inside models \(§[2\.2](https://arxiv.org/html/2606.05538#S2.SS2)[2\.3](https://arxiv.org/html/2606.05538#S2.SS3)\)\.

## Limitations

Due to limited computational resources, we did not extend the exploration of Fisher\-MoE to 100 Billion\+ parameters MoE backbones, nor did we run a full recovery study with longer post\-training on larger and more diverse corpora to characterize how much of the residual gap to the uncompressed base model can be closed\. We leave both directions as natural follow\-ups for future work\.

## References

- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- I\. Chen, H\. Liu, W\. Sun, C\. Chao, Y\. Hsu, C\. Lee,et al\.\(2024\)Retraining\-free merging of sparse moe via hierarchical clustering\.arXiv preprint arXiv:2410\.08589\.Cited by:[§1](https://arxiv.org/html/2606.05538#S1.p2.1)\.
- I\. Chen, H\. Liu, W\. Sun, C\. Chao, Y\. Hsu, and C\. Lee \(2025\)Retraining\-free merging of sparse moe via hierarchical clustering\.InInternational Conference on Machine Learning,pp\. 8594–8620\.Cited by:[§3\.3](https://arxiv.org/html/2606.05538#S3.SS3.p1.2)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.External Links:2107\.03374Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- T\. Chen, S\. Huang, Y\. Xie, B\. Jiao, D\. Jiang, H\. Zhou, J\. Li, and F\. Wei \(2022\)Task\-specific expert pruning for sparse mixture\-of\-experts\.arXiv preprint arXiv:2206\.00277\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.05538#S3.SS3.p1.2)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2606.05538#S1.p1.1)\.
- E\. Frantar and D\. Alistarh \(2023\)Sparsegpt: massive language models can be accurately pruned in one\-shot\.InInternational conference on machine learning,pp\. 10323–10337\.Cited by:[Table 28](https://arxiv.org/html/2606.05538#A12.T28),[Appendix L](https://arxiv.org/html/2606.05538#A12.p1.1),[§P\.2](https://arxiv.org/html/2606.05538#A16.SS2.p1.3)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1)\.
- H\. Gu, W\. Li, L\. Li, Q\. Zhu, M\. Lee, S\. Sun, W\. Xue, and Y\. Guo \(2025\)Delta decompression for moe\-based llms compression\.arXiv preprint arXiv:2502\.17298\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1)\.
- S\. Gugger, L\. Debut, T\. Wolf, P\. Schmid, Z\. Mueller, S\. Mangrulkar, M\. Sun, and B\. Bossan \(2022\)Accelerate: training and inference at scale made simple, efficient and adaptable\.\.Note:[https://github\.com/huggingface/accelerate](https://github.com/huggingface/accelerate)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px5.p1.1)\.
- N\. Habib, C\. Fourrier, H\. Kydlíček, T\. Wolf, and L\. Tunstall \(2023\)LightEval: a lightweight framework for llm evaluation\.External Links:[Link](https://github.com/huggingface/lighteval)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px5.p1.1)\.
- B\. Hassibi, D\.G\. Stork, and G\.J\. Wolff \(1993\)Optimal brain surgeon and general network pruning\.InIEEE International Conference on Neural Networks,Vol\.,pp\. 293–299 vol\.1\.External Links:[Document](https://dx.doi.org/10.1109/ICNN.1993.298572)Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1)\.
- B\. Hassibi and D\. Stork \(1992\)Second order derivatives for network pruning: optimal brain surgeon\.Advances in neural information processing systems5\.Cited by:[§P\.2](https://arxiv.org/html/2606.05538#A16.SS2.p1.3)\.
- H\. He, X\. Ding, X\. Jiang, X\. Zou, A\. Cheng, Y\. Zhao, J\. B\. Li, and H\. Miller \(2026\)Preserving long\-tailed expert information in mixture\-of\-experts tuning\.arXiv preprint arXiv:2604\.23036\.Cited by:[§3\.2](https://arxiv.org/html/2606.05538#S3.SS2.p3.1)\.
- H\. He, J\. B\. Li, X\. Jiang, and H\. Miller \(2025a\)Sparse matrix in large language model fine\-tuning\.External Links:2405\.15525,[Link](https://arxiv.org/abs/2405.15525)Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1)\.
- S\. He, D\. Dong, L\. Ding, and A\. Li \(2025b\)Towards efficient mixture of experts: a holistic study of compression techniques\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=HTpMOl6xSI)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.05538#S2.SS1.SSS0.Px4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021a\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021b\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- F\. Kunstner, P\. Hennig, and L\. Balles \(2019\)Limitations of the empirical fisher approximation for natural gradient descent\.Advances in neural information processing systems32\.Cited by:[§P\.1](https://arxiv.org/html/2606.05538#A16.SS1.SSS0.Px2.p1.2)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px5.p1.1)\.
- Y\. LeCun, J\. Denker, and S\. Solla \(1989\)Optimal brain damage\.Advances in neural information processing systems2\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1),[§P\.2](https://arxiv.org/html/2606.05538#A16.SS2.p1.3)\.
- J\. Lee, A\. Qiao, D\. F\. Campos, Z\. Yao, Y\. He,et al\.\(2024\)Stun: structured\-then\-unstructured pruning for scalable moe pruning\.arXiv preprint arXiv:2409\.06211\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[3rd item](https://arxiv.org/html/2606.05538#S2.I1.i3.p1.1)\.
- D\. Lepikhin, H\. Lee, Y\. Xu, D\. Chen, O\. Firat, Y\. Huang, M\. Krikun, N\. Shazeer, and Z\. Chen \(2021\)GShard: scaling giant models with conditional computation and automatic sharding\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05538#S1.p1.1)\.
- H\. Li, Y\. Zhang, F\. Koto, Y\. Yang, H\. Zhao, Y\. Gong, N\. Duan, and T\. Baldwin \(2024\)Cmmlu: measuring massive multitask language understanding in chinese\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 11260–11285\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- P\. Li, Z\. Zhang, P\. Yadav, Y\. Sung, Y\. Cheng, M\. Bansal, and T\. Chen \(2023\)Merge, then compress: demystify efficient smoe with hints from its routing policy\.arXiv preprint arXiv:2310\.01334\.Cited by:[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.05538#S3.SS3.p1.2)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.Proceedings of machine learning and systems6,pp\. 87–100\.Cited by:[§4\.5](https://arxiv.org/html/2606.05538#S4.SS5.p1.1),[§4](https://arxiv.org/html/2606.05538#S4.p1.2)\.
- X\. Lu, Q\. Liu, Y\. Xu, A\. Zhou, S\. Huang, B\. Zhang, J\. Yan, and H\. Li \(2024a\)Not all experts are equal: efficient expert pruning and skipping for mixture\-of\-experts large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6159–6172\.External Links:[Link](https://aclanthology.org/2024.acl-long.334/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.334)Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[1st item](https://arxiv.org/html/2606.05538#S2.I1.i1.p1.2),[§3\.3](https://arxiv.org/html/2606.05538#S3.SS3.p1.2)\.
- X\. Lu, Q\. Liu, Y\. Xu, A\. Zhou, S\. Huang, B\. Zhang, J\. Yan, and H\. Li \(2024b\)Not all experts are equal: efficient expert pruning and skipping for mixture\-of\-experts large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6159–6172\.Cited by:[Appendix L](https://arxiv.org/html/2606.05538#A12.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 17359–17372\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1)\.
- P\. Michel, O\. Levy, and G\. Neubig \(2019\)Are sixteen heads really better than one?\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1)\.
- P\. Molchanov, S\. Tyree, T\. Karras, T\. Aila, and J\. Kautz \(2017\)Pruning convolutional neural networks for resource efficient inference\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1)\.
- N\. Muennighoff, L\. Soldaini, D\. Groeneveld, K\. Lo, J\. Morrison, S\. Min, W\. Shi, P\. Walsh, O\. Tafjord, N\. Lambert, Y\. Gu, S\. Arora, A\. Bhagia, D\. Schwenk, D\. Wadden, A\. Wettig, B\. Hui, T\. Dettmers, D\. Kiela, A\. Farhadi, N\. A\. Smith, P\. W\. Koh, A\. Singh, and H\. Hajishirzi \(2024\)OLMoE: open mixture\-of\-experts language models\.External Links:2409\.02060,[Link](https://arxiv.org/abs/2409.02060)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- A\. Muzio, A\. Sun, and C\. He \(2024\)Seer\-moe: sparse expert efficiency through regularization for mixture\-of\-experts\.arXiv preprint arXiv:2404\.05089\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[1st item](https://arxiv.org/html/2606.05538#S2.I1.i1.p1.2),[§3\.3](https://arxiv.org/html/2606.05538#S3.SS3.p1.2)\.
- Qwen Team \(2024\)Qwen1\.5\-MoE: matching 7b model performance with 1/3 activated parameters\.Qwen Blog\.External Links:[Link](https://qwenlm.github.io/blog/qwen-moe/)Cited by:[Appendix O](https://arxiv.org/html/2606.05538#A15.p1.1),[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1)\.
- Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- J\. Ren, S\. Rajbhandari, R\. Y\. Aminabadi, O\. Ruwase, S\. Yang, M\. Zhang, D\. Li, and Y\. He \(2021\)ZeRO\-offload: democratizing billion\-scale model training\.ArXivabs/2101\.06840\.External Links:[Link](https://arxiv.org/abs/2101.06840)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px5.p1.1)\.
- S\. Roy and D\. Roth \(2015\)Solving general arithmetic word problems\.InProceedings of the 2015 conference on empirical methods in natural language processing,pp\. 1743–1752\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§1](https://arxiv.org/html/2606.05538#S1.p1.1)\.
- M\. Sun, X\. Chen, J\. Z\. Kolter, and Z\. Liu \(2024a\)Massive activations in large language models\.InConference on Language Modeling \(COLM\),Cited by:[Appendix B](https://arxiv.org/html/2606.05538#A2.SS0.SSS0.Px3.p2.16),[Appendix B](https://arxiv.org/html/2606.05538#A2.p1.2),[Appendix C](https://arxiv.org/html/2606.05538#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.05538#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.05538#A3.p5.1),[§2\.2](https://arxiv.org/html/2606.05538#S2.SS2.p2.2)\.
- M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter \(2024b\)A simple and effective pruning approach for large language models\.InInternational Conference on Learning Representations,Cited by:[Table 28](https://arxiv.org/html/2606.05538#A12.T28),[Appendix L](https://arxiv.org/html/2606.05538#A12.p1.1),[§P\.2](https://arxiv.org/html/2606.05538#A16.SS2.p1.3)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. Le, E\. Chi, D\. Zhou,et al\.\(2023\)Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 13003–13051\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- E\. Voita, D\. Talbot, F\. Moiseev, R\. Sennrich, and I\. Titov \(2019\)Analyzing multi\-head self\-attention: specialized heads do the heavy lifting, the rest can be pruned\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 5797–5808\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px2.p1.1)\.
- L\. von Werra, Y\. Belkada, L\. Tunstall, E\. Beeching, T\. Thrush, N\. Lambert, S\. Huang, K\. Rasul, and Q\. Gallouédec \(2020\)TRL: transformer reinforcement learning\.GitHub\.Note:[https://github\.com/huggingface/trl](https://github.com/huggingface/trl)Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px5.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix C](https://arxiv.org/html/2606.05538#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.05538#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.05538#A3.p5.1),[§2\.2](https://arxiv.org/html/2606.05538#S2.SS2.p3.3)\.
- Y\. Xie, Z\. Zhang, D\. Zhou, C\. Xie, Z\. Song, X\. Liu, Y\. Wang, X\. Lin, and A\. Xu \(2024\)Moe\-pruner: pruning mixture\-of\-experts large language model using the hints from its router\.arXiv preprint arXiv:2410\.12013\.Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px1.p1.1),[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[2nd item](https://arxiv.org/html/2606.05538#S2.I1.i2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1)\.
- A\. Yang, B\. Yang, B\. Hui, B\. Zheng, B\. Yu, C\. Zhou, C\. Li, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Tang, J\. Wang, J\. Yang, J\. Tu, J\. Zhang, J\. Ma, J\. Xu, J\. Zhou, J\. Bai, J\. He, J\. Lin, K\. Dang, K\. Lu, K\. Chen, K\. Yang, M\. Li, M\. Xue, N\. Ni, P\. Zhang, P\. Wang, R\. Peng, R\. Men, R\. Gao, R\. Lin, S\. Wang, S\. Bai, S\. Tan, T\. Zhu, T\. Li, T\. Liu, W\. Ge, X\. Deng, X\. Zhou, X\. Ren, X\. Zhang, X\. Wei, X\. Ren, Y\. Fan, Y\. Yao, Y\. Zhang, Y\. Wan, Y\. Chu, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Fan \(2024a\)Qwen2 technical report\.arXiv preprint arXiv:2407\.10671\.Cited by:[Appendix E](https://arxiv.org/html/2606.05538#A5.SS0.SSS0.Px1.p1.1)\.
- C\. Yang, Y\. Sui, J\. Xiao, L\. Huang, Y\. Gong, Y\. Duan, W\. Jia, M\. Yin, Y\. Cheng, and B\. Yuan \(2024b\)MoE\-i2: compressing mixture of experts models through inter\-expert pruning and intra\-expert low\-rank decomposition\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10456–10466\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.612/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.612)Cited by:[Appendix A](https://arxiv.org/html/2606.05538#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05538#S1.p2.1),[3rd item](https://arxiv.org/html/2606.05538#S2.I1.i3.p1.1)\.

## Appendix ARelated Work

Prior MoE compression methods can be grouped by their*importance metric*\(how compression targets are selected, we discussed in §[1](https://arxiv.org/html/2606.05538#S1)\)\.

#### Importance metrics\.

Across these strategies, methods differ in how they select which experts or parameters to compress\.Activation\-ratio\-basedmethods measure routing frequency\(Muzioet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib38); Luet al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib39); Chenet al\.,[2022](https://arxiv.org/html/2606.05538#bib.bib41)\)\.Router\-score\-basedmethods use average gating weights\(Xieet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib42); Guet al\.,[2025](https://arxiv.org/html/2606.05538#bib.bib48)\)\.Magnitude\-basedmethods rank by weight norms\(Leeet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib43); Yanget al\.,[2024b](https://arxiv.org/html/2606.05538#bib.bib40)\)\. MoE\-Pruner\(Xieet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib42)\)combines gated values with weight magnitudes\. Our work demonstrates thatgradient\-basedimportance—which directly measures each parameter’s contribution to the loss—substantially outperforms all of these alternatives \(§[2\.1](https://arxiv.org/html/2606.05538#S2.SS1)\)\. Moreover, all prior MoE compression methods operate at the granularity of entire experts or expert blocks; our intermediate dimension compression \(§[3](https://arxiv.org/html/2606.05538#S3)\) represents a strictly finer granularity that better preserves the distributed knowledge structure of MoE models\.

#### Model Attribution

Our Fisher\-importance\-based model attribution analysis relates to the broader literature on understanding which model components are responsible for specific capabilities\. The discussion about parameter importance can trace back to Optimal Brain Damage\(LeCunet al\.,[1989](https://arxiv.org/html/2606.05538#bib.bib68)\)and Optimal Brain Surgeon\(Hassibiet al\.,[1993](https://arxiv.org/html/2606.05538#bib.bib69)\), which use second\-order information \(Hessian diagonals\) or first\-order Taylor approximations\(Molchanovet al\.,[2017](https://arxiv.org/html/2606.05538#bib.bib70)\)to estimate the loss increase from weight removal\. Recent work extends these ideas to attention heads\(Michelet al\.,[2019](https://arxiv.org/html/2606.05538#bib.bib71); Voitaet al\.,[2019](https://arxiv.org/html/2606.05538#bib.bib72)\)or attention matrices\(Heet al\.,[2025a](https://arxiv.org/html/2606.05538#bib.bib4)\)\. Several studies have investigated where factual knowledge is stored in transformers\.Gevaet al\.\([2021](https://arxiv.org/html/2606.05538#bib.bib74)\)show that FFN layers function as key\-value memories\.Menget al\.\([2022](https://arxiv.org/html/2606.05538#bib.bib75)\)use causal tracing to locate factual associations in specific MLP layers\. Our Fisher\-importance\-based attribution further provides a finer\-grained lens for model attribution and understanding parameter specialization than layer\-structure\-wise analysis alone\.

## Appendix BCritical Dimensions Coincide with FFN Activation Outliers

Section[2\.2](https://arxiv.org/html/2606.05538#S2.SS2)establishes that zero\-masking the top\-12 Fisher\-ranked intermediate dimensions of Qwen1\.5\-MoE collapses GSM8K from35\.9%35\.9\\%to0\.8%0\.8\\%while leaving multi\-choice knowledge tasks largely intact\. This appendix examines the forward\-pass activation behaviour of those twelve dimensions and places them within the broader literature on activation outliers\(Sunet al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib81)\)\. We make a deliberately limited claim: the critical dimensions are extreme upper\-tail activation outliers that share the qualitative signature of – but are not at the canonical scale of – the massive\-activation phenomenon documented in dense LLMs\.

#### Fisher and activation magnitude are not statistically independent\.

Before reporting the activation statistics we make explicit a known mathematical dependency that prevents interpreting them as an independent confirmation of Fisher\. For the down\-projection slab, the gradient with respect to weight\(Widown\)k,j\(W^\{\\text\{down\}\}\_\{i\}\)\_\{k,j\}factors as∂ℒ/∂\(Widown\)k,j=\(∂ℒ/∂hk\(ℓ\)\)⋅aj\(i\),\\partial\\mathcal\{L\}/\\partial\(W^\{\\text\{down\}\}\_\{i\}\)\_\{k,j\}\\;=\\;\(\\partial\\mathcal\{L\}/\\partial h^\{\(\\ell\)\}\_\{k\}\)\\cdot a^\{\(i\)\}\_\{j\},whereaj\(i\)a^\{\(i\)\}\_\{j\}is the post\-activation along intermediate dimensionjjof expertiiand∂ℒ/∂hk\(ℓ\)\\partial\\mathcal\{L\}/\\partial h^\{\(\\ell\)\}\_\{k\}is the upstream gradient at the residual stream\. Squaring and summing this term acrosskk– the form that enters the down\-projection contribution of Eq\.[5](https://arxiv.org/html/2606.05538#S2.E5)– therefore scales with\(aj\(i\)\)2\(a^\{\(i\)\}\_\{j\}\)^\{2\}\. High\-activation channels accumulate Fisher mass through this term by construction, so any positive correlation between Fisher rank and activation magnitude is partly expected, not a separate signal\. The gate\- and up\-projection contributions in Eq\.[5](https://arxiv.org/html/2606.05538#S2.E5)do not factor in the same way, so the coupling is partial rather than total, but it is strong enough that we treat the activation analysis below as*descriptive*– characterising the magnitude regime the top\-12 occupy and connecting them to the activation\-outlier literature – rather than as an independent attribution channel\.

Table 7:Forward activation magnitudes of the top\-12 Fisher\-ranked intermediate dimensions in Qwen1\.5\-MoE on GSM8K\.
#### Measurement\.

For every routed expert in Qwen1\.5\-MoE\-A2\.7B and every intermediate dimension inside it \(in total∼\\sim2\.03​M2\.03\\text\{M\}dimensions across2424MoE layers and6060experts per layer\), we record the mean absolute post\-activation\|aj\|\|a\_\{j\}\|on the GSM8K calibration set \(128128samples, the same set used to compute Fisher\)\. We then compare the population distribution of\|aj\|\|a\_\{j\}\|against the values observed at the twelve dimensions whose Fisher score is largest\.

#### Population baseline is sharply skewed\.

The empirical distribution of mean\|a\|\|a\|is heavy\-tailed even before we look at any Fisher\-selected outliers: median0\.170\.17andp99=1\.17p\_\{99\}=1\.17\. In other words, fewer than1%1\\%of all FFN intermediate dimensions exceed mean activation magnitude1\.171\.17, and the typical dimension carries a forward signal an order of magnitude smaller\.

Top\-12 Fisher dimensions are extreme upper\-tail outliers – but smaller than canonical massive activations\. Table[7](https://arxiv.org/html/2606.05538#A2.T7)reports, for each of the twelve dimensions, the mean\|a\|\|a\|together with its ratio to the population median and to the populationp99p\_\{99\}\. The top nine dimensions carry mean\|a\|\|a\|between4\.04\.0and77\.977\.9, i\.e\.,23×23\\timesto450×450\\timesthe population median and3×3\\timesto66×66\\timesthe populationp99p\_\{99\}\. Even the weakest two of the twelve \(ranks1111and1212\) sit at77–8×8\\timesthe median, comfortably above the9999th percentile of the entire MoE FFN\. For calibration against the original literature,Sunet al\.\([2024a](https://arxiv.org/html/2606.05538#bib.bib81)\)report*massive activations*in dense LLMs whose magnitudes exceed the channel mean by roughly104×10^\{4\}\\times– two to three orders of magnitude beyond what we observe here\. We therefore describe the top\-12 as*extreme upper\-tail activation outliers*that share the qualitative signature of the massive\-activation phenomenon – a tiny subset of hidden coordinates carrying disproportionate forward\-pass mass – without claiming they are massive activations in the strict numerical sense of the original definition\. The MoE setting may also dilute the per\-channel magnitude relative to a dense backbone, since each expert is activated only on a fraction of tokens\.

The Spearman rank correlation between Fisher and mean\|a\|\|a\|over the twelve dimensions is0\.910\.91\. Given the gradient–activation coupling noted above, this correlation is the expected direction and approximately the expected strength: it confirms that intermediate\-dimension Fisher concentrates on high\-activation channels, but does not establish a statistically independent attribution channel\. Read this way, the coupling is also the reason intermediate\-dimension Fisher is a useful tractable attribution: by inheriting weight from the activation\-driven down\-projection term, it inherits an inductive bias toward exactly the outlier channels that prior work on activation outliers and attention sinks has shown to be load\-bearing \(Appendix[C](https://arxiv.org/html/2606.05538#A3)\)\.

#### Internal structure within the top\-12\.

Ranks11–99are extreme outliers deep in the upper tail of the population distribution \(≥23×\\geq 23\\timesthe median\), while ranks1010–1212are still abovep99p\_\{99\}but no longer of the same magnitude \(77–16×16\\times\)\. This tapering is consistent with activation outliers being concentrated in only a handful of channels per model – the first nine ranks carry the bulk of the abnormal forward\-pass mass that, as Appendix[C](https://arxiv.org/html/2606.05538#A3)shows, is associated with the mid\-stack attention sink\.

## Appendix CMasking Critical Dimensions Reduces the Mid\-Stack Attention Sink

Appendix[B](https://arxiv.org/html/2606.05538#A2)shows that the twelve Fisher\-critical intermediate dimensions of Qwen1\.5\-MoE coincide with extreme upper\-tail outliers in the FFN activation distribution – a milder MoE analogue of the massive activations documented in dense LLMs bySunet al\.\([2024a](https://arxiv.org/html/2606.05538#bib.bib81)\)\. This appendix reports a second, downstream property of those dimensions: masking them substantially reduces the BOS attention sink\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib82)\)in the mid\-stack layers where the sink dominates\. We use this observation to relate the accuracy dissociation in Table[1](https://arxiv.org/html/2606.05538#S2.T1)to a known mechanistic concept, while being explicit about what the experiment can and cannot establish\.

In a softmax attention layer the attention weights for each query must sum to one\. When no key is genuinely relevant, optimisation pressure has been shown to push probability mass onto a low\-information “parking” position – typically the BOS token\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib82)\)\. This sink stabilises decoding in two ways: it absorbs unallocated probability mass without distorting on\-topic tokens, and it provides a residual\-stream anchor whose norm keeps mid\-stack pre\-softmax logits inside a numerically well\-behaved range\.Sunet al\.\([2024a](https://arxiv.org/html/2606.05538#bib.bib81)\)further argue that the sink is supported by extreme FFN activations that progressively inflate the BOS residual\-stream norm across depth\.

We measure, per decoder layer, the mean fraction of softmax attention paid to the BOS token, averaged across attention heads and across all positions of every GSM8K test prompt\. We compare the base model against the Mask\-Top\-12 variant – the identical model with those twelve intermediate dimensions zeroed in the forward pass\. Layers are grouped into bands that reflect the baseline sink profile of Qwen1\.5\-MoE\.

Table 8:Mean attention to BOS per decoder\-layer band on GSM8K with Qwen1\.5\-MoE\-A2\.7B\.Masking the top\-12 substantially reduces the mid\-stack sink\. Table[8](https://arxiv.org/html/2606.05538#A3.T8)shows that Mask\-Top\-12 attenuates the sink in the layers where it is strongest at baseline\. In the mid\-stack \(L7–17\) mean BOS attention falls from0\.320\.32–0\.440\.44down to0\.070\.07–0\.180\.18, roughly a70%70\\%reduction\. The early sink\-free layers \(L0–1\) are unchanged; the weaker early\-middle and late tails \(L2–6, L18–23\) are only mildly attenuated\. The layer\-localised pattern is consistent with the twelve dimensions contributing meaningfully to the mid\-stack sink rather than driving attention behaviour uniformly across depth\.

We are deliberate about the strength of this claim\. The result establishes a strong*association*: the twelve dimensions are activation outliers \(Appendix[B](https://arxiv.org/html/2606.05538#A2)\), they sit on the same family of channels prior work has linked to BOS attention dynamics\(Sunet al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib81); Xiaoet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib82)\), and removing them produces a sizeable mid\-stack sink reduction\. It does*not*prove that those twelve dimensions are the unique source of the sink: any sufficiently large activation\-outlier ablation might produce a qualitatively similar mid\-stack reduction, and our experiment varies only which dimensions are masked rather than sweeping comparably\-sized outlier subsets\. We treat the result as evidence that the critical dimensions identified by Fisher are part of – not necessarily the entirety of – the FFN\-side substrate that maintains the mid\-stack sink\.

Two controls partially tighten the attribution\. \(i\) Removing the universally\-critical core𝒦∩\\mathcal\{K\}\_\{\\cap\}\(∼\\sim4\.88%4\.88\\%of all dimensions; §[2\.3](https://arxiv.org/html/2606.05538#S2.SS3)\) – which strictly contains the top\-12 – collapses BOS attention globally rather than only in the mid\-stack, consistent with the larger set containing additional sink\-supporting dimensions beyond the top\-12\. \(ii\) Removing the comparably\-sized but Fisher\-redundant set𝒟∩\\mathcal\{D\}\_\{\\cap\}\(∼\\sim4\.01%4\.01\\%\) leaves BOS attention essentially unchanged in every layer band\. Together with the mid\-stack\-only response under Mask\-Top\-12, these controls argue against the alternative that any random ablation of similar parameter mass would disturb the sink\. They do not, however, control for activation magnitude at a matched set size – a stronger causal isolation we leave to future work\.

#### Linking sink reduction to the accuracy dissociation\.

The pattern of sink reduction lines up with the accuracy pattern in Table[1](https://arxiv.org/html/2606.05538#S2.T1): generation\-heavy benchmarks \(GSM8K,MATH,HumanEval,MBPP\) require many autoregressive decoding steps, each of which depends on the numerically stable attention dynamics that the mid\-stack sink maintains; once the sink is attenuated, long\-form generation degenerates into the echoes and garbled outputs catalogued in Appendix[J](https://arxiv.org/html/2606.05538#A10)\(31\.1%31\.1\\%echo,7\.5%7\.5\\%empty – behaviours absent at baseline\)\. Short\-answer MCQ tasks \(MMLU,CEval,CMMLU,BBH\) require only a single\-token prediction conditioned on the prompt, do not heavily exercise long\-range attention dynamics, and correspondingly retain9090–98%98\\%of base accuracy\. We frame this as a coherent mechanistic explanation: the sink\-reduction direction and the accuracy\-collapse direction match, and that match is consistent with the literature linking outlier FFN activations to attention sinks\.

#### Summary\.

Taken together, Appendices[B](https://arxiv.org/html/2606.05538#A2)and[C](https://arxiv.org/html/2606.05538#A3)place the twelve Fisher\-critical dimensions in a coherent mechanistic context:\(i\)intermediate\-dimension Fisher localises capability onto a tiny set of channels whose Fisher scores are∼\\sim1000×1000\\timesthe population mean;\(ii\)those channels sit in the extreme upper tail of the FFN activation distribution, a milder MoE analogue of the massive\-activation phenomenon documented for dense LLMs;\(iii\)masking them substantially reduces the mid\-stack BOS attention sink while leaving sink\-free layers alone, mirroring the generation\-vs\-MCQ dissociation in Table[1](https://arxiv.org/html/2606.05538#S2.T1)\. We present this as a consistent pattern that connects intermediate\-dimension Fisher to known mechanistic concepts\.

## Appendix DDetailed Experimental Results

This appendix collects the full per\-benchmark numbers that underlie the figures and tables in the main paper\. Each block below corresponds to one main\-paper figure or table\.

Table 9:Raw data underlyingFigure 2: expert\-level pruning with existing importance metrics on Qwen1\.5\-MoE atp=50%p=50\\%MoE compression ratio\.Table 10:Raw data underlyingTable 1\(mask\-top\-12 and critical/redundant dimension removal\)\. Performance \(%\) on Qwen1\.5\-MoE\-A2\.7B\.Table 11:Raw data underlyingFigure 3: Qwen1\.5\-MoE accuracy \(%\) under expert\-level compression at three drop ratios\.Table 12:Raw data underlyingTable 2: expert\-level vs\. intermediate dimension compression atp=50%p=50\\%on Qwen1\.5\-MoE\-A2\.7B\.Table 13:Raw data underlyingTable 7: strict zero\-shot evaluation on Qwen1\.5\-MoE\-A2\.7B\. GSM8K is calibration and is part of the in\-domain set\.Table 14:Raw data underlyingTable 8: stacking 4\-bit AWQ on top of Fisher\-MoE at MoE compression ratiop=50%p=50\\%on Qwen1\.5\-MoE\-A2\.7B\. Disk and VRAM are in GiB\.
## Appendix EAdditional Experimental Settings

#### Model Architecture and Dataset:

In our experimental setup, we use open\-weight OLMoE\-1B\-7B\-0125\(Muennighoffet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib22)\), Qwen1\.5\-MoE\-A2\.7B\(Yanget al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib15)\), Qwen3\-30B\-A3B\(Yanget al\.,[2025](https://arxiv.org/html/2606.05538#bib.bib53)\), and Qwen3\.5\-35B\-A3B\(Qwen Team,[2026](https://arxiv.org/html/2606.05538#bib.bib23)\)to conduct experiments\. We adopt two data configurations: a general task configuration following technical reports\(Yanget al\.,[2024a](https://arxiv.org/html/2606.05538#bib.bib15); Qwen Team,[2024](https://arxiv.org/html/2606.05538#bib.bib66)\)for the lightweight pre\-trained LLMs \(Qwen1\.5\-MoE\-A2\.7B, OLMoE\-1B\-7B\-0125\), and a long CoT configuration for the stronger and larger reasoning LLMs \(Qwen3\-30B\-A3B, Qwen3\.5\-35B\-A3B\)\. For Qwen1\.5\-MoE\-A2\.7B and OLMoE\-1B\-7B\-0125, we evaluate onGSM\_8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.05538#bib.bib6)\),MMLU\(Hendryckset al\.,[2021a](https://arxiv.org/html/2606.05538#bib.bib2)\),HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2606.05538#bib.bib14)\),MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2606.05538#bib.bib11)\),CEval,CMMLU\(Liet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib12)\),MATH\(Hendryckset al\.,[2021b](https://arxiv.org/html/2606.05538#bib.bib13)\),BBH\(Suzgunet al\.,[2023](https://arxiv.org/html/2606.05538#bib.bib8)\), andMultiArith\(Roy and Roth,[2015](https://arxiv.org/html/2606.05538#bib.bib7)\), which together cover knowledge, code generation, mathematical reasoning, and general reasoning\. For Qwen3\-30B\-A3B and Qwen3\.5\-35B\-A3B, we evaluate on the SOTA long\-CoT math reasoning benchmarksAIME2026,AIME2025,GPQA\-Diamond\(Reinet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib10)\),MATH\-500, andOlympiad Bench\.

All pretrained backbones \(Qwen1\.5\-MoE, Qwen3/Qwen3\.5, OLMoE\) and all evaluation benchmarks used in this paper are public research artifacts; we use each under its original license \(Qwen / Apache\-2\.0 for the Qwen family, Apache\-2\.0 for OLMoE, and the respective research licenses forGSM8K, MATH, MMLU, HumanEval, MBPP, CEval, CMMLU, BBH, MultiArith, AIME, GPQA\-D, Olympiad Bench, MATH\-500, andStanford\-S1\)\. Our released code and checkpoints inherit the license of the corresponding base model\.

#### MoE Compression Baselines:

For state\-of\-the\-art \(SOTA\) MoE compression baselines, we include the unified MoE compression framework of\(Heet al\.,[2025b](https://arxiv.org/html/2606.05538#bib.bib46)\)\(denoted*MoE compression*, which uses router scores as the importance metric\), and MoE\-Pruner\(Xieet al\.,[2024](https://arxiv.org/html/2606.05538#bib.bib42)\)\(weight magnitude\)\. All these baselines operate at the expert granularity\. To more comprehensively compare importance signals while holding the compression framework fixed, we further extend the unified framework with two alternative importance metrics \(activation ratio and router\-Fisher\) yielding two controlled variants that we denote*MoE compression \(activation\)*and*MoE compression \(Fisher\)*\. This setup lets us attribute performance differences to the importance metric and granularity\.

#### Fisher\-MoE Variants\.

We propose four Fisher\-MoE variants that differ in pruning granularity and allocation flexibility\.\(1\) Expert\-level Fishertreats each expert as the pruning unit, where the Fisher score of expertiiis computed over the union of its parametersWigateW\_\{i\}^\{\\text\{gate\}\},WiupW\_\{i\}^\{\\text\{up\}\}, andWidownW\_\{i\}^\{\\text\{down\}\}, and entire experts are removed according to this score\.\(2\) IntDim\-Eapplies Fisher scoring at the intermediate dimension level but keeps the top\(1−p\)\(1\-p\)fraction of dimensions within each expert, enforcing the same compression ratio for every expert\.\(3\) IntDim\-Lpools intermediate dimensions across all experts within the same MoE layer and keeps the top\(1−p\)\(1\-p\)fraction per layer, allowing the retained dimensions to be redistributed among experts in that layer\.\(4\) IntDim\-Gpools intermediate dimensions across the entire model and keeps the top\(1−p\)\(1\-p\)fraction globally, providing the most flexible allocation across both layers and experts\. All intermediate dimension variants use the same Fisher scoring criterion and overall parameter budget, allowing us to isolate the effect of increasingly flexible fine\-grained allocation\.

#### Calibration–Evaluation Disjointness:

For every benchmark used as both a calibration source and an evaluation target \(e\.g\., GSM8K, MATH, MultiArith, MBPP, HumanEval\), we use the standard training split for calibration \(and SFT, where applicable\) and evaluate on the standard test split, which has no overlap with calibration\. The same disjoint train/test convention is used for every other in\-domain calibration–evaluation pair throughout the paper\. To ensure fair comparison, all methods use the same calibration data to compute importance scores and perform compression\.

#### Training Framework and Hyper\-parameters:

We use thehuggingface\-trl\(von Werraet al\.,[2020](https://arxiv.org/html/2606.05538#bib.bib20)\)library with ZeRO\-2 or ZeRO\-3\(Renet al\.,[2021](https://arxiv.org/html/2606.05538#bib.bib16)\)for fine\-tuning, andvllm\(Kwonet al\.,[2023](https://arxiv.org/html/2606.05538#bib.bib19)\),lighteval\(Habibet al\.,[2023](https://arxiv.org/html/2606.05538#bib.bib17)\), andaccelerate\(Guggeret al\.,[2022](https://arxiv.org/html/2606.05538#bib.bib18)\)for inference and evaluation\. Both training and evaluation use bf16\.

#### Computational Resources:

We run all experiments, baseline implementations, and post\-training on 8×\\timesNVIDIA H100 80 GB GPUs or 8×\\timesNVIDIA A100 80 GB GPUs\. CPU–GPU communication is over PCIe Gen4, and inter\-GPU communication is over NVLink\-3\.

## Appendix FAblation Study on Calibration Set Size

The Fisher importance score \(§[2\.1](https://arxiv.org/html/2606.05538#S2.SS1)\) is a Monte\-Carlo square of\|∇Wℒ​\(x,y\)\|\|\\nabla\_\{W\}\\mathcal\{L\}\(x,y\)\|over a calibration set𝒟⊂𝒳×𝒴\\mathcal\{D\}\\subset\\mathcal\{X\}\\times\\mathcal\{Y\}:

sFisher​\(W\)=1N​∑\(x,y\)∈𝒟\|∇Wℒ​\(x,y\)\|2,N=\|𝒟\|\.s^\{\\text\{Fisher\}\}\(W\)=\\frac\{1\}\{N\}\\sum\_\{\(x,y\)\\in\\mathcal\{D\}\}\\left\|\\nabla\_\{W\}\\mathcal\{L\}\(x,y\)\\right\|^\{2\},\\qquad N=\|\\mathcal\{D\}\|\.

\(9\)Because every\(x,y\)\(x,y\)requires a forward and a backward pass through the full MoE, the cost of computingsFishers^\{\\text\{Fisher\}\}scales linearly withNN\. The natural question is how largeNNmust be for the resulting Fisher ranking—and thus the compressed model that ranking selects—to stabilize\.

We answer this empirically: we fix the backbone \(Qwen1\.5\-MoE\-A2\.7B\), the MoE compression ratio \(p=50%p=50\\%via Fisher\-IntDim\-E\), the calibration domain \(domain\-matched calibration data\), and the decoding settings \(greedy, vLLM v0\.8\.4, bf16\), then vary onlyN∈\{32,64,128,256,512\}N\\in\\\{32,64,128,256,512\\\}\. Table[15](https://arxiv.org/html/2606.05538#A6.T15)reports downstream accuracy of the resulting compressed model\. The ’N/A’ results forN=512N=512occur because some datasets contain fewer than 512 samples for calibration\.

Table 15:Fisher\-IntDim\-E calibration\-size sweep on Qwen1\.5\-MoE\-A2\.7B atp=50%p=50\\%\(domain\-matched calibration, greedy decoding via vLLM\)\. Each column reports downstream accuracy of the compressed model produced when the Fisher score is estimated fromNNcalibration samples\. Higher is better\.Three observations stand out\. First, the Fisher ranking is sample\-efficient: atN=32N=32the resulting compressed model already retains 33\.7% on GSM8K and 50\.4% on MMLU, both above the strongest expert\-level Fisher baseline atN=128N=128\(Table[3](https://arxiv.org/html/2606.05538#S4.T3),Fisher\-Expert: 18\.0/37\.9\)\. This is consistent with the empirical Fisher being a Monte\-Carlo estimator of𝔼\(x,y\)∼𝒟cal​\[\|∇Wℒ\|\]\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{\\text\{cal\}\}\}\[\|\\nabla\_\{W\}\\mathcal\{L\}\|\]: the average is dominated by a small number of high\-gradient examples, so even tens of samples concentrate the ranking around the right tail\. Second, the curve is monotone but flat pastN=128N=128: MMLU and MATH show marginal improvement fromN=128N=128toN=512N=512\(50\.3%→\\to53\.0% and 8\.0→\\to7\.8\) despite a4×4\\timesincrease in compute\.

#### Default\.

Following the knee of the curve, we useN=128N=128for all main\-paper experiments\. This setting is the smallest sample budget for which the Fisher\-IntDim\-E model is within∼\\sim22points of theN=512N=512result on MMLU and MATH, while requiring only128128forward\+backward passes per backbone, a cost within one minute on one H100 node even for the 30B\-parameter MoEs\.

## Appendix GAWQ Quantization Settings

#### Quantization precision\.

AWQ is configured with per\-group W4A16\-asymmetric quantization: weight bit\-widthw\_bit=4\(INT4\),zero\_point=True\(asymmetric, per\-group scale and zero\-point\), group sizeq\_group\_size=128, activations kept in the original FP16/BF16 dtype, GEMM kernel format, and packed INT4safetensorsoutput\.

#### Calibration\.

PileVal \(alternatives: C4, WikiText2\),split=train, text columntext, at most 128 calibration samples \(script variants: 32, 1024\), maximum sequence length 512 \(variant: 256\), duo\-scaling enabled \(jointly optimize the scale w\.r\.t\. weight and activation\)\.

#### Quantized layers\.

Quantization is applied perQwen2MoeDecoderLayer; within each layer, weights are partitioned into*scaling groups*that share AWQ smoothing and quantization jointly\.

- •Attention\.\(i\) Input LayerNorm→\\to\[Q\-proj, K\-proj, V\-proj\] \(grouped scaling\); \(ii\) V\-proj→\\to\[O\-proj\] \(only whennum\_kv\_heads == num\_heads; O\-proj is still quantized\)\.
- •MoE sparse layers\(most decoder layers\)\. \(i\) Post\-Attention LayerNorm→\\to\[all expert gate/up projections \+ shared\-expert gate/up\] \(jointly scaled across 60 experts×\\times2 \+ 2 shared modules\); \(ii\) per expert: up\-proj→\\to\[down\-proj\]; \(iii\) shared expert: up\-proj→\\to\[down\-proj\]\.
- •Dense MLP layers\(mlp\_only\_layers\)\. \(i\) Post\-Attention LayerNorm→\\to\[MLP gate\-proj, up\-proj\]; \(ii\) MLP up\-proj→\\to\[MLP down\-proj\]\.

#### Modules excluded from quantization\.

modules\_to\_not\_convert = \["gate", "shared\_expert\_gate"\]: the MoE router \(mlp\.gate, computes routing logits\) and the shared\-expert gate \(shared\_expert\_gate, sigmoid gating\)\. AutoAWQ defaults additionally skiplm\_head, all embedding layers, and every LayerNorm/RMSNorm\.

#### Additional behaviours\.

fuse\_layersis a no\-op for Qwen2\-MoE \(no QKV/MLP fusion; module structure is preserved\)\.get\_act\_for\_scalingusesis\_scalable=False\(no extra scalable activations at the decoder\-layer level\)\.move\_embedtransfers embedding tokens to the configured device for calibration hidden\-state collection\. Layer type:Qwen2MoeDecoderLayer; sequence\-length key:max\_position\_embeddings\.

#### Summary\.

AWQ performs per\-group W4A16\-asymmetric quantization \(group size 128\) across Qwen2\-MoE, covering all QKV/O projections in attention and gate/up/down modules in all routed and shared experts, while skipping routers \(mlp\.gate\), the shared\-expert gate, layer norms, embeddings, and the head\. Calibration uses PileVal with 128 samples×\\times512 tokens by default and duo\-scaling enabled; the output uses the GEMM kernel format\.

## Appendix HShared Dimensions Across Tasks

To further characterize how retained dimensions are shared across tasks, we compute pairwise overlaps\. For each pair\(a,b\)\(a,b\), we define

Overlap​\(a,b\)=\|𝒦a∩𝒦b\|\|𝒦a\|=\|𝒦a∩𝒦b\|\|𝒦b\|,\\mathrm\{Overlap\}\(a,b\)\\;=\\;\\frac\{\|\\mathcal\{K\}\_\{a\}\\cap\\mathcal\{K\}\_\{b\}\|\}\{\|\\mathcal\{K\}\_\{a\}\|\}\\;=\\;\\frac\{\|\\mathcal\{K\}\_\{a\}\\cap\\mathcal\{K\}\_\{b\}\|\}\{\|\\mathcal\{K\}\_\{b\}\|\},\(10\)where the equality holds since\|𝒦a\|=\|𝒦b\|\|\\mathcal\{K\}\_\{a\}\|=\|\\mathcal\{K\}\_\{b\}\|under a shared drop ratio\. Table[16](https://arxiv.org/html/2606.05538#A8.T16)shows the resulting overlap matrix\. Three structured patterns emerge:\(1\) Linguistic affinity\.The Chinese benchmarksC\-EvalandCMMLUexhibit the highest overlap \(69\.6%69\.6\\%\)\.\(2\) Domain affinity\.Coding tasks \(HumanEval/MBPP,66\.5%66\.5\\%\) and math tasks \(GSM8K/MATH,65\.3%65\.3\\%\) form similarly strong clusters\.

Table 16:Pairwise overlap of kept intermediate dimensions between tasks\. Values above65%65\\%are highlighted in bold\.
## Appendix ICommonsense Benchmarks Exhibit High Variance Under Random Compression

In Section[1](https://arxiv.org/html/2606.05538#S1), we argue that an overlooked issue is that prior works predominantly evaluate compressed MoE models on commonsense reasoning benchmarks \(e\.g\., ARC, PIQA, HellaSwag\)\. We find that these benchmarks exhibit large variance across runs and are unreliable indicators of compression quality\. For instance, randomly removing 50% of experts, performance on individual commonsense tasks can fluctuate by over 20 percentage points across different random seeds, and removing more parameters paradoxically yields better scores\.

We compress Qwen1\.5\-MoE\-A2\.7B at 50% expert removal using two different random seeds and evaluate on eight commonsense reasoning tasks\. No calibration data or importance metric is used—experts are selected uniformly at random\. For comparison, we also include Fisher\-based attention head compression at 25% and 50% drop ratios\. Table[17](https://arxiv.org/html/2606.05538#A9.T17)reports the results alongside the uncompressed base model\.

Table 17:Commonsense reasoning performance under random 50% expert removal \(two seeds\) and Fisher\-based attention head compression on Qwen1\.5\-MoE\-A2\.7B\. The base model \(no compression\) is shown for reference\.Three phenomena are noteworthy:

#### Extreme inter\-seed variance\.

The average commonsense accuracy differs by 18\.4 percentage points between the two random seeds \(9\.1% vs\. 27\.5%\)\. On individual tasks, the variance is even larger: BoolQ fluctuates by 47\.0 points \(12\.0% vs\. 59\.0%\) and PIQA by 33\.6 points \(8\.6% vs\. 42\.2%\)\. This level of variance means that any single\-run comparison between compression methods on these benchmarks is essentially uninformative—the difference between “method A beats method B” and the reverse can be determined entirely by which random subset of experts happens to be removed\.

#### More compression can paradoxically improve scores\.

Consider the attention head results: removing 50% of heads with Fisher\-based selection achieves*higher*average commonsense accuracy \(23\.1%\) than removing only 25% of heads \(18\.7%\)\. On 6 out of 8 individual tasks—including ARC\-C \(16\.0 vs\. 11\.5\), HellaSwag \(18\.6 vs\. 3\.1\), OBQA \(21\.4 vs\. 17\.0\), PIQA \(39\.0 vs\. 34\.8\), SIQA \(26\.4 vs\. 8\.4\), and WinoGrande \(32\.8 vs\. 22\.2\)—the more aggressively compressed model outperforms the less compressed one\. This is nonsensical: removing more parameters should not improve the model\. The explanation is that commonsense benchmarks have high random baselines \(BoolQ: 50%, PIQA: 50% for binary/two\-option tasks\) and at high compression ratios the scores are dominated by noise rather than genuine model capability\.

#### Random removal can beat Fisher\-guided compression\.

Random seed 42 achieves higher commonsense accuracy \(27\.5%\) than Fisher\-based 25% attention head compression \(18\.7%\), despite using no importance metric whatsoever and removing twice as many parameters\. On BoolQ specifically, random removal scores 59\.0%—above the 50% random baseline and far above the Fisher\-guided 39\.0%\. This further confirms that commonsense benchmarks cannot reliably distinguish between compression strategies at moderate\-to\-high compression ratios\.

These observations motivate our use of more challenging benchmarks \(GSM8K, HumanEval, MMLU, MATH, BBH, CEval, CMMLU, MBPP\) in the main paper, which have much lower random baselines and require genuine multi\-step reasoning or generation\.

## Appendix JFailure\-Mode Dissection of GSM8K Responses at 0\.001% Critical Intermediate Dimension Removal

Section[2\.2](https://arxiv.org/html/2606.05538#S2.SS2)reports that masking the top∼\\sim12 most Fisher\-important intermediate dimensions \(0\.001% of the 1\.35M MoE FFN intermediate dimensions in Qwen1\.5\-MoE\-A2\.7B\) collapses GSM8K accuracy from 35\.9% to 0\.8%\. To understand*how*the model fails rather than merely*that*it fails, we manually inspect all 1,319 GSM8K test outputs and classify each into one of eight categories\. We compare the resulting distribution to the base model, where accuracy is 35\.9% and the model still produces coherent multi\-step reasoning\. Decoding uses temperature0\.10\.1andmax\_tokens=500=500in both settings\.

#### Categories\.

We define eight mutually exclusive categories spanning the observed output behaviors:Real reasoning:a multi\-step chain\-of\-thought with explicit arithmetic and connectives that reaches a numeric conclusion\.Partial reasoning:reasoning tokens \(“total”, “step”, “therefore”\) appear but the chain is incomplete or incoherent\.Exact echo:a verbatim copy of the input question\.Truncated echo:a partial copy of the question that loops or terminates mid\-sentence\.Garbled echo:the question text with name substitutions, dropped clauses, or reordered phrases \(a semantics\-free transformation\)\.Numeric only:a bare digit, almost always “1”\.Empty output:no tokens generated\.Other:short fragments, paraphrases, or otherwise garbled text not covered above\.

#### Distribution\.

Table[18](https://arxiv.org/html/2606.05538#A10.T18)reports the category breakdown at top\-12 removal alongside the base model\. Three patterns are noteworthy\.

Table 18:Distribution of GSM8K output categories \(1,319 test items\) under base model and top\-12 \(∼\\sim0\.001%\) reverse intermediate dimension removal on Qwen1\.5\-MoE\-A2\.7B\. Values are counts \(percentage\)\. Real reasoning collapses by14×14\\timeswhile echo behaviors and empty outputs emerge\.\(1\) Reasoning collapses by an order of magnitude\.Real reasoning drops from 60\.0% to 4\.2% \(a14×14\\timesreduction\) and partial reasoning drops from 7\.6% to 2\.1%\. The 10 “correct” answers at top\-12 removal are coincidental matches between numbers appearing in the echoed question text and the gold answer; none reflect residual computation\. This reduction in reasoning capacity, rather than the raw accuracy drop, is the more direct signal that the removed dimensions support multi\-step generation\.

\(2\) Echo behaviors emerge as the dominant failure mode\.Exact, truncated, and garbled echoes together account for 31\.1% of outputs at top\-12 removal but only 4\.1% at top\-1\. The model defaults to copying or lightly transforming the input rather than generating a response\. Garbled echoes typically substitute names while preserving sentence structure \(e\.g\., “Gissela”→\\to“Gordy”\), suggesting that surface\-level next\-token prediction survives while task\-conditioned generation does not\.

\(3\) Empty outputs emerge\.At top\-12 removal, 99 outputs \(7\.5%\) contain no tokens, a behavior absent at top\-1\. Removing the critical dimensions occasionally pushes the next\-token distribution into an immediate end\-of\-sequence state at the very first decoding step\.

#### Representative examples\.

The qualitative gap between the two settings is best illustrated by paired outputs\.

#### Representative examples\.

The qualitative gap between the two settings is best illustrated by paired outputs\.

Base model \(real reasoning, correct\):

Q:Edgar eats 18 pretzels a day\. If his brother eats 1/2 as many, how many does his brother eat in a week?Output:*Edgar’s brother eats 1/2 as many pretzels as Edgar, which means he eats18/2=918/2=9pretzels a day\. To find out how many pretzels his brother eats in a week, we multiply…\\dots9×7=639\\times 7=63pretzels in a week\. Therefore, his brother eats 63 pretzels in a week\.*Pred:63Gold:63 \(correct\)

top\-12 removal \(truncated echo\):

Q:The farm has 30 cows and the zoo has 20 sheep\. The zoo has twice as many cows as the farm does…\\dotsOutput:*The farm has 30 cows and the zoo has 20 sheep\. The zoo has twice as many cows as the farm does…\\dots*\(verbatim until truncation\)Pred:20Gold:120

top\-12 removal \(garbled echo\):

Q:Gissela, Gordy, and Gary are truck drivers\. Gissela has a truck large enough to haul 4,000 pounds…\\dotsOutput:*Gordy, Gordy, and Gary are truck drivers\. Gissela has a truck large enough to haul 4,000 pounds…\\dots*Pred:800Gold:2,800

#### Interpretation\.

The shift between base model and top\-12 is qualitative, not just quantitative: removing∼\\sim12 additional intermediate dimensions does not gradually degrade reasoning but eliminates the model’s ability to transition from input\-conditioned copying to forward generation\. The combined 86% share of echo and “other” fragment outputs indicates that the critical dimensions identified by Fisher importance participate in the computation mapping question representations to multi\-step reasoning trajectories, rather than in the surface\-level token\-prediction circuit that produces fluent text\. This view is consistent with the cross\-domain dissociation in Table[1](https://arxiv.org/html/2606.05538#S2.T1): multiple\-choice knowledge tasks, which rely on input\-conditioned token prediction, retain 90–98% of base accuracy, while generation\-heavy tasks, which require multi\-step trajectories, collapse to 3–13%\.

## Appendix KMechanisms of Compression\-Induced Improvement: A Unified Fisher\-Prior Account

We observe that intermediate dimension compression yields modest accuracy gains on several benchmarks \(MultiArith, CMMLU, CEval\) despite removing 50% of routed\-expert FFN parameters\. We attribute these gains to a single underlying mechanism—empirical Fisher importance systematically removes high\-prior, low\-evidence dimensions while preserving reasoning\-supporting dimensions—and demonstrate that the magnitude of improvement on each benchmark is governed by how much*shortcut headroom*the base model’s failure modes leave behind\. We examine this through output category distributions across benchmarks\.

### K\.1Shortcut Suppression on Generation Tasks

On unconstrained reasoning tasks, the base model emits template shortcuts and question paraphrase for 20\.2% of MultiArith outputs \(Short direct answer 15\.5% \+ Garbled echo 4\.7%\), against 71\.5% Real reasoning\. We adopt eight output categories \(*real reasoning*: explicit multi\-step chain\-of\-thought reaching a numeric conclusion;*partial reasoning*: reasoning tokens present but incoherent or incomplete;*short direct answer*,*garbled echo*, etc\.\)\. Intermediate dimension compression shifts this distribution:*Short direct answer*and*Garbled echo*drop 20\.2pp to nearly zero, while*Real reasoning*rises to 98\.0%–98\.2%\. The accompanying accuracy gain \(\+13\.5% to \+14\.7% from Fisher\-IntDim\-L to Fisher\-IntDim\-G\) shows that our pruning further disrupts these residual shortcuts even on top of an already\-competent base\.

This distributional shift is quantitatively reflected in generation lengths: base outputs are bimodal \(median 44 tokens, SD 106\.6\)—a peak of one\-line shortcut answers plus a long tail of garbled echoes capped at the 500\-token generation limit \(p95=499p\_\{95\}=499\)—while Fisher\-IntDim\-G outputs are unimodal \(median 54, SD 59\.0,p95=96p\_\{95\}=96\), eliminating both extremes\. On problem \#257 \(“Will had $83, spent $47; how many $4 toys can he buy?”\), the base model emits only “10” with no derivation; Fisher\-IntDim\-G writes three explicit steps \(83−47=3683\-47=36;36/4=936/4=9\) and answers correctly\. We attribute this to a*Fisher–prior asymmetry*: shortcut behavior reflects high\-prior, low\-evidence outputs whose gradients w\.r\.t\. expert FFN parameters are small, while multi\-step reasoning requires precise intermediate\-state propagation and produces large gradients\. Empirical Fisher therefore systematically retains reasoning\-supporting dimensions and discards prior\-driven shortcut dimensions, exposing latent reasoning capacity that the base model fails to invoke\.

Table 19:Detailed Prediction Analysis on MultiArith
### K\.2Format Compliance on Multiple\-Choice Tasks

The same Fisher–prior mechanism produces a second manifestation on multiple\-choice benchmarks, where the base model’s high\-prior failure mode is to echo the option text instead of emitting a single answer letter\. On CMMLU, 23\.6% of base outputs are option\-text echoes; on CEval, 25\.6%\. Both are high\-prior, low\-evidence emissions: the model copies salient input spans rather than committing to a letter\. Fisher\-IntDim suppresses these echoes \(CMMLU: 23\.6%→\\rightarrow5\.9%,−17\.7\-17\.7pp; CEval: 25\.6%→\\rightarrow10\.5%,−15\.1\-15\.1pp\) and shifts probability mass to the single\-letter format, yielding accuracy gains of \+1\.3–4\.2pp on CMMLU and \+2\.0pp on CEval\. The smaller magnitude relative to MultiArith reflects the smaller fraction of recoverable mass: for option\-text echoes, the underlying answer choice is often already wrong, so format correction alone cannot rescue accuracy\.

Table 20:Detailed Prediction Analysis on CMMLUTable 21:Detailed Prediction Analysis on CEval
### K\.3Partial\-Reasoning Completion on Long\-CoT Benchmarks

The shortcut\-suppression account above applies to base models that frequently default to terse or template\-style outputs\. A natural question is whether the same Fisher–prior mechanism produces analogous effects on frontier\-scale reasoning models, where the dominant failure mode is not shortcut emission but*incomplete chain\-of\-thought*: the model initiates a multi\-step reasoning trace but fails to reach a coherent numeric conclusion\.

We examine this through output category distributions on AIME 2024, AIME 2025, and Olympiad Bench for Qwen3\-30B\-A3B and Qwen3\.5\-35B\-A3B, evaluated with avg@8 sampling \(240 samples per benchmark, 30 problems×\\times8 draws\)\.

#### Dominant failure mode shifts from shortcut to partial reasoning\.

Unlike the small base models in §[K\.1](https://arxiv.org/html/2606.05538#A11.SS1), neither Qwen3\-30B\-A3B nor Qwen3\.5\-35B\-A3B produces echo, numeric\-only, or empty outputs at any measurable rate\. The sole failure mode is*partial reasoning*: chains that contain valid reasoning tokens but are incomplete or internally incoherent\. This accounts for 24\.6% and 35\.4% of Qwen3\-30B\-A3B outputs on AIME 2025 and AIME 2026 respectively, and 7\.9% and 5\.4% for the stronger Qwen3\.5\-35B\-A3B\.

#### Compression converts partial reasoning into real reasoning\.

Tables[22](https://arxiv.org/html/2606.05538#A11.T22)–[26](https://arxiv.org/html/2606.05538#A11.T26)show that Fisher\-IntDim\-G shifts partial reasoning to real reasoning across all five settings with no new failure modes introduced\. The accuracy gains are consistent with the recoverable\-mass account: Qwen3\-30B\-A3B, which starts from a higher partial\-reasoning rate, benefits more \(\+26\.7pp on AIME 2025, \+6\.7pp on AIME 2026\) than Qwen3\.5\-35B\-A3B \(\+10\.0pp and \+6\.7pp\), whose partial\-reasoning rate is already low\.

Table 22:Distribution of AIME 2025 output categories under base Qwen3\-30B\-A3B and Fisher\-IntDim\-G \(50% routed\-FFN compression\)\. 240 samples \(30 problems×\\timesavg@8\)\. Pruned model produces*deeper*multi\-step CoT than the base\.Table 23:Distribution of AIME 2026 output categories under base Qwen3\-30B\-A3B and Fisher\-IntDim\-G\. 240 samples \(30 problems×\\timesavg@8\)\.Table 24:Distribution of AIME 2025 output categories under base Qwen3\.5\-35B\-A3B and Fisher\-IntDim\-G\. 240 samples \(30 problems×\\timesavg@8\)\.Table 25:Distribution of AIME 2026 output categories under base Qwen3\.5\-35B\-A3B and Fisher\-IntDim\-G\. 240 samples \(30 problems×\\timesavg@8\)\.Table 26:Distribution of Olympiad Bench output categories under base Qwen3\.5\-35B\-A3B and Fisher\-IntDim\-G\. 674 samples \(single\-sample per problem\)\.
#### Mechanism: Fisher importance targets incomplete\-chain dimensions\.

We interpret this under the same Fisher–prior account\. In a long\-CoT reasoning model,*partial reasoning*represents a regime where intermediate\-state propagation partially succeeds—the model initiates a plausible chain—but fails to sustain the precise token\-to\-token dependencies required to close it\. Dimensions that support only the initialization of a reasoning chain without contributing to its completion receive lower Fisher scores: their gradient signal across calibration examples reflects high\-prior behavior \(starting a chain is likely regardless of the specific problem\) rather than low\-evidence, problem\-conditioned computation \(completing it\)\. Removing these dimensions selectively suppresses the partial\-chain attractor, exposing the model’s latent capacity to produce complete derivations\.

#### Contrast with the small\-model setting\.

The long\-CoT case and the small\-model case share the same underlying mechanism—Fisher\-MoE removes high\-prior, low\-evidence dimensions—but differ in which prior is suppressed\. In small base models the prior is a surface\-level output template \(echo, one\-line answer\)\. In frontier reasoning models the prior is an*incomplete chain initialization*: the model defaults to beginning a plausible\-looking reasoning trace without the problem\-specific precision to complete it\. In both cases the recoverable accuracy gain is determined by how large a fraction of outputs are trapped in the high\-prior attractor, and compression releases exactly that fraction\. The clean absence of echo or empty\-output failures in Tables[22](https://arxiv.org/html/2606.05538#A11.T22)–[26](https://arxiv.org/html/2606.05538#A11.T26)further confirms that the removed dimensions are genuinely redundant: unlike critical\-dimension removal, pruning the bottom 50% by Fisher score introduces no new failure modes in either model family\.

### K\.4Summary: One Mechanism, Three Manifestations

Across all benchmarks examined, intermediate\-dimension Fisher pruning suppresses the same class of behavior: high\-prior, low\-evidence outputs that the model produces when it defaults to surface\-level continuation rather than task\-conditioned generation\. The mechanism manifests differently depending on the model family and benchmark format, but the underlying logic is identical in each case\.

- •Unconstrained arithmetic \(MultiArith\)\.The base model’s dominant failure mode is one\-line shortcut answers and garbled echoes\. Compression removes the dimensions that support these high\-prior, low\-computation outputs, shifting the majority of shortcut outputs to complete multi\-step derivations\.
- •Multiple\-choice knowledge \(CMMLU/CEval\)\.The base model echoes full option text instead of committing to a single answer letter in roughly one quarter of outputs\. Compression substantially reduces this echo rate, but the accuracy gain is smallest here because format correction alone rarely rescues semantically wrong answers\.
- •Long\-CoT math reasoning \(AIME/Olympiad\)\.In frontier\-scale reasoning models, echo and shortcut behaviors are absent entirely\. The sole failure mode is partial reasoning: chains that initiate plausibly but fail to close\. Compression converts partial chains to complete derivations, with gain magnitude proportional to the base model’s partial\-reasoning rate—the model with a higher baseline partial\-reasoning rate benefits more, consistent with the recoverable\-mass account\.

The magnitude of accuracy gain in each case is governed by the*recoverable mass*: how large a fraction of outputs are trapped in the high\-prior attractor, and how directly format or chain\-completion correction translates to accuracy\. Table[27](https://arxiv.org/html/2606.05538#A11.T27)summarizes the three manifestations\.

Table 27:Summary of the Fisher–prior mechanism across benchmark types\.We emphasize that none of these gains reflect added capability\. They reflect removal of a generation\-time prior that suppresses latent circuits already present in the model\. The long\-CoT results further strengthen this interpretation: in Tables[22](https://arxiv.org/html/2606.05538#A11.T22)–[26](https://arxiv.org/html/2606.05538#A11.T26), compression introduces zero new failure modes while monotonically shifting outputs toward complete reasoning, confirming that the removed dimensions are genuinely redundant rather than load\-bearing\. On benchmarks limited by raw reasoning capacity rather than shortcut headroom \(MATH, MMLU, BBH\), Fisher\-IntDim incurs the expected modest capacity cost \(Table[3](https://arxiv.org/html/2606.05538#S4.T3)\)\.

## Appendix LDense LLM Pruning Baseline Analysis

Table 28:Comparison on Qwen1\.5\-MoE\-A2\.7B at 50% sparsity \(zero\-shot, T=0\)\. Wanda\(Sunet al\.,[2024b](https://arxiv.org/html/2606.05538#bib.bib77)\)and SparseGPT\(Frantar and Alistarh,[2023](https://arxiv.org/html/2606.05538#bib.bib35)\)with our method\.Wanda\(Sunet al\.,[2024b](https://arxiv.org/html/2606.05538#bib.bib77)\)and SparseGPT\(Frantar and Alistarh,[2023](https://arxiv.org/html/2606.05538#bib.bib35)\)are post\-training weight compression methods originally designed for dense decoder\-only LLMs: Wanda prunes weights by the product of magnitude and input activation norm without weight updates, while SparseGPT formulates compression as a layer\-wise sparse regression with second\-order information from an approximated Hessian\. Both target unstructured or N:M semi\-structured patterns \(e\.g\., 2:4\) over FFN and attention weight matrices, and are unaware of MoE\-specific structural properties such as routing dynamics or expert grouping—when applied to MoE, they degenerate into per\-row compression of individual expert FFN matrices, ignoring inter\-expert redundancy\. Consequently,Luet al\.\([2024b](https://arxiv.org/html/2606.05538#bib.bib78)\)report that directly applying Wanda 2:4 to Mixtral 8x7B causes substantial drops on general LM Harness benchmarks and a near\-collapse on math reasoning, suggesting that weight\-level semi\-structured sparsity inadequately preserves task\-specific expert specialization\. In addition, Wanda and SparseGPT rely on sparse GEMM, which typically provides limited speedup for MoE models\. In contrast, Fisher\-MoE preserves dense GEMM execution and achieves more consistent and substantial speedups, as shown in §[3\.3](https://arxiv.org/html/2606.05538#S3.SS3)\.

Our results on Qwen1\.5\-MoE\-A2\.7B \(Table[28](https://arxiv.org/html/2606.05538#A12.T28)\) exhibit the same trend: both Wanda and SparseGPT retain moderate average performance but suffer pronounced degradation on math and code generation, while Fisher\-IntDim\-G \(ours\) achieves comparable averages with stronger preservation on knowledge\-heavy and code benchmarks\. This complementarity reflects a fundamental design difference—weight\-level 2:4 sparsity preserves all experts while sparsifying their internal weights, whereas our method exploits MoE\-aware intermediate dimension structure to selectively retain task\-critical intermediate dimensions across experts\.

## Appendix MModel Size and Whole\-Model Compression

In this section, we report the total parameter counts and disk footprints of several MoE backbones before and after applying intermediate dimension compression at a target MoE compression ratio ofp=50%p=50\\%\.

It is important to distinguish between the*MoE compression ratio*ppand the resulting*whole\-model compression rate*pmodelp\_\{\\text\{model\}\}\. Whilepprefers to the fraction of parameters removed within the expert FFN modules,pmodelp\_\{\\text\{model\}\}measures the reduction in total model parameters\. The latter is consistently smaller because several components are not compressed, including attention layers, the shared expert, embeddings, routing networks, and layer normalization parameters\.

As shown in Table[29](https://arxiv.org/html/2606.05538#A13.T29), intermediate dimension compression achieves substantial reductions in total model size across all backbones, with whole\-model compression rates ranging from approximately43%43\\%to48%48\\%\. Notably, these reductions translate directly into proportional decreases in disk storage requirements, making the compressed models significantly more efficient for deployment without modifying the model architecture or routing structure\.

Table 29:Parameters and disk size for several models before and after compression at MoE compression ratiop=50%p=50\\%\. The*whole\-model compression rate*pmodelp\_\{\\text\{model\}\}is the percentage reduction in total parameter count; it is smaller thanppand varies across backbones because attention, the shared expert, embeddings, the router, and layer norms are not compressed\.
## Appendix NDivergence of Expert Selection Across Importance Metrics

To better understand how different importance metrics affect expert\-level pruning decisions, we analyze the overlap between the sets of retained experts selected by three methods: Fisher\-based importance \(A\), activation\-based importance \(B\), and score\-based importance \(C\)\.

Letℰℓ\(m\)\\mathcal\{E\}^\{\(m\)\}\_\{\\ell\}denote the set of retained experts at layerℓ\\ellunder methodmm\. We quantify similarity using two complementary measures: \(i\) the Jaccard similarity over the union of retained experts across all layers, and \(ii\) the average per\-layer overlap, defined as the number of shared experts per layer\.

Table[30](https://arxiv.org/html/2606.05538#A14.T30)summarizes the results\.

Fisher selects substantially different experts\.The overlap between Fisher\-based pruning and both activation\- and score\-based methods is low, with Jaccard similarities of only28\.5%28\.5\\%and29\.1%29\.1\\%, respectively\. At the layer level, Fisher shares on average only∼13\\sim 13out of 30 experts with these methods\. In contrast, activation\- and score\-based pruning exhibit much higher agreement \(Jaccard54\.9%54\.9\\%,∼21\.5/30\\sim 21\.5/30overlap\), indicating that these heuristic metrics tend to select similar experts\.

Limited consensus across all methods\.The intersection across all three methods contains only8\.88\.8experts per layer on average, representing less than one\-third of the expert pool\. This further highlights the diversity of expert importance signals captured by different metrics\.

Implications for compression\.These results suggest that Fisher\-based importance captures a fundamentally different notion of expert utility compared to activation\- or score\-based heuristics\. In particular, the low overlap indicates that Fisher is not merely a refinement of existing metrics, but rather identifies distinct experts that may be critical for preserving performance\. This observation provide insights why Fisher\-guided compression outperforms prior expert\-level pruning approaches\.

Table 30:Jaccard similarity and average per\-layer overlap of removed expert sets across different pruning methods\.
## Appendix OComparison Against the Dense Baseline and Uncompressed MoE

To place Fisher\-MoE in the broader context of dense\-vs\-sparse model trade\-offs, we compare three models of comparable total parameter budget: \(i\)Qwen1\.5\-7B, a dense transformer; \(ii\)Qwen1\.5\-MoE\-A2\.7B, the uncompressed sparse MoE that serves as our base model; and \(iii\)Fisher\-MoE, our compressed model derived from Qwen1\.5\-MoE\-A2\.7B by removing 50% of every routed expert’s FFN intermediate dimensions using Fisher\-IntDim\-E with 128 GSM8K calibration samples \(moe\_intermediate\_size:1408→7041408\\to 704\)\. We report task accuracy, activated/total parameters, and standalone throughput on a single NVIDIA A100\-80G following the same setting as Qwen official blogpost\(Qwen Team,[2024](https://arxiv.org/html/2606.05538#bib.bib66)\)\.

#### Setup\.

All accuracies use vLLM v0\.8\.4 in bfloat16 withmax\_model\_len=4096=4096,temperature=0\.1=0\.1, andseed=1234=1234\. MBPP uses\-\-stop\_sequences "\[DONE\], END"\. The throughput benchmark fixes input length to 1,000 tokens and output length to 1,000 tokens on a single A100\-80G\.

#### Parameter and activation budget\.

Table[31](https://arxiv.org/html/2606.05538#A15.T31)compares parameter counts\. While Fisher\-MoE and Qwen1\.5\-7B occupy comparable total budgets \(8\.1 B vs\. 7\.7 B\), Fisher\-MoE activates only2\.27 Bparameters per token—about29%29\\%of the dense activation cost—because at mostK=4K\{=\}4ofN=60N\{=\}60routed experts fire per token after intermediate dimension compression\.

Table 31:Parameter counts for the dense baseline and the two MoE models\. Activated parameters are non\-expert parameters plusK=4K\{=\}4active routed experts per layer\. Counts are extracted directly from safetensors headers\.
#### Task accuracy\.

Table[32](https://arxiv.org/html/2606.05538#A15.T32)reports eight\-task accuracy\. Fisher\-MoE retains the math performance of the dense baseline, it slightly*exceeds*the dense Qwen1\.5\-7B on GSM8K \(35\.0 vs\. 34\.1\) and matches it on MATH \(8\.0 vs\. 8\.2\) while losing modest ground on knowledge\-heavy and coding tasks \(MMLU−8\.2\-8\.2, HumanEval−12\.3\-12\.3\)\. The average drop from dense is roughly55points despite activating3\.4×3\.4\\timesfewer parameters per token\.

Table 32:Eight\-task accuracy \(%\) for the dense baseline, the uncompressed Qwen1\.5\-MoE\-A2\.7B, and Fisher\-MoE compressed at the 50% MoE compression ratio with in\-domain calibration\. Higher is better\. Same vLLM decoding settings as the main experiments\. Qwen1\.5\-MoE\-A2\.7B numbers are reproduced from thebaserow of Table[3](https://arxiv.org/html/2606.05538#S4.T3)\.
#### Throughput and tokens\-per\-second\.

Table[33](https://arxiv.org/html/2606.05538#A15.T33)reports vLLM throughput on a single A100\-80G with1,0001\{,\}000input and1,0001\{,\}000output tokens\. Qwen1\.5\-MoE\-A2\.7B already runs∼1\.74×\\sim 1\.74\\timesfaster than the dense Qwen1\.5\-7B because of its sparse activation pattern and shared expert\. Fisher\-MoE pushes this further: by halving each routed expert’s intermediate dimension, the per\-token expert FLOPs drop accordingly, yielding2\.10×2\.10\\timesthe throughputof the dense baseline and1\.21×1\.21\\timesthe throughputof the uncompressed MoE\.

Table 33:Standalone vLLM throughput on a single NVIDIA A100\-80G with1,0001\{,\}000input tokens and1,0001\{,\}000output tokens\. Higher is better\. “Speedup” columns are relative to Qwen1\.5\-7B and Qwen1\.5\-MoE\-A2\.7B respectively\.At a comparable total\-parameter budget, Fisher\-MoE delivers the strongest inference profile of the three:2\.10×2\.10\\timesthe dense baseline throughput,29%29\\%of its activated\-parameter cost, and accuracy competitive with the dense model on the math reasoning tasks \(GSM8K, MATH\)\. Compared to the uncompressed Qwen1\.5\-MoE\-A2\.7B, removing50%50\\%of routed\-expert intermediate dimensions buys an additional1\.21×1\.21\\timesthroughput on top of the MoE baseline’s already substantial inference advantage—demonstrating that intermediate\-dimension compression is complementary to, rather than a dilution of, the inference benefits of sparse architectures\.

## Appendix PTheoretical Comparison: Empirical Fisher, Diagonal Hessian, and First\-Order Pruning

A reasonable concern about our use of the empirical Fisher as the importance metric is that “Fisher importance” is related to Hessian\-based attribution criteria\. This appendix consolidates the relationships and reports the controlled comparison that supports this hedged framing\.

### P\.1What the Empirical Fisher Computes

Withℒ​\(x,y\)=−log⁡pθ​\(y∣x\)\\mathcal\{L\}\(x,y\)=\-\\log p\_\{\\theta\}\(y\\mid x\), the empirical Fisher \(Eq\.[4](https://arxiv.org/html/2606.05538#S2.E4)\) is the diagonal of1N​∑\(x,y\)∈𝒟∇θℒ​\(x,y\)​∇θℒ​\(x,y\)⊤\\frac\{1\}\{N\}\\sum\_\{\(x,y\)\\in\\mathcal\{D\}\}\\nabla\_\{\\theta\}\\mathcal\{L\}\(x,y\)\\nabla\_\{\\theta\}\\mathcal\{L\}\(x,y\)^\{\\top\}\. Two facts are directly relevant\.

#### \(i\) Empirical Fisher\.

For a per\-parameter score, the empirical Fisher reduces tosFisher​\(θi\)=1N​∑\(x,y\)\(∂ℒ/∂θi\)2s^\{\\text\{Fisher\}\}\(\\theta\_\{i\}\)=\\tfrac\{1\}\{N\}\\sum\_\{\(x,y\)\}\(\\partial\\mathcal\{L\}/\\partial\\theta\_\{i\}\)^\{2\}\. This is the Taylor\-expansion saliency: at a parameter we expectΔ​ℒ≈\|∂ℒ/∂θi\|⋅\|δi\|\\Delta\\mathcal\{L\}\\approx\|\\partial\\mathcal\{L\}/\\partial\\theta\_\{i\}\|\\cdot\|\\delta\_\{i\}\|, so empirical fisher is the variance of this first\-order increment over the calibration distribution\. In our setting the two metrics coincide up to a constant after we group parameters into intermediate dimension units \(Eq\.[5](https://arxiv.org/html/2606.05538#S2.E5)\), and the practical advantage we report over magnitude/activation/router heuristics is therefore equally a statement about first\-order pruning under the same intermediate dimension grouping\.

#### \(ii\) Empirical Fisher vs\. true Fisher vs\. diagonal Hessian\.

The Fisher information matrix isFθ=𝔼x​𝔼y∼pθ​\[∇θlog⁡pθ​∇θlog⁡pθ⊤\]=−𝔼x​𝔼y∼pθ​\[∇θ2log⁡pθ​\(y∣x\)\]F\_\{\\theta\}=\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{y\\sim p\_\{\\theta\}\}\[\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\\,\\nabla\_\{\\theta\}\\log p\_\{\\theta\}^\{\\top\}\]=\-\\mathbb\{E\}\_\{x\}\\mathbb\{E\}\_\{y\\sim p\_\{\\theta\}\}\[\\nabla^\{2\}\_\{\\theta\}\\log p\_\{\\theta\}\(y\\mid x\)\], so at the model’s optimum on its own data distribution, the \(true\) Fisher and the negative Hessian of the log\-likelihood coincide\. The*empirical*Fisher samplesyyfrom the data rather than from the model, which makes it a biased estimate of either quantity away from the optimum\(Kunstneret al\.,[2019](https://arxiv.org/html/2606.05538#bib.bib80)\); for a frozen, pretrained MoE evaluated on a calibration set the bias can be non\-negligible\.

### P\.2Relationship to Hessian\-Based Pruning \(OBD/OBS, SparseGPT/Wanda\)

Optimal Brain Damage \(OBD,LeCunet al\.\([1989](https://arxiv.org/html/2606.05538#bib.bib68)\)\) and Optimal Brain Surgeon \(OBS,Hassibi and Stork \([1992](https://arxiv.org/html/2606.05538#bib.bib79)\)\) score weightθi\\theta\_\{i\}bysiOBD=12​Hi​i​θi2s^\{\\text\{OBD\}\}\_\{i\}=\\tfrac\{1\}\{2\}H\_\{ii\}\\,\\theta\_\{i\}^\{2\}, the second\-order Taylor estimate of the loss increase under a single\-weight perturbation\. Modern weight\-pruning methods such as SparseGPT\(Frantar and Alistarh,[2023](https://arxiv.org/html/2606.05538#bib.bib35)\)and Wanda\(Sunet al\.,[2024b](https://arxiv.org/html/2606.05538#bib.bib77)\)replace the parameter Hessian with a layer\-wise input second\-moment matrix, but the spirit is the same: rank parameters by curvature×\\timesmagnitude\. Two practical differences with our criterion:

- •Curvature vs\. sensitivity\.OBD/OBS rank parameters by the second\-order loss change after the optimal compensating update\. Empirical Fisher rank parameters by the first\-order loss change with no compensation\. The two coincide at an optimum and on the data distribution the model was trained on, but disagree off\-optimum and under domain mismatch\.
- •Aggregation unit\.OBD/OBS and SparseGPT/Wanda are derived for unstructured pruning of single weights or layer\-input groupings\. Our contribution \(Eq\.[5](https://arxiv.org/html/2606.05538#S2.E5)\) groups parameters tied to a single FFN intermediate dimension across the gate/up/down slabs of a routed expert\. This grouping is what enables structurally smaller MoE inference \(§[3](https://arxiv.org/html/2606.05538#S3)\); it is orthogonal to the choice between Fisher\- and Hessian\-based scoring rules and could be paired with either\.

WandB and SparseGPT are primarily designed for dense model pruning and sparsification, and thus fall outside the scope of MoE compression\. Nevertheless, we include comparisons with these methods in Appendix[L](https://arxiv.org/html/2606.05538#A12)to demonstrate the effectiveness of the proposed Fisher\-MoE\.

### P\.3Empirical Comparison Under the Same Grouping

To check that the gains we report are not simply a relabeling of an existing scoring rule, we compare three metrics on Qwen1\.5\-MoE\-A2\.7B atp=50%p=50\\%under*exactly*the intermediate dimension grouping of Eq\.[5](https://arxiv.org/html/2606.05538#S2.E5), with the same calibration data \(GSM8K training set, 128 samples\), the same compression operation, and the same evaluation protocol\. Only the per\-parameter score differs:

- •Magnitude\(data\-free\):\|θi\|\|\\theta\_\{i\}\|, the baseline signal\.
- •First\-order\|∇ℒ\|\|\\nabla\\mathcal\{L\}\|\(no square\):1N​∑\|∂ℒ/∂θi\|\\tfrac\{1\}\{N\}\\sum\|\\partial\\mathcal\{L\}/\\partial\\theta\_\{i\}\|, classical first\-order pruning without squaring\.
- •Empirical Fisher / squared gradient \(ours\):1N​∑\(∂ℒ/∂θi\)2\\tfrac\{1\}\{N\}\\sum\(\\partial\\mathcal\{L\}/\\partial\\theta\_\{i\}\)^\{2\}\(Eq\.[4](https://arxiv.org/html/2606.05538#S2.E4)\)\.

Computing a faithful diagonal\-Hessian variant on a 14B\-parameter MoE is several times the cost of empirical Fisher \(a Hutchinson estimator requires multiple Hessian–vector products per calibration sample\); we discuss the relationship theoretically above and treat a full empirical sweep as future work\.

Empirical Fisher is best understood as a coordinate\-wise squared\-gradient score under the intermediate dimension grouping that is our actual contribution\. Our contribution lies in the attribution unit and grouping \(Eq\.[5](https://arxiv.org/html/2606.05538#S2.E5)\) and the structural\-removal operation it enables \(§[3](https://arxiv.org/html/2606.05538#S3)\), not in the choice of scoring rule against the broader family of gradient\- and Hessian\-based criteria\.

## Appendix QWhere is the Redundancy? Locating Compressible Substructures in an MoE

The intermediate dimension granularity is one of three structural axes along which a sparse MoE could plausibly be compressed\. To justify our choice empirically, we run a controlled head\-to\-head comparison: we hold the importance signal fixed \(Fisher importance, §[2\.1](https://arxiv.org/html/2606.05538#S2.SS1)\), the backbone fixed \(Qwen1\.5\-MoE\-A2\.7B\), and the calibration data fixed \(domain\-matched,128128samples\), and we vary only*which structural unit Fisher importance is applied to*\. We consider three axes:

1. 1\.Expert\-level pruning– Fisher scores are aggregated per routed expert, and the lowest\-ranked experts are removed wholesale\. We further study the gate/up/down sub\-blocks individually \(Fisher\-UP,Fisher\-DOWN,Fisher\-GATE,Fisher\-UP\+GATE\) and a router\-side variant \(MoE compression \(Fisher\)\) to localize the contribution within an expert\.
2. 2\.Intermediate dimension pruning– Fisher scores are aggregated per FFN intermediate dimension across the gate/up/down slabs of every routed expert \(Fisher\-IntDim, ourFisher\-MoEmethod\)\.
3. 3\.Attention\-head pruning– Fisher scores are aggregated per attention head, and the lowest\-ranked heads are removed\.

For each axis we sweep a fixed compression budget and report eight downstream tasks\. To put expert\-level pruning on the strongest possible footing, we also include the activation\-, score\-, and magnitude\-based heuristics from prior work as additional expert\-level baselines, and an alternative*router DenseMixer*variant\.

Table[34](https://arxiv.org/html/2606.05538#A17.T34)reports the full sweep\.

Table 34:Where is the redundancy? Fisher\-guided compression applied at three different structural granularities on Qwen1\.5\-MoE\-A2\.7B \(domain\-matched calibration\)\. Heuristic expert\-level baselines \(activation/score/magnitude\) are included for reference\. Higher is better\.#### Expert\-level pruning is uniformly fragile\.

Every method that removes whole experts collapses on generation\-heavy tasks: GSM8K stays below23%23\\%of base, HumanEval below50%50\\%, and MATH near zero, regardless of whether the importance signal is activation, score, magnitude, Fisher, router\-side Fisher, or expert\-level Fisher\. Decomposing Expert\-Fisher further into its three projection slabs \(UP, DOWN, GATE\) does not help—no slab dominates the others\. We read this as evidence that redundancy at expert granularity is*limited*: the units identified for removal are not truly redundant, just less salient on average, so discarding them takes essential computation with them\.

#### Attention\-head pruning is even more fragile\.

At25%25\\%head removal, MMLU drops to43\.12%43\.12\\%\(∼\\sim73%73\\%of base\) and code/math collapse below1313points\. At50%50\\%head removal, the model essentially stops working \(≤\\leq4%4\\%on every task\)\. Attention heads concentrate too much per\-head capacity to be removed at this scale without retraining; head\-level redundancy in this MoE is effectively zero at the budgets we consider\.

#### Intermediate dimension pruning has substantial slack\.

The contrast with FFN intermediate dimensions is striking\. At a*25%*budget, Fisher\-IntDim already*exceeds*the uncompressed base on CEval \(\+10\.4\+10\.4\), CMMLU \(\+6\.8\+6\.8\), and GSM8K \(\+6\.5\+6\.5\), and matches base on MMLU/MBPP/BBH within11–44points\. At the much more aggressive*50%*budget, intermediate dimension pruning still retains8484–116%116\\%of base on every non\-MATH task, while every expert\-level competitor at a comparable MoE compression ratio retains≤\\leq25%25\\%of base on most generation tasks\.

#### Conclusion\.

Across the same backbone, the same Fisher signal, and the same calibration set, FFN intermediate dimensions are the structural unit with the most slack: 25–50% of them can be removed without meaningful performance loss, whereas pruning the same fraction of experts or attention heads is catastrophic\. This empirically grounds the granularity choice in our main paper\.

Holding the Fisher signal and calibration fixed, FFN intermediate dimensions admit 25–50% removal at near\-zero cost on most tasks, while removing the same fraction of experts or attention heads collapses generation\-heavy benchmarks—identifying intermediate dimensions as the locus of structural redundancy in the MoE \(Appendix[Q](https://arxiv.org/html/2606.05538#A17)\)\.

Similar Articles

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Hugging Face Daily Papers

ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.