ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

arXiv cs.AI Papers

Summary

ConMoE proposes a train-free prototype remapping framework for Mixture-of-Experts (MoE) compression, which selects a subset of experts as reusable prototypes and deterministically remaps original expert calls to them, reducing memory usage without weight updates or fine-tuning.

arXiv:2605.29350v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:17 AM

# ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression
Source: [https://arxiv.org/html/2605.29350](https://arxiv.org/html/2605.29350)
###### Abstract

Mixture\-of\-Experts \(MoE\) language models reduce per\-token computation but still require storing and serving all experts, making deployment memory\-intensive\. Existing post\-training compression methods mainly shrink this cost by pruning experts or merging their weights\. We formulate post\-training MoE compression as*expert\-pool consolidation*: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype\. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface\. We propose ConMoE, a train\-free prototype remapping framework that selects retained experts using calibration\-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post\-compression fine\-tuning\. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek\-moe\-16b\-base at both 25% and 50% routed\-expert reduction, while remaining competitive on Qwen3\-30B\-A3B and OLMoE\-1B\-7B\-0125\. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross\-layer sharing and post\-hoc weight fusion are model\-dependent\.

ConMoE: Expert\-Pool Consolidation via Prototype Reassignment for MoE Compression

Yilun Yao1, Jiaming Pan1, Elsie Dai1, Peizhuang Cong1, Yaoming Li1, Tong Yang1,1Peking University

## 1Introduction

Mixture\-of\-Experts \(MoE\) architectures scale language models by activating only a small subset of experts for each token, allowing large parameter counts with relatively low per\-token computation\(Shazeeret al\.,[2017](https://arxiv.org/html/2605.29350#bib.bib1); Lepikhinet al\.,[2020](https://arxiv.org/html/2605.29350#bib.bib2); Lewiset al\.,[2021](https://arxiv.org/html/2605.29350#bib.bib40); Feduset al\.,[2022](https://arxiv.org/html/2605.29350#bib.bib3)\)\. This design has been adopted in recent MoE language models such as Mixtral, DeepSeekMoE, Qwen\-MoE, and OLMoE\(Jianget al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib4); Daiet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib5); Yanget al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib13); Muennighoffet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib6)\)\. However, sparsity mainly reduces computation rather than storage: the full routed expert pool must still be stored and served even though each token uses only a few experts\. As MoE models grow, expert storage becomes a major obstacle to efficient deployment\(Rajbhandariet al\.,[2022](https://arxiv.org/html/2605.29350#bib.bib41); Galeet al\.,[2022](https://arxiv.org/html/2605.29350#bib.bib42)\)\.

Existing post\-training MoE compression methods typically reduce this cost by pruning experts\(Luet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib11); Chenet al\.,[2022](https://arxiv.org/html/2605.29350#bib.bib8); Lasbyet al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib7)\)or merging multiple experts into fewer modules\(Liet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib17); Chenet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib19); Miaoet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib22); LIet al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib18)\)\. These methods shrink the expert pool, but they often conflate two distinct questions: which expert parameters should be retained, and how the router’s original expert references should be represented after compression\. In this work, we study a complementary view in which a compressed MoE retains a smaller set of pretrained experts as reusable prototypes, while explicitly mapping each original expert reference to the retained pool\.

We formulate this view as*expert\-pool consolidation*\. Under a fixed reduction budget, a compressed MoE consists of a reduced prototype pool and a deterministic reassignment map from original experts to selected prototypes\. This separates two decisions that are usually coupled in pruning and merging: which expert parameters are stored, and how the original router\-facing expert slots are represented\. The original router interface can therefore be preserved by redirecting each expert call through the reassignment map, while multiple original expert slots may share the same stored prototype\. The same formulation also permits local cross\-layer reuse: nearby layers may share prototypes when they contain reusable redundancy, but we restrict sharing to bounded local scopes to avoid mismatch from model\-wide expert reuse\.

Based on this formulation, we propose ConMoE, a train\-free prototype remapping framework for post\-training MoE compression\. ConMoE selects a budgeted subset of pretrained experts as prototypes and deterministically reassigns each original expert to one selected prototype\. The selected prototypes are reused directly, without weight updates or post\-compression fine\-tuning, and the original router is kept unchanged\. Post\-hoc weight fusion is studied only as a diagnostic sensitivity analysis, not as part of the default ConMoE pipeline\(Wortsmanet al\.,[2022](https://arxiv.org/html/2605.29350#bib.bib39); Yadavet al\.,[2023](https://arxiv.org/html/2605.29350#bib.bib43)\)\. We report*logical*routed\-expert reduction: a selected prototype is counted once even if it represents multiple original expert slots, while realizing the corresponding physical memory savings requires a shared\-prototype checkpoint or runtime\.

In summary, this work makes three contributions\. First, we formulate one\-shot MoE compression as expert\-pool consolidation with explicit prototype reassignment\. Second, we propose ConMoE, a train\-free remapping method that preserves the original router interface while reducing the logical routed\-expert pool\. Third, we empirically show across multiple pretrained MoE language models that remapping\-based consolidation is a viable alternative to pruning and merging under matched logical routed\-expert budgets\. Our ablations further indicate that deterministic reassignment is the most stable component, while broader cross\-layer sharing and post\-hoc weight fusion are model\-dependent\.

## 2Related Work

#### Post\-training MoE compression\.

Sparse MoE language models reduce per\-token computation by activating only a few experts, but their full routed expert pool still creates substantial memory and deployment overhead\. Existing post\-training MoE compression methods mainly shrink this expert pool through expert pruning or expert merging\. Expert pruning removes experts according to usage frequency, routing mass, activation statistics, or searched importance scores\(Luet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib11); Yanget al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib12); Chenet al\.,[2022](https://arxiv.org/html/2605.29350#bib.bib8); Lasbyet al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib7); Liuet al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib14)\)\. Expert merging instead combines multiple experts into fewer modules using routing statistics, output similarity, clustering, alignment, or subspace fusion, as in M\-SMoE/MC\-SMoE, HC\-SMoE, MergeMoE, and Sub\-MoE\(Liet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib17); Chenet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib19); Miaoet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib22); LIet al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib18)\)\. These methods are closest to ours in objective, since they also aim to reduce routed\-expert storage after pretraining\. However, pruning removes experts and merging constructs new or fused expert modules, whereas ConMoE keeps selected pretrained experts as reusable prototypes and explicitly remaps original expert references to them\. This makes the reuse structure part of the compressed model rather than a by\-product of deletion or fusion\.

#### Non\-uniform budgets and local cross\-layer reuse\.

Recent pruning and compression methods show that expert redundancy is heterogeneous across layers, making uniform layer\-wise budgets suboptimal\. DiEP learns layer\-level pruning rates, while EvoESAP decouples within\-layer expert ranking from across\-layer budget allocation\(Baiet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib9); Liuet al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib14)\)\. Related shared\-pool architectures such as UniPool further challenge the assumption that each layer must own a private expert set\(Huanget al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib15)\)\. ConMoE is complementary to these works: it targets existing pretrained checkpoints, requires no gradient updates, and preserves the original router\-facing expert slots\. Instead of training a globally shared expert pool from scratch, ConMoE performs post\-training prototype remapping and allows neighboring layers to share a local candidate pool when beneficial\. This local\-scope view avoids assuming that experts from distant layers are interchangeable, while still permitting cross\-layer reuse within bounded neighborhoods\.

![Refer to caption](https://arxiv.org/html/2605.29350v1/x1.png)Figure 1:Overview of ConMoE\. Starting from a pretrained MoE with layer\-wise routed expert pools, ConMoE performs prototype\-based expert\-pool consolidation within local scopes, each containing one or more neighboring MoE layers\. It uses calibration statistics and expert distances to select pretrained experts as reusable prototypes, and deterministically reassigns each original expert reference to one selected prototype\. The compressed MoE preserves the original router interface by redirecting original expert calls to their assigned prototypes in the logical reduced pool\.

## 3Problem Formulation

### 3\.1Sparse MoE Expert Pools

Consider a decoder\-only Transformer with MoE layers indexed byl∈\{1,…,L\}l\\in\\\{1,\\ldots,L\\\}\. The routed feed\-forward block at layerllcontains an expert poolℰ\(l\)=\{E1\(l\),…,ENl\(l\)\}\\mathcal\{E\}^\{\(l\)\}=\\\{E^\{\(l\)\}\_\{1\},\\ldots,E^\{\(l\)\}\_\{N\_\{l\}\}\\\}\. For an input token representationht\(l\)h\_\{t\}^\{\(l\)\}, the router selects a top\-kkexpert setT\(l\)​\(t\)T^\{\(l\)\}\(t\)and assigns normalized routing weightsgi\(l\)​\(t\)g\_\{i\}^\{\(l\)\}\(t\)to the selected experts\. The MoE output is

MoE\(l\)​\(ht\(l\)\)=∑i∈T\(l\)​\(t\)gi\(l\)​\(t\)​Ei\(l\)​\(ht\(l\)\)\.\\mathrm\{MoE\}^\{\(l\)\}\(h\_\{t\}^\{\(l\)\}\)=\\sum\_\{i\\in T^\{\(l\)\}\(t\)\}g\_\{i\}^\{\(l\)\}\(t\)E\_\{i\}^\{\(l\)\}\(h\_\{t\}^\{\(l\)\}\)\.Although each token activates only a few experts, every routed expert must remain stored and addressable because routing decisions vary across tokens and inputs\. We focus on compressing this routed\-expert pool, while keeping shared experts, routers, attention blocks, embeddings, and other non\-routed modules unchanged\.

### 3\.2Expert\-Pool Consolidation

LetG⊆\{1,…,L\}G\\subseteq\\\{1,\\ldots,L\\\}be a local scope containing one or more neighboring MoE layers, and let

ℰG=⋃l∈Gℰ\(l\)\\mathcal\{E\}\_\{G\}=\\bigcup\_\{l\\in G\}\\mathcal\{E\}^\{\(l\)\}denote the original routed expert pool in this scope\. Given a routed\-expert reduction ratioρ∈\[0,1\)\\rho\\in\[0,1\), we aim to construct a reduced prototype poolPGP\_\{G\}with

\|PG\|=K,K=max⁡\(1,round​\(\(1−ρ\)​\|ℰG\|\)\)\.\|P\_\{G\}\|=K,\\qquad K=\\max\(1,\\mathrm\{round\}\(\(1\-\\rho\)\|\\mathcal\{E\}\_\{G\}\|\)\)\.Thus,ρ=25%\\rho=25\\%corresponds to retaining approximately75%75\\%of the routed experts in the logical prototype pool, whileρ=50%\\rho=50\\%retains approximately half of them\.

Expert\-pool consolidation also specifies how the original expert pool is represented by the reduced pool\. We denote this reassignment by

mG:ℰG→PG,m\_\{G\}:\\mathcal\{E\}\_\{G\}\\rightarrow P\_\{G\},wheremG​\(e\)m\_\{G\}\(e\)is the stored prototype that represents the original expert referenceee\. A compressed scope is therefore described by two objects: the reduced prototype poolPGP\_\{G\}and the reassignment mapmGm\_\{G\}\.

This formulation separates two coupled decisions in MoE compression: which expert parameters are stored, and how the original router\-facing expert slots are represented\. In this work, ConMoE uses retained pretrained experts directly as prototypes, i\.e\.,PG⊆ℰGP\_\{G\}\\subseteq\\mathcal\{E\}\_\{G\}, and does not update or fuse expert weights in the default setting\. WhenGGcontains multiple neighboring layers, the fixed budget can be allocated non\-uniformly across layers; whenGGcontains a single layer, the formulation reduces to layer\-local consolidation\.

### 3\.3Consolidation Objective

An ideal reduced pool should represent the original experts with low reassignment cost\. Letd​\(e,p\)d\(e,p\)be the cost of representing original experteeby prototypepp, and letwew\_\{e\}measure the importance ofeeunder the original model\. For a candidate prototype poolP⊆ℰGP\\subseteq\\mathcal\{E\}\_\{G\}, define

D​\(e,P\)=minp∈P⁡d​\(e,p\),LG​\(P\)=∑e∈ℰGwe​D​\(e,P\)\.D\(e,P\)=\\min\_\{p\\in P\}d\(e,p\),\\qquad L\_\{G\}\(P\)=\\sum\_\{e\\in\\mathcal\{E\}\_\{G\}\}w\_\{e\}D\(e,P\)\.The ideal remapping\-only consolidation problem is

PG⋆=arg​minP⊆ℰG,\|P\|=K⁡LG​\(P\)\.P\_\{G\}^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{P\\subseteq\\mathcal\{E\}\_\{G\},\\ \|P\|=K\}L\_\{G\}\(P\)\.Given a selected prototype poolPGP\_\{G\}, each original expert is assigned to its nearest prototype:

mG​\(e\)=arg​minp∈PG⁡d​\(e,p\)\.m\_\{G\}\(e\)=\\operatorname\*\{arg\\,min\}\_\{p\\in P\_\{G\}\}d\(e,p\)\.
This objective captures the central trade\-off of expert\-pool consolidation: the reduced pool should prioritize important experts while also covering the original expert pool with low reassignment error\. Directly minimizingLG​\(P\)L\_\{G\}\(P\)is a combinatorial prototype\-selection problem\. ConMoE therefore uses this objective as a guiding principle and introduces an efficient score\-based prototype selection rule in the next section\.

## 4Method

ConMoE performs expert\-pool consolidation by selecting a reduced set of pretrained experts as prototypes and defining a deterministic reassignment map from original experts to those prototypes\. For each local scopeGG, letℰG\\mathcal\{E\}\_\{G\}be the original routed expert pool and letKKbe the prototype budget\. ConMoE constructs a prototype set

PG⊆ℰG,\|PG\|=K,P\_\{G\}\\subseteq\\mathcal\{E\}\_\{G\},\\qquad\|P\_\{G\}\|=K,together with a reassignment map

mG:ℰG→PG\.m\_\{G\}:\\mathcal\{E\}\_\{G\}\\rightarrow P\_\{G\}\.Each original expert is therefore represented by one selected pretrained prototype\. No expert weights are updated or fused in the default construction\.

### 4\.1Prototype Scoring

The prototype set should retain experts that are useful under the pretrained routing distribution and difficult to substitute within the same scope\. ConMoE estimates these two properties using a routing\-conditioned contribution score and a replaceability score\.

For each experte∈ℰGe\\in\\mathcal\{E\}\_\{G\}, let𝒟e\\mathcal\{D\}\_\{e\}be the calibration tokens that activate it\. We define its routing\-conditioned contribution as

ae=1\|𝒟e\|​∑t∈𝒟ege​\(t\)​‖e​\(ht\)‖2,a\_\{e\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{e\}\|\}\\sum\_\{t\\in\\mathcal\{D\}\_\{e\}\}g\_\{e\}\(t\)\\\|e\(h\_\{t\}\)\\\|\_\{2\},withae=0a\_\{e\}=0when𝒟e\\mathcal\{D\}\_\{e\}is empty\. This score measures the average contribution ofeeconditional on being selected\.

To estimate replaceability, we use the nearest\-neighbor distance within the scope:

be=mine′∈ℰG∖\{e\}⁡d​\(e,e′\)\.b\_\{e\}=\\min\_\{e^\{\\prime\}\\in\\mathcal\{E\}\_\{G\}\\setminus\\\{e\\\}\}d\(e,e^\{\\prime\}\)\.Hered​\(e,e′\)d\(e,e^\{\\prime\}\)is a normalized parameter distance between experts; its exact form is given in Appendix[B\.1](https://arxiv.org/html/2605.29350#A2.SS1)\. A largerbeb\_\{e\}indicates thateehas no close substitute inℰG\\mathcal\{E\}\_\{G\}\.

We normalizeaea\_\{e\}andbeb\_\{e\}within the scope,

a¯e=NormG​\(ae\),b¯e=NormG​\(be\),\\bar\{a\}\_\{e\}=\\mathrm\{Norm\}\_\{G\}\(a\_\{e\}\),\\qquad\\bar\{b\}\_\{e\}=\\mathrm\{Norm\}\_\{G\}\(b\_\{e\}\),whereNormG​\(⋅\)\\mathrm\{Norm\}\_\{G\}\(\\cdot\)denotes min–max normalization over experts inℰG\\mathcal\{E\}\_\{G\}\. The final prototype score is

se=a¯e​b¯e\.s\_\{e\}=\\bar\{a\}\_\{e\}\\bar\{b\}\_\{e\}\.This score favors experts that are both useful when routed to and hard to replace\.

### 4\.2Prototype Selection and Reassignment

Given the prototype score, ConMoE selects the top\-KKexperts as the reduced prototype set:

PG=TopKe∈ℰG⁡\(se,K\)\.P\_\{G\}=\\operatorname\{TopK\}\_\{e\\in\\mathcal\{E\}\_\{G\}\}\(s\_\{e\},K\)\.Each original expert is then assigned to its nearest selected prototype:

mG​\(e\)=arg​minp∈PG⁡d​\(e,p\)\.m\_\{G\}\(e\)=\\operatorname\*\{arg\\,min\}\_\{p\\in P\_\{G\}\}d\(e,p\)\.This induces a prototype\-centered partition of the original expert pool:

Ap=\{e∈ℰG:mG​\(e\)=p\},p∈PG\.A\_\{p\}=\\\{e\\in\\mathcal\{E\}\_\{G\}:m\_\{G\}\(e\)=p\\\},\\qquad p\\in P\_\{G\}\.The clustersApA\_\{p\}define the reuse structure of the compressed expert pool\. Unlike expert merging methods, ConMoE does not combine the weights of experts inApA\_\{p\}; the selected prototypeppremains the original pretrained expert\.

The top\-KKselection rule is a computationally simple heuristic guided by the consolidation objective in Section[3](https://arxiv.org/html/2605.29350#S3)\. It does not attempt to exactly solve the combinatorial prototype\-selection problem\.

### 4\.3Consolidated MoE Operator

For a layerl∈Gl\\in G, letT\(l\)​\(t\)T^\{\(l\)\}\(t\)be the set of experts selected by the pretrained router for tokentt, with routing weightsgi\(l\)​\(t\)g\_\{i\}^\{\(l\)\}\(t\)\. The consolidated MoE operator replaces each selected expert by its assigned prototype:

MoE~\(l\)​\(ht\(l\)\)=∑p∈PGαp\(l\)​\(t\)​p​\(ht\(l\)\),\\widetilde\{\\mathrm\{MoE\}\}^\{\(l\)\}\(h\_\{t\}^\{\(l\)\}\)=\\sum\_\{p\\in P\_\{G\}\}\\alpha\_\{p\}^\{\(l\)\}\(t\)\\,p\(h\_\{t\}^\{\(l\)\}\),where

αp\(l\)​\(t\)=∑i∈T\(l\)​\(t\)mG​\(Ei\(l\)\)=pgi\(l\)​\(t\)\.\\alpha\_\{p\}^\{\(l\)\}\(t\)=\\sum\_\{\\begin\{subarray\}\{c\}i\\in T^\{\(l\)\}\(t\)\\\\ m\_\{G\}\(E\_\{i\}^\{\(l\)\}\)=p\\end\{subarray\}\}g\_\{i\}^\{\(l\)\}\(t\)\.Thus, when multiple routed experts selected for the same token are assigned to the same prototype, their routing weights are aggregated into a single coefficient\.

Applying this construction independently to all scopes yields the compressed MoE\. When a scope contains one layer, ConMoE reduces to layer\-local consolidation\. When a scope contains multiple neighboring layers, the same formulation allows local cross\-layer prototype reuse\.

Table 1:Main results on six multiple\-choice benchmarks\. All compressed methods are one\-shot and use no post\-compression fine\-tuning\. Reduction denotes the logical routed\-expert reduction ratio\.ModelReductionTypeMethodWinoGrandeARC\-CARC\-EBoolQHellaSwagPIQAAvg\.Qwen3\-30B\-A3B–Original0\.7070\.5640\.7910\.8870\.7760\.8050\.75525%MergingM\-SMoE0\.7100\.5520\.7880\.8840\.7750\.8040\.751HC\-SMoE0\.6940\.4650\.7280\.8590\.6530\.7580\.693PruningFrequency0\.7050\.5590\.7910\.8850\.7760\.8040\.754REAP0\.6980\.5500\.7950\.8810\.7680\.7970\.748–ConMoE0\.7100\.5460\.7860\.8850\.7720\.8040\.75150%MergingM\-SMoE0\.6960\.5070\.7770\.8180\.7350\.7510\.714HC\-SMoE0\.5210\.3070\.5030\.7340\.3700\.6300\.511PruningFrequency0\.6900\.5200\.7730\.8740\.7470\.7940\.733REAP0\.6850\.5490\.7650\.8610\.7360\.7900\.731–ConMoE0\.7150\.5140\.7310\.8510\.7150\.7800\.717deepseek\-moe\-16b\-base–Original0\.7010\.4590\.6950\.7400\.7720\.7970\.69425%MergingM\-SMoE0\.6830\.4110\.6760\.7380\.6820\.7720\.660HC\-SMoE0\.6910\.4180\.6670\.7550\.7400\.7850\.676PruningFrequency0\.6910\.4320\.6800\.6990\.7470\.7910\.673REAP0\.6930\.4470\.6900\.6920\.7650\.7980\.681–ConMoE0\.6960\.4470\.7060\.7450\.7570\.7960\.69150%MergingM\-SMoE0\.5630\.3350\.5680\.6190\.2690\.7040\.510HC\-SMoE0\.6480\.3710\.6090\.6780\.6290\.7460\.613PruningFrequency0\.6280\.3920\.6260\.6350\.6780\.7730\.622REAP0\.6540\.4090\.6780\.6310\.6990\.7800\.642–ConMoE0\.6740\.3980\.6630\.6970\.6900\.7810\.651OLMoE\-1B\-7B\-0125–Original0\.6890\.4920\.7700\.7040\.7820\.7960\.70525%MergingM\-SMoE0\.6260\.4650\.7050\.6310\.6270\.7820\.639HC\-SMoE0\.6620\.4510\.7290\.6590\.7120\.7620\.662PruningFrequency0\.6650\.4520\.7020\.6430\.6680\.7750\.651REAP0\.6740\.4700\.7530\.5890\.7470\.7790\.669–ConMoE0\.6780\.4880\.7560\.6180\.7410\.7780\.67650%MergingM\-SMoE0\.5150\.3650\.5930\.4290\.3270\.6200\.475HC\-SMoE0\.5640\.3640\.6060\.5780\.5280\.6710\.552PruningFrequency0\.5700\.3430\.5470\.5730\.4940\.7040\.539REAP0\.5750\.3430\.5820\.4990\.5630\.7100\.545–ConMoE0\.5700\.3270\.5770\.5790\.4980\.6830\.540

## 5Experiments

We evaluate ConMoE from three perspectives\. First, we compare its quality–storage trade\-off against pruning and merging baselines under matched logical routed\-expert reduction budgets\. Second, we use controlled ablations to isolate reassignment structure, cross\-layer scope, prototype selection, and post\-hoc fusion\. Third, we analyze whether pretrained MoE checkpoints contain local cross\-layer expert substitutability\.

### 5\.1Experimental Setup

#### Models\.

We evaluate on three pretrained MoE language models with different scales and expert layouts: Qwen3\-30B\-A3B\(Yanget al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib13)\), deepseek\-moe\-16b\-base\(Daiet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib5)\), and OLMoE\-1B\-7B\-0125\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib6)\)\. Unless otherwise stated, all methods are one\-shot and train\-free, and ConMoE denotes the default remapping\-only method\. We compress only routed experts, while keeping shared experts, routers, attention blocks, embeddings, layer norms, and output heads unchanged\.

#### Calibration and evaluation\.

We use unlabeled calibration text only to collect routing statistics, expert usage, and expert\-output norms\. No labels, losses, gradients, or post\-compression fine\-tuning are used\. All downstream evaluations are run with lm\-eval\(Gaoet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib38)\); calibration sources, prompt sampling, and metric definitions are detailed in Appendices[B\.3](https://arxiv.org/html/2605.29350#A2.SS3)and[B\.4](https://arxiv.org/html/2605.29350#A2.SS4)\. The main comparison uses six multiple\-choice benchmarks: WinoGrande, ARC\-C, ARC\-E, BoolQ, HellaSwag, and PIQA\(Sakaguchiet al\.,[2019](https://arxiv.org/html/2605.29350#bib.bib34); Clarket al\.,[2018](https://arxiv.org/html/2605.29350#bib.bib29),[2019](https://arxiv.org/html/2605.29350#bib.bib35); Zellerset al\.,[2019](https://arxiv.org/html/2605.29350#bib.bib27); Bisket al\.,[2019](https://arxiv.org/html/2605.29350#bib.bib37)\)\. We report task accuracy or normalized accuracy and the average score across the suite\. For controlled ablations, we additionally include MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.29350#bib.bib36)\)\.

For compression, we report the logical routed\-expert reduction ratio, where each selected prototype is counted once\. Thus, 25% reduction keeps approximately 75% of routed experts as logical prototypes, while 50% reduction keeps approximately half\. Materialized checkpoints are used only for evaluation compatibility and are not counted as compressed storage\.

#### Baselines\.

We compare against representative pruning and merging baselines\. For pruning, we use Frequency pruning as a simple routing\-saliency baseline and REAP pruning as a stronger contribution\-based baseline\(Lasbyet al\.,[2026](https://arxiv.org/html/2605.29350#bib.bib7)\)\. For merging, we compare against M\-SMoE and HC\-SMoE, two retraining\-free expert merging methods\(Liet al\.,[2024](https://arxiv.org/html/2605.29350#bib.bib17); Chenet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib19)\)\. All methods are evaluated at matched logical routed\-expert reduction budgets\. For ConMoE, the main table uses local scopes: scope size 4 for Qwen3\-30B\-A3B and scope size 1 for deepseek\-moe\-16b\-base and OLMoE\-1B\-7B\-0125\. We study scope size explicitly in Section[5\.3](https://arxiv.org/html/2605.29350#S5.SS3)\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2605.29350#S4.T1)reports the main comparison across three MoE models and two routed\-expert reduction ratios, 25% and 50%\. The table groups methods into pruning, merging, and remapping\-based expert\-pool consolidation\. The key question is whether representing original expert slots by retained prototypes provides a competitive quality–storage trade\-off\.

Across models and reduction ratios, ConMoE is competitive with strong pruning and merging baselines and achieves the best or near\-best average performance in several settings\. On DeepSeek, ConMoE obtains the best average score at both reduction ratios\. On OLMoE, it performs best at 25% reduction and remains close to the strongest baselines at 50% reduction\. On Qwen3, pruning is particularly strong, especially at 50% reduction, while ConMoE remains competitive and preserves strong performance on WinoGrande and BoolQ\.

These results suggest that deterministic prototype remapping is a useful alternative to directly deleting or merging experts\. Compared with pruning, ConMoE also defines how original expert references are reassigned to reusable prototypes\. Compared with merging, default ConMoE reuses selected pretrained experts directly rather than constructing fused expert weights\.

### 5\.3Ablation Studies

We next isolate the main design choices in ConMoE\. Unless otherwise stated, ablations are conducted on Qwen3\-30B\-A3B at 50% routed\-expert reduction\. To reduce cost while preserving task diversity, we report ARC\-C, HellaSwag, and MMLU\. ARC\-C and HellaSwag report normalized accuracy, while MMLU reports accuracy\.

#### Consolidation structure\.

Table[2](https://arxiv.org/html/2605.29350#S5.T2)isolates how the reduced prototype pool and reassignment structure are constructed\. All variants are remapping\-only and use the same expert\-to\-prototype reassignment mechanism; they differ only in candidate scope and selection policy\. We use descriptive component names because these variants analyze mechanisms inside ConMoE rather than define separate algorithms\.

Table 2:Consolidation\-structure ablation on Qwen3\-30B\-A3B at 50% routed\-expert reduction\. All variants are remapping\-only\. ARC\-C and HellaSwag report normalized accuracy; MMLU reports accuracy\.Component variantARC\-CHellaSwagMMLUAvg\.Layer\-local0\.5150\.7150\.6510\.627Cross\-layer fixed\-kk0\.5150\.7200\.6470\.627Adaptive prototype0\.5040\.7160\.6390\.620The layer\-local variant already provides a strong consolidation baseline, indicating that explicit reassignment to retained prototypes can preserve much of the original expert\-pool behavior without modifying weights\. Allowing a cross\-layer candidate pool with a fixed per\-layer budget slightly improves HellaSwag while maintaining a similar average score, suggesting that neighboring layers can provide useful substitutes\. Adaptive prototype selection replaces the uniform per\-layer budget with a contribution–replaceability\-aware rule\. On these representative Qwen3 tasks, it remains close to cross\-layer fixed\-kk, so adaptive selection is better viewed as a controlled capacity\-allocation mechanism than as a uniform improvement on every task\.

#### Effect of scope size\.

![Refer to caption](https://arxiv.org/html/2605.29350v1/x2.png)Figure 2:Effect of scope size on Qwen3\-30B\-A3B at 50% routed\-expert reduction\. Small scopes achieve comparable performance, and scope 4 gives the best average score on this representative subset\. Expanding the scope to 8 or 16 layers substantially degrades all tasks, indicating that cross\-layer expert reuse is beneficial mainly within a local neighborhood\.Figure[2](https://arxiv.org/html/2605.29350#S5.F2)studies how the size of the cross\-layer candidate pool affects consolidation\. All variants use the same prototype\-selection and remapping procedure, and only differ in the number of neighboring MoE layers grouped into one scope\. The results show that cross\-layer reuse is useful mainly within a limited local range\. Scope sizes 1, 2, and 4 obtain comparable performance, while the best scope depends on the task\. Scope 4 gives the best average score on this representative subset\. In contrast, larger scopes perform substantially worse: when the scope is increased to 8 or 16 layers, all three tasks degrade, with especially large drops on HellaSwag and MMLU\. This suggests that nearby layers can sometimes share useful prototypes, but distant layers likely correspond to different depth\-specific transformations\.

#### Prototype selection\.

Table 3:Prototype\-selection ablation on Qwen3\-30B\-A3B at 50% routed\-expert reduction\. All variants use the same remapping mechanism\. ARC\-C and HellaSwag report normalized accuracy; MMLU reports accuracy\.Selection policyARC\-CHellaSwagMMLUAvg\.Usage top\-kk0\.5090\.5860\.6300\.575REAP top\-kk0\.5040\.7180\.6390\.620Distance\-only0\.2420\.3190\.2310\.264Adaptive prototype0\.5040\.7160\.6400\.620Table[3](https://arxiv.org/html/2605.29350#S5.T3)compares different prototype\-selection policies under the same remapping mechanism\. The results show that routing\-conditioned contribution is the dominant selection signal\. Distance\-only selection performs poorly, indicating that retaining hard\-to\-replace experts without considering routed contribution can preserve experts with limited downstream impact\. Usage top\-kkis also weaker, mainly due to a large drop on HellaSwag\. REAP top\-kkprovides a much stronger contribution signal and matches adaptive prototype selection on this subset, suggesting that replaceability is best used as a controlled complement rather than as a standalone criterion\.

#### Post\-hoc fusion diagnostics\.

Table 4:Post\-hoc fusion diagnostics on Qwen3\-30B\-A3B at 50% routed\-expert reduction\. “None” is the default remapping\-only ConMoE setting\. All variants use the same adaptive prototype selection and reassignment structure\.FusionARC\-CHellaSwagMMLUAvg\.None0\.5040\.7160\.6390\.620Arcee0\.4880\.5310\.5120\.510Weighted average0\.4510\.4690\.3580\.426Table[4](https://arxiv.org/html/2605.29350#S5.T4)evaluates whether weight\-level fusion improves the selected prototype pool after the remapping structure is fixed\. The “None” row is the default ConMoE setting used in the main experiments, where selected pretrained prototypes are reused directly\. On Qwen3, direct prototype remapping is more stable than both tested fusion operators: Arcee fusion\(Goddardet al\.,[2025](https://arxiv.org/html/2605.29350#bib.bib24)\)substantially reduces HellaSwag and MMLU, and weighted averaging\(Wortsmanet al\.,[2022](https://arxiv.org/html/2605.29350#bib.bib39)\)degrades performance further\. These results indicate that the gains of ConMoE come from deterministic reassignment to retained pretrained prototypes, rather than from a particular weight\-merging operator\.

![Refer to caption](https://arxiv.org/html/2605.29350v1/x3.png)Figure 3:Cross\-layer nearest\-neighbor analysis on Qwen3\-30B\-A3B\. Left: source\-layer to nearest\-neighbor\-layer distribution under normalized parameter distance within four\-layer scopes\. Right: fraction of experts in each layer whose nearest neighbor lies in a different layer\. Overall, 50\.4% of routed experts have a cross\-layer nearest neighbor, indicating that expert substitutability is not strictly layer\-local\. The near\-diagonal structure further suggests that such substitutability is local rather than model\-wide\.

### 5\.4Analysis: Local Cross\-layer Expert Substitutability

We further analyze whether pretrained MoE checkpoints contain reusable expert structure across neighboring layers\. Rather than measuring downstream accuracy, this analysis tests whether a strictly layer\-local redundancy assumption is consistent with the geometry of pretrained experts\.

For each routed expert, we compute its nearest neighbor under the normalized parameter distance used by ConMoE within local four\-layer scopes, and record whether the nearest neighbor lies in the same layer or in a different layer\. Figure[3](https://arxiv.org/html/2605.29350#S5.F3)reports the source\-to\-nearest\-layer distribution and the cross\-layer nearest\-neighbor fraction for each layer\.

The heatmap shows that nearest neighbors concentrate in local near\-diagonal blocks, indicating that expert proximity is structured by depth rather than randomly distributed across the model\. However, this structure is not purely layer\-local: on Qwen3\-30B\-A3B, 50\.4% of routed experts have a cross\-layer nearest neighbor under four\-layer scopes, and several middle layers exceed 70%\. This motivates local cross\-layer candidate pools in ConMoE, while still cautioning against interpreting parameter\-space proximity as evidence that cross\-layer sharing is always preferable\.

The locality of the heatmap also explains why ConMoE uses bounded cross\-layer scopes rather than a single model\-wide expert pool\. Neighboring layers expose reusable redundancy, while distant layers may correspond to different depth\-specific transformations, consistent with the scope ablation where moderate scopes share expert capacity without forcing all layers into one global pool\.

## 6Discussion

The experiments suggest three practical lessons\. First, deterministic reassignment is the most stable component: even layer\-local consolidation preserves much of the original expert\-pool behavior without modifying weights\. Second, cross\-layer reuse should remain local\. Moderate scopes can expose nearby reusable prototypes, whereas broader scopes introduce depth mismatch and degrade performance\. Third, routing\-conditioned contribution is the primary prototype\-selection signal, while replaceability mainly regularizes capacity allocation by discouraging redundant selections\. Post\-hoc fusion is not the source of the gains in our setting: the remapping\-only model is more stable than the tested fusion operators, distinguishing ConMoE from methods that construct new fused experts\.

## 7Conclusion

We presented ConMoE, a train\-free framework that casts one\-shot MoE compression as prototype selection plus deterministic expert\-slot remapping\. By reusing selected pretrained experts directly, ConMoE reduces the logical routed\-expert pool while preserving the original router interface\. Experiments across pretrained MoE language models show competitive quality–storage trade\-offs, with ablations highlighting deterministic remapping as the most robust component\.

## Limitations

ConMoE is train\-free but relies on calibration data to estimate routing demand and expert saliency\. Its cross\-layer reuse is also local: overly broad scopes can introduce depth mismatch and reduce performance\. In addition, prototype refinement is model\-dependent and should be viewed as an optional module rather than the core contribution\. Finally, our compression ratios measure the logical routed\-expert budget; realizing the same memory savings in deployment requires a shared\-prototype runtime or checkpoint format rather than a fully materialized compatibility checkpoint\.

## References

- S\. Bai, H\. Li, J\. Zhang, Z\. Hong, and S\. Guo \(2025\)DiEP: adaptive mixture\-of\-experts compression through differentiable expert pruning\.External Links:2509\.16105,[Link](https://arxiv.org/abs/2509.16105)Cited by:[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2019\)PIQA: reasoning about physical commonsense in natural language\.External Links:1911\.11641,[Link](https://arxiv.org/abs/1911.11641)Cited by:[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px2.p1.1)\.
- I\. Chen, H\. Liu, W\. Sun, C\. Chao, Y\. Hsu, and C\. Lee \(2025\)Retraining\-free merging of sparse moe via hierarchical clustering\.Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p2.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px3.p1.1)\.
- T\. Chen, S\. Huang, Y\. Xie, B\. Jiao, D\. Jiang, H\. Zhou, J\. Li, and F\. Wei \(2022\)Task\-specific expert pruning for sparse mixture\-of\-experts\.External Links:2206\.00277,[Link](https://arxiv.org/abs/2206.00277)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p2.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.External Links:1905\.10044,[Link](https://arxiv.org/abs/1905.10044)Cited by:[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px2.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv:1803\.05457v1\.Cited by:[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px2.p1.1)\.
- D\. Dai, C\. Deng, C\. Zhao, R\. X\. Xu, H\. Gao, D\. Chen, J\. Li, W\. Zeng, X\. Yu, Y\. Wu, Z\. Xie, Y\. K\. Li, P\. Huang, F\. Luo, C\. Ruan, Z\. Sui, and W\. Liang \(2024\)DeepSeekMoE: towards ultimate expert specialization in mixture\-of\-experts language models\.External Links:2401\.06066,[Link](https://arxiv.org/abs/2401.06066)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px1.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.External Links:2101\.03961,[Link](https://arxiv.org/abs/2101.03961)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1)\.
- T\. Gale, D\. Narayanan, C\. Young, and M\. Zaharia \(2022\)MegaBlocks: efficient sparse training with mixture\-of\-experts\.External Links:2211\.15841,[Link](https://arxiv.org/abs/2211.15841)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px2.p1.1)\.
- C\. Goddard, S\. Siriwardhana, M\. Ehghaghi, L\. Meyers, V\. Karpukhin, B\. Benedict, M\. McQuade, and J\. Solawetz \(2025\)Arcee’s mergekit: a toolkit for merging large language models\.External Links:2403\.13257,[Link](https://arxiv.org/abs/2403.13257)Cited by:[§5\.3](https://arxiv.org/html/2605.29350#S5.SS3.SSS0.Px4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.External Links:2009\.03300,[Link](https://arxiv.org/abs/2009.03300)Cited by:[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px2.p1.1)\.
- M\. Huang, H\. Shi, C\. Zheng, Y\. Wu, G\. Chen, X\. Yu, Y\. Yin, and H\. Cheng \(2026\)UniPool: a globally shared expert pool for mixture\-of\-experts\.External Links:2605\.06665,[Link](https://arxiv.org/abs/2605.06665)Cited by:[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, E\. B\. Hanna, F\. Bressand, G\. Lengyel, G\. Bour, G\. Lample, L\. R\. Lavaud, L\. Saulnier, M\. Lachaux, P\. Stock, S\. Subramanian, S\. Yang, S\. Antoniak, T\. L\. Scao, T\. Gervet, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2024\)Mixtral of experts\.External Links:2401\.04088,[Link](https://arxiv.org/abs/2401.04088)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1)\.
- M\. Lasby, I\. Lazarevich, N\. Sinnadurai, S\. Lie, Y\. Ioannou, and V\. Thangarasa \(2026\)REAP the experts: why pruning prevails for one\-shot moe compression\.External Links:2510\.13999,[Link](https://arxiv.org/abs/2510.13999)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p2.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px3.p1.1)\.
- D\. Lepikhin, H\. Lee, Y\. Xu, D\. Chen, O\. Firat, Y\. Huang, M\. Krikun, N\. Shazeer, and Z\. Chen \(2020\)GShard: scaling giant models with conditional computation and automatic sharding\.External Links:2006\.16668,[Link](https://arxiv.org/abs/2006.16668)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1)\.
- M\. Lewis, S\. Bhosale, T\. Dettmers, N\. Goyal, and L\. Zettlemoyer \(2021\)BASE layers: simplifying training of large, sparse models\.External Links:2103\.16716,[Link](https://arxiv.org/abs/2103.16716)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1)\.
- L\. LI, Q\. ZHU, J\. WANG, W\. LI, H\. GU, S\. HAN, and Y\. GUO \(2026\)Sub\-moe: efficient mixture\-of\-expert llms compression via subspace expert merging\.InProceedings of the 40th Annual AAAI Conference on Artificial Intelligence,\(English\)\.Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p2.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Li, Z\. Zhang, P\. Yadav, Y\. Sung, Y\. Cheng, M\. Bansal, and T\. Chen \(2024\)Merge, then compress: demystify efficient smoe with hints from its routing policy\.External Links:2310\.01334,[Link](https://arxiv.org/abs/2310.01334)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p2.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px3.p1.1)\.
- Z\. Liu, S\. Tang, B\. Sun, Z\. Shen, and X\. Yuan \(2026\)EvoESAP: non\-uniform expert pruning for sparse moe\.External Links:2603\.06003,[Link](https://arxiv.org/abs/2603.06003)Cited by:[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Lu, Q\. Liu, Y\. Xu, A\. Zhou, S\. Huang, B\. Zhang, J\. Yan, and H\. Li \(2024\)Not all experts are equal: efficient expert pruning and skipping for mixture\-of\-experts large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6159–6172\.External Links:[Link](https://aclanthology.org/2024.acl-long.334/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.334)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p2.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Miao, Y\. Yao, Z\. Wang, Z\. Wang, B\. Yi, L\. Liu, Y\. Zhao, and T\. Yang \(2025\)MergeMoE: efficient compression of moe models via expert output merging\.External Links:2510\.14436,[Link](https://arxiv.org/abs/2510.14436)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p2.1),[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Muennighoff, L\. Soldaini, D\. Groeneveld, K\. Lo, J\. Morrison, S\. Min, W\. Shi, P\. Walsh, O\. Tafjord, N\. Lambert, Y\. Gu, S\. Arora, A\. Bhagia, D\. Schwenk, D\. Wadden, A\. Wettig, B\. Hui, T\. Dettmers, D\. Kiela, A\. Farhadi, N\. A\. Smith, P\. W\. Koh, A\. Singh, and H\. Hajishirzi \(2025\)OLMoE: open mixture\-of\-experts language models\.External Links:2409\.02060,[Link](https://arxiv.org/abs/2409.02060)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px1.p1.1)\.
- S\. Rajbhandari, C\. Li, Z\. Yao, M\. Zhang, R\. Y\. Aminabadi, A\. A\. Awan, J\. Rasley, and Y\. He \(2022\)DeepSpeed\-moe: advancing mixture\-of\-experts inference and training to power next\-generation ai scale\.External Links:2201\.05596,[Link](https://arxiv.org/abs/2201.05596)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2019\)WinoGrande: an adversarial winograd schema challenge at scale\.External Links:1907\.10641,[Link](https://arxiv.org/abs/1907.10641)Cited by:[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px2.p1.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.External Links:1701\.06538,[Link](https://arxiv.org/abs/1701.06538)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1)\.
- M\. Wortsman, G\. Ilharco, S\. Y\. Gadre, R\. Roelofs, R\. Gontijo\-Lopes, A\. S\. Morcos, H\. Namkoong, A\. Farhadi, Y\. Carmon, S\. Kornblith, and L\. Schmidt \(2022\)Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 23965–23998\.External Links:[Link](https://proceedings.mlr.press/v162/wortsman22a.html)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p4.1),[§5\.3](https://arxiv.org/html/2605.29350#S5.SS3.SSS0.Px4.p1.1)\.
- P\. Yadav, D\. Tam, L\. Choshen, C\. Raffel, and M\. Bansal \(2023\)TIES\-merging: resolving interference when merging models\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=xtaX3WyCj1)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p4.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2605.29350#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px1.p1.1)\.
- C\. Yang, Y\. Sui, J\. Xiao, L\. Huang, Y\. Gong, Y\. Duan, W\. Jia, M\. Yin, Y\. Cheng, and B\. Yuan \(2024\)MoE\-i2: compressing mixture of experts models through inter\-expert pruning and intra\-expert low\-rank decomposition\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10456–10466\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.612/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.612)Cited by:[§2](https://arxiv.org/html/2605.29350#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Cited by:[§5\.1](https://arxiv.org/html/2605.29350#S5.SS1.SSS0.Px2.p1.1)\.

## Appendix AReproducibility and Compliance

#### AI assistant use\.

AI assistants were used to support manuscript writing, language polishing, and statistical analysis\. All AI\-assisted analyses, results, claims, and final text were reviewed and verified by the authors\.

#### Artifacts and licenses\.

We use publicly available pretrained MoE checkpoints and benchmark datasets according to their released licenses and terms of use\. Our experiments are for research purposes\. We do not redistribute the original model weights or benchmark data; the released code contains scripts for reproducing compression and evaluation with user\-provided access to the corresponding artifacts\.

#### Compute\.

All experiments, including expert\-pool consolidation and downstream evaluation, were implemented and run on two NVIDIA A100\-SXM4\-80GB GPUs\.

## Appendix BImplementation Details

### B\.1Distance and Normalization

Letℳ=\{gate,up,down\}\\mathcal\{M\}=\\\{\\mathrm\{gate\},\\mathrm\{up\},\\mathrm\{down\}\\\}denote the set of routed expert projections\. For each projectionm∈ℳm\\in\\mathcal\{M\}, we define

δm​\(e,e′\)=2​‖Wme−Wme′‖F‖Wme‖F\+‖Wme′‖F\+2​ϵ\.\\delta\_\{m\}\(e,e^\{\\prime\}\)=\\frac\{2\\\|W\_\{m\}^\{e\}\-W\_\{m\}^\{e^\{\\prime\}\}\\\|\_\{F\}\}\{\\\|W\_\{m\}^\{e\}\\\|\_\{F\}\+\\\|W\_\{m\}^\{e^\{\\prime\}\}\\\|\_\{F\}\+2\\epsilon\}\.The expert distance is the average projection distance:

d​\(e,e′\)=1\|ℳ\|​∑m∈ℳδm​\(e,e′\)\.d\(e,e^\{\\prime\}\)=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{m\\in\\mathcal\{M\}\}\\delta\_\{m\}\(e,e^\{\\prime\}\)\.This normalization prevents projections with larger parameter norms from dominating the distance\.

For any scope\-level scorexex\_\{e\}, min–max normalization is

NormG​\(xe\)=xe−xGminxGmax−xGmin\+ϵ,\\mathrm\{Norm\}\_\{G\}\(x\_\{e\}\)=\\frac\{x\_\{e\}\-x\_\{G\}^\{\\min\}\}\{x\_\{G\}^\{\\max\}\-x\_\{G\}^\{\\min\}\+\\epsilon\},where

xGmin=minu∈ℰG⁡xu,xGmax=maxu∈ℰG⁡xu\.x\_\{G\}^\{\\min\}=\\min\_\{u\\in\\mathcal\{E\}\_\{G\}\}x\_\{u\},\\qquad x\_\{G\}^\{\\max\}=\\max\_\{u\\in\\mathcal\{E\}\_\{G\}\}x\_\{u\}\.

### B\.2Compression Accounting and Evaluation Checkpoints

All methods compress only routed experts\. Shared experts, routers, attention blocks, embeddings, normalization layers, and output heads are kept unchanged\.

We report the logical routed\-expert reduction ratio\. A selected prototype is counted once in the logical reduced pool, even if multiple original expert slots are assigned to it\. For compatibility with standard HuggingFace evaluation pipelines, we materialize an evaluation checkpoint by filling each original expert slot with the weights of its assigned prototype\. This materialized checkpoint preserves the original architecture for evaluation only and is not counted as compressed storage\.

### B\.3Calibration Data

ConMoE and the pruning baselines use unlabeled calibration text to estimate routing statistics and expert\-output saliency\. Calibration uses only text prompts; it does not use labels, losses, gradients, or post\-compression fine\-tuning\. The default calibration source is matched to the benchmark family used in evaluation\. For each task, we sample 128 examples with seed 42 and format them as multiple\-choice prompts\.

Table 5:Default calibration sources\. Calibration uses only text prompts and does not use labels or losses\.TaskSourceSplitARC\-CAI2 ARC\-ChallengevalidationARC\-EAI2 ARC\-EasyvalidationBoolQSuperGLUE BoolQtrainHellaSwagHellaSwagtrainMMLUMMLU allvalidationPIQAPIQAtrainWinoGrandeWinoGrande\-XLtrain
### B\.4Evaluation Metrics

The main table reports six multiple\-choice benchmarks: WinoGrande, ARC\-C, ARC\-E, BoolQ, HellaSwag, and PIQA\. The ablations additionally use MMLU as one representative knowledge\-intensive task\. We report normalized accuracy for ARC\-C, ARC\-E, and HellaSwag, and accuracy for BoolQ, MMLU, PIQA, and WinoGrande\. All averages are simple arithmetic means over the tasks included in the corresponding table or figure\.

## Appendix CBaseline and Ablation Details

#### Frequency pruning\.

Frequency pruning is a routing\-only pruning baseline\. It ranks routed experts by the number of calibration tokens for which they appear in the router top\-kkset, and keeps the highest\-frequency experts under the matched expert budget\. It tests whether a simple usage signal is sufficient for expert\-pool reduction\.

#### REAP pruning\.

The REAP pruning baseline ranks experts by a contribution score based on router weight and expert\-output norm on calibration tokens\. For each expert, the score is computed over tokens that route to that expert\. This provides a stronger pruning baseline than frequency alone because it accounts for both routing selection and the magnitude of expert contribution\. In the main comparison, REAP pruning uses a uniform per\-layer budget, matching the pruning setup used by this class of post\-training expert pruning methods\.

#### M\-SMoE merging\.

The M\-SMoE baseline first selects high\-usage core experts and assigns the remaining experts to similar cores\. The resulting groups are merged with usage\-weighted averaging\. This baseline represents routing\-statistics\-guided expert merging\.

#### HC\-SMoE merging\.

The HC\-SMoE baseline groups experts by output behavior on calibration inputs\. It clusters expert\-output features and uses a representative expert for each cluster before applying the same budgeted merging protocol\. This baseline represents output\-feature\-based expert merging\.

#### Layer\-local consolidation\.

Layer\-local consolidation uses the same expert\-to\-prototype remapping mechanism as ConMoE, but restricts the prototype pool to each layer independently\. This variant isolates explicit reassignment without allowing cross\-layer reuse\.

#### Cross\-layer fixed\-kkconsolidation\.

Cross\-layer fixed\-kkconsolidation allows experts in neighboring layers to share a candidate prototype pool, but still assigns an equal number of prototypes to each layer\. This variant isolates the effect of the cross\-layer candidate pool before adaptive prototype selection\.

#### Adaptive prototype selection\.

Adaptive prototype selection is the selection rule used by ConMoE in the ablation tables\. It selects prototypes using both routing\-conditioned contribution and replaceability\. Contribution measures how much an expert contributes when selected by the router, while replaceability measures how difficult it is to substitute that expert with another expert in the same local scope\.

#### Usage top\-kk, REAP top\-kk, and distance\-only selection\.

The prototype\-selection ablation compares adaptive prototype selection against three controlled alternatives\. Usage top\-kkselects prototypes using only routing frequency\. REAP top\-kkselects prototypes using only the REAP contribution score\. Distance\-only selection ignores routing contribution and keeps experts that are hardest to replace under the normalized expert distance\. These variants separate usage\-only, contribution\-only, and replaceability\-only signals\.

#### Post\-hoc fusion diagnostics\.

The default ConMoE model is remapping\-only: the selected pretrained prototypes are reused directly\. For diagnostic ablations, we also test two post\-hoc fusion operators after the prototype set and reassignment map have already been fixed\.Arceeapplies a base\-preserving selective weight fusion operator to each prototype\-centered cluster\.Weighted averageaverages cluster experts using routing\-derived weights\. These fusion operators are not part of the default ConMoE pipeline\.

## Appendix DAnalysis Protocols

#### Cross\-layer nearest\-neighbor analysis\.

The cross\-layer analysis in Section[5\.4](https://arxiv.org/html/2605.29350#S5.SS4)uses the same normalized expert distance as ConMoE\. For each routed expert, we find its nearest neighbor within the local scope and record whether the nearest neighbor lies in the same layer or a different layer\. This analysis does not depend on downstream labels and is used only to examine whether expert substitutability is strictly layer\-local\.

#### Scope\-size analysis\.

The scope\-size ablation varies the number of neighboring MoE layers that share a candidate prototype pool\. Scope size one corresponds to layer\-local consolidation\. Larger scopes allow local cross\-layer reuse, while overly large scopes test whether distant layers introduce depth mismatch\. All points in the scope\-size figure use the same routed\-expert reduction ratio and differ only in scope size\.

Similar Articles

EMO: Pretraining Mixture of Experts for Emergent Modularity

Hugging Face Daily Papers

EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.

Mixture of Experts (MoEs) in Transformers

Hugging Face Blog

Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.

Less is MoE: Trimming Experts in Domain-Specialist Language Models

arXiv cs.LG

This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Hugging Face Daily Papers

ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.