Post-Trained MoE Can Skip Half Experts via Self-Distillation
Summary
ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.
View Cached Full Text
Cached at: 05/19/26, 06:31 AM
Paper page - Post-Trained MoE Can Skip Half Experts via Self-Distillation
Source: https://huggingface.co/papers/2605.18643 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Zero-Expert Self-Distillation Adaptation (ZEDA) enables efficient dynamic Mixture-of-Experts models by converting static models into adaptive ones with reduced computational costs and improved inference speed.
Mixture-of-Experts(MoE) scales language models efficiently throughsparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existingdynamic MoEmethods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-ExpertSelf-DistillationAdaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injectsparameter-free zero-output expertsinto each MoE layer and adapts the augmented model through two-stageself-distillation, utilizing the original MoE as a frozen teacher and applying agroup-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% ofexpert FLOPsat marginal accuracy loss. It outperforms the strongestdynamic MoEbaseline by 6.1 and 4.0 points on the two models, and delivers ~1.20times end-to-endinference speedup.
View arXiv pageView PDFGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.18643
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18643 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18643 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18643 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…
A new method called Zero-Expert Self-Distillation Adaptation (ZEDA) allows MoE models like Qwen3 and GLM to skip half their expert computations on easy tokens with minimal accuracy loss, achieving ~20% inference speedup by adding dummy experts that output nothing.
Less is MoE: Trimming Experts in Domain-Specialist Language Models
This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.
EMO: Pretraining Mixture of Experts for Emergent Modularity
EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.
@FinanceYF5: MoE models may waste about half of expert computations on tokens that don't need experts 1/ Half of experts are working for nothing MoE models already seem efficient, but a paper finds that many tokens don't need expert processing at all. ZEDA teaches the model to "save when possible," skipping up to 50% of expert computations.
A paper discovers that about 50% of expert computations in MoE models are wasted on tokens that don't need expert processing. The proposed ZEDA method teaches the model to skip these computations, saving up to half of expert calculations.