Post-Trained MoE Can Skip Half Experts via Self-Distillation

Hugging Face Daily Papers Papers

Summary

ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20times end-to-end inference speedup.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:31 AM

Paper page - Post-Trained MoE Can Skip Half Experts via Self-Distillation

Source: https://huggingface.co/papers/2605.18643 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Zero-Expert Self-Distillation Adaptation (ZEDA) enables efficient dynamic Mixture-of-Experts models by converting static models into adaptive ones with reduced computational costs and improved inference speed.

Mixture-of-Experts(MoE) scales language models efficiently throughsparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existingdynamic MoEmethods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-ExpertSelf-DistillationAdaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injectsparameter-free zero-output expertsinto each MoE layer and adapts the augmented model through two-stageself-distillation, utilizing the original MoE as a frozen teacher and applying agroup-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% ofexpert FLOPsat marginal accuracy loss. It outperforms the strongestdynamic MoEbaseline by 6.1 and 4.0 points on the two models, and delivers ~1.20times end-to-endinference speedup.

View arXiv pageView PDFGitHub5Add to collection

Get this paper in your agent:

hf papers read 2605\.18643

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18643 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18643 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18643 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Less is MoE: Trimming Experts in Domain-Specialist Language Models

arXiv cs.LG

This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.

EMO: Pretraining Mixture of Experts for Emergent Modularity

Hugging Face Daily Papers

EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.

@FinanceYF5: MoE models may waste about half of expert computations on tokens that don't need experts 1/ Half of experts are working for nothing MoE models already seem efficient, but a paper finds that many tokens don't need expert processing at all. ZEDA teaches the model to "save when possible," skipping up to 50% of expert computations.

X AI KOLs Following

A paper discovers that about 50% of expert computations in MoE models are wasted on tokens that don't need expert processing. The proposed ZEDA method teaches the model to skip these computations, saving up to half of expert calculations.