Post-Trained MoE Can Skip Half Experts via Self-Distillation

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

Summary

ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20times end-to-end inference speedup.

Original Article

View Cached Full Text

Cached at: 05/19/26, 06:31 AM

Paper page - Post-Trained MoE Can Skip Half Experts via Self-Distillation

Source: https://huggingface.co/papers/2605.18643 Authors:

Abstract

Zero-Expert Self-Distillation Adaptation (ZEDA) enables efficient dynamic Mixture-of-Experts models by converting static models into adaptive ones with reduced computational costs and improved inference speed.

Mixture-of-Experts(MoE) scales language models efficiently throughsparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existingdynamic MoEmethods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-ExpertSelf-DistillationAdaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injectsparameter-free zero-output expertsinto each MoE layer and adapts the augmented model through two-stageself-distillation, utilizing the original MoE as a frozen teacher and applying agroup-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% ofexpert FLOPsat marginal accuracy loss. It outperforms the strongestdynamic MoEbaseline by 6.1 and 4.0 points on the two models, and delivers ~1.20times end-to-endinference speedup.

View arXiv page View PDF GitHub5 Add to collection

Get this paper in your agent:

hf papers read 2605\.18643

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18643 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18643 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18643 in a Space README.md to link it from this page.

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Paper page - Post-Trained MoE Can Skip Half Experts via Self-Distillation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…

Less is MoE: Trimming Experts in Domain-Specialist Language Models

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

EMO: Pretraining Mixture of Experts for Emergent Modularity

Submit Feedback

Similar Articles

@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…

Less is MoE: Trimming Experts in Domain-Specialist Language Models

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

EMO: Pretraining Mixture of Experts for Emergent Modularity

@FinanceYF5: MoE models may waste about half of expert computations on tokens that don't need experts 1/ Half of experts are working for nothing MoE models already seem efficient, but a paper finds that many tokens don't need expert processing at all. ZEDA teaches the model to "save when possible," skipping up to 50% of expert computations.