Pruning and Distilling Mixture-of-Experts into Dense Language Models

Hugging Face Daily Papers 05/27/26, 12:00 AM Papers

Summary

A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

Original Article

View Cached Full Text

Cached at: 06/10/26, 12:12 AM

Paper page - Pruning and Distilling Mixture-of-Experts into Dense Language Models

Source: https://huggingface.co/papers/2605.28207

Abstract

Mixture-of-Experts(MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined byknowledge distillationfrom the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2605\.28207

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### EvanOLeary/laguna-xs2-densify-smoke Updated11 days ago #### EvanOLeary/laguna-xs2-dense-k8-recon Text Generation• 3B• Updated11 days ago • 210

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.28207 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.28207 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Paper page - Pruning and Distilling Mixture-of-Experts into Dense Language Models

Abstract

Models citing this paper2

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

Less is MoE: Trimming Experts in Domain-Specialist Language Models

Submit Feedback

Similar Articles

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

Less is MoE: Trimming Experts in Domain-Specialist Language Models