SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Hugging Face Daily Papers 05/09/26, 12:00 AM Papers

moE pruning distillation model-compression pre-training qwen

Summary

This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/12/26, 07:27 AM

Paper page - SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Source: https://huggingface.co/papers/2605.08738 Published on May 9

Submitted byhttps://huggingface.co/Shengkun

Tangon May 12

Abstract

Research demonstrates that structured pruning and knowledge distillation effectively compress mixture-of-experts models at scale, with progressive pruning and combined distillation strategies improving performance.

Structured pruningandknowledge distillation(KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied atpretraining scale, especially to recentmixture-of-experts(MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, howexpert compressionchoices affect the final model aftercontinued training, and which training strategy is most effective. We have the following findings: First, across depth, width, andexpert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shotexpert compressionmethods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simplepartial-preservation expert mergingstrategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens,progressive pruning schedulesoutperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.08738

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.08738 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08738 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08738 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Paper page - SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Qwen/Qwen3.6-35B-A3B

@cjzafir: Qwen 3.5 4B model and 8B are too good. I fine-tuned a 4B model today and got 98% accuracy on full precision and Q8 quan…

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

Submit Feedback

Similar Articles

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

@cjzafir: Qwen 3.5 4B model and 8B are too good. I fine-tuned a 4B model today and got 98% accuracy on full precision and Q8 quan…

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI