SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Summary
This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.
View Cached Full Text
Cached at: 05/12/26, 07:27 AM
Paper page - SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Source: https://huggingface.co/papers/2605.08738 Published on May 9
·
Submitted byhttps://huggingface.co/Shengkun
Tangon May 12
Abstract
Research demonstrates that structured pruning and knowledge distillation effectively compress mixture-of-experts models at scale, with progressive pruning and combined distillation strategies improving performance.
Structured pruningandknowledge distillation(KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied atpretraining scale, especially to recentmixture-of-experts(MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, howexpert compressionchoices affect the final model aftercontinued training, and which training strategy is most effective. We have the following findings: First, across depth, width, andexpert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shotexpert compressionmethods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simplepartial-preservation expert mergingstrategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens,progressive pruning schedulesoutperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.08738
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.08738 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08738 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08738 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
This article introduces Qwen-Scope, a toolkit of Sparse Autoencoders (SAEs) trained on Qwen3 and Qwen3.5 models to enable mechanistic analysis and intervention. It releases 14 groups of SAE weights covering dense and MoE backbones, providing sparse representations for residual-stream activations.
Qwen/Qwen3.6-35B-A3B
Qwen releases Qwen3.6-35B-A3B, an open-weight Mixture-of-Experts model with 35B total parameters and 3B active parameters, featuring significant improvements in agentic coding and reasoning preservation.
@cjzafir: Qwen 3.5 4B model and 8B are too good. I fine-tuned a 4B model today and got 98% accuracy on full precision and Q8 quan…
A developer reports achieving high accuracy with fine-tuned Qwen 3.5 4B and 8B models using Unsloth, suggesting a shift towards specialized Expert Language Models (ELMs) for niche tasks.
Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI
This study reveals a 'Smart Pruning Paradox' where activation-aware pruning methods like Wanda preserve perplexity but significantly amplify bias in Large Language Models deployed on edge devices.