Be wary of Qwen/Claude distillations - they're often worse than the base model

Reddit r/LocalLLaMA News

Summary

A critical analysis warning that many Qwen/Claude distillation models use too few training samples (e.g., 4K) to transfer actual capabilities, often degrading quality instead of improving it, compared to official distills like DeepSeek-R1 which used ~700K samples.

Just to be clear; I am not attempting to call anybody out or be mean to those who take the time/money to make these models, I just want to inform people about these distills/finetunes since there's clearly some confusion going on. I'm going to assume those of us who often visit this subreddit have noticed these models, particularly the "Qwopus" model and the such, though I'm sure there's probably Gemma 4/Claude distills too. As I type this, there's currently a Qwen 3.6 based Claude Fable 5 distillation model on the frontpage. Seems pretty cool, right? Yep. Up until you actually look into how these models were distilled. This new Fable distillation uses around 4,000 samples of Fable 5/Opus 4.8 to finetune Qwen 3.6 on. 4k samples is basically *nothing* when it comes to improving a models quality/performance. At best, it'll act slightly differently. But it certainly won't perform better than just running standard Qwen 3.6. If anything, it's actually likely to slightly degrade quality. Why? 4K samples is just not enough. And I am aware that Qwopus (or it may be another finetune called Qwen3.6-Claude-Opus.4.6-Distill iirc) has a version with ~8-10k samples used for the training rather than the 3-4K. Unfortunately that's still nowhere near enough to be actually meaningful. If anybody remembers the original DeepSeek-R1 LLaMa/Qwen distillations that were released by deepseek offiically back when the model first came out, around ~700,000 samples from R1 was used to create those distills. That's enough to not only impact behaviour, but actually improve benchmark scores. So, these Qwen + Claude models will have a slightly different reasoning style. They might feel "more Opus-like" chatting wise. But they are not performing better than their base Qwen models, and based on everything I've seen, a lot of people seem to think that's the case. Even with that Qwen/Opus distill that uses like 10K+ samples, that's still just not enough to transfer any sort of actual capability. [There's a decent example of someone testing this, showing Qwopus hallucinating compared to the standard Qwen 3.6, and also taking twice the amount of time.](https://akitaonrails.com/en/2026/04/24/llm-benchmarks-parte-3-deepseek-kimi-mimo/#the-discovery-claude-distillation-doesnt-transfer-library-knowledge) - there's also ofc plenty of people on this sub who have posted similar results. So yeah, just something to be aware of whenever you come across these distills/finetunes. At the very least, don't blindly trust them to be superior and bench them on your own specific usecases. I've personally tried a couple of these finetunes and both of them had issues with coherence and subtle mistakes that the standard model didn't have. But YMMV.
Original Article

Similar Articles

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Hugging Face Models Trending

Jackrong releases Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, a fine-tuned 27B parameter model with improved reasoning capabilities and stability, along with comprehensive training guides and code on GitHub using the Unsloth framework.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Hugging Face Daily Papers

This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.