Be wary of Qwen/Claude distillations - they're often worse than the base model

Reddit r/LocalLLaMA 06/16/26, 10:48 AM News

distillation fine-tuning qwen claude open-source model-quality ai-warning

Summary

A critical analysis warning that many Qwen/Claude distillation models use too few training samples (e.g., 4K) to transfer actual capabilities, often degrading quality instead of improving it, compared to official distills like DeepSeek-R1 which used ~700K samples.

Just to be clear; I am not attempting to call anybody out or be mean to those who take the time/money to make these models, I just want to inform people about these distills/finetunes since there's clearly some confusion going on. I'm going to assume those of us who often visit this subreddit have noticed these models, particularly the "Qwopus" model and the such, though I'm sure there's probably Gemma 4/Claude distills too. As I type this, there's currently a Qwen 3.6 based Claude Fable 5 distillation model on the frontpage. Seems pretty cool, right? Yep. Up until you actually look into how these models were distilled. This new Fable distillation uses around 4,000 samples of Fable 5/Opus 4.8 to finetune Qwen 3.6 on. 4k samples is basically *nothing* when it comes to improving a models quality/performance. At best, it'll act slightly differently. But it certainly won't perform better than just running standard Qwen 3.6. If anything, it's actually likely to slightly degrade quality. Why? 4K samples is just not enough. And I am aware that Qwopus (or it may be another finetune called Qwen3.6-Claude-Opus.4.6-Distill iirc) has a version with ~8-10k samples used for the training rather than the 3-4K. Unfortunately that's still nowhere near enough to be actually meaningful. If anybody remembers the original DeepSeek-R1 LLaMa/Qwen distillations that were released by deepseek offiically back when the model first came out, around ~700,000 samples from R1 was used to create those distills. That's enough to not only impact behaviour, but actually improve benchmark scores. So, these Qwen + Claude models will have a slightly different reasoning style. They might feel "more Opus-like" chatting wise. But they are not performing better than their base Qwen models, and based on everything I've seen, a lot of people seem to think that's the case. Even with that Qwen/Opus distill that uses like 10K+ samples, that's still just not enough to transfer any sort of actual capability. [There's a decent example of someone testing this, showing Qwopus hallucinating compared to the standard Qwen 3.6, and also taking twice the amount of time.](https://akitaonrails.com/en/2026/04/24/llm-benchmarks-parte-3-deepseek-kimi-mimo/#the-discovery-claude-distillation-doesnt-transfer-library-knowledge) - there's also ofc plenty of people on this sub who have posted similar results. So yeah, just something to be aware of whenever you come across these distills/finetunes. At the very least, don't blindly trust them to be superior and bench them on your own specific usecases. I've personally tried a couple of these finetunes and both of them had issues with coherence and subtle mistakes that the standard model didn't have. But YMMV.

Original Article

Be wary of Qwen/Claude distillations - they're often worse than the base model

Similar Articles

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Submit Feedback

Similar Articles

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers