@witcheer: this is the first Qwen3.6-27B coding tune I've measured that improves real bug-fixing (!!!). - quality (MMLU/ARC/HellaS…

X AI KOLs Timeline 06/17/26, 02:47 PM Models

qwen fine-tuning coding bug-fixing benchmarks open-source agentic

Summary

A community fine-tune of Qwen3.6-27B improves real bug-fixing on SWE-bench while maintaining quality, unlike synthetic distillations that regress.

this is the first Qwen3.6-27B coding tune I've measured that improves real bug-fixing (!!!). - quality (MMLU/ARC/HellaSwag/GSM8K/HumanEval): 93.3 vs base 94.0. flat. - agentic score (native tool-calling, 40 tasks): 98.0 vs base 98.6. flat. - real bugs (SWE-bench Verified, 30, official harness): 20/30 vs base 19/30. up. it solves 2 the base can't and gives up less (6 empty patches vs 8). - MTP drafter: 2.0 to 2.4x vs base 1.8 to 2.2x. the fine-tune kept its drafter. this is the third Qwen3.6-27B coding tune I've benched. the other two were distilled on synthetic agent traces and both regressed on real bugs. ~~~ across all three the synthetic agentic score is a 2.4pt band (97.6 to 100) while real SWE spans 11 to 20. the cheap axis can't tell them apart. pi-tune even has the lowest quality of the group and the best real resolve. real capability tracks the training data, not the agentic coder label: real traces improved it, synthetic distill narrowed it. only the reality anchor could see the difference.

Original Article

View Cached Full Text

Cached at: 06/17/26, 04:00 PM

this is the first Qwen3.6-27B coding tune I’ve measured that improves real bug-fixing (!!!).

quality (MMLU/ARC/HellaSwag/GSM8K/HumanEval): 93.3 vs base 94.0. flat.
agentic score (native tool-calling, 40 tasks): 98.0 vs base 98.6. flat.
real bugs (SWE-bench Verified, 30, official harness): 20/30 vs base 19/30. up. it solves 2 the base can’t and gives up less (6 empty patches vs 8).
MTP drafter: 2.0 to 2.4x vs base 1.8 to 2.2x. the fine-tune kept its drafter.

this is the third Qwen3.6-27B coding tune I’ve benched. the other two were distilled on synthetic agent traces and both regressed on real bugs.

across all three the synthetic agentic score is a 2.4pt band (97.6 to 100) while real SWE spans 11 to 20. 
the cheap axis can't tell them apart. 

pi-tune even has the lowest quality of the group and the best real resolve. real capability tracks the training data, not the agentic coder label: real traces improved it, synthetic distill narrowed it. 

only the reality anchor could see the difference.

> **Tongyi Lab (@Ali_TongyiLab):**
> We are pleased to highlight an excellent community model from developer : Qwen3.6-27B-MTP-pi-reasoning-GGUF.
> 
> Built on our Qwen3.6-27B base model, this release focuses on optimizing automated programming and debugging workflows for local coding agents.
> 
> If you are exploring local

@witcheer: this is the first Qwen3.6-27B coding tune I've measured that improves real bug-fixing (!!!). - quality (MMLU/ARC/HellaS…

Similar Articles

I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why

bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF

@populartourist: Having worked consistently with Qwen3.6 27B NVFP4 on repos - it's clear that this quant is not reliable, at least for c…

@songjunkr: SuperQwen3.6-35B-DFlash-MLX is ready. Benchmark: Comparison of original vs. tuned versions on 100 actual items from com…

Any news (or hope) of Qwen-3.6 14B and 9B distills for local coding ?

Submit Feedback

Similar Articles

I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why

bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF

@populartourist: Having worked consistently with Qwen3.6 27B NVFP4 on repos - it's clear that this quant is not reliable, at least for c…

@songjunkr: SuperQwen3.6-35B-DFlash-MLX is ready. Benchmark: Comparison of original vs. tuned versions on 100 actual items from com…

Any news (or hope) of Qwen-3.6 14B and 9B distills for local coding ?