Qwen 3.6 27B on DeepSWE

Reddit r/LocalLLaMA 06/07/26, 08:13 PM News

benchmark qwen open-source local-models deep-swe coding-benchmark

Summary

Qwen 3.6 27B scored 2% on the DeepSWE benchmark, placing 18/20 above Haiku 4.5 and Minimax M2.7, highlighting the gap between local and leading-edge models.

Overview: * It scored 2% (1.79% rounded up) * It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 * Full benchmark took 70 hours * Average time per task 32m * Average output tokens per task: 44k Perspectives: * It scored suspiciously similar to 3.6 Plus and it really gets me wondering how the architecture of 3.6 Plus differs from 27B. * Qwen 3.6 27B has a bad reputation in the community for being verbose. But surprisingly. The output tokens were on par or less to similar models. Methodology: * Qwen 3.6 27B FP8 with BF16 KV cache, reasoning on and 262k context window on VLLM. * Model ran on 1x RTX6000 pro Blackwell on RunPod. * Ran with mini-swe agent harness on modal sandboxes. * Ran 1 rollout per task instead of the official 4 to save time which is why images do not show a score range. * Costs calculated by tasks completed within RunPod hourly rate. * Codex 5.5xhigh was used to orchestrate and monitor the full benchmark run. [src](https://xcancel.com/Youssofal_/status/2063672976982069413) The best OS model Kimi-k2.6 is so far from the perf of the leading edge. Most cant even do Kimi locally and something like Qwen 3.6 27B is the local poor man's SOTA. It appears to take great size to perform at the leading edge. Models that start to be competitive tends to get closed source real quick. It doesn't feel like local will win. Feels more like a game of "how badly will local lose".

Original Article

Qwen 3.6 27B on DeepSWE

Similar Articles

Qwen 3.7 Max scores 60.6% on SWE-Bench Pro

Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room

Qwen3.7: The Agent Frontier (15 minute read)

Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard!

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

Submit Feedback

Similar Articles

Qwen 3.7 Max scores 60.6% on SWE-Bench Pro
Qwen 3.7 Max achieves a score of 60.6% on SWE-Bench Pro, demonstrating competitive performance on software engineering tasks.

Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room
Qwen3.7 Max ranks 5th on Artificial Analysis benchmarks, matching GPT-5.4 and outperforming Gemini 3.5 Flash, while Qwen3.6 27B trails significantly.

Qwen3.7: The Agent Frontier (15 minute read)

Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard!

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B