A hands-on benchmark of four local LLMs—Qwen3.6-27B, Qwen3.6-35B, Qwen3.5-27B and Gemma 4—on a 20k-token architecture-writing task shows Qwen3.6-27B delivering the best overall balance of clarity, completeness and usefulness on an RTX 5090.
I ran a pretty simple but revealing local-LLM test. At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well... Models tested: * **Gemma4** * `cyankiwi/gemma-4-31B-it-AWQ-4bit` * **Qwen3.6-35B** * `RedHatAI/Qwen3.6-35B-A3B-NVFP4` * **Qwen3.5-27B** * `QuantTrio/Qwen3.5-27B-AWQ` * **Qwen3.6-27B** * `cyankiwi/Qwen3.6-27B-AWQ-INT4` Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.” I gave the same Hermes writing agent (“Scribe”) the same task: take 2 architecture blueprint docs (v1 baseline + v2 expansion) describing the "truth engine" and produce a unified \`Masterplan.md\` explaining: \- what the product is \- the user problem \- UX/product shape \- UVP/moat \- pipeline \- agent roles \- architecture \- trust/legal/provenance posture \- what changed between plan V1 and V2 V1: \~16k tokens, V2: \~4.6k tokens, Combined: \~20.6k tokens Then I ran the full workflow locally on my RTX 5090 all 4 models: \- \*\*Gemma4\*\* \- \*\*Qwen3.6-35B\*\* \- \*\*Qwen3.5-27B\*\* \- \*\*Qwen3.6-27B\*\* To make it fair and push the models, each model got: 1. initial draft 2. second-pass revision 3. final polish Each stage was directed and reviewed by my GPT-5.4 agent Manny, so this wasn’t just “ask once and compare vibes.” \## What I/Manny scored \- \*\*Clarity\*\* \- \*\*Completeness\*\* \- \*\*Discipline\*\* \- \*\*Usefulness\*\* \## Final results **### Clarity** \- Gemma4: \*\*9.4\*\* \- Qwen3.6-27B: \*\*8.8\*\* \- Qwen3.6-35B: \*\*8.1\*\* \- Qwen3.5-27B: \*\*7.4\*\* \*\*Winner: Gemma4\*\* (at a cost, read further below) Gemma was the best editor. Cleanest structure, best pacing, strongest restraint. \--- **### Completeness** \- Qwen3.6-35B: \*\*9.6\*\* \- Qwen3.5-27B: \*\*9.1\*\* \- Qwen3.6-27B: \*\*8.7\*\* \- Gemma4: \*\*7.9\*\* \*\*Winner: Qwen3.6-35B\*\* The 35B Qwen wrote the most exhaustive architecture doc by far. Best sourcebook, most implementation mass. \--- **### Discipline** \- Gemma4: \*\*9.5\*\* \- Qwen3.6-27B: \*\*8.6\*\* \- Qwen3.6-35B: \*\*7.7\*\* \- Qwen3.5-27B: \*\*6.8\*\* \*\*Winner: Gemma4\*\* Gemma best preserved the actual product identity \--- \### Usefulness \- Qwen3.6-27B: \*\*9.3\*\* \- Qwen3.6-35B: \*\*9.2\*\* \- Gemma4: \*\*8.9\*\* \- Qwen3.5-27B: \*\*8.8\*\* \*\*Winner: Qwen3.6-27B\*\* This was the surprise. **The 27B Qwen 3.6 ended up as the best \*\*overall practical workhorse\*\* — better balance of depth, readability, and usability than the others.** \## Final ranking **1. \*\*Qwen3.6-27B\*\* — best all-around balance** 2. \*\*Gemma4\*\* — best editor / strategist 3. \*\*Qwen3.6-35B\*\* — best exhaustive drafter 4. \*\*Qwen3.5-27B\*\* — solid, but clearly behind the others for this task # 1) Best overall balance **Qwen3.6-27B** This is the new interesting winner. It doesn’t beat Gemma4 on clarity or discipline. It doesn’t beat Qwen3.6-35B on completeness. But it wins the thing that matters most for a real working master plan: **balance**. It’s the best compromise between: * readability * completeness * structure * practical usefulness # 2) Best editor / best strategist **Gemma4** If the goal is: * cleanest finished document * strongest executive readability * best restraint * best “this feels like a real deliberate plan” Then Gemma still wins. # 3) Best exhaustive architecture quarry **Qwen3.6-35B** If the goal is: * maximum implementation mass * biggest architecture sourcebook * richest mining material for downstream docs Then Qwen3.6-35B is still the beast. # 4) Fourth place **Qwen3.5-27B** Not bad. Not embarrassing. But now clearly behind both Qwen3.6 variants and Gemma for this kind of long-form architecture/planning task. \## Actual takeaway This ended up being a really clean split: \- \*\*Gemma4 = best editor\*\* \- \*\*Qwen3.6-35B = best expander\*\* \- \*\*Qwen3.6-27B = best practical default\*\* \- \*\*Qwen3.5-27B = respectable, but not the winner\*\* So if I were setting a default local writing worker for long-form architecture/master-plan work today, I’d probably choose: **\*\*Qwen3.6-27B\*\*** It’s the best compromise between: \- readability \- completeness \- structure \- practical usefulness Personal Note re Gemma 4: It was **drastically** shorter than the Qwens for the final output * **Gemma4** → **147 lines** * **Qwen3.6-35B** → **725 lines** * **Qwen3.5-27B** → **840 lines** * **Qwen3.6-27B** → **555 lines** So while I do agree that less is often more, I found the Gemma4 output lacking in both technical depth and detail. Sure, it captured the core concepts, but I would position the output as more of a pitching deck or high level concept, technical details and concepts however are sorely missing. On the other end of the spectrum is Qwen3.6-35B which delivered 5x the volume. That document could really serve as a technical blueprint and architecture implementation bible. Qwen3.5-27B produced even more but this was quantity over quality. I would honestly have rated Gemma4 less favourably than Manny did, so make of that what you will. **For First-draft only** performance, I’d rank them: # One-shot ranking 1. **Qwen3.6-27B** 2. **Qwen3.6-35B** 3. **Qwen3.5-27B** 4. **Gemma4** # Why # 1) Qwen3.6-27B Best balance right out of the gate: * strong product framing * solid structure * good density * less bloated than the other Qwens * more complete than Gemma’s first draft This was the best **raw first shot**. # 2) Qwen3.6-35B Very strong one-shot draft, but more sprawling: * most exhaustive * richest implementation mass * more likely to over-include * better sourcebook than polished masterplan on first pass If you want maximum raw material, this one was a beast. # 3) Qwen3.5-27B Good first-draft generator, but sloppier: * ambitious * broad * lots of content * weaker discipline and coherence than the 3.6 models Still useful, but clearly behind both 3.6 variants. # 4) Gemma4 Gemma (arguably) won the **final polished-document** contest, but not the first-draft contest. Its one-shot behaviour was: * too compressed * too selective * not thorough enough for the initial task It needed the later revision passes to get more substance. Depending on the audience, this may be either good or bad. # Short version * **Best one-shot:** Qwen3.6-27B * **Best after revision/polish:** Gemma4
A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.
A user benchmarks three Qwen models (Qwen3.5-27B dense, Qwen3.5-122B-A10B MoE, Qwen3.6-35B-A3B MoE) on 4x RTX 3090 GPUs under real agentic workloads, finding that MoE models consistently underperform the dense 27B at following strict global rules despite speed advantages, with the Qwen3.6-35B leading in generation throughput.
A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.
Personal benchmark shows Gemma-4E4B tops for routing, Qwen-3.6 27/30B beats Gemma-4 for coding, and MiniMax M2.7 MXFP4 replaces giant Qwen-3.5 quants in an OpenCode llama-swap workflow.