I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Reddit r/LocalLLaMA 04/23/26, 12:22 AM Models

Summary

A hands-on benchmark of four local LLMs—Qwen3.6-27B, Qwen3.6-35B, Qwen3.5-27B and Gemma 4—on a 20k-token architecture-writing task shows Qwen3.6-27B delivering the best overall balance of clarity, completeness and usefulness on an RTX 5090.

I ran a pretty simple but revealing local-LLM test. At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well... Models tested: * **Gemma4** * `cyankiwi/gemma-4-31B-it-AWQ-4bit` * **Qwen3.6-35B** * `RedHatAI/Qwen3.6-35B-A3B-NVFP4` * **Qwen3.5-27B** * `QuantTrio/Qwen3.5-27B-AWQ` * **Qwen3.6-27B** * `cyankiwi/Qwen3.6-27B-AWQ-INT4` Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.” I gave the same Hermes writing agent (“Scribe”) the same task: take 2 architecture blueprint docs (v1 baseline + v2 expansion) describing the "truth engine" and produce a unified \`Masterplan.md\` explaining: \- what the product is \- the user problem \- UX/product shape \- UVP/moat \- pipeline \- agent roles \- architecture \- trust/legal/provenance posture \- what changed between plan V1 and V2 V1: \~16k tokens, V2: \~4.6k tokens, Combined: \~20.6k tokens Then I ran the full workflow locally on my RTX 5090 all 4 models: \- \*\*Gemma4\*\* \- \*\*Qwen3.6-35B\*\* \- \*\*Qwen3.5-27B\*\* \- \*\*Qwen3.6-27B\*\* To make it fair and push the models, each model got: 1. initial draft 2. second-pass revision 3. final polish Each stage was directed and reviewed by my GPT-5.4 agent Manny, so this wasn’t just “ask once and compare vibes.” \## What I/Manny scored \- \*\*Clarity\*\* \- \*\*Completeness\*\* \- \*\*Discipline\*\* \- \*\*Usefulness\*\* \## Final results **### Clarity** \- Gemma4: \*\*9.4\*\* \- Qwen3.6-27B: \*\*8.8\*\* \- Qwen3.6-35B: \*\*8.1\*\* \- Qwen3.5-27B: \*\*7.4\*\* \*\*Winner: Gemma4\*\* (at a cost, read further below) Gemma was the best editor. Cleanest structure, best pacing, strongest restraint. \--- **### Completeness** \- Qwen3.6-35B: \*\*9.6\*\* \- Qwen3.5-27B: \*\*9.1\*\* \- Qwen3.6-27B: \*\*8.7\*\* \- Gemma4: \*\*7.9\*\* \*\*Winner: Qwen3.6-35B\*\* The 35B Qwen wrote the most exhaustive architecture doc by far. Best sourcebook, most implementation mass. \--- **### Discipline** \- Gemma4: \*\*9.5\*\* \- Qwen3.6-27B: \*\*8.6\*\* \- Qwen3.6-35B: \*\*7.7\*\* \- Qwen3.5-27B: \*\*6.8\*\* \*\*Winner: Gemma4\*\* Gemma best preserved the actual product identity \--- \### Usefulness \- Qwen3.6-27B: \*\*9.3\*\* \- Qwen3.6-35B: \*\*9.2\*\* \- Gemma4: \*\*8.9\*\* \- Qwen3.5-27B: \*\*8.8\*\* \*\*Winner: Qwen3.6-27B\*\* This was the surprise. **The 27B Qwen 3.6 ended up as the best \*\*overall practical workhorse\*\* — better balance of depth, readability, and usability than the others.** \## Final ranking **1. \*\*Qwen3.6-27B\*\* — best all-around balance** 2. \*\*Gemma4\*\* — best editor / strategist 3. \*\*Qwen3.6-35B\*\* — best exhaustive drafter 4. \*\*Qwen3.5-27B\*\* — solid, but clearly behind the others for this task # 1) Best overall balance **Qwen3.6-27B** This is the new interesting winner. It doesn’t beat Gemma4 on clarity or discipline. It doesn’t beat Qwen3.6-35B on completeness. But it wins the thing that matters most for a real working master plan: **balance**. It’s the best compromise between: * readability * completeness * structure * practical usefulness # 2) Best editor / best strategist **Gemma4** If the goal is: * cleanest finished document * strongest executive readability * best restraint * best “this feels like a real deliberate plan” Then Gemma still wins. # 3) Best exhaustive architecture quarry **Qwen3.6-35B** If the goal is: * maximum implementation mass * biggest architecture sourcebook * richest mining material for downstream docs Then Qwen3.6-35B is still the beast. # 4) Fourth place **Qwen3.5-27B** Not bad. Not embarrassing. But now clearly behind both Qwen3.6 variants and Gemma for this kind of long-form architecture/planning task. \## Actual takeaway This ended up being a really clean split: \- \*\*Gemma4 = best editor\*\* \- \*\*Qwen3.6-35B = best expander\*\* \- \*\*Qwen3.6-27B = best practical default\*\* \- \*\*Qwen3.5-27B = respectable, but not the winner\*\* So if I were setting a default local writing worker for long-form architecture/master-plan work today, I’d probably choose: **\*\*Qwen3.6-27B\*\*** It’s the best compromise between: \- readability \- completeness \- structure \- practical usefulness Personal Note re Gemma 4: It was **drastically** shorter than the Qwens for the final output * **Gemma4** → **147 lines** * **Qwen3.6-35B** → **725 lines** * **Qwen3.5-27B** → **840 lines** * **Qwen3.6-27B** → **555 lines** So while I do agree that less is often more, I found the Gemma4 output lacking in both technical depth and detail. Sure, it captured the core concepts, but I would position the output as more of a pitching deck or high level concept, technical details and concepts however are sorely missing. On the other end of the spectrum is Qwen3.6-35B which delivered 5x the volume. That document could really serve as a technical blueprint and architecture implementation bible. Qwen3.5-27B produced even more but this was quantity over quality. I would honestly have rated Gemma4 less favourably than Manny did, so make of that what you will. **For First-draft only** performance, I’d rank them: # One-shot ranking 1. **Qwen3.6-27B** 2. **Qwen3.6-35B** 3. **Qwen3.5-27B** 4. **Gemma4** # Why # 1) Qwen3.6-27B Best balance right out of the gate: * strong product framing * solid structure * good density * less bloated than the other Qwens * more complete than Gemma’s first draft This was the best **raw first shot**. # 2) Qwen3.6-35B Very strong one-shot draft, but more sprawling: * most exhaustive * richest implementation mass * more likely to over-include * better sourcebook than polished masterplan on first pass If you want maximum raw material, this one was a beast. # 3) Qwen3.5-27B Good first-draft generator, but sloppier: * ambitious * broad * lots of content * weaker discipline and coherence than the 3.6 models Still useful, but clearly behind both 3.6 variants. # 4) Gemma4 Gemma (arguably) won the **final polished-document** contest, but not the first-draft contest. Its one-shot behaviour was: * too compressed * too selective * not thorough enough for the initial task It needed the later revision passes to get more substance. Depending on the audience, this may be either good or bad. # Short version * **Best one-shot:** Qwen3.6-27B * **Best after revision/polish:** Gemma4

Original Article

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Qwen3.5-27B, Qwen3.5-122B, and Qwen3.6-35B on 4x RTX 3090 — MoEs struggle with strict global rules

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Youtuber tries Qwen 3.5 35B, Qwen 3.6 35B, and Gemma 4 27b to reverse engineer some large JS, with good results for Qwen 3.6

Gemma 4 beats Qwen 3.5 (UPDATE), and Qwen 3.6 27B + MiniMax M2.7 is the best OpenCode setup

Submit Feedback

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Qwen3.5-27B, Qwen3.5-122B, and Qwen3.6-35B on 4x RTX 3090 — MoEs struggle with strict global rules

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Youtuber tries Qwen 3.5 35B, Qwen 3.6 35B, and Gemma 4 27b to reverse engineer some large JS, with good results for Qwen 3.6

Gemma 4 beats Qwen 3.5 (UPDATE), and Qwen 3.6 27B + MiniMax M2.7 is the best OpenCode setup