@zhixianio: Finished testing, feeling quite surprised, not sure if I'm using it wrong. Feel free to provide counterexamples. Here are my results: On M5 Max, pitting this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-…

X AI KOLs Timeline News

Summary

The user tested the community fine-tuned gemma-4-12B-coder against Qwen3.6-35B-A3B MoE on three programming tasks, finding that gemma performed poorly on complex stateful programs, while Qwen 35B remained robust.

Finished testing, and I'm quite astonished. Not sure if I'm using it the wrong way. Feel free to provide counterexamples. Here are my results: On M5 Max, pitting this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-A3B MoE (oMLX). Three tasks: - matplotlib data chart → Tie, both ran correctly on first try with proper formatting. - three.js galaxy particle effect → Qwen produced a rotatable and scalable galaxy; gemma gave a black screen (missing importmap, CDN version number typo, particle size too small to see — three bugs combined). - A fully playable Tetris → Qwen actually works (falling, line clearing, score up to 117, Next preview all present); gemma's blocks don't fall at all, score stuck at 0. - Video: Left = Tetris comparison, Right = galaxy effect comparison. Later I also downloaded the original gemma-4-12B-it (also 4-bit quantized) and ran the same Tetris → still crashed (empty board, score NaN, line count jumping). This shows the bottleneck is that the 12B scale cannot handle long, stateful, one-shot complex programs, and it's not about fine-tuning. Also found an interesting point: with the original model in thinking mode, all 12,000 tokens were spent on 'thinking', not a single line of code output; whereas the coder fine-tuned version learned to 'think a little then act'. Fine-tuning improves convergence/efficiency, but doesn't raise the ceiling of 12B. My Qwen 35B still sits solidly on the sweet spot throne.
Original Article
View Cached Full Text

Cached at: 06/15/26, 03:07 PM

Finished testing, and I’m honestly quite surprised by the results. Not sure if I’m using it wrong — feel free to offer counterexamples. Here’s what I found:

On the M5 Max, I pitted this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-A3B MoE (oMLX). Three tasks:

  • Matplotlib data chart: Tie — both ran correctly on the first try, properly formatted.
  • Three.js galaxy particle effect: Qwen produced a rotatable and zoomable galaxy; gemma gave a black screen (missing importmap, wrong CDN version number by one digit, particle size too small to see — three bugs stacked together).
  • A fully playable Tetris: Qwen actually works (block drop, line clear, score up to 117, Next preview all present); gemma’s blocks don’t drop at all, score stays at 0.
  • Video: Left = Tetris comparison, Right = Galaxy effect comparison

Later, I also downloaded the original gemma-4-12B-it (also 4-bit quantized) and ran the same Tetris → still broken (empty board, score NaN, line count jumping randomly). This shows the bottleneck is that the 12B parameter size can’t handle complex programs that require “long context, stateful, single-shot generation”, unrelated to fine-tuning.

Also found an interesting point: when enabling thinking mode on the original version, it used up 12,000 tokens just “thinking” without outputting a single line of code; the coder fine-tuned version, on the other hand, learned to “think a little then act”. Fine-tuning improves convergence/efficiency, but doesn’t raise the ceiling of a 12B model.

My Qwen 35B still sits firmly on the sweet spot throne.

Hugging Models (@HuggingModels):
Gemma 4 12B Coder is here and it’s a game changer for local code generation. This GGUF model packs Google’s latest gemma-4 architecture into a compact 12B size, perfect for running on consumer hardware. It’s optimized for reasoning and thinking, making it ideal for developers who

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Reddit r/LocalLLaMA

A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA

A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.