@zhixianio: Finished testing, feeling quite surprised, not sure if I'm using it wrong. Feel free to provide counterexamples. Here are my results: On M5 Max, pitting this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-…

X AI KOLs Timeline 06/15/26, 10:34 AM News

code-generation model-comparison gemma-4 qwen-3 benchmark llama-cpp

Summary

The user tested the community fine-tuned gemma-4-12B-coder against Qwen3.6-35B-A3B MoE on three programming tasks, finding that gemma performed poorly on complex stateful programs, while Qwen 35B remained robust.

Finished testing, and I'm quite astonished. Not sure if I'm using it the wrong way. Feel free to provide counterexamples. Here are my results: On M5 Max, pitting this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-A3B MoE (oMLX). Three tasks: - matplotlib data chart → Tie, both ran correctly on first try with proper formatting. - three.js galaxy particle effect → Qwen produced a rotatable and scalable galaxy; gemma gave a black screen (missing importmap, CDN version number typo, particle size too small to see — three bugs combined). - A fully playable Tetris → Qwen actually works (falling, line clearing, score up to 117, Next preview all present); gemma's blocks don't fall at all, score stuck at 0. - Video: Left = Tetris comparison, Right = galaxy effect comparison. Later I also downloaded the original gemma-4-12B-it (also 4-bit quantized) and ran the same Tetris → still crashed (empty board, score NaN, line count jumping). This shows the bottleneck is that the 12B scale cannot handle long, stateful, one-shot complex programs, and it's not about fine-tuning. Also found an interesting point: with the original model in thinking mode, all 12,000 tokens were spent on 'thinking', not a single line of code output; whereas the coder fine-tuned version learned to 'think a little then act'. Fine-tuning improves convergence/efficiency, but doesn't raise the ceiling of 12B. My Qwen 35B still sits solidly on the sweet spot throne.

Original Article

View Cached Full Text

Cached at: 06/15/26, 03:07 PM

Finished testing, and I’m honestly quite surprised by the results. Not sure if I’m using it wrong — feel free to offer counterexamples. Here’s what I found:

On the M5 Max, I pitted this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-A3B MoE (oMLX). Three tasks:

Matplotlib data chart: Tie — both ran correctly on the first try, properly formatted.
Three.js galaxy particle effect: Qwen produced a rotatable and zoomable galaxy; gemma gave a black screen (missing importmap, wrong CDN version number by one digit, particle size too small to see — three bugs stacked together).
A fully playable Tetris: Qwen actually works (block drop, line clear, score up to 117, Next preview all present); gemma’s blocks don’t drop at all, score stays at 0.
Video: Left = Tetris comparison, Right = Galaxy effect comparison

Later, I also downloaded the original gemma-4-12B-it (also 4-bit quantized) and ran the same Tetris → still broken (empty board, score NaN, line count jumping randomly). This shows the bottleneck is that the 12B parameter size can’t handle complex programs that require “long context, stateful, single-shot generation”, unrelated to fine-tuning.

Also found an interesting point: when enabling thinking mode on the original version, it used up 12,000 tokens just “thinking” without outputting a single line of code; the coder fine-tuned version, on the other hand, learned to “think a little then act”. Fine-tuning improves convergence/efficiency, but doesn’t raise the ceiling of a 12B model.

My Qwen 35B still sits firmly on the sweet spot throne.

Hugging Models (@HuggingModels):
Gemma 4 12B Coder is here and it’s a game changer for local code generation. This GGUF model packs Google’s latest gemma-4 architecture into a compact 12B size, perfect for running on consumer hardware. It’s optimized for reasoning and thinking, making it ideal for developers who

@zhixianio: Finished testing, feeling quite surprised, not sure if I'm using it wrong. Feel free to provide counterexamples. Here are my results: On M5 Max, pitting this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-…

Similar Articles

Gemma 4 beats Qwen 3.5 (UPDATE), and Qwen 3.6 27B + MiniMax M2.7 is the best OpenCode setup

Qwen3.6:27b single-shot fixed a CSS UI bug that had Gemma4:26B doom looping uselessly for 15 minutes

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Submit Feedback

Similar Articles

Gemma 4 beats Qwen 3.5 (UPDATE), and Qwen 3.6 27B + MiniMax M2.7 is the best OpenCode setup

Qwen3.6:27b single-shot fixed a CSS UI bug that had Gemma4:26B doom looping uselessly for 15 minutes

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint
Qwen3.5-9B outperforms gemma-4-12b-it on 5 of 8 benchmarks despite having a smaller footprint, with gemma only slightly better at coding.

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed