The author reports that the Gemma 4 12b QAT model suffers from a regression in tool calling and coding tasks compared to the standard Q5_K_L version, due to a bug involving control token misconfiguration. Despite high token speed, the model's inconsistent outputs make it unsuitable for agent workflows.
I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me. It is a major regression compared to the standard Q5\_K\_L version, which worked without issue. I know the general consensus is that Qwen is for coding and Gemma is for creatives. But I can tell you for a fact that I code very well with the regular Q5\_K\_L version. When factoring in prompt structure, edits, and specific coding languages, I was able to generate 2,300 solid lines of code on a project (fully debugged, architecturally sound, and tested) . Additionally, I was able to generate 10,000 lines of story writing on a generic prompt about a samurai. Speed is not everything. The main problem with this QAT model is that it constantly questions itself during generation. I tried using it for coding in my custom VS Code extension, writing stories, and real use cases, but the results are completely inconsistent despite hitting a solid 60 tokens a second. The core failure point shows up right in the server startup logs: `W load: control-looking token: 50 '<|tool_response|>' was not control-type; this is probably a bug in the model. its type will be overridden` Because the model misconfigures and overrides its own tool response tags before it even starts processing, structured function execution is broken. If you rely on agent workflows or developer extensions, save your time and stick to the regular quants. I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me. It is a major regression compared to the standard Q5\_K\_L version, which worked without issue. I know the general consensus is that Qwen is for coding and Gemma is for creatives. But I can tell you for a fact that I code very well with the regular Q5\_K\_L version. When factoring in prompt structure, edits, and specific coding languages, I was able to generate 2,300 solid lines of code on a project. Additionally, I was able to generate 10,000 lines of story writing on a generic prompt about a samurai. Speed is not everything. The main problem with this QAT model is that it constantly questions itself during generation. I tried using it for coding in my custom VS Code extension, writing stories, and real use cases, but the results are completely inconsistent despite hitting a solid 60 tokens a second. To rule out any backend or hardware misconfiguration, here is the continuous startup block from my server logs showing the exact GPU detection, thread assignment, context allocation, and the native template auto-match: 0.00.074.191 I - CUDA0 : NVIDIA GeForce RTX 4080 SUPER (16375 MiB, 15061 MiB free) 0.00.074.205 I - CPU : 12th Gen Intel(R) Core(TM) i7-12700KF (98097 MiB, 86472 MiB free) 0.00.074.254 I system_info: n_threads = 12 (n_threads_batch = 12) / 20 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.00.074.293 I srv init: using 19 threads for HTTP server 0.00.080.574 I srv load_model: loading model 'E:\models\gemma-4-12B-it-qat-UD-Q4_K_XL.gguf' 0.01.205.117 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.205.496 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.242.092 W load: special_eog_ids contains '<|tool_response|>', removing '</s>' token from EOG list 0.03.279.202 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized 0.03.370.810 I slot load_model: id 0 | task -1 | new slot, n_ctx = 32768 0.03.370.887 I srv load_model: prompt cache is enabled, size limit: 8192 MiB 4.07.196.023 I srv params_from_: Chat format: peg-gemma4 The hardware lines prove the 4080 Super is utilized cleanly and thread execution matches the i7-12700KF topology correctly. The server successfully initialized the 32768 context size and auto-detected the proper native peg-gemma4 chat layout from the model metadata on its own. This completely isolates the broken tool calling to the token bug shown in the warnings. The model is misconfiguring and overriding its own tool response tags before it even starts processing, breaking structured function execution. If you rely on agent workflows or developer extensions, save your time and stick to the regular quants.
A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.
Google releases Gemma 4 models optimized with Quantization-Aware Training (QAT) to improve efficiency for mobile and laptop deployment, reducing memory footprint to 1GB for the E2B model while preserving quality.
A Google Gemma team member has confirmed that Gemma 4 QAT (Quantization-Aware Training) models will be releasing soon, suggesting users wait before testing their own quantizations.
A user reports that the QAT quantized variant of Gemma4 26B A4B performs worse on a chessboard SVG test compared to the non-QAT version, with unstable piece drawing despite using suggested settings.