What's your experience with Gemma4 QAT?
Summary
User shares positive experience with Gemma4 QAT model, noting quality improvements and speed gains with MTP, and asks others for their experiences.
Similar Articles
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]
Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.
Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze
The author reports that the Gemma 4 12b QAT model suffers from a regression in tool calling and coding tasks compared to the standard Q5_K_L version, due to a bug involving control token misconfiguration. Despite high token speed, the model's inconsistent outputs make it unsuitable for agent workflows.
Gemma 4 26B A4B IT QAT Comparison
A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss
A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.
@osanseviero: Gemma 4 MTP just got officially merged into llama.cpp This means you can use Gemma 4 QAT + MTP for a lightweight + supe…
Gemma 4 MTP has been merged into llama.cpp, enabling lightweight and fast inference with Gemma 4 QAT and MTP.