What's your experience with Gemma4 QAT?

Reddit r/LocalLLaMA 06/08/26, 12:11 AM Models

gemma4 qat quantization user-experience performance roleplay mtp

Summary

User shares positive experience with Gemma4 QAT model, noting quality improvements and speed gains with MTP, and asks others for their experiences.

Hey everyone! Not a native speaker, so please correct my english where I make mistakes, (can only learn from it!). While it's been out only for just a while, I wanted to post about it because it's been such a joy. So, to say upfront: I use Qwen3.6 27B for programming, Gemma4 for basically everything else. So I can't say anything meaningful about programming. Previously I've used Gemma4-31B Q4\_K\_L (for long 128k Q8\_0 context tasks) and Q6\_K\_L (for short 32k Q8\_0 context tasks). For short context tasks, think quick translations, roleplaying, short but accurate OCR, etc. For long context think long-document parsing, websearch research, etc. With the QAT model, I've been able to use the same model for both tasks (nice!) and notice subtle quality improvements. With roleplay for example, it has much more varied word use, more context relevant remarks, understand corrolations better and able to use it, etc. Sadly I have no experience with the Q8\_0 model, but from what I can tell it performs at least better than Q6\_K\_L from bartowski. It is however still severely hampered by cache quant, Q8\_0 does show a noticable degration for me at 128K. Using MTP with Gemma 31B QAT has been amazing too! I get 50 t/s tg (opposed to 21 t/s) for 32k tokens wikipedia page summerization, \~36 t/s tg during roleplay (opposed to 20 t/s), and you likely can get higher numbers on linux (stuck with windows for now...). I had to dial it in though, 5 max drafts seemed to work well for me, but for my friends 4 or 6 worked better for them. Try 3-7 in 5 separate runs for the same task and see wich one runs best for you. So yeah, enough about my experiences! How was yours? Do you notice any improvement or degration when using the QAT models? And what is programming like on it?

Original Article

What's your experience with Gemma4 QAT?

Similar Articles

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze

Gemma 4 26B A4B IT QAT Comparison

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

@osanseviero: Gemma 4 MTP just got officially merged into llama.cpp This means you can use Gemma 4 QAT + MTP for a lightweight + supe…

Submit Feedback

Similar Articles

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]
Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.

Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze

Gemma 4 26B A4B IT QAT Comparison

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

@osanseviero: Gemma 4 MTP just got officially merged into llama.cpp This means you can use Gemma 4 QAT + MTP for a lightweight + supe…