Tag
Unsloth released Gemma 4 QAT MTP assistant models as GGUF files on Hugging Face, available in q8_0 and larger quantization formats.
A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.
Google released Gemma 4 models with quantization-aware training (QAT) at Q4_0 precision on Hugging Face, offering efficient variants from 5B to 33B parameters.
New QAT Gemma 4 checkpoints offer similar performance with ~4x less memory, enabling a 1GB memory footprint for Gemma 4 E2B via a new mobile quantization format.
Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.
The author reports that the Gemma 4 12b QAT model suffers from a regression in tool calling and coding tasks compared to the standard Q5_K_L version, due to a bug involving control token misconfiguration. Despite high token speed, the model's inconsistent outputs make it unsuitable for agent workflows.
A technical comparison reveals that Google's Q4_0 quantized Gemma-4 models have higher precision and more high-precision tensors than Unsloth's Q4_K_XL versions, resulting in larger file sizes.
User shares positive experience with Gemma4 QAT model, noting quality improvements and speed gains with MTP, and asks others for their experiences.
A discussion on the potential of 2-bit Quantization Aware Training (QAT) for larger MoE models, comparing their performance to 4-bit QAT and ternary LLMs, and considering feasibility for consumer hardware.
A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.
A user reports that the QAT quantized variant of Gemma4 26B A4B performs worse on a chessboard SVG test compared to the non-QAT version, with unstable piece drawing despite using suggested settings.
Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.
A discussion on whether it is sensible to use alternative quantization methods on quantization-aware trained (QAT) models like Gemma-4, questioning if unsloth's benchmarks showing closer performance to QAT fine-tunes are beneficial or counterproductive.
A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.
Unsloth releases GGUF quantized versions of Google DeepMind's Gemma 4 models, optimized with Quantization-Aware Training (QAT) to reduce memory requirements while preserving quality, supporting multiple formats and sizes for diverse deployment.
A Google Gemma team member has confirmed that Gemma 4 QAT (Quantization-Aware Training) models will be releasing soon, suggesting users wait before testing their own quantizations.
Announcement of an upcoming release of a quantized version of the B27 model using quantization-aware training (QAT), described as the smartest B27 yet.