@_philschmid: More Gemma 4! New QAT Gemma 4 checkpoints with similar performance while using ~4x less memory! It comes with a new mob…

X AI KOLs Following 06/08/26, 02:24 PM Models

gemma-4 quantization-aware-training qat model-compression mobile memory-footprint huggingface

Summary

New QAT Gemma 4 checkpoints offer similar performance with ~4x less memory, enabling a 1GB memory footprint for Gemma 4 E2B via a new mobile quantization format.

More Gemma 4! New QAT Gemma 4 checkpoints with similar performance while using ~4x less memory! It comes with a new mobile quantization format that reduces memory footprint of Gemma 4 E2B to just 1GB. Quantization-Aware Training (QAT) simulates low-precision operations during training to allow loss-less quantization afterwards for smaller, faster models while maintaining accuracy. Available on @huggingface and directly runnable.

Original Article

View Cached Full Text

Cached at: 06/08/26, 03:22 PM

More Gemma 4! New QAT Gemma 4 checkpoints with similar performance while using ~4x less memory!

It comes with a new mobile quantization format that reduces memory footprint of Gemma 4 E2B to just 1GB.

Quantization-Aware Training (QAT) simulates low-precision operations during training to allow loss-less quantization afterwards for smaller, faster models while maintaining accuracy.

Available on @huggingface and directly runnable.

Similar Articles

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Hacker News Top

Google releases Gemma 4 models optimized with Quantization-Aware Training (QAT) to improve efficiency for mobile and laptop deployment, reducing memory footprint to 1GB for the E2B model while preserving quality.

Gemma 4 QAT 31B responds better to KV cache quantization too

Reddit r/LocalLLaMA

The Gemma 4 QAT 31B model demonstrates improved behavior with KV cache quantization, suggesting enhanced inference efficiency.

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Reddit r/LocalLLaMA

A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.

Google's quantization aware trained Gemma checkpoints enabling mobile device inference just dropped on HF

Reddit r/singularity

Google released quantization-aware trained Gemma 4 checkpoints on HuggingFace, optimized for mobile device inference and available in QAT Mobile and Q4_0 variants.

Gemma 4 26B A4B IT QAT Comparison

Reddit r/LocalLLaMA

A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.

Similar Articles

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Gemma 4 QAT 31B responds better to KV cache quantization too

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Google's quantization aware trained Gemma checkpoints enabling mobile device inference just dropped on HF

Gemma 4 26B A4B IT QAT Comparison

Submit Feedback