qat

#qat

Compressed Whisper large-v3-turbo to 368 MB with Q3_K-matched QAT — multilingual WER results

Reddit r/openclaw ↗ · 2026-06-28

Whisper large-v3-turbo has been compressed to 368 MB using Q3_K-matched quantization-aware training, with multilingual word error rate results reported.

0 favorites 0 likes

#qat

Gemma 4 QAT 31B responds better to KV cache quantization too

Reddit r/LocalLLaMA ↗ · 2026-06-22

The Gemma 4 QAT 31B model demonstrates improved behavior with KV cache quantization, suggesting enhanced inference efficiency.

0 favorites 0 likes

#qat

@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

X AI KOLs Following ↗ · 2026-06-18 Cached

Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.

0 favorites 0 likes

#qat

moar QAT stuff and hairy ticks

Reddit r/LocalLLaMA ↗ · 2026-06-15

The author releases improved GGUF quantized versions of Gemma 4 models (12B and 31B) using a more accurate quantization-aware training process that achieves lower KLD and higher same-top percentage than stock quantizations.

0 favorites 0 likes

#qat

Unsloth Gemma 4 QAT MTP assistant models now available

Reddit r/LocalLLaMA ↗ · 2026-06-09

Unsloth released Gemma 4 QAT MTP assistant models as GGUF files on Hugging Face, available in q8_0 and larger quantization formats.

0 favorites 0 likes

#qat

Gemma 4 26B A4B IT QAT Comparison

Reddit r/LocalLLaMA ↗ · 2026-06-09

A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.

0 favorites 0 likes

#qat

@_philschmid: Weights: https://huggingface.co/collections/google/gemma-4-qat-q4-0… Blog: https://blog.google/innovation-and-ai/techno…

X AI KOLs Following ↗ · 2026-06-08 Cached

Google released Gemma 4 models with quantization-aware training (QAT) at Q4_0 precision on Hugging Face, offering efficient variants from 5B to 33B parameters.

0 favorites 0 likes

#qat

@_philschmid: More Gemma 4! New QAT Gemma 4 checkpoints with similar performance while using ~4x less memory! It comes with a new mob…

X AI KOLs Following ↗ · 2026-06-08 Cached

New QAT Gemma 4 checkpoints offer similar performance with ~4x less memory, enabling a 1GB memory footprint for Gemma 4 E2B via a new mobile quantization format.

0 favorites 0 likes

#qat

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Reddit r/LocalLLaMA ↗ · 2026-06-08

Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.

0 favorites 0 likes

#qat

Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze

Reddit r/LocalLLaMA ↗ · 2026-06-08

The author reports that the Gemma 4 12b QAT model suffers from a regression in tool calling and coding tasks compared to the standard Q5_K_L version, due to a bug involving control token misconfiguration. Despite high token speed, the model's inconsistent outputs make it unsuitable for agent workflows.

0 favorites 0 likes

#qat

QATs Q4_0 from Google have more precision than Q4_K_XL from Unsloth (at least some)

Reddit r/LocalLLaMA ↗ · 2026-06-08

A technical comparison reveals that Google's Q4_0 quantized Gemma-4 models have higher precision and more high-precision tensors than Unsloth's Q4_K_XL versions, resulting in larger file sizes.

0 favorites 0 likes

#qat

What's your experience with Gemma4 QAT?

Reddit r/LocalLLaMA ↗ · 2026-06-08

User shares positive experience with Gemma4 QAT model, noting quality improvements and speed gains with MTP, and asks others for their experiences.

0 favorites 0 likes

#qat

2-bit QAT model releases

Reddit r/LocalLLaMA ↗ · 2026-06-07

A discussion on the potential of 2-bit Quantization Aware Training (QAT) for larger MoE models, comparing their performance to 4-bit QAT and ternary LLMs, and considering feasibility for consumer hardware.

0 favorites 0 likes

#qat

MTP and QTA - what is the relation?

Reddit r/LocalLLaMA ↗ · 2026-06-07

A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.

0 favorites 0 likes

#qat

QAT variant of Gemma4 26B A4B is not working well for me

Reddit r/LocalLLaMA ↗ · 2026-06-07

A user reports that the QAT quantized variant of Gemma4 26B A4B performs worse on a chessboard SVG test compared to the non-QAT version, with unstable piece drawing despite using suggested settings.

0 favorites 0 likes

#qat

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA ↗ · 2026-06-06

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.

0 favorites 0 likes

#qat

Does it make sense to use alternative quantizations of QAT models? [D]

Reddit r/MachineLearning ↗ · 2026-06-06

A discussion on whether it is sensible to use alternative quantization methods on quantization-aware trained (QAT) models like Gemma-4, questioning if unsloth's benchmarks showing closer performance to QAT fine-tunes are beneficial or counterproductive.

0 favorites 0 likes

#qat

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Reddit r/LocalLLaMA ↗ · 2026-06-05

A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.

0 favorites 0 likes

#qat

unsloth/gemma-4-12B-it-qat-GGUF

Hugging Face Models Trending ↗ · 2026-06-05 Cached

Unsloth releases GGUF quantized versions of Google DeepMind's Gemma 4 models, optimized with Quantization-Aware Training (QAT) to reduce memory requirements while preserving quality, supporting multiple formats and sizes for diverse deployment.

0 favorites 0 likes

#qat

Gemma 4 QAT confirmed to release soon!

Reddit r/LocalLLaMA ↗ · 2026-06-04

A Google Gemma team member has confirmed that Gemma 4 QAT (Quantization-Aware Training) models will be releasing soon, suggesting users wait before testing their own quantizations.

0 favorites 0 likes

qat

Submit Feedback