@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…
Summary
Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.
View Cached Full Text
Cached at: 06/18/26, 04:08 PM
Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context
Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM:
Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase)
prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments)
llama.cpp TurboQuant flags:
-m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 –cache-type-k q8_0 –cache-type-v turbo3 -ngl 99 –port 8080
tested with a 27k prompt, 120k context loaded.
-ngl 99 here isn’t a typo, full 12B dense, every layer on GPU, on an 8GB card. that’s the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card.
TurboQuant’s KV cache savings are what free up the room to do that at 120k context.
side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+
rig: RTX 4060 8GB · i7H · 16GB RAM
same two flags as yesterday, different model size:
–cache-type-k q8_0 –cache-type-v turbo3
thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (@no_stp_on_snek) to make this work.
unsloth’s model quant huggingface and the llama.cpp fork github link in the comments
Do you prefer a dense or a MoE for your 8GB card?
Alok (@analogalok): Google’s Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant
Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma
Similar Articles
@leopardracer: GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can …
Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.
@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…
Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.
120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP
Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.
Gemma 4 26B Hits 600 Tok/s on One RTX 5090
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss
A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.