@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

X AI KOLs Following Models

Summary

Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM: Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase) prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments) llama.cpp TurboQuant flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --port 8080 tested with a 27k prompt, 120k context loaded. -ngl 99 here isn't a typo, full 12B dense, every layer on GPU, on an 8GB card. that's the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card. TurboQuant's KV cache savings are what free up the room to do that at 120k context. side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+ rig: RTX 4060 8GB · i7H · 16GB RAM same two flags as yesterday, different model size: --cache-type-k q8_0 --cache-type-v turbo3 thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (@no_stp_on_snek) to make this work. unsloth's model quant huggingface and the llama.cpp fork github link in the comments Do you prefer a dense or a MoE for your 8GB card?
Original Article
View Cached Full Text

Cached at: 06/18/26, 04:08 PM

Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context

Gemma 4 12B QAT (dense), TurboQuant (Without MTP), RTX 4060 8GB VRAM:

Prefill: 1000+ tok/s (42% increase) Decode: 25+ tok/s (25% increase) Context: 120k (150% increase)

prefill was 700 tok/sec and decode 20 tok/sec with only 48k context without turbo quant (older test with mtp link in the comments)

llama.cpp TurboQuant flags:

-m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 120000 –cache-type-k q8_0 –cache-type-v turbo3 -ngl 99 –port 8080

tested with a 27k prompt, 120k context loaded.

-ngl 99 here isn’t a typo, full 12B dense, every layer on GPU, on an 8GB card. that’s the part worth sitting with. The model has vision, audio input, thinking/reasoning and fits your 8GB card.

TurboQuant’s KV cache savings are what free up the room to do that at 120k context.

side by side with yesterday: 26B A4B MoE got 320+ tok/s prefill. this dense 12B is clearing 1000+

rig: RTX 4060 8GB · i7H · 16GB RAM

same two flags as yesterday, different model size:

–cache-type-k q8_0 –cache-type-v turbo3

thanks to TheTom/llama-cpp-turboquant, TurboQuant fork of llama.cpp by Tom Turney (@no_stp_on_snek) to make this work.

unsloth’s model quant huggingface and the llama.cpp fork github link in the comments

Do you prefer a dense or a MoE for your 8GB card?

Alok (@analogalok): Google’s Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant

Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma

Similar Articles

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.