120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA 06/06/26, 06:53 PM News

gemma-4 qat mtp inference benchmarking quantization llama-cpp

Summary

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.

Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the Gemma 4 MTP PR, and loading Unsloth's [gemma-4-12B-it-qat-GGUF](https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF) quant and Google's [gemma-4-12B-it-qat-q4\_0-unquantized-assistant](https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-unquantized-assistant) QAT assistant / draft model, which I converted to GGUF and uploaded to HuggingFace as [gemma-4-12B-it-qat-assistant-MTP-Q8\_0-GGUF](https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF) using llama.cpp's convert\_hf\_to\_gguf.py, I was able to achieve **120 tok/s** with [mtp-bench.py](https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090/)! # Before we start, here's my PC specs: OS: CachyOS GPU: RTX 4070 Super 12GB (iGPU as main GPU) CPU: AMD Ryzen 7 9700X RAM: 32GB DDR5-6000 # Here's my llama.cpp command: llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 131072 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 # For comparison, here's my [mtp-bench.py](http://mtp-bench.py) benchmark results without MTP: ❯ ./mtp-bench.py code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.8 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=57.6 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 0, "total_draft_accepted": 0, "aggregate_accept_rate": null, "wall_s_total": 30.2 } # Here's my [mtp-bench.py](http://mtp-bench.py) benchmark results with MTP: ❯ ./mtp-bench.py code_python pred= 192 draft= 172 acc= 133 rate=0.773 tok/s=130.5 code_cpp pred= 192 draft= 187 acc= 128 rate=0.684 tok/s=120.4 explain_concept pred= 192 draft= 213 acc= 119 rate=0.559 tok/s=105.7 summarize pred= 192 draft= 168 acc= 134 rate=0.798 tok/s=133.5 qa_factual pred= 192 draft= 210 acc= 120 rate=0.571 tok/s=107.2 translation pred= 192 draft= 175 acc= 132 rate=0.754 tok/s=128.6 creative_short pred= 192 draft= 240 acc= 110 rate=0.458 tok/s=94.0 stepwise_math pred= 192 draft= 165 acc= 135 rate=0.818 tok/s=135.7 long_code_review pred= 192 draft= 197 acc= 125 rate=0.634 tok/s=111.7 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1727, "total_draft_accepted": 1136, "aggregate_accept_rate": 0.6578, "wall_s_total": 15.66 } To achieve this, all you need is a 12GB NVIDIA GPU and enough free VRAM to fit Gemma 4 12GB + assistant entirely in GPU memory. With CachyOS and my dGPU set as a secondary GPU, this gives me pretty much 100% free VRAM. On Windows, or if using your dGPU as your main GPU, you will probably loose 500MB+ of VRAM to the OS and driver, so you might need to lower the context size, or it might simply not work. You'll probably need to do some testing 😄 # Here's step-by-step instructions to get this working: 1. Clone llama.cpp git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp 2. Fetch and switch to the Gemma 4 MTP PR branch git fetch origin pull/23398/head:gemma4-mtp git checkout gemma4-mtp 3. Build with CUDA support for NVIDIA GPUs cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF cmake --build build --config Release -j$(nproc) 4. Download Unsloth's Gemma 4 12B QAT here: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF 5. Download Google's Gemma 4 assistant / draft here https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF 6. Load the models with llama-server llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 131072 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 Cheers 😄

Original Article

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Similar Articles

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…

Submit Feedback

Similar Articles

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]
Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.

@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…