Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.
Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the Gemma 4 MTP PR, and loading Unsloth's [gemma-4-12B-it-qat-GGUF](https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF) quant and Google's [gemma-4-12B-it-qat-q4\_0-unquantized-assistant](https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-unquantized-assistant) QAT assistant / draft model, which I converted to GGUF and uploaded to HuggingFace as [gemma-4-12B-it-qat-assistant-MTP-Q8\_0-GGUF](https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF) using llama.cpp's convert\_hf\_to\_gguf.py, I was able to achieve **120 tok/s** with [mtp-bench.py](https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090/)! # Before we start, here's my PC specs: OS: CachyOS GPU: RTX 4070 Super 12GB (iGPU as main GPU) CPU: AMD Ryzen 7 9700X RAM: 32GB DDR5-6000 # Here's my llama.cpp command: llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 131072 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 # For comparison, here's my [mtp-bench.py](http://mtp-bench.py) benchmark results without MTP: ❯ ./mtp-bench.py code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.8 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=57.6 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 0, "total_draft_accepted": 0, "aggregate_accept_rate": null, "wall_s_total": 30.2 } # Here's my [mtp-bench.py](http://mtp-bench.py) benchmark results with MTP: ❯ ./mtp-bench.py code_python pred= 192 draft= 172 acc= 133 rate=0.773 tok/s=130.5 code_cpp pred= 192 draft= 187 acc= 128 rate=0.684 tok/s=120.4 explain_concept pred= 192 draft= 213 acc= 119 rate=0.559 tok/s=105.7 summarize pred= 192 draft= 168 acc= 134 rate=0.798 tok/s=133.5 qa_factual pred= 192 draft= 210 acc= 120 rate=0.571 tok/s=107.2 translation pred= 192 draft= 175 acc= 132 rate=0.754 tok/s=128.6 creative_short pred= 192 draft= 240 acc= 110 rate=0.458 tok/s=94.0 stepwise_math pred= 192 draft= 165 acc= 135 rate=0.818 tok/s=135.7 long_code_review pred= 192 draft= 197 acc= 125 rate=0.634 tok/s=111.7 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1727, "total_draft_accepted": 1136, "aggregate_accept_rate": 0.6578, "wall_s_total": 15.66 } To achieve this, all you need is a 12GB NVIDIA GPU and enough free VRAM to fit Gemma 4 12GB + assistant entirely in GPU memory. With CachyOS and my dGPU set as a secondary GPU, this gives me pretty much 100% free VRAM. On Windows, or if using your dGPU as your main GPU, you will probably loose 500MB+ of VRAM to the OS and driver, so you might need to lower the context size, or it might simply not work. You'll probably need to do some testing 😄 # Here's step-by-step instructions to get this working: 1. Clone llama.cpp git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp 2. Fetch and switch to the Gemma 4 MTP PR branch git fetch origin pull/23398/head:gemma4-mtp git checkout gemma4-mtp 3. Build with CUDA support for NVIDIA GPUs cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF cmake --build build --config Release -j$(nproc) 4. Download Unsloth's Gemma 4 12B QAT here: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF 5. Download Google's Gemma 4 assistant / draft here https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF 6. Load the models with llama-server llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 131072 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 Cheers 😄
A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.