Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.
These last few weeks have been godsend for 24GB (and below) gpu poor peeps. 1. Killer models released (Gemma 4 / Qwen 3.6) 2. Free intelligence via QAT 3. Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are actually NOT poor any more. I was already happy with Gemma 4 31b running at 40tok/s but now its 70-80tok/s Its not a wonder 3090 prices are increasing. For ref: \- limit=1, OSL=192, concurrency 1, temp=1.0/top\_k=64/top\_p=0.95, ctx=40960, q8\_0 KV cache, parallel=1 \- For the 12b, did test for both TEXT only as well as mmproj multimodal. Same speedup increase. (Im TOTALLY Loving the fact that you can actually TALK to the model, and its a split second before it starts generating a response. No TTS yet though) • Hardware \- CPU: Intel Core i9-13900H, 14 cores / 20 threads \- RAM: 62 GiB system RAM, 8 GiB swap \- GPU: NVIDIA GeForce RTX 3090, 24 GiB VRAM \- Driver/CUDA: NVIDIA driver 595.71.05, CUDA 13.2 \- OS/kernel: Ubuntu 24.04-ish, Linux 6.17.0-35-generic Startup config: llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 40960 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-draft-ngl all \ --spec-draft-type-k q8_0 \ --spec-draft-type-v q8_0 \ UPDATE: for 26b, turns out best N-max is 1, which gives a 1.26x speedup: setting tok/s speedup accept ━━━━━━━━━ ━━━━━━━━ ━━━━━━━━━ ━━━━━━━━ no MTP 143.01 1.00x - ───────── ──────── ───────── ──────── n-max 1 180.01 1.26x 0.765 ───────── ──────── ───────── ──────── n-max 2 175.77 1.23x 0.654 ───────── ──────── ───────── ──────── n-max 3 170.37 1.19x 0.576 ───────── ──────── ───────── ──────── n-max 4 165.90 1.16x 0.492 ───────── ──────── ───────── ──────── n-max 5 155.51 1.09x 0.444 Also what is the test: 11 requests, one each for coding, humanities, math, QA, RAG, reasoning, STEM, writing, multilingual, summarization, roleplay. Context allocated is 40960, but prompt lengths were only about 22 to 1578 tokens, average about 280. Output target is --osl 192 per turn; some samples are multi-turn, so max full-length total is 15 turns * 192 = 2880 generated tokens, but stop tokens can end samples early. This is meant to be a super benchmark to get a rough idea of potential impact of QAT + MTP. A full proper grid of context + depth will be done separately.
Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.
A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.
A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.