[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Reddit r/LocalLLaMA 06/08/26, 02:07 PM Models

gemma-4 qat mtp benchmark gpu performance llama-cpp

Summary

Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.

These last few weeks have been godsend for 24GB (and below) gpu poor peeps. 1. Killer models released (Gemma 4 / Qwen 3.6) 2. Free intelligence via QAT 3. Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are actually NOT poor any more. I was already happy with Gemma 4 31b running at 40tok/s but now its 70-80tok/s Its not a wonder 3090 prices are increasing. For ref: \- limit=1, OSL=192, concurrency 1, temp=1.0/top\_k=64/top\_p=0.95, ctx=40960, q8\_0 KV cache, parallel=1 \- For the 12b, did test for both TEXT only as well as mmproj multimodal. Same speedup increase. (Im TOTALLY Loving the fact that you can actually TALK to the model, and its a split second before it starts generating a response. No TTS yet though) • Hardware \- CPU: Intel Core i9-13900H, 14 cores / 20 threads \- RAM: 62 GiB system RAM, 8 GiB swap \- GPU: NVIDIA GeForce RTX 3090, 24 GiB VRAM \- Driver/CUDA: NVIDIA driver 595.71.05, CUDA 13.2 \- OS/kernel: Ubuntu 24.04-ish, Linux 6.17.0-35-generic Startup config: llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 40960 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-draft-ngl all \ --spec-draft-type-k q8_0 \ --spec-draft-type-v q8_0 \ UPDATE: for 26b, turns out best N-max is 1, which gives a 1.26x speedup: setting tok/s speedup accept ━━━━━━━━━ ━━━━━━━━ ━━━━━━━━━ ━━━━━━━━ no MTP 143.01 1.00x - ───────── ──────── ───────── ──────── n-max 1 180.01 1.26x 0.765 ───────── ──────── ───────── ──────── n-max 2 175.77 1.23x 0.654 ───────── ──────── ───────── ──────── n-max 3 170.37 1.19x 0.576 ───────── ──────── ───────── ──────── n-max 4 165.90 1.16x 0.492 ───────── ──────── ───────── ──────── n-max 5 155.51 1.09x 0.444 Also what is the test: 11 requests, one each for coding, humanities, math, QA, RAG, reasoning, STEM, writing, multilingual, summarization, roleplay. Context allocated is 40960, but prompt lengths were only about 22 to 1578 tokens, average about 280. Output target is --osl 192 per turn; some samples are multi-turn, so max full-length total is 15 turns * 192 = 2880 generated tokens, but stop tokens can end samples early. This is meant to be a super benchmark to get a rough idea of potential impact of QAT + MTP. A full proper grid of context + depth will be done separately.

Original Article

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Similar Articles

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Submit Feedback

Similar Articles

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

Qwen3.6-35B vs Gemma4-26B on 7900 XTX