[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Reddit r/LocalLLaMA Models

Summary

Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.

These last few weeks have been godsend for 24GB (and below) gpu poor peeps. 1. Killer models released (Gemma 4 / Qwen 3.6) 2. Free intelligence via QAT 3. Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are actually NOT poor any more. I was already happy with Gemma 4 31b running at 40tok/s but now its 70-80tok/s Its not a wonder 3090 prices are increasing. For ref: \- limit=1, OSL=192, concurrency 1, temp=1.0/top\_k=64/top\_p=0.95, ctx=40960, q8\_0 KV cache, parallel=1 \- For the 12b, did test for both TEXT only as well as mmproj multimodal. Same speedup increase. (Im TOTALLY Loving the fact that you can actually TALK to the model, and its a split second before it starts generating a response. No TTS yet though) • Hardware \- CPU: Intel Core i9-13900H, 14 cores / 20 threads \- RAM: 62 GiB system RAM, 8 GiB swap \- GPU: NVIDIA GeForce RTX 3090, 24 GiB VRAM \- Driver/CUDA: NVIDIA driver 595.71.05, CUDA 13.2 \- OS/kernel: Ubuntu 24.04-ish, Linux 6.17.0-35-generic Startup config: llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 40960 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-draft-ngl all \ --spec-draft-type-k q8_0 \ --spec-draft-type-v q8_0 \ UPDATE: for 26b, turns out best N-max is 1, which gives a 1.26x speedup: setting tok/s speedup accept ━━━━━━━━━ ━━━━━━━━ ━━━━━━━━━ ━━━━━━━━ no MTP 143.01 1.00x - ───────── ──────── ───────── ──────── n-max 1 180.01 1.26x 0.765 ───────── ──────── ───────── ──────── n-max 2 175.77 1.23x 0.654 ───────── ──────── ───────── ──────── n-max 3 170.37 1.19x 0.576 ───────── ──────── ───────── ──────── n-max 4 165.90 1.16x 0.492 ───────── ──────── ───────── ──────── n-max 5 155.51 1.09x 0.444 Also what is the test: 11 requests, one each for coding, humanities, math, QA, RAG, reasoning, STEM, writing, multilingual, summarization, roleplay. Context allocated is 40960, but prompt lengths were only about 22 to 1578 tokens, average about 280. Output target is --osl 192 per turn; some samples are multi-turn, so max full-length total is 15 turns * 192 = 2880 generated tokens, but stop tokens can end samples early. This is meant to be a super benchmark to get a rough idea of potential impact of QAT + MTP. A full proper grid of context + depth will be done separately.
Original Article

Similar Articles

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Reddit r/LocalLLaMA

A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.