DiffusionGemma 26B A4B results on my 5090

Reddit r/LocalLLaMA 06/11/26, 03:00 PM Models

diffusiongemma 26b a4b gguf quantization benchmark rtx5090 llama.cpp

Summary

This post presents benchmark results and tuning parameters for running DiffusionGemma 26B A4B GGUF models on an RTX 5090 GPU, showing up to 44% speedup via optimized temperature settings and quantization choices.

\# DiffusionGemma 26B A4B — Tuning Results [https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF](https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF) \## System \- \*\*GPU\*\*: RTX 5090 (32 GB VRAM), CUDA 13.3 \- \*\*Build\*\*: \`llama.cpp\` PR #24423, GCC-15, Ninja, ccache \- \*\*Flash Attention\*\*: Auto-disabled on SM120 — limits max context \- \*\*Models\*\*: \`unsloth/diffusiongemma-26B-A4B-it-GGUF\` \## Models | Variant | File | Size | |---------|------|------| | Q6\_K | \`diffusiongemma-26B-A4B-it-Q6\_K.gguf\` | 22 GB | | Q4\_K\_M | \`diffusiongemma-26B-A4B-it-Q4\_K\_M.gguf\` | 16 GB | \## Max Stable Context | Quant | Formula | Max ctx | -n limit | VRAM limit | |-------|---------|---------|----------|------------| | Q6\_K | 16 blocks × 256 + 2048 | 6,144 | -n 4096 | 22 GB model + \~10 GB buffers | | Q4\_K\_M | 32 blocks × 256 + 2048 | 10,240 | -n 8192 | 16 GB model + \~14 GB buffers | Context is limited by compute buffer size — Flash Attention is auto-disabled on RTX 5090 (SM120), causing O(n²) memory scaling for full attention. Model itself supports up to 262k context; 64k is achievable with Flash Attention enabled. \## Best Parameters | Parameter | Q6\_K | Q4\_K\_M | |-----------|------|--------| | \`--diffusion-eb-t-max\` | 0.4 | 0.3 | | \`--diffusion-eb-t-min\` | 0.1 | 0.05 | | \`--diffusion-eb-max-steps\` | auto (48) | 20 | | \`--diffusion-eb-entropy-bound\` | 0.1 (default) | 0.1 (default) | | \`--diffusion-eb-confidence\` | 0.005 (default) | 0.005 (default) | | \`--diffusion-eb-stability\` | 1 (default) | 1 (default) | | \`-ub\` / \`-b\` | auto-derived from -n | auto-derived from -n | \### Optimal invocations \*\*Q6\_K fastest:\*\* \`\`\` ./build/bin/llama-diffusion-cli \\ \-m /path/to/diffusiongemma-26B-A4B-it-Q6\_K.gguf \\ \-ngl 99 -n 2048 \\ \--diffusion-eb-t-max 0.4 --diffusion-eb-t-min 0.1 \`\`\` \*\*Q4\_K\_M fastest:\*\* \`\`\` ./build/bin/llama-diffusion-cli \\ \-m /path/to/diffusiongemma-26B-A4B-it-Q4\_K\_M.gguf \\ \-ngl 99 -n 8192 \\ \--diffusion-eb-max-steps 20 \\ \--diffusion-eb-t-max 0.3 --diffusion-eb-t-min 0.05 \`\`\` \## Speed Comparison \### Multi-block throughput (long prompt, 2048 token generation) | Context | Q6\_K default | Q6\_K tuned | Q4\_K\_M default | Q4\_K\_M tuned | |---------|-------------|------------|----------------|--------------| | -n 2048 (ctx=4096) | 180 tok/s | \*\*213 tok/s\*\* | 174 tok/s | \*\*244 tok/s\*\* | | -n 3072 (ctx=5120) | 183 tok/s | \*\*209 tok/s\*\* | 175 tok/s | \*\*245 tok/s\*\* | | -n 8192 (ctx=10240) | — | — | 175 tok/s | \*\*252 tok/s\*\* | \### Short-prompt (single block, 256 tokens) | Metric | Q6\_K default | Q6\_K tuned | Q4\_K\_M default | Q4\_K\_M tuned | |--------|-------------|------------|----------------|--------------| | Throughput | 523 tok/s | 523 tok/s | 456 tok/s | \*\*545 tok/s\*\* | | Steps per block | 6 | 6 | 8 | \*\*6\*\* | \### Speedup over default | Quant | -n 2048 | -n 3072 | -n 8192 | |-------|---------|---------|---------| | Q6\_K | \*\*+18%\*\* | \*\*+14%\*\* | — | | Q4\_K\_M | \*\*+40%\*\* | \*\*+40%\*\* | \*\*+44%\*\* | \## Parameter Impact Analysis \### Temperature range (t-max / t-min) — biggest lever Lower temperature makes the model less exploratory, so the canvas converges in fewer denoising steps. Effect is consistent across both quantizations. | t-max / t-min | Q6\_K steps/blk | Q6\_K tok/s | Q4\_K\_M steps/blk | Q4\_K\_M tok/s | |---------------|----------------|------------|-------------------|--------------| | 0.8 / 0.4 (default) | 15.8 | 180 | 18.0 | 174 | | 0.6 / 0.2 | 14.8 | 192 | 16.9 | 188 | | 0.4 / 0.1 | \*\*13.0\*\* | \*\*213\*\* | 13.2 | 221 | | 0.3 / 0.05 | 13.5 | 199 | \*\*12.6\*\* | \*\*230\*\* | | 0.2 / 0.05 | 12.0\* | 223\* | 15.0\* | 260\* | \\\* Single-block or partial generation — quality degraded, speed inflated. Going too cold (< t-max 0.25) kills multi-block generation: the model becomes too deterministic to produce diverse tokens for subsequent blocks. \### EB max-steps — Q4\_K\_M only Capping the maximum denoising steps per block helps Q4\_K\_M but not Q6\_K. The smaller model converges faster, so a hard cap at 20 shaves off \~1.2 steps/block without hitting quality. | max-steps | Q4\_K\_M steps/blk | Q4\_K\_M tok/s | |-----------|-------------------|--------------| | auto (48) | 12.6 | 230 | | 24 | 12.0 | 236 | | \*\*20\*\* | \*\*11.4\*\* | \*\*244\*\* | | 18 | 12.2 | 235 | | 16 | 12.8 | 228 | \### Entropy-bound — stick with default | entropy-bound | Q6\_K tok/s | Q4\_K\_M tok/s | Effect | |---------------|------------|---------------|--------| | 0.05 | 152 | 216 | Too selective → more steps | | \*\*0.1 (default)\*\* | \*\*180\*\* | \*\*230\*\* | Sweet spot | | 0.15 | — | 240 | Slight improvement on Q4 | | 0.2 | 158 | 233 | Too noisy → more steps | \### Batch size — auto is optimal | -ub / -b | Q6\_K tok/s | Notes | |----------|------------|-------| | auto (4096) | \*\*213\*\* | Derived from -n / ctx | | 512 | 203 | Smaller = less parallelism | | 8192 | 213 | Larger = no benefit | \## Key Findings 1. \*\*Q4\_K\_M is the better choice\*\* — 50% more context (10k vs 6k) and 18% faster generation (252 vs 213 tok/s at max context). 2. \*\*Temperature is everything\*\* — lowering t-max from 0.8→0.3 and t-min from 0.4→0.05 accounts for virtually all the speedup. The rest of the EB params are already well-tuned at defaults. 3. \*\*Bigger context doesn't slow down Q4\_K\_M\*\* — speed actually \*improves\* at larger context (252 tok/s at -n 8192 vs 244 at -n 2048). The larger batch gives the entropy-bound sampler better signal. 4. \*\*Flash Attention is the blocker for 64k\*\* — once SM120 support lands in llama.cpp, the compute buffer bottleneck goes away and DiffusionGemma's full 262k context should be reachable on a single RTX 5090.

Original Article

DiffusionGemma 26B A4B results on my 5090

Similar Articles

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

DifussionGemma 4 on 4x7900xtx

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Submit Feedback

Similar Articles

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]
Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.