DiffusionGemma 26B A4B results on my 5090

Reddit r/LocalLLaMA Models

Summary

This post presents benchmark results and tuning parameters for running DiffusionGemma 26B A4B GGUF models on an RTX 5090 GPU, showing up to 44% speedup via optimized temperature settings and quantization choices.

\# DiffusionGemma 26B A4B — Tuning Results [https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF](https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF) \## System \- \*\*GPU\*\*: RTX 5090 (32 GB VRAM), CUDA 13.3 \- \*\*Build\*\*: \`llama.cpp\` PR #24423, GCC-15, Ninja, ccache \- \*\*Flash Attention\*\*: Auto-disabled on SM120 — limits max context \- \*\*Models\*\*: \`unsloth/diffusiongemma-26B-A4B-it-GGUF\` \## Models | Variant | File | Size | |---------|------|------| | Q6\_K | \`diffusiongemma-26B-A4B-it-Q6\_K.gguf\` | 22 GB | | Q4\_K\_M | \`diffusiongemma-26B-A4B-it-Q4\_K\_M.gguf\` | 16 GB | \## Max Stable Context | Quant | Formula | Max ctx | -n limit | VRAM limit | |-------|---------|---------|----------|------------| | Q6\_K | 16 blocks × 256 + 2048 | 6,144 | -n 4096 | 22 GB model + \~10 GB buffers | | Q4\_K\_M | 32 blocks × 256 + 2048 | 10,240 | -n 8192 | 16 GB model + \~14 GB buffers | Context is limited by compute buffer size — Flash Attention is auto-disabled on RTX 5090 (SM120), causing O(n²) memory scaling for full attention. Model itself supports up to 262k context; 64k is achievable with Flash Attention enabled. \## Best Parameters | Parameter | Q6\_K | Q4\_K\_M | |-----------|------|--------| | \`--diffusion-eb-t-max\` | 0.4 | 0.3 | | \`--diffusion-eb-t-min\` | 0.1 | 0.05 | | \`--diffusion-eb-max-steps\` | auto (48) | 20 | | \`--diffusion-eb-entropy-bound\` | 0.1 (default) | 0.1 (default) | | \`--diffusion-eb-confidence\` | 0.005 (default) | 0.005 (default) | | \`--diffusion-eb-stability\` | 1 (default) | 1 (default) | | \`-ub\` / \`-b\` | auto-derived from -n | auto-derived from -n | \### Optimal invocations \*\*Q6\_K fastest:\*\* \`\`\` ./build/bin/llama-diffusion-cli \\ \-m /path/to/diffusiongemma-26B-A4B-it-Q6\_K.gguf \\ \-ngl 99 -n 2048 \\ \--diffusion-eb-t-max 0.4 --diffusion-eb-t-min 0.1 \`\`\` \*\*Q4\_K\_M fastest:\*\* \`\`\` ./build/bin/llama-diffusion-cli \\ \-m /path/to/diffusiongemma-26B-A4B-it-Q4\_K\_M.gguf \\ \-ngl 99 -n 8192 \\ \--diffusion-eb-max-steps 20 \\ \--diffusion-eb-t-max 0.3 --diffusion-eb-t-min 0.05 \`\`\` \## Speed Comparison \### Multi-block throughput (long prompt, 2048 token generation) | Context | Q6\_K default | Q6\_K tuned | Q4\_K\_M default | Q4\_K\_M tuned | |---------|-------------|------------|----------------|--------------| | -n 2048 (ctx=4096) | 180 tok/s | \*\*213 tok/s\*\* | 174 tok/s | \*\*244 tok/s\*\* | | -n 3072 (ctx=5120) | 183 tok/s | \*\*209 tok/s\*\* | 175 tok/s | \*\*245 tok/s\*\* | | -n 8192 (ctx=10240) | — | — | 175 tok/s | \*\*252 tok/s\*\* | \### Short-prompt (single block, 256 tokens) | Metric | Q6\_K default | Q6\_K tuned | Q4\_K\_M default | Q4\_K\_M tuned | |--------|-------------|------------|----------------|--------------| | Throughput | 523 tok/s | 523 tok/s | 456 tok/s | \*\*545 tok/s\*\* | | Steps per block | 6 | 6 | 8 | \*\*6\*\* | \### Speedup over default | Quant | -n 2048 | -n 3072 | -n 8192 | |-------|---------|---------|---------| | Q6\_K | \*\*+18%\*\* | \*\*+14%\*\* | — | | Q4\_K\_M | \*\*+40%\*\* | \*\*+40%\*\* | \*\*+44%\*\* | \## Parameter Impact Analysis \### Temperature range (t-max / t-min) — biggest lever Lower temperature makes the model less exploratory, so the canvas converges in fewer denoising steps. Effect is consistent across both quantizations. | t-max / t-min | Q6\_K steps/blk | Q6\_K tok/s | Q4\_K\_M steps/blk | Q4\_K\_M tok/s | |---------------|----------------|------------|-------------------|--------------| | 0.8 / 0.4 (default) | 15.8 | 180 | 18.0 | 174 | | 0.6 / 0.2 | 14.8 | 192 | 16.9 | 188 | | 0.4 / 0.1 | \*\*13.0\*\* | \*\*213\*\* | 13.2 | 221 | | 0.3 / 0.05 | 13.5 | 199 | \*\*12.6\*\* | \*\*230\*\* | | 0.2 / 0.05 | 12.0\* | 223\* | 15.0\* | 260\* | \\\* Single-block or partial generation — quality degraded, speed inflated. Going too cold (< t-max 0.25) kills multi-block generation: the model becomes too deterministic to produce diverse tokens for subsequent blocks. \### EB max-steps — Q4\_K\_M only Capping the maximum denoising steps per block helps Q4\_K\_M but not Q6\_K. The smaller model converges faster, so a hard cap at 20 shaves off \~1.2 steps/block without hitting quality. | max-steps | Q4\_K\_M steps/blk | Q4\_K\_M tok/s | |-----------|-------------------|--------------| | auto (48) | 12.6 | 230 | | 24 | 12.0 | 236 | | \*\*20\*\* | \*\*11.4\*\* | \*\*244\*\* | | 18 | 12.2 | 235 | | 16 | 12.8 | 228 | \### Entropy-bound — stick with default | entropy-bound | Q6\_K tok/s | Q4\_K\_M tok/s | Effect | |---------------|------------|---------------|--------| | 0.05 | 152 | 216 | Too selective → more steps | | \*\*0.1 (default)\*\* | \*\*180\*\* | \*\*230\*\* | Sweet spot | | 0.15 | — | 240 | Slight improvement on Q4 | | 0.2 | 158 | 233 | Too noisy → more steps | \### Batch size — auto is optimal | -ub / -b | Q6\_K tok/s | Notes | |----------|------------|-------| | auto (4096) | \*\*213\*\* | Derived from -n / ctx | | 512 | 203 | Smaller = less parallelism | | 8192 | 213 | Larger = no benefit | \## Key Findings 1. \*\*Q4\_K\_M is the better choice\*\* — 50% more context (10k vs 6k) and 18% faster generation (252 vs 213 tok/s at max context). 2. \*\*Temperature is everything\*\* — lowering t-max from 0.8→0.3 and t-min from 0.4→0.05 accounts for virtually all the speedup. The rest of the EB params are already well-tuned at defaults. 3. \*\*Bigger context doesn't slow down Q4\_K\_M\*\* — speed actually \*improves\* at larger context (252 tok/s at -n 8192 vs 244 at -n 2048). The larger batch gives the entropy-bound sampler better signal. 4. \*\*Flash Attention is the blocker for 64k\*\* — once SM120 support lands in llama.cpp, the compute buffer bottleneck goes away and DiffusionGemma's full 262k context should be reachable on a single RTX 5090.
Original Article

Similar Articles

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

DifussionGemma 4 on 4x7900xtx

Reddit r/LocalLLaMA

Reports running DiffusionGemma 26B on four AMD 7900 XTX GPUs using vllm, achieving 100 tps generation with overall 45-60 t/s, sharing performance metrics and setup commands.