DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Reddit r/LocalLLaMA Models

Summary

A user shares their experience running DiffusionGemma 26B on a 4090 GPU via vLLM, achieving up to 475t/s but noting drawbacks like single-user limitation, lower accuracy, and short context, concluding it's not worth using over the regular 26B model.

Figured I'd post up a bit of info for anyone else who was thinking about messing with this model on a 3090/4090. Obviously I can't use the nvfp4, but I got it up and running in vLLM using diffusiongemma-26B-A4B-it-AWQ-INT4. Had to run it in a custom vLLM docker they provide for the purpose, then load a gemma 4 tool/reasoning parser. Once it was all done, it pushed 475t/s on the first prompt, and seems to run between 290t/s and 700t/s depending on output length and context (long outputs come out very fast). It's pretty heavy though, so you're not getting long context (I tested at 8k and could have gone higher, but not THAT much higher). Downsides? It's single-user only (it slows down if you try to batch it), clearly worse at responses (makes mistakes the regular 26ba4b doesn't), and it can't find a needle in a haystack to save its life (context fades quick). Time to first token is a hair slower too on short prompts than a regular llm (it's diffusing everything and giving you the chunks all at once, so it takes a bit longer to get that first chunk). Is it worth bothering with? I don't think so. The regular 26ba4b running through llama.cpp still nails down over 300t/s when batched, and it's significantly more accurate.
Original Article

Similar Articles

DifussionGemma 4 on 4x7900xtx

Reddit r/LocalLLaMA

Reports running DiffusionGemma 26B on four AMD 7900 XTX GPUs using vllm, achieving 100 tps generation with overall 45-60 t/s, sharing performance metrics and setup commands.

DiffusionGemma 26B A4B results on my 5090

Reddit r/LocalLLaMA

This post presents benchmark results and tuning parameters for running DiffusionGemma 26B A4B GGUF models on an RTX 5090 GPU, showing up to 44% speedup via optimized temperature settings and quantization choices.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

DiffusionGemma under real workloads feels very different from benchmark demos

Reddit r/LocalLLaMA

Internal testing of DiffusionGemma reveals significant performance differences between H100 and A100 GPUs under real-world workloads, with H100s scaling much better under concurrency, and efficiency varying greatly depending on workload type, raising questions about benchmark reliability.