DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Reddit r/LocalLLaMA 06/18/26, 10:29 PM Models

diffusiongemma 26b 4090 performance inference vllm awq

Summary

A user shares their experience running DiffusionGemma 26B on a 4090 GPU via vLLM, achieving up to 475t/s but noting drawbacks like single-user limitation, lower accuracy, and short context, concluding it's not worth using over the regular 26B model.

Figured I'd post up a bit of info for anyone else who was thinking about messing with this model on a 3090/4090. Obviously I can't use the nvfp4, but I got it up and running in vLLM using diffusiongemma-26B-A4B-it-AWQ-INT4. Had to run it in a custom vLLM docker they provide for the purpose, then load a gemma 4 tool/reasoning parser. Once it was all done, it pushed 475t/s on the first prompt, and seems to run between 290t/s and 700t/s depending on output length and context (long outputs come out very fast). It's pretty heavy though, so you're not getting long context (I tested at 8k and could have gone higher, but not THAT much higher). Downsides? It's single-user only (it slows down if you try to batch it), clearly worse at responses (makes mistakes the regular 26ba4b doesn't), and it can't find a needle in a haystack to save its life (context fades quick). Time to first token is a hair slower too on short prompts than a regular llm (it's diffusing everything and giving you the chunks all at once, so it takes a bit longer to get that first chunk). Is it worth bothering with? I don't think so. The regular 26ba4b running through llama.cpp still nails down over 300t/s when batched, and it's significantly more accurate.

Original Article

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Similar Articles

DifussionGemma 4 on 4x7900xtx

DiffusionGemma 26B A4B results on my 5090

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

DiffusionGemma under real workloads feels very different from benchmark demos

Submit Feedback

Similar Articles

DiffusionGemma 26B A4B results on my 5090

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

DiffusionGemma under real workloads feels very different from benchmark demos