Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA News

Summary

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: * GPU: RTX 5090, 32GB VRAM * vLLM: 0.19.2rc1 * Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit * Draft model: z-lab/gemma-4-26B-A4B-it-DFlash * Workload: random dataset, 256 input tokens, 1024 output tokens * Concurrency: 1 * Request rate: 1 * Tested num\_speculative\_tokens from 0 to 15 The short version: Baseline without DFlash: * \~228 output tok/s * \~4455 ms mean E2E latency Best practical DFlash setting: * num\_speculative\_tokens=13 * max\_num\_batched\_tokens=8192 * \~578 output tok/s * \~1738 ms mean E2E latency * \~2.56x speedup One interesting thing: the fastest average setting was not automatically the best serving setting. num\_speculative\_tokens=13 with max\_num\_batched\_tokens=4096 had slightly better mean latency, but worse p95. Moving to 8192 gave a cleaner tail. I made a short video showing the setup, script, benchmark method, graphs, and final recommended command: [https://youtu.be/S\_zbHH5Ycs0](https://youtu.be/S_zbHH5Ycs0) Charts / script / results: [https://medium.com/@ttio2tech\_28094/3a7ac4f73e5d](https://medium.com/@ttio2tech_28094/3a7ac4f73e5d) Curious if others are seeing similar optimal speculative-token counts with DFlash, especially on 4090/5090 or different Gemma/Qwen models.
Original Article

Similar Articles

DiffusionGemma 26B A4B results on my 5090

Reddit r/LocalLLaMA

This post presents benchmark results and tuning parameters for running DiffusionGemma 26B A4B GGUF models on an RTX 5090 GPU, showing up to 44% speedup via optimized temperature settings and quantization choices.

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Reddit r/LocalLLaMA

A user shares their experience running DiffusionGemma 26B on a 4090 GPU via vLLM, achieving up to 475t/s but noting drawbacks like single-user limitation, lower accuracy, and short context, concluding it's not worth using over the regular 26B model.