Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA 05/08/26, 02:13 PM News

gemma-4 rtx-5090 vllm speculative-decoding df-lash benchmark performance

Summary

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: * GPU: RTX 5090, 32GB VRAM * vLLM: 0.19.2rc1 * Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit * Draft model: z-lab/gemma-4-26B-A4B-it-DFlash * Workload: random dataset, 256 input tokens, 1024 output tokens * Concurrency: 1 * Request rate: 1 * Tested num\_speculative\_tokens from 0 to 15 The short version: Baseline without DFlash: * \~228 output tok/s * \~4455 ms mean E2E latency Best practical DFlash setting: * num\_speculative\_tokens=13 * max\_num\_batched\_tokens=8192 * \~578 output tok/s * \~1738 ms mean E2E latency * \~2.56x speedup One interesting thing: the fastest average setting was not automatically the best serving setting. num\_speculative\_tokens=13 with max\_num\_batched\_tokens=4096 had slightly better mean latency, but worse p95. Moving to 8192 gave a cleaner tail. I made a short video showing the setup, script, benchmark method, graphs, and final recommended command: [https://youtu.be/S\_zbHH5Ycs0](https://youtu.be/S_zbHH5Ycs0) Charts / script / results: [https://medium.com/@ttio2tech\_28094/3a7ac4f73e5d](https://medium.com/@ttio2tech_28094/3a7ac4f73e5d) Curious if others are seeing similar optimal speculative-token counts with DFlash, especially on 4090/5090 or different Gemma/Qwen models.

Original Article

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Similar Articles

@googlegemma: Gemma 4 up to 3x faster, directly in your phone! Check out the difference Speculative Decoding makes! Multi-Token Predi…

z-lab/gemma-4-31B-it-DFlash

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Submit Feedback

Similar Articles

@googlegemma: Gemma 4 up to 3x faster, directly in your phone! Check out the difference Speculative Decoding makes! Multi-Token Predi…

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)