Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA News

Summary

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: * GPU: RTX 5090, 32GB VRAM * vLLM: 0.19.2rc1 * Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit * Draft model: z-lab/gemma-4-26B-A4B-it-DFlash * Workload: random dataset, 256 input tokens, 1024 output tokens * Concurrency: 1 * Request rate: 1 * Tested num\_speculative\_tokens from 0 to 15 The short version: Baseline without DFlash: * \~228 output tok/s * \~4455 ms mean E2E latency Best practical DFlash setting: * num\_speculative\_tokens=13 * max\_num\_batched\_tokens=8192 * \~578 output tok/s * \~1738 ms mean E2E latency * \~2.56x speedup One interesting thing: the fastest average setting was not automatically the best serving setting. num\_speculative\_tokens=13 with max\_num\_batched\_tokens=4096 had slightly better mean latency, but worse p95. Moving to 8192 gave a cleaner tail. I made a short video showing the setup, script, benchmark method, graphs, and final recommended command: [https://youtu.be/S\_zbHH5Ycs0](https://youtu.be/S_zbHH5Ycs0) Charts / script / results: [https://medium.com/@ttio2tech\_28094/3a7ac4f73e5d](https://medium.com/@ttio2tech_28094/3a7ac4f73e5d) Curious if others are seeing similar optimal speculative-token counts with DFlash, especially on 4090/5090 or different Gemma/Qwen models.
Original Article

Similar Articles

z-lab/gemma-4-31B-it-DFlash

Hugging Face Models Trending

Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Reddit r/LocalLLaMA

This benchmark compares Gemma 4's Multi-Token Prediction (MTP) and z-lab's DFlash speculative decoding methods on a single H100 GPU, showing MTP faster for dense models and DFlash faster for MoE models.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.