Gemma 4 26B Hits 600 Tok/s on One RTX 5090
Summary
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
Similar Articles
@leopardracer: GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can …
Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.
@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…
Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.
DiffusionGemma 26B A4B results on my 5090
This post presents benchmark results and tuning parameters for running DiffusionGemma 26B A4B GGUF models on an RTX 5090 GPU, showing up to 44% speedup via optimized temperature settings and quantization choices.
Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5
Gemma 4 is demonstrated running in-browser via WebGPU at 255 tokens per second, using kernels generated by Fable 5, showcasing efficient on-device inference.
DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...
A user shares their experience running DiffusionGemma 26B on a 4090 GPU via vLLM, achieving up to 475t/s but noting drawbacks like single-user limitation, lower accuracy, and short context, concluding it's not worth using over the regular 26B model.