DiffusionGemma under real workloads feels very different from benchmark demos

Reddit r/LocalLLaMA News

Summary

Internal testing of DiffusionGemma reveals significant performance differences between H100 and A100 GPUs under real-world workloads, with H100s scaling much better under concurrency, and efficiency varying greatly depending on workload type, raising questions about benchmark reliability.

okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100 behavior felt compared to normal transformer inference on some runs the H100s scaled almost exactly how you’d want them to the A100s were still good, but once concurrency started increasing the gap widened way more than we expected. not the usual “yeah H100 is faster” difference - this felt more dramatic another thing we noticed was that the model looks absolutely insane on cleaner workloads and shorter generations, but once you start mixing longer outputs, uneven request lengths, streaming, multiple users, different temperatures etc the behavior changes really fast some workloads looked almost suspiciously fast honestly then one messy real-world style batch would suddenly bring efficiency down harder than expected also GPU utilization patterns looked very different from what we’re used to seeing with normal decode-heavy serving hard to explain properly yet but it didn’t feel like the classic token-by-token bottleneck situation at all dropping some pics from the A100 test boxes as well we’re still testing a lot more combinations + real traffic simulations right now and honestly the more we test the more questions we have will share more numbers once we finish running more workloads across the stacks curious if others here are seeing similar behavior or completely different results
Original Article

Similar Articles

DiffusionGemma 26B A4B results on my 5090

Reddit r/LocalLLaMA

This post presents benchmark results and tuning parameters for running DiffusionGemma 26B A4B GGUF models on an RTX 5090 GPU, showing up to 44% speedup via optimized temperature settings and quantization choices.

DifussionGemma 4 on 4x7900xtx

Reddit r/LocalLLaMA

Reports running DiffusionGemma 26B on four AMD 7900 XTX GPUs using vllm, achieving 100 tps generation with overall 45-60 t/s, sharing performance metrics and setup commands.

DiffusionGemma: 4x Faster Text Generation

Hacker News Top

Google introduces DiffusionGemma, an experimental 26B MoE open model that achieves up to 4x faster text generation on GPUs using text diffusion, targeting speed-critical interactive local workflows.

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Reddit r/LocalLLaMA

This benchmark compares Gemma 4's Multi-Token Prediction (MTP) and z-lab's DFlash speculative decoding methods on a single H100 GPU, showing MTP faster for dense models and DFlash faster for MoE models.