DiffusionGemma under real workloads feels very different from benchmark demos

Reddit r/LocalLLaMA 06/11/26, 02:18 PM News

diffusion-gemma gpu-benchmarking h100 a100 real-world-performance inference model-testing

Summary

Internal testing of DiffusionGemma reveals significant performance differences between H100 and A100 GPUs under real-world workloads, with H100s scaling much better under concurrency, and efficiency varying greatly depending on workload type, raising questions about benchmark reliability.

okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100 behavior felt compared to normal transformer inference on some runs the H100s scaled almost exactly how you’d want them to the A100s were still good, but once concurrency started increasing the gap widened way more than we expected. not the usual “yeah H100 is faster” difference - this felt more dramatic another thing we noticed was that the model looks absolutely insane on cleaner workloads and shorter generations, but once you start mixing longer outputs, uneven request lengths, streaming, multiple users, different temperatures etc the behavior changes really fast some workloads looked almost suspiciously fast honestly then one messy real-world style batch would suddenly bring efficiency down harder than expected also GPU utilization patterns looked very different from what we’re used to seeing with normal decode-heavy serving hard to explain properly yet but it didn’t feel like the classic token-by-token bottleneck situation at all dropping some pics from the A100 test boxes as well we’re still testing a lot more combinations + real traffic simulations right now and honestly the more we test the more questions we have will share more numbers once we finish running more workloads across the stacks curious if others here are seeing similar behavior or completely different results

Original Article

DiffusionGemma under real workloads feels very different from benchmark demos

Similar Articles

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

DiffusionGemma 26B A4B results on my 5090

DifussionGemma 4 on 4x7900xtx

DiffusionGemma: 4x Faster Text Generation

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Submit Feedback

Similar Articles

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

DiffusionGemma 26B A4B results on my 5090

DiffusionGemma: 4x Faster Text Generation

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results