Tag
vLLM announces native support for Google DeepMind's DiffusionGemma, a 26B discrete diffusion language model that generates 256-token blocks in parallel, enabling low-latency inference at 1200+ tok/s on a single H200.
DiffusionGemma is out; it's compute-bound and 4x faster than other Gemma-4 models with 1k tok/s on H100, and excels at coding tasks including 3D generation and front-end.
Google introduces DiffusionGemma, an experimental 26B MoE open model that achieves up to 4x faster text generation on GPUs using text diffusion, targeting speed-critical interactive local workflows.