@_philschmid: Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! - Built on Gemma 4 as a 26B MoE model. - 3.8B …

X AI KOLs Following 06/10/26, 04:24 PM Models

diffusion gemma moe text-generation parallel-inference apache-2.0 open-source

Summary

DiffusionGemma, a 26B MoE model based on Gemma 4, achieves over 1000 tokens per second using diffusion for text generation in 256-token blocks, fitting in 18GB VRAM with quantization, released under Apache 2.0.

Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! 🌬️ - Built on Gemma 4 as a 26B MoE model. - 3.8B parameters during inference. - Generates text in 256-token blocks in parallel. - Fits within 18 GB VRAM limits when quantized. - Apache 2.0 https://t.co/rnQsdRNoD0

Original Article

View Cached Full Text

Cached at: 06/10/26, 05:53 PM

Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! 🌬️

Built on Gemma 4 as a 26B MoE model.
3.8B parameters during inference.
Generates text in 256-token blocks in parallel.
Fits within 18 GB VRAM limits when quantized.
Apache 2.0 https://t.co/rnQsdRNoD0

Similar Articles

DiffusionGemma: 4x Faster Text Generation

Hacker News Top

Google introduces DiffusionGemma, an experimental 26B MoE open model that achieves up to 4x faster text generation on GPUs using text diffusion, targeting speed-critical interactive local workflows.

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

X AI KOLs Following

DiffusionGemma is out; it's compute-bound and 4x faster than other Gemma-4 models with 1k tok/s on H100, and excels at coding tasks including 3D generation and front-end.

DiffusionGemma

Simon Willison's Blog

Google released DiffusionGemma, an open-weight text generation model (26B parameters, 4B active) under Apache 2 license, demonstrating high inference speeds via NVIDIA's NIM cloud API.

@HuggingPapers: NVIDIA just released an NVFP4-quantized DiffusionGemma on Hugging Face A 26B MoE multimodal model generating text via p…

X AI KOLs Following

NVIDIA released a 26B MoE multimodal model called DiffusionGemma on Hugging Face, using NVFP4 quantization and achieving over 1,100 tokens per second on Hopper hardware.

DiffusionGemma: The Developer Guide- Google Developers Blog

Reddit r/LocalLLaMA

DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.