@_philschmid: Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! - Built on Gemma 4 as a 26B MoE model. - 3.8B …

X AI KOLs Following Models

Summary

DiffusionGemma, a 26B MoE model based on Gemma 4, achieves over 1000 tokens per second using diffusion for text generation in 256-token blocks, fitting in 18GB VRAM with quantization, released under Apache 2.0.

Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! 🌬️ - Built on Gemma 4 as a 26B MoE model. - 3.8B parameters during inference. - Generates text in 256-token blocks in parallel. - Fits within 18 GB VRAM limits when quantized. - Apache 2.0 https://t.co/rnQsdRNoD0
Original Article
View Cached Full Text

Cached at: 06/10/26, 05:53 PM

Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! 🌬️

  • Built on Gemma 4 as a 26B MoE model.
  • 3.8B parameters during inference.
  • Generates text in 256-token blocks in parallel.
  • Fits within 18 GB VRAM limits when quantized.
  • Apache 2.0 https://t.co/rnQsdRNoD0

Similar Articles

DiffusionGemma: 4x Faster Text Generation

Hacker News Top

Google introduces DiffusionGemma, an experimental 26B MoE open model that achieves up to 4x faster text generation on GPUs using text diffusion, targeting speed-critical interactive local workflows.

DiffusionGemma

Simon Willison's Blog

Google released DiffusionGemma, an open-weight text generation model (26B parameters, 4B active) under Apache 2 license, demonstrating high inference speeds via NVIDIA's NIM cloud API.

DiffusionGemma: The Developer Guide- Google Developers Blog

Reddit r/LocalLLaMA

DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.