Google's latest DiffusionGemma open AI model comes with a 4x speed boost

Ars Technica Models

Summary

Google released DiffusionGemma, an experimental open-source diffusion model for text generation that achieves 4x speed boost over autoregressive models, optimized for local processing.

<p>Another day, another AI model from Google. This time, Google DeepMind has released a new member of the <a href="https://arstechnica.com/ai/2026/04/google-announces-gemma-4-open-ai-models-switches-to-apache-2-0-license/">Gemma 4 open model family</a>, but it's fundamentally different from the rest of the lineup. DiffusionGemma doesn't generate outputs linearly like most AI models. Instead, it can produce an entire block of text in parallel. <a href="https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/">Google says</a> this makes it faster and more efficient when running on local hardware like an Nvidia DGX or a humble gaming GPU.</p> <p>Most AI models are designed to be autoregressive—they generate text left to right one token at a time. DiffusionGemma has more in common with image generation models, which start with static and then denoise it to create the desired content. This model takes a field of placeholder tokens running over the canvas multiple times to generate likely tokens and using those to improve estimation of others. At the end of the process, the model finalizes its token outputs in one large block—the "denoised" text canvas.</p> <figure class="video ars-wp-video ars-wp-video--horizontal"> <div class="" style=""> <div class="wrapper ars-wp-video-wrapper relative" style="aspect-ratio: 1.7732656514382;"> <video class="wp-video-shortcode absolute w-full h-full object-contain left-0 top-0" id="video-2158841-1" width="2096" height="1182" preload="metadata" controls="controls"><source type="video/mp4" src="https://cdn.arstechnica.net/wp-content/uploads/2026/06/Diffusion_Process_3_1.mp4?_=1"></source></video> </div> <figcaption> <span class="icon caption-arrow icon-drop-indicator"></span> </figcaption> </div> </figure> <p>DiffusionGemma is fairly large in the realm of Google's open models. It's a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference. That means it should fit in the 18GB ram allotment of a high-end GPU. In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second. With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second. That's about four times the output of the similarly sized autoregressive Gemma models.</p><p><a href="https://arstechnica.com/google/2026/06/googles-latest-diffusiongemma-open-ai-model-comes-with-a-4x-speed-boost/">Read full article</a></p> <p><a href="https://arstechnica.com/google/2026/06/googles-latest-diffusiongemma-open-ai-model-comes-with-a-4x-speed-boost/#comments">Comments</a></p>
Original Article
View Cached Full Text

Cached at: 06/10/26, 08:46 PM

# Google's latest DiffusionGemma open AI model comes with a 4x speed boost Source: [https://arstechnica.com/google/2026/06/googles-latest-diffusiongemma-open-ai-model-comes-with-a-4x-speed-boost/](https://arstechnica.com/google/2026/06/googles-latest-diffusiongemma-open-ai-model-comes-with-a-4x-speed-boost/) ## Multiple paths to local efficiency If diffusion is so much faster, why isn’t Google using it in big cloud\-based Gemini models? Google has[experimented with this](https://arstechnica.com/gadgets/2025/05/gemini-2-5-is-leaving-preview-just-in-time-for-googles-new-250-ai-subscription/), but there are a few drawbacks to text diffusion, including a higher error rate\. In image diffusion models, a single badly predicted pixel doesn’t make the image useless, but language is discrete\. An equivalent error in text can make a block of tokens meaningless and force you to start over to get a better output\. Diffusion models also waste resources when the desired output is only a few tokens long\. They have to do a lot more parallel work to whittle down to, say, five tokens that an autoregressive model does from beginning to end in just five steps\. [![](https://cdn.arstechnica.net/wp-content/uploads/2026/06/updated-Intelligence_vs_Latency_.width-1000.format-webp.png)](https://cdn.arstechnica.net/wp-content/uploads/2026/06/updated-Intelligence_vs_Latency_.width-1000.format-webp.png) DiffusionGemma is about as capable as other Gemma models, but it’s much faster\. Credit: Google DiffusionGemma is about as capable as other Gemma models, but it’s much faster\.Credit: Google The efficiency gain for local processing makes this an appealing avenue of experimentation, though\. In the cloud, autoregressive models can batch large numbers of compute jobs from multiple users so they’re always churning out tokens, and the high bandwidth memory \(HBM\) used in these systems can move data around much more efficiently\. Conversely, local AI encounters wasted compute cycles due to lower memory bandwidth and idle time\. Diffusion models can make more efficient use of available compute, but this isn’t the only way\. Google also recently began[implementing Multi\-Token Prediction](https://arstechnica.com/ai/2026/05/googles-gemma-4-open-ai-models-use-speculative-decoding-to-get-up-to-3x-faster/)\(MTP\) drafters, which use otherwise wasted compute cycles to predict possible tokens to increase speed\. But diffusion is even faster than the MTP versions of Gemma\. Google stresses that DiffusionGemma is experimental, but it’s available under the same Apache 2\.0 license as all the other fourth\-generation Gemma models\. You can download the model weights today from[Hugging Face](https://huggingface.co/google/diffusiongemma-26B-A4B-it)\. Google says it worked with Nvidia to ensure DiffusionGemma was optimized for a variety of setups, including high\-end RTX GPUs \(quantized\) and enterprise systems like the H100 or DGX Spark platform\.

Similar Articles

DiffusionGemma: 4x Faster Text Generation

Hacker News Top

Google introduces DiffusionGemma, an experimental 26B MoE open model that achieves up to 4x faster text generation on GPUs using text diffusion, targeting speed-critical interactive local workflows.

DiffusionGemma

Simon Willison's Blog

Google released DiffusionGemma, an open-weight text generation model (26B parameters, 4B active) under Apache 2 license, demonstrating high inference speeds via NVIDIA's NIM cloud API.

google/diffusiongemma-26B-A4B-it

Hugging Face Models Trending

Google DeepMind releases DiffusionGemma, a 26B-parameter Mixture-of-Experts model that uses discrete diffusion for faster text generation, supporting multimodal inputs and a 256K token context.