NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

NVIDIA Blog Models

Summary

NVIDIA optimizes Google DeepMind's DiffusionGemma, an open model that generates text in parallel 256-token blocks, achieving up to 4x faster performance on local RTX GPUs, DGX Spark, and DGX Station systems.

<div id="bsf_rt_marker"></div><p><span data-contrast="none">Today, Google DeepMind released DiffusionGemma — an experimental open model built for exceptionally fast text generation. NVIDIA has optimized DiffusionGemma to run even faster across NVIDIA GeForce RTX GPUs, the NVIDIA RTX PRO platform and NVIDIA DGX Spark systems, from local PCs to the cloud.</span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="none">Rather than generating text one word at a time, DiffusionGemma generates multiple words in parallel to output whole blocks of text, opening a new, low-latency frontier for the kind of single-user workloads that developers, researchers and AI enthusiasts run every day.</span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="none">Features of the new model include:</span><span data-ccp-props="{}"> </span></p> <ul> <li><b><span data-contrast="none">Parallel generation: </span></b><span data-contrast="none">DiffusionGemma denoises up to 256 tokens per step instead of predicting one at a time.</span><span data-ccp-props="{}"> </span></li> <li><b><span data-contrast="auto">Built on Gemma 4: </span></b><span data-contrast="auto">DiffusionGemma is built on Gemma 4, a 26-billion-parameter mixture-of-experts model that activates just 3.8 billion parameters per step, pairing a diffusion head with Google’s Gemma 4 architecture.</span><span data-ccp-props="{}"> </span></li> <li><b><span data-contrast="auto">Up to 4x faster performance:</span></b><span data-contrast="auto"> The boost means fast text generation, where single-user generation usually stalls — on local hardware.</span><span data-ccp-props="{}"> </span></li> <li><b><span data-contrast="auto">Open and local:</span></b><span data-contrast="auto"> DiffusionGemma is open weights under a permissive Apache 2.0 license and runs entirely on RTX and DGX Spark — no cloud, no per-token cost — with day-zero support in </span><a target="_blank" href="https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4"><span data-contrast="none">Hugging Face Transformers</span></a><span data-contrast="auto">, vLLM and Unsloth.</span><span data-ccp-props="{}"> </span></li> </ul> <h2><strong><span class="TextRun MacChromeBold SCXW229153698 BCX0" lang="EN-US" xml:lang="EN-US" data-contrast="none"><span class="NormalTextRun SCXW229153698 BCX0">A Different Way to Generate Text</span></span><span class="EOP SCXW229153698 BCX0" data-ccp-props="{}"> </span></strong></h2> <p><span data-contrast="none">Almost every large language model (LLM) in wide use today is autoregressive — meaning it generates text one token at a time, with each new word depending on the one before it. That sequential process is what makes interactive AI feel like it’s typing.</span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="none">DiffusionGemma takes a different path. Built on the Gemma 4 26B mixture-of-experts architecture, it generates text the way diffusion models generate images: by starting from noise and refining a whole block of text at once. Each step denoises up to 256 tokens in parallel rather than emitting a single token and waiting to compute the next.</span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="none">The result is a model that thinks in blocks instead of sequentially. For latency-sensitive, single-user work — such as interactive chat, agentic loops or on-device assistants that plan and act — that parallelism translates into responses fast enough to keep pace with how developers think and iterate.</span></p> <h2><b><span data-contrast="none">DiffusionGemma Flies on NVIDIA GPUs</span></b><span data-ccp-props="{}"> </span></h2> <p><span class="TextRun SCXW65433496 BCX0" lang="EN-US" xml:lang="EN-US" data-contrast="none"><span class="NormalTextRun SpellingErrorV2Themed SCXW65433496 BCX0">Generating one</span><span class="NormalTextRun SCXW65433496 BCX0"> token at a time is fundamentally a memory-bound problem — a traditional LLM spends most of its time waiting on memory bandwidth, not doing </span><span class="NormalTextRun SCXW65433496 BCX0">math</span><span class="NormalTextRun SCXW65433496 BCX0">, which </span><span class="NormalTextRun SCXW65433496 BCX0">leaves</span><span class="NormalTextRun SCXW65433496 BCX0"> a lot of </span><span class="NormalTextRun ContextualSpellingAndGrammarErrorV2Themed SCXW65433496 BCX0">compute</span><span class="NormalTextRun SCXW65433496 BCX0"> on the table.</span></span><span class="EOP SCXW65433496 BCX0" data-ccp-props="{}"> </span></p> <p><span data-contrast="none">Diffusion flips the equation. Pulling a full 256-token block through the transformer in parallel is a compute-bound workload — exactly what NVIDIA GPUs are built for. NVIDIA Tensor Cores accelerate the dense parallel math, and the CUDA software stack lets the model run efficiently from day one without bespoke tuning. In short, the model’s design plays directly to the GPU</span><span data-contrast="none">’</span><span data-contrast="none">&#8216;</span><span data-contrast="none">s strengths.</span><span data-ccp-props="{}"> </span></p> <p><span class="TextRun SCXW243680204 BCX0" lang="EN-US" xml:lang="EN-US" data-contrast="none"><span class="NormalTextRun CommentStart SCXW243680204 BCX0">That shows up in the numbers. </span><span class="NormalTextRun SpellingErrorV2Themed SCXW243680204 BCX0">DiffusionGemma</span><span class="NormalTextRun SCXW243680204 BCX0"> delivers</span><span class="NormalTextRun SCXW243680204 BCX0"> </span><span class="NormalTextRun SCXW243680204 BCX0">1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU</span><span class="NormalTextRun SCXW243680204 BCX0">, </span><span class="NormalTextRun SCXW243680204 BCX0">15</span><span class="NormalTextRun SCXW243680204 BCX0">0 tokens/sec on NVIDIA DGX Spark and </span><span class="NormalTextRun SCXW243680204 BCX0">fa</span><span class="NormalTextRun SCXW243680204 BCX0">s</span><span class="NormalTextRun SCXW243680204 BCX0">test local inference on </span><span class="NormalTextRun SCXW243680204 BCX0">NVIDIA DGX Stat</span><span class="NormalTextRun SCXW243680204 BCX0">ion</span><span class="NormalTextRun SCXW243680204 BCX0"> </span><span class="NormalTextRun SCXW243680204 BCX0">—</span><span class="NormalTextRun SCXW243680204 BCX0"> </span><span class="NormalTextRun SCXW243680204 BCX0">roug</span><span class="NormalTextRun SCXW243680204 BCX0">hly 4</span><span class="NormalTextRun SCXW243680204 BCX0">x</span><span class="NormalTextRun SCXW243680204 BCX0"> fast</span><span class="NormalTextRun SCXW243680204 BCX0">er than an equivalent autoregressive model running in the same single-user regime.</span></span><span class="EOP SCXW243680204 BCX0" data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}"> </span></p> <p><span data-contrast="none">That advantage holds across NVIDIA’s full lineup, running</span><span data-contrast="none">:</span><span data-ccp-props="{}"> </span></p> <ul> <li><b><span data-contrast="none">Locally on the NVIDIA DGX Spark deskside personal AI supercomputer </span></b><span data-contrast="none">— powered by the NVIDIA GB10 Grace Blackwell Superchip with 128GB of unified memory — with the preinstalled NVIDIA AI software stack ready for prototyping, fine-tuning and fully local agent workflows.</span><span data-ccp-props="{}"> </span></li> <li><b><span data-contrast="none">On NVIDIA RTX PRO 6000 workstations, </span></b>providing<span data-contrast="none"> developers, researchers and AI professionals with the headroom to run local low-latency generation and agentic loops as part of a professional workflow.</span><span data-ccp-props="{}"> </span></li> <li><b><span data-contrast="none">On DGX Station, </span></b>delivering best-in-class,<span data-contrast="none"> high-speed inference at up to 800 tokens/sec for low-latency text generation and agentic loops with 748GB of coherent memory.</span><span data-ccp-props="{}"> </span></li> <li><b><span data-contrast="none">On GeForce RTX GPUs, </span></b>with<span data-contrast="none"> llama.cpp support coming soon.</span><span data-ccp-props="{}"> </span></li> </ul> <h2><span class="TextRun MacChromeBold SCXW101704684 BCX0" lang="EN-US" xml:lang="EN-US" data-contrast="none"><span class="NormalTextRun CommentStart CommentHighlightPipeClicked CommentHighlightClicked CommentImportant SCXW101704684 BCX0">Get Started Locally</span></span></h2> <p><span data-contrast="none">The fastest way to start testing and prototyping the model is through Hugging Face Transformers, which runs DiffusionGemma on a GeForce RTX 5090 or DGX Spark out of the box. For higher-throughput inference, vLLM provides day-zero serving support. </span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="none">For adapting the model to a specific task or domain, fine-tuning is available through Unsloth and NVIDIA NeMo framework, with ready-made DGX Spark playbooks to get a local environment running quickly. Check out the vLLM playbooks for </span><a target="_blank" href="https://build.nvidia.com/spark/vllm"><span data-contrast="none">DGX Spark</span></a><span data-contrast="none"> , </span><a target="_blank" href="https://build.nvidia.com/rtx/vllm"><span data-contrast="none">RTX PRO</span></a><span data-contrast="none"> and </span><a target="_blank" href="https://build.nvidia.com/station/vllm"><span data-contrast="none">DGX Station</span></a><span data-contrast="none">.</span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="none">Try Diffusion Gemma on Hugging Face or test it for free using NVIDIA-hosted application programming interfaces at </span><a target="_blank" href="https://build.nvidia.com/"><span data-contrast="none">build.nvidia.com</span></a><span data-contrast="none">.</span><span data-ccp-props="{}"> </span></p> <p><span data-contrast="none">Go deeper on the architecture and local deployment by reading the </span><a target="_blank" href="https://developer.nvidia.com/blog/?p=118305"><span data-contrast="none">NVIDIA technical blog</span></a><span data-contrast="none"> and the <a class="Hyperlink TrackedChange TrackChangeHyperlinkInstruction SCXW235303664 BCX0" href="https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/" target="_blank" rel="noreferrer noopener"><span class="TrackChangeTextInsertion TrackedChange SCXW235303664 BCX0"><span class="TextRun Highlight Underlined SCXW235303664 BCX0" lang="EN-US" xml:lang="EN-US" data-contrast="none"><span class="NormalTextRun SCXW235303664 BCX0" data-ccp-charstyle="Hyperlink">Google DeepMind announcement</span></span></span></a>.</span></p> <h2><b><span data-contrast="none">#ICYMI: The Latest From RTX AI Garage</span></b><span data-ccp-props="{}"> </span></h2> <p><span data-contrast="auto"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f3ac.png" alt="🎬" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span><span data-contrast="auto"> </span><b><span data-contrast="auto">NVIDIA researchers released SANA-WM</span></b><span data-contrast="auto">, an open source world model that turns a single image and a camera path into a minute-long, 720p video with precise 6-DoF control. At just 2.6 billion parameters, its distilled version generates a full 60-second clip in 34 seconds on a single NVIDIA GeForce RTX 5090 GPU using the NVFP4 format — delivering up to 36x higher throughput than comparable open models while running on one GPU. Read </span><a target="_blank" href="https://arxiv.org/pdf/2605.15178"><span data-contrast="auto">the paper.</span></a><span data-ccp-props="{}"> </span></p> <p><span data-contrast="auto"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f6e0.png" alt="🛠" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </span><b><span data-contrast="auto">Building Windows agents just got a full toolset </span></b><span data-contrast="auto">— </span><a target="_blank" href="https://developer.nvidia.com/blog/build-personal-ai-agents-on-windows-pcs-with-new-tools-from-microsoft-and-nvidia/"><span data-contrast="none">NVIDIA and Microsoft</span></a><span data-contrast="auto"> rolled out turnkey agent sandboxing on native Windows — Microsoft eXecution Containers plus the NVIDIA OpenShell runtime — alongside up to 2x faster agentic inference and native Windows support for Hermes Agent.</span><span data-ccp-props="{}"> </span></p> <p><b><span data-contrast="auto"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f916.png" alt="🤖" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></b><b><span data-contrast="auto">DGX Spark goes from unboxing to a running agent in minutes</span></b><span data-contrast="auto"> — A streamlined NVIDIA NemoClaw install gets developers to a working local agent fast, with Qwen3.6-35B running up to 2.6x faster on vLLM. And the new cluster assistant in NVIDIA Sync links up to four DGX Spark units into one 512GB pool — enough for ~400-billion-parameter models.</span><span data-ccp-props="{}"> </span></p> <p><i><span data-contrast="none">Plug in to RTX Spark on </span></i><a target="_blank" href="https://www.facebook.com/NVIDIARTXSpark/"><i><span data-contrast="none">Facebook</span></i></a><i><span data-contrast="none">, </span></i><a target="_blank" href="https://www.instagram.com/nvidiartxspark"><i><span data-contrast="none">Instagram</span></i></a><i><span data-contrast="none">, </span></i><a target="_blank" href="https://www.tiktok.com/@nvidiartxspark"><i><span data-contrast="none">TikTok</span></i></a><i><span data-contrast="none"> and </span></i><a target="_blank" href="https://x.com/NVIDIARTXSpark"><i><span data-contrast="none">X</span></i></a><i><span data-contrast="none"> — and stay informed by subscribing to the </span></i><a target="_blank" href="https://www.nvidia.com/en-us/ai-on-rtx/?modal=subscribe-ai"><i><span data-contrast="none">RTX Spark newsletter</span></i></a><i><span data-contrast="none">.</span></i><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;335551550&quot;:0,&quot;335551620&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:300}"> </span></p> <p><i><span data-contrast="none">See </span></i><a target="_blank" href="https://www.nvidia.com/en-eu/about-nvidia/terms-of-service/"><i><span data-contrast="none">notice</span></i></a><i><span data-contrast="none"> regarding software product information.</span></i></p>
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:12 PM

# NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI Source: [https://blogs.nvidia.com/blog/rtx-ai-garage-local-gemma-diffusion/](https://blogs.nvidia.com/blog/rtx-ai-garage-local-gemma-diffusion/) Today, Google DeepMind released DiffusionGemma — an experimental open model built for exceptionally fast text generation\. NVIDIA has optimized DiffusionGemma to run even faster across NVIDIA GeForce RTX GPUs, the NVIDIA RTX PRO platform and NVIDIA DGX Spark systems, from local PCs to the cloud\. Rather than generating text one word at a time, DiffusionGemma generates multiple words in parallel to output whole blocks of text, opening a new, low\-latency frontier for the kind of single\-user workloads that developers, researchers and AI enthusiasts run every day\. Features of the new model include: - **Parallel generation:**DiffusionGemma denoises up to 256 tokens per step instead of predicting one at a time\. - **Built on Gemma 4:**DiffusionGemma is built on Gemma 4, a 26\-billion\-parameter mixture\-of\-experts model that activates just 3\.8 billion parameters per step, pairing a diffusion head with Google’s Gemma 4 architecture\. - **Up to 4x faster performance:**The boost means fast text generation, where single\-user generation usually stalls — on local hardware\. - **Open and local:**DiffusionGemma is open weights under a permissive Apache 2\.0 license and runs entirely on RTX and DGX Spark — no cloud, no per\-token cost — with day\-zero support in[Hugging Face Transformers](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4), vLLM and Unsloth\. ## **A Different Way to Generate Text** Almost every large language model \(LLM\) in wide use today is autoregressive — meaning it generates text one token at a time, with each new word depending on the one before it\. That sequential process is what makes interactive AI feel like it’s typing\. DiffusionGemma takes a different path\. Built on the Gemma 4 26B mixture\-of\-experts architecture, it generates text the way diffusion models generate images: by starting from noise and refining a whole block of text at once\. Each step denoises up to 256 tokens in parallel rather than emitting a single token and waiting to compute the next\. The result is a model that thinks in blocks instead of sequentially\. For latency\-sensitive, single\-user work — such as interactive chat, agentic loops or on\-device assistants that plan and act — that parallelism translates into responses fast enough to keep pace with how developers think and iterate\. ## **DiffusionGemma Flies on NVIDIA GPUs** Generating onetoken at a time is fundamentally a memory\-bound problem — a traditional LLM spends most of its time waiting on memory bandwidth, not doingmath, whichleavesa lot ofcomputeon the table\. Diffusion flips the equation\. Pulling a full 256\-token block through the transformer in parallel is a compute\-bound workload — exactly what NVIDIA GPUs are built for\. NVIDIA Tensor Cores accelerate the dense parallel math, and the CUDA software stack lets the model run efficiently from day one without bespoke tuning\. In short, the model’s design plays directly to the GPU’‘s strengths\. DiffusionGemmadelivers1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU,150 tokens/sec on NVIDIA DGX Spark andfastest local inference onNVIDIA DGX Station—roughly 4xfaster than an equivalent autoregressive model running in the same single\-user regime\. That advantage holds across NVIDIA’s full lineup, running: - **Locally on the NVIDIA DGX Spark deskside personal AI supercomputer**— powered by the NVIDIA GB10 Grace Blackwell Superchip with 128GB of unified memory — with the preinstalled NVIDIA AI software stack ready for prototyping, fine\-tuning and fully local agent workflows\. - **On NVIDIA RTX PRO 6000 workstations,**providingdevelopers, researchers and AI professionals with the headroom to run local low\-latency generation and agentic loops as part of a professional workflow\. - **On DGX Station,**delivering best\-in\-class,high\-speed inference at up to 800 tokens/sec for low\-latency text generation and agentic loops with 748GB of coherent memory\. - **On GeForce RTX GPUs,**withllama\.cpp support coming soon\. The fastest way to start testing and prototyping the model is through Hugging Face Transformers, which runs DiffusionGemma on a GeForce RTX 5090 or DGX Spark out of the box\. For higher\-throughput inference, vLLM provides day\-zero serving support\. For adapting the model to a specific task or domain, fine\-tuning is available through Unsloth and NVIDIA NeMo framework, with ready\-made DGX Spark playbooks to get a local environment running quickly\. Check out the vLLM playbooks for[DGX Spark](https://build.nvidia.com/spark/vllm),[RTX PRO](https://build.nvidia.com/rtx/vllm)and[DGX Station](https://build.nvidia.com/station/vllm)\. Try Diffusion Gemma on Hugging Face or test it for free using NVIDIA\-hosted application programming interfaces at[build\.nvidia\.com](https://build.nvidia.com/)\. Go deeper on the architecture and local deployment by reading the[NVIDIA technical blog](https://developer.nvidia.com/blog/?p=118305)and the[Google DeepMind announcement](https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/)\. ## **\#ICYMI: The Latest From RTX AI Garage** 🎬**NVIDIA researchers released SANA\-WM**, an open source world model that turns a single image and a camera path into a minute\-long, 720p video with precise 6\-DoF control\. At just 2\.6 billion parameters, its distilled version generates a full 60\-second clip in 34 seconds on a single NVIDIA GeForce RTX 5090 GPU using the NVFP4 format — delivering up to 36x higher throughput than comparable open models while running on one GPU\. Read[the paper\.](https://arxiv.org/pdf/2605.15178) 🛠️**Building Windows agents just got a full toolset**—[NVIDIA and Microsoft](https://developer.nvidia.com/blog/build-personal-ai-agents-on-windows-pcs-with-new-tools-from-microsoft-and-nvidia/)rolled out turnkey agent sandboxing on native Windows — Microsoft eXecution Containers plus the NVIDIA OpenShell runtime — alongside up to 2x faster agentic inference and native Windows support for Hermes Agent\. **🤖****DGX Spark goes from unboxing to a running agent in minutes**— A streamlined NVIDIA NemoClaw install gets developers to a working local agent fast, with Qwen3\.6\-35B running up to 2\.6x faster on vLLM\. And the new cluster assistant in NVIDIA Sync links up to four DGX Spark units into one 512GB pool — enough for ~400\-billion\-parameter models\. *Plug in to RTX Spark on*[*Facebook*](https://www.facebook.com/NVIDIARTXSpark/)*,*[*Instagram*](https://www.instagram.com/nvidiartxspark)*,*[*TikTok*](https://www.tiktok.com/@nvidiartxspark)*and*[*X*](https://x.com/NVIDIARTXSpark)*— and stay informed by subscribing to the*[*RTX Spark newsletter*](https://www.nvidia.com/en-us/ai-on-rtx/?modal=subscribe-ai)*\.* *See*[*notice*](https://www.nvidia.com/en-eu/about-nvidia/terms-of-service/)*regarding software product information\.*

Similar Articles

DiffusionGemma: 4x Faster Text Generation

Hacker News Top

Google introduces DiffusionGemma, an experimental 26B MoE open model that achieves up to 4x faster text generation on GPUs using text diffusion, targeting speed-critical interactive local workflows.

DiffusionGemma

Simon Willison's Blog

Google released DiffusionGemma, an open-weight text generation model (26B parameters, 4B active) under Apache 2 license, demonstrating high inference speeds via NVIDIA's NIM cloud API.

DiffusionGemma: The Developer Guide- Google Developers Blog

Reddit r/LocalLLaMA

DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.