[Talk] Text Diffusion — Google DeepMind's Brendan O’Donoghue

Reddit r/LocalLLaMA Papers

Summary

DeepMind researcher Brendan O'Donoghue provides an in-depth introduction to text diffusion models, which generate text through iterative denoising. Compared to autoregressive models, they offer lower latency but limited throughput, and demonstrate unique advantages such as self-correction and dynamic computation.

This video was released just a week ago, right before the release of DiffusionGemma, and it's even more relevant now! it answers a lot of questions and confusion I've seen in this sub-reddit on this release, so I highly recommend giving it a watch if you're interested in it.
Original Article
View Cached Full Text

Cached at: 06/12/26, 02:57 AM

TL;DR: DeepMind researcher Brendan O'Donoghue dives into text diffusion models—generating text through iterative denoising, offering lower latency than autoregressive models but limited throughput; and highlights unique advantages like self-correction and adaptive computation. ## Text Diffusion Basics The idea behind text diffusion is similar to image and video diffusion: during training, gradually add noise to a clean token sequence (e.g., randomly replacing tokens) and let a neural network learn to denoise; during inference, start from a pure noise (random token) sequence and iteratively refine it, eventually obtaining a clean output. Unlike autoregressive token-by-token generation, diffusion models process the entire sequence at once across multiple forward passes, enabling bidirectional attention to future tokens. A year ago, DeepMind released a research preview called Gemini Diffusion, made available to about 100,000 users. The model was a variant of Gemini, using text diffusion instead of autoregressive generation. At the time, compared to Gemini 2.0 Flashlight, quality was close (slightly better on code), but latency was much lower. However, that was a year ago; there have since been updates (e.g., DiffusionGemma). ## Autoregressive vs. Diffusion: Pros and Cons ### Advantages - **Lower latency (faster inference)**: Can generate with a higher token-per-second rate in a single request. - **Bidirectional attention**: The model can look at future tokens, enabling self-correction. For example, it can reason first, then backtrack and fix earlier text after discovering an error. - **Adaptive computation**: After training, the model can automatically decide the number of denoising steps needed based on problem difficulty—few steps for simple problems, more for complex ones. - **In-place editing**: Can specify to correct only certain tokens in the sequence. ### Disadvantages - **Lower throughput for large batch requests**: Autoregressive models can batch many queries together for parallel processing, fully utilizing GPU/TPU compute to serve many users at low cost. Diffusion models require multiple forward passes on the same data, causing computational bottlenecks early on—latency per user is low, but overall throughput is typically lower, leading to higher serving costs. This is the main reason text diffusion hasn't been widely deployed at scale. ## Why Diffusion Models Have Lower Latency (Hardware Explanation) The operational bottleneck on modern GPUs/TPUs is **memory bandwidth**, not compute (FLOPs). Running an autoregressive model requires transferring the entire model weights, KV cache, etc., from HBM to tensor cores for every generated token. If batch size is 1, generating N tokens requires N full weight transfers. With a diffusion model: Suppose you want to generate 256 tokens. If you use 24 denoising iterations (instead of 256 autoregressive steps), memory transfers are reduced to about 1/10. If the model is truly bandwidth-bound, latency can drop by about 10x. In the Gemini Diffusion preview, depending on query length, it achieved a steady output rate of about 2000 tokens/s (including preprocessing time). ## Bidirectional Inference and Self-Correction Example Prompt: compute `(square root of 81 * (2/3) squared) + ...` (problem abbreviated, correct answer 39). - After one forward pass: output = 60, incorrect. - After two forward passes: output becomes 49, reasoning steps appear (e.g., "calculate square root of 81", etc.). - After three forward passes: reasoning completes, output corrected to 39. The model backtracked and fixed the initial answer. For comparison, contemporaneous ChatGPT 4o (output 40) and Gemini 2.5 Flash (output 42 and insisted on error) both failed this question. Those models are much larger than Gemini Diffusion, but autoregressive causal attention makes self-correction difficult. Text diffusion models naturally have bidirectional information flow and can correct errors over multiple iterations. ## Adaptive Computation: Variable Steps During training, the model learns to autonomously decide when to stop denoising. For simple replies (e.g., "write the first 100 digits of π"), 4 steps suffice; medium difficulty (write a FizzBuzz code snippet) takes about 18 steps; more complex questions (explain quantum mechanics in a paragraph) take 31 steps. On a set of older benchmarks, difficult tasks (e.g., GPQA Diamond) required significantly more iterations than simple ones (e.g., MBPP basic Python problems). This allows the model to allocate computation according to problem complexity. Source: YouTube video (https://www.youtube.com/watch?v=r305-aQTaU0)

Similar Articles

google/diffusiongemma-26B-A4B-it

Hugging Face Models Trending

Google DeepMind releases DiffusionGemma, a 26B-parameter Mixture-of-Experts model that uses discrete diffusion for faster text generation, supporting multimodal inputs and a 256K token context.

DiffusionGemma: 4x Faster Text Generation

Hacker News Top

Google introduces DiffusionGemma, an experimental 26B MoE open model that achieves up to 4x faster text generation on GPUs using text diffusion, targeting speed-critical interactive local workflows.