Tag
DeepMind researcher Brendan O'Donoghue provides an in-depth introduction to text diffusion models, which generate text through iterative denoising. Compared to autoregressive models, they offer lower latency but limited throughput, and demonstrate unique advantages such as self-correction and dynamic computation.
This paper proposes BiCache, a novel KV caching technique for shared prefixes in diffusion language models, which avoids accuracy collapse by dynamically reusing cached keys and values in shallow layers and achieves 36.3%–98.3% throughput improvement.