Tag
This paper introduces DiffRetriever, a method that uses diffusion language models to generate multiple representative tokens in parallel for efficient information retrieval, outperforming autoregressive baselines in speed and accuracy.
DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.