@simplifyinAI: Researchers just made LLMs 8.5x faster with zero accuracy loss. It's called DFlash. It replaces the slow autoregressive…
Summary
Researchers introduced DFlash, a method that replaces autoregressive drafters with block diffusion models to achieve 8.5x faster LLM inference with zero accuracy loss.
Similar Articles
@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…
Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.
@DivyanshT91162: Autoregressive LLMs might already be getting replaced Someone built dLLM — an open-source library that can turn ANY aut…
dLLM is an open-source library that converts any autoregressive LLM into a diffusion LLM, enabling parallel decoding and faster text generation.
@zhijianliu_: DFlash is now running in a production inference stack. More draft models coming soon. https://github.com/z-lab/dflash
DFlash is a lightweight block diffusion model for speculative decoding, now running in production with support for various LLMs like Qwen and Gemma.
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.
z-lab/dflash
DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.