@simplifyinAI: Researchers just made LLMs 8.5x faster with zero accuracy loss. It's called DFlash. It replaces the slow autoregressive…

X AI KOLs Timeline Papers

Summary

Researchers introduced DFlash, a method that replaces autoregressive drafters with block diffusion models to achieve 8.5x faster LLM inference with zero accuracy loss.

Researchers just made LLMs 8.5x faster with zero accuracy loss. It's called DFlash. It replaces the slow autoregressive drafter in speculative decoding with a block diffusion model that guesses all tokens in one parallel shot. → Vanilla: 48.5 tokens/sec → DFlash: 415 tokens/sec → Same model. Same quality. Already integrated with vLLM, SGLang, and Transformers. Draft models on HuggingFace for Qwen3, Llama 3.1, Kimi-K2.5, gpt-oss and more. 100% Open Source.
Original Article

Similar Articles

DFlash: Block Diffusion for Flash Speculative Decoding

Papers with Code Trending

DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.

z-lab/dflash

GitHub Trending (daily)

DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.