DFlash: Block Diffusion for Flash Speculative Decoding
Summary
DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.
View Cached Full Text
Cached at: 05/08/26, 08:32 AM
Paper page - DFlash: Block Diffusion for Flash Speculative Decoding
Source: https://huggingface.co/papers/2602.06036
Abstract
DFlash is a speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving significant speedup over existing autoregressive methods while maintaining high-quality outputs.
Autoregressive large language models(LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization.Speculative decodingmitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups.Diffusion LLMsoffer a promising alternative by enablingparallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, aspeculative decodingframework that employs a lightweightblock diffusion modelfor parallel drafting. By generatingdraft tokensin a single forward pass and conditioning the draft model oncontext featuresextracted from the target model, DFlash enables efficient drafting with high-quality outputs and higheracceptance rates. Experiments show that DFlash achieves over 6xlossless accelerationacross a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-artspeculative decodingmethodEAGLE-3.
View arXiv pageView PDFProject pageGitHub3.55kAdd to collection
Community
Can I use ParoQuant with DFlash?
·
Yes, DFlash works well with quantized models, including ParoQuant-quantized ones.
hay man love your work! Would you be able to make a dflash for the 122b one? Is it hard?
Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2602\.06036
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper54
#### z-lab/Qwen3.6-27B-DFlash Text Generation• 2B• Updated11 days ago • 30.5k • 263
#### z-lab/Qwen3.6-35B-A3B-DFlash Text Generation• 0.5B• Updated12 days ago • 58.9k • 214
#### z-lab/Qwen3.5-27B-DFlash Text Generation• 2B• Updatedabout 1 month ago • 23.6k • 107
#### spiritbuun/Qwen3.6-27B-DFlash-GGUF 2B• Updated14 days ago • 27.2k • 56
Browse 54 models citing this paper## Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2602.06036 in a dataset README.md to link it from this page.
Spaces citing this paper3
Collections including this paper10
Similar Articles
@zhijianliu_: DFlash is now running in a production inference stack. More draft models coming soon. https://github.com/z-lab/dflash
DFlash is a lightweight block diffusion model for speculative decoding, now running in production with support for various LLMs like Qwen and Gemma.
@charles_irl: dflash go brr
NVIDIA announces DFlash, an open source block diffusion model for speculative decoding that achieves up to 15x higher inference throughput on Blackwell GPUs while maintaining interactivity.
DFlash and Spec V2 Decoding (14 minute read)
Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.
z-lab/dflash
DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.
@charles_irl: Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFla…
Modal and Z Lab release six new DFlash speculative decoding draft models for Qwen 3.x, achieving over 1000 tokens per second on a B200 and arguing that speculative decoding is the most impactful inference optimization.