DFlash: Block Diffusion for Flash Speculative Decoding

Papers with Code Trending Papers

Summary

DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 08:32 AM

Paper page - DFlash: Block Diffusion for Flash Speculative Decoding

Source: https://huggingface.co/papers/2602.06036

Abstract

DFlash is a speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving significant speedup over existing autoregressive methods while maintaining high-quality outputs.

Autoregressive large language models(LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization.Speculative decodingmitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups.Diffusion LLMsoffer a promising alternative by enablingparallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, aspeculative decodingframework that employs a lightweightblock diffusion modelfor parallel drafting. By generatingdraft tokensin a single forward pass and conditioning the draft model oncontext featuresextracted from the target model, DFlash enables efficient drafting with high-quality outputs and higheracceptance rates. Experiments show that DFlash achieves over 6xlossless accelerationacross a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-artspeculative decodingmethodEAGLE-3.

View arXiv pageView PDFProject pageGitHub3.55kAdd to collection

Community

Can I use ParoQuant with DFlash?

·

Yes, DFlash works well with quantized models, including ParoQuant-quantized ones.

hay man love your work! Would you be able to make a dflash for the 122b one? Is it hard?

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2602\.06036

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper54

#### z-lab/Qwen3.6-27B-DFlash Text Generation• 2B• Updated11 days ago • 30.5k • 263 #### z-lab/Qwen3.6-35B-A3B-DFlash Text Generation• 0.5B• Updated12 days ago • 58.9k • 214 #### z-lab/Qwen3.5-27B-DFlash Text Generation• 2B• Updatedabout 1 month ago • 23.6k • 107 #### spiritbuun/Qwen3.6-27B-DFlash-GGUF 2B• Updated14 days ago • 27.2k • 56 Browse 54 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2602.06036 in a dataset README.md to link it from this page.

Spaces citing this paper3

Collections including this paper10

Browse 10 collections that include this paper

Similar Articles

z-lab/dflash

GitHub Trending (daily)

DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.

z-lab/Qwen3.6-35B-A3B-DFlash

Hugging Face Models Trending

z-lab releases DFlash, a speculative decoding drafter that uses a lightweight block-diffusion model to draft 15–16 tokens in parallel, yielding up to 2.9× speedup for Qwen3.6-35B-A3B inference.

z-lab/gemma-4-31B-it-DFlash

Hugging Face Models Trending

Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.

z-lab/Qwen3.6-27B-DFlash

Hugging Face Models Trending

This article introduces Qwen3.6-27B-DFlash, a specialized drafter model for DFlash, a novel speculative decoding method using block diffusion to accelerate inference speed. It provides installation instructions for vLLM and SGLang to enable parallel drafting with the target Qwen3.6-27B model.