@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…

X AI KOLs Following 06/15/26, 04:32 PM Tools

inference-acceleration block-diffusion kv-cache speculative-decoding llm modal sglang

Summary

DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.

This is what DFlash was built for. ⚡ Our block-diffusion drafter + KV injection, now running at frontier scale — thanks to @modal and @sgl_project for the engine + integration support!

Original Article

View Cached Full Text

Cached at: 06/16/26, 03:37 PM

This is what DFlash was built for. ⚡

Our block-diffusion drafter + KV injection, now running at frontier scale — thanks to @modal and @sgl_project for the engine + integration support!

Modal (@modal): We worked with @lmsysorg and https://t.co/Cg0JsVomui to

integrate DFlash spec into @sgl_project

make it faster with overlap

train a DFlash drafter for @Alibaba_Qwen 397B-A17B

The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.

Similar Articles

@zhijianliu_: DFlash is now running in a production inference stack. More draft models coming soon. https://github.com/z-lab/dflash

X AI KOLs Following

DFlash is a lightweight block diffusion model for speculative decoding, now running in production with support for various LLMs like Qwen and Gemma.

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.

DFlash: Block Diffusion for Flash Speculative Decoding

Papers with Code Trending

DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.

z-lab/Qwen3.6-27B-DFlash

Hugging Face Models Trending

This article introduces Qwen3.6-27B-DFlash, a specialized drafter model for DFlash, a novel speculative decoding method using block diffusion to accelerate inference speed. It provides installation instructions for vLLM and SGLang to enable parallel drafting with the target Qwen3.6-27B model.

z-lab/Qwen3.6-35B-A3B-DFlash