@charles_irl: dflash go brr

X AI KOLs Timeline 06/24/26, 12:04 AM Models

open-source speculative-decoding inference nvidia blackwell diffusion-model performance

Summary

NVIDIA announces DFlash, an open source block diffusion model for speculative decoding that achieves up to 15x higher inference throughput on Blackwell GPUs while maintaining interactivity.

dflash go brr

Original Article

View Cached Full Text

Cached at: 06/24/26, 12:17 AM

dflash go brr

NVIDIA AI (@NVIDIAAI): Increase inference performance by up to 15x without sacrificing responsiveness.

DFlash, an open source lightweight block diffusion model designed for speculative decoding, delivers up to 15x higher throughput on NVIDIA Blackwell while maintaining the same user interactivity

Similar Articles

DFlash: Block Diffusion for Flash Speculative Decoding

Papers with Code Trending

DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.

@zhijianliu_: DFlash is now running in a production inference stack. More draft models coming soon. https://github.com/z-lab/dflash

X AI KOLs Following

DFlash is a lightweight block diffusion model for speculative decoding, now running in production with support for various LLMs like Qwen and Gemma.

@charles_irl: Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFla…

X AI KOLs Following

Modal and Z Lab release six new DFlash speculative decoding draft models for Qwen 3.x, achieving over 1000 tokens per second on a B200 and arguing that speculative decoding is the most impactful inference optimization.

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.

z-lab/dflash

GitHub Trending (daily)

DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.