@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…
Summary
Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.
View Cached Full Text
Cached at: 06/15/26, 07:06 PM
We worked with @lmsysorg and https://t.co/Cg0JsVomui to
- integrate DFlash spec into @sgl_project
- make it faster with overlap
- train a DFlash drafter for @Alibaba_Qwen 397B-A17B
The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP. https://t.co/tIypW8vc7E
Z Lab
Source: https://z-lab.ai/

We focus onmaking AI smaller, faster, and more efficientthrough full-stack innovations:
- 🧠Algorithm: Designing efficient model architectures and approximations (e.g., sparsity, compression).
- ⚙️System: Building hardware-aware system support to accelerate emerging AI workloads.
- 🚀Application: Working with real-world use cases in generative AI, robotics, and scientific discovery.
We are part of theUCSD ML Systems Groupand theUCSD Center for Visual Computing.
News
- Jan 2026ParoQuantis accepted to ICLR 2026! ParoQuant enables efficient reasoning LLM inference through pairwise rotation quantization.
- Jan 2026DFlashis released! DFlash uses block diffusion for speculative decoding, enabling efficient and high-quality parallel drafting.
- Jun 2025SparseVILAis accepted to ICCV 2025! SparseVILA decouples visual token sparsity for efficient vision-language model inference.
- Jun 2025SparseLoRAis accepted to ICML 2025! SparseLoRA applies contextual sparsity to skip unnecessary computations during fine-tuning, achieving up to 2.2× compute reduction.
Highlights
Similar Articles
@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…
New research on DFlash and Spec V2 speculative decoding methods achieves >4.3X baseline throughput for LLM inference, released as the default speculative decoding engine in SGLang.
@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…
Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.
@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…
A collaboration between Modal, SGLang, and Z Lab integrates DFlash speculation into SGLang, achieving up to 4.3x throughput improvement for Alibaba's Qwen 397B-A17B model, advancing open intelligence.
z-lab/dflash
DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.
DFlash and Spec V2 Decoding (14 minute read)
Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.