@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…
Summary
A collaboration between Modal, SGLang, and Z Lab integrates DFlash speculation into SGLang, achieving up to 4.3x throughput improvement for Alibaba's Qwen 397B-A17B model, advancing open intelligence.
View Cached Full Text
Cached at: 06/16/26, 11:40 AM
Many are belatedly realizing that intelligence must be open.
For open intelligence to succeed, developers must work together across institutional lines.
That’s why I’m particularly excited about this collab across @modal, @sgl_project, and Z Lab:
Modal (@modal): We worked with @lmsysorg and https://t.co/Cg0JsVomui to
- integrate DFlash spec into @sgl_project
- make it faster with overlap
- train a DFlash drafter for @Alibaba_Qwen 397B-A17B
The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.
Similar Articles
@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…
Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.
@charles_irl: Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFla…
Modal and Z Lab release six new DFlash speculative decoding draft models for Qwen 3.x, achieving over 1000 tokens per second on a B200 and arguing that speculative decoding is the most impactful inference optimization.
@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…
New research on DFlash and Spec V2 speculative decoding methods achieves >4.3X baseline throughput for LLM inference, released as the default speculative decoding engine in SGLang.
@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…
DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.
DFlash and Spec V2 Decoding (14 minute read)
Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.