@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…
Summary
A collaboration between Modal, SGLang, and Z Lab integrates DFlash speculation into SGLang, achieving up to 4.3x throughput improvement for Alibaba's Qwen 397B-A17B model, advancing open intelligence.
View Cached Full Text
Cached at: 06/16/26, 11:40 AM
Many are belatedly realizing that intelligence must be open.
For open intelligence to succeed, developers must work together across institutional lines.
That’s why I’m particularly excited about this collab across @modal, @sgl_project, and Z Lab:
Modal (@modal): We worked with @lmsysorg and https://t.co/Cg0JsVomui to
- integrate DFlash spec into @sgl_project
- make it faster with overlap
- train a DFlash drafter for @Alibaba_Qwen 397B-A17B
The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.
Similar Articles
@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…
Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.
@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…
New research on DFlash and Spec V2 speculative decoding methods achieves >4.3X baseline throughput for LLM inference, released as the default speculative decoding engine in SGLang.
@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…
DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.
DFlash and Spec V2 Decoding (14 minute read)
Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.
@Ali_TongyiLab: We are pleased to highlight an excellent community model from developer : Qwen3.6-27B-MTP-pi-reasoning-GGUF. Built on o…
Alibaba's Tongyi Lab highlights a community model, Qwen3.6-27B-MTP-pi-reasoning-GGUF, built on Qwen3.6-27B, optimized for automated programming and debugging workflows for local coding agents.