@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…

X AI KOLs Following 06/15/26, 04:38 PM Tools

open-intelligence sglang inference-speed throughput collaboration speculative-decoding

Summary

A collaboration between Modal, SGLang, and Z Lab integrates DFlash speculation into SGLang, achieving up to 4.3x throughput improvement for Alibaba's Qwen 397B-A17B model, advancing open intelligence.

Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work together across institutional lines. That's why I'm particularly excited about this collab across @modal, @sgl_project, and Z Lab:

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:40 AM

Many are belatedly realizing that intelligence must be open.

For open intelligence to succeed, developers must work together across institutional lines.

That’s why I’m particularly excited about this collab across @modal, @sgl_project, and Z Lab:

Modal (@modal): We worked with @lmsysorg and https://t.co/Cg0JsVomui to

integrate DFlash spec into @sgl_project

make it faster with overlap

train a DFlash drafter for @Alibaba_Qwen 397B-A17B

The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.

Similar Articles

@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…

X AI KOLs Following

Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.

@charles_irl: Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFla…

X AI KOLs Following

Modal and Z Lab release six new DFlash speculative decoding draft models for Qwen 3.x, achieving over 1000 tokens per second on a B200 and arguing that speculative decoding is the most impactful inference optimization.

@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…

X AI KOLs Following

New research on DFlash and Spec V2 speculative decoding methods achieves >4.3X baseline throughput for LLM inference, released as the default speculative decoding engine in SGLang.

@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…

X AI KOLs Following

DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.

Similar Articles

@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…

@charles_irl: Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFla…

@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…

@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…

DFlash and Spec V2 Decoding (14 minute read)

Submit Feedback