@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…

X AI KOLs Following 06/15/26, 04:29 PM Models

speculative-decoding inference-optimization llm qwen open-source collaboration efficiency

Summary

Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.

We worked with @lmsysorg and https://t.co/Cg0JsVomui to - integrate DFlash spec into @sgl_project - make it faster with overlap - train a DFlash drafter for @Alibaba_Qwen 397B-A17B The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP. https://t.co/tIypW8vc7E

Original Article

View Cached Full Text

Cached at: 06/15/26, 07:06 PM

We worked with @lmsysorg and https://t.co/Cg0JsVomui to

integrate DFlash spec into @sgl_project
make it faster with overlap
train a DFlash drafter for @Alibaba_Qwen 397B-A17B

The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP. https://t.co/tIypW8vc7E

Z Lab

Source: https://z-lab.ai/

We focus onmaking AI smaller, faster, and more efficientthrough full-stack innovations:

🧠Algorithm: Designing efficient model architectures and approximations (e.g., sparsity, compression).
⚙️System: Building hardware-aware system support to accelerate emerging AI workloads.
🚀Application: Working with real-world use cases in generative AI, robotics, and scientific discovery.

We are part of theUCSD ML Systems Groupand theUCSD Center for Visual Computing.

News

Jan 2026ParoQuantis accepted to ICLR 2026! ParoQuant enables efficient reasoning LLM inference through pairwise rotation quantization.
Jan 2026DFlashis released! DFlash uses block diffusion for speculative decoding, enabling efficient and high-quality parallel drafting.
Jun 2025SparseVILAis accepted to ICCV 2025! SparseVILA decouples visual token sparsity for efficient vision-language model inference.
Jun 2025SparseLoRAis accepted to ICML 2025! SparseLoRA applies contextual sparsity to skip unnecessary computations during fine-tuning, achieving up to 2.2× compute reduction.

@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…

Z Lab

News

Highlights

Similar Articles

@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…

z-lab/dflash

DFlash and Spec V2 Decoding (14 minute read)

Submit Feedback

Similar Articles

@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…

DFlash and Spec V2 Decoding (14 minute read)