@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…

X AI KOLs Following Models

Summary

Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.

We worked with @lmsysorg and https://t.co/Cg0JsVomui to - integrate DFlash spec into @sgl_project - make it faster with overlap - train a DFlash drafter for @Alibaba_Qwen 397B-A17B The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP. https://t.co/tIypW8vc7E
Original Article
View Cached Full Text

Cached at: 06/15/26, 07:06 PM

We worked with @lmsysorg and https://t.co/Cg0JsVomui to

  • integrate DFlash spec into @sgl_project
  • make it faster with overlap
  • train a DFlash drafter for @Alibaba_Qwen 397B-A17B

The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP. https://t.co/tIypW8vc7E


Z Lab

Source: https://z-lab.ai/

We focus onmaking AI smaller, faster, and more efficientthrough full-stack innovations:

  • 🧠Algorithm: Designing efficient model architectures and approximations (e.g., sparsity, compression).
  • ⚙️System: Building hardware-aware system support to accelerate emerging AI workloads.
  • 🚀Application: Working with real-world use cases in generative AI, robotics, and scientific discovery.

We are part of theUCSD ML Systems Groupand theUCSD Center for Visual Computing.

News

  • Jan 2026ParoQuantis accepted to ICLR 2026! ParoQuant enables efficient reasoning LLM inference through pairwise rotation quantization.
  • Jan 2026DFlashis released! DFlash uses block diffusion for speculative decoding, enabling efficient and high-quality parallel drafting.
  • Jun 2025SparseVILAis accepted to ICCV 2025! SparseVILA decouples visual token sparsity for efficient vision-language model inference.
  • Jun 2025SparseLoRAis accepted to ICML 2025! SparseLoRA applies contextual sparsity to skip unnecessary computations during fine-tuning, achieving up to 2.2× compute reduction.

Highlights

Similar Articles

z-lab/dflash

GitHub Trending (daily)

DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.