@bstnxbt: DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction a…

X AI KOLs Following Tools

Summary

DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4
Original Article
View Cached Full Text

Cached at: 04/20/26, 09:39 AM

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4

Similar Articles

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.

@charles_irl: dflash go brr

X AI KOLs Timeline

NVIDIA announces DFlash, an open source block diffusion model for speculative decoding that achieves up to 15x higher inference throughput on Blackwell GPUs while maintaining interactivity.