@bstnxbt: DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction a…

X AI KOLs Following 04/18/26, 07:53 PM Tools

quantization metal-kernels qwen3 inference-optimization mlx open-source

Summary

DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.

DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction at long context. M5 Max 40-core GPU, 64GB, stock mlx_lm baseline: Qwen3.6-35B-A3B-4bit: ► @ 1024 · 138.3 → 300.3 tok/s (2.20x) ► @ 2048 · 135.6 → 246.4

Original Article

View Cached Full Text

Cached at: 04/20/26, 09:39 AM

Similar Articles

@zhijianliu_: DFlash for Qwen3.6-35B-A3B just dropped The community was running the day-1 preview before we even finished training. N…

X AI KOLs Following

Z-lab releases DFlash for Qwen3.6-35B-A3B, a model fine-tuning/compression technique, with training complete and weights now available on GitHub and HuggingFace.

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

@charles_irl: dflash go brr

X AI KOLs Timeline

NVIDIA announces DFlash, an open source block diffusion model for speculative decoding that achieves up to 15x higher inference throughput on Blackwell GPUs while maintaining interactivity.

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

Reddit r/LocalLLaMA

Benchmarks of DFlash speculative decoding combined with KV cache compression on RTX 5090 show up to 3.26x speedup on Qwen3.6-27B with minimal perplexity degradation, with q4_0/turbo4 providing the best balance.