RDNA3 Flash Attention fix just dropped by llama.cpp b9158
Summary
llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.
Similar Articles
@pupposandro: https://x.com/pupposandro/status/2054241934164492328
The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.
Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP
Luce releases DFlash and PFlash support for AMD Strix Halo APUs, achieving 2.23x decode and 3.05x prefill speedups over llama.cpp HIP on Qwen3.6-27B.
@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
@bstnxbt: DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction a…
DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.
ExLlamaV3 Major Updates!
ExLlamaV3 has released a series of major updates including Gemma 4 support, improved caching efficiency, and the new DFlash technology for significantly faster inference speeds across various model categories.