RDNA3 Flash Attention fix just dropped by llama.cpp b9158

Reddit r/LocalLLaMA 05/15/26, 12:50 AM Tools

flash-attention rdna3 bug-fix llama-cpp inference amd-gpu release

Summary

llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.

[https://github.com/ggml-org/llama.cpp/releases](https://github.com/ggml-org/llama.cpp/releases)

Original Article

Similar Articles

@pupposandro: https://x.com/pupposandro/status/2054241934164492328

X AI KOLs Timeline

The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Reddit r/LocalLLaMA

Luce releases DFlash and PFlash support for AMD Strix Halo APUs, achieving 2.23x decode and 3.05x prefill speedups over llama.cpp HIP on Qwen3.6-27B.

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

@bstnxbt: DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction a…

X AI KOLs Following

DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.

ExLlamaV3 Major Updates!