RDNA3 Flash Attention fix just dropped by llama.cpp b9158
Summary
llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.
Similar Articles
RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed
Custom binary workaround enables flash attention on AMD RDNA2 GPUs for llama.cpp, doubling inference speed (70-80 tok/s vs stock crash). Only confirmed working with Qwen3.6 35B/27B.
llama.cpp B9387 Significant AMD/ROCm PP Update
llama.cpp version b9387 introduces MFMA support for AMD CDNA architecture (MI100, MI200, MI300 series), improving processing pipeline performance on datacenter AMD GPUs.
@pupposandro: https://x.com/pupposandro/status/2054241934164492328
The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.
Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.
A new packed16 K technique for llama.cpp on RDNA3 GPUs reduces KV cache VRAM by 47% compared to Vulkan fp16, using int8 packing and native dot4 instructions to maintain fp16-quality K values with minimal KLD loss.
@nullfoundry: hey everyone. i'd like to share my new recipe for dflash ( merged yesterday on oficial llama.cpp ) llama-server -hf uns…
Sharing a new recipe for dflash speculative decoding in llama.cpp, achieving ~70 TPS on a single RTX 3090 using Qwen3.6-27B GGUF with a draft model.