RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed

Reddit r/LocalLLaMA 05/19/26, 04:24 AM Tools

flash-attention amd rdna2 llama-cpp rocm performance workaround

Summary

Custom binary workaround enables flash attention on AMD RDNA2 GPUs for llama.cpp, doubling inference speed (70-80 tok/s vs stock crash). Only confirmed working with Qwen3.6 35B/27B.

What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases. This binary bypasses that assert and enables flash attention. Works for rocm lamma cpp build with qwen3.6 35B. tldr; vulkan tok/s 30. stock rocm tok/s: Doesnt run. This build: 70-80 tok/s try it yourself. https://github.com/Minerest/llama.cpp\_RDNA2\_FlashAttnEnabled/releases/tag/mtp-fa-workaround If you guys try to run flash attention on rocm with this hardware with a stock llama cpp build, you will hit a wall. GGMLFlash Attention Crash (gfx1030/gfx1031) GGML\_ASSERT(max\_blocks\_per\_sm > 0) failed ggml/src/ggml-cuda/fattn-common.cuh:1054 Basically, HIP reports that hipOccupancyMaxActiveBlocksPerMultiprocessor = 0 which is wrong. This is working proof that we do, indeed, have memory. I patched a workaround log when you would have crashed. There's some technical findings in github, but for the rest of you who just want a faster build, this is it. Buyer Beware, local AI on rocm crash often. Gemma crashes on bigger contexts with this build. Deepseek ran very, very slowly. Only confirmed working AI I've tried is qwen3.6 35B and 27B. And for those who want the llama server flags. exec "$REPO/mtp-build/bin/llama-server" \\ \-m "$MODEL" \\ \--spec-type draft-mtp \\ \--spec-draft-n-max 2 \\ \-fa on \\ \--no-mmproj \\ \-ngl 50 \\ \-ts 16,10 \\ \-c 64192 \\ \--parallel 1 \\ \--host 127.0.0.1 --port 8080 \\ And finally, the llama cpp build command post patch cmake -S . -B build-instrumented \\ \-DCMAKE\_BUILD\_TYPE=Release \\ \-DGGML\_HIP=ON \\ \-DGPU\_TARGETS="gfx1030;gfx1031" \\ \-DROCM\_PATH=/usr \\ \-DBUILD\_SHARED\_LIBS=ON \\ \-DCMAKE\_HIP\_FLAGS="-DGGML\_FATTN\_TRACE" cmake --build build-instrumented --target llama-bench -j6

Original Article

RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed

Similar Articles

RDNA3 Flash Attention fix just dropped by llama.cpp b9158

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

@pupposandro: https://x.com/pupposandro/status/2054241934164492328

Submit Feedback

Similar Articles

RDNA3 Flash Attention fix just dropped by llama.cpp b9158
llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP
Luce releases DFlash and PFlash support for AMD Strix Halo APUs, achieving 2.23x decode and 3.05x prefill speedups over llama.cpp HIP on Qwen3.6-27B.

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

@pupposandro: https://x.com/pupposandro/status/2054241934164492328