RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed

What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases. This binary bypasses that assert and enables flash attention. Works for rocm lamma cpp build with qwen3.6 35B.

tldr; vulkan tok/s 30. stock rocm tok/s: Doesnt run. This build: 70-80 tok/s
try it yourself.

https://github.com/Minerest/llama.cpp\_RDNA2\_FlashAttnEnabled/releases/tag/mtp-fa-workaround

If you guys try to run flash attention on rocm with this hardware with a stock llama cpp build, you will hit a wall.

GGMLFlash Attention Crash (gfx1030/gfx1031)
GGML_ASSERT(max_blocks_per_sm > 0) failed
ggml/src/ggml-cuda/fattn-common.cuh:1054

Basically, HIP reports that hipOccupancyMaxActiveBlocksPerMultiprocessor
= 0 which is wrong. This is working proof that we do, indeed, have memory. I patched a workaround log when you would have crashed. There's some technical findings in github, but for the rest of you who just want a faster build, this is it.

Buyer Beware, local AI on rocm crash often. Gemma crashes on bigger contexts with this build. Deepseek ran very, very slowly. Only confirmed working AI I've tried is qwen3.6 35B and 27B.

And for those who want the llama server flags.

exec "$REPO/mtp-build/bin/llama-server" \
-m "$MODEL" \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
-fa on \
--no-mmproj \
-ngl 50 \
-ts 16,10 \
-c 64192 \
--parallel 1 \
--host 127.0.0.1 --port 8080 \

And finally, the llama cpp build command post patch

cmake -S . -B build-instrumented \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DGPU_TARGETS="gfx1030;gfx1031" \
-DROCM_PATH=/usr \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_HIP_FLAGS="-DGGML_FATTN_TRACE"
cmake --build build-instrumented --target llama-bench -j6

submitted by /u/DiscipleofDeceit666
[link] [comments]

Leave a Comment