What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases. This binary bypasses that assert and enables flash attention. Works for rocm lamma cpp build with qwen3.6 35B.
tldr; vulkan tok/s 30. stock rocm tok/s: Doesnt run. This build: 70-80 tok/s
try it yourself.
https://github.com/Minerest/llama.cpp\_RDNA2\_FlashAttnEnabled/releases/tag/mtp-fa-workaround
If you guys try to run flash attention on rocm with this hardware with a stock llama cpp build, you will hit a wall.
GGMLFlash Attention Crash (gfx1030/gfx1031)
GGML_ASSERT(max_blocks_per_sm > 0) failed
ggml/src/ggml-cuda/fattn-common.cuh:1054
Basically, HIP reports that hipOccupancyMaxActiveBlocksPerMultiprocessor
= 0 which is wrong. This is working proof that we do, indeed, have memory. I patched a workaround log when you would have crashed. There's some technical findings in github, but for the rest of you who just want a faster build, this is it.
Buyer Beware, local AI on rocm crash often. Gemma crashes on bigger contexts with this build. Deepseek ran very, very slowly. Only confirmed working AI I've tried is qwen3.6 35B and 27B.
And for those who want the llama server flags.
exec "$REPO/mtp-build/bin/llama-server" \
-m "$MODEL" \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
-fa on \
--no-mmproj \
-ngl 50 \
-ts 16,10 \
-c 64192 \
--parallel 1 \
--host 127.0.0.1 --port 8080 \
And finally, the llama cpp build command post patch
cmake -S . -B build-instrumented \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DGPU_TARGETS="gfx1030;gfx1031" \
-DROCM_PATH=/usr \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_HIP_FLAGS="-DGGML_FATTN_TRACE"
cmake --build build-instrumented --target llama-bench -j6
[link] [comments]