Strix Halo ROCm + MTP Notes (May 2026)

With the MTP merge into mainline llama.cpp I wanted to try out some other optimizations i could think of. Ended up tested backends, mtp, and bumping to ROCm nightlies.

What's changed:

ROCm 7.13 works on gfx1151 (7.2.2 could see the GPU but couldn't compile shaders)
MTP merged to llama.cpp main yesterday (May 16)
I ran 3 models x 2 backends x 3 prompt lengths + a full-context decode test

The headline: ROCm drops 64% at full context, but MTP recovers most of it. Vulkan barely drops.

Full writeup with all tables: https://kmarble.dev/posts/strix-halo-full-context-decode-drops/

But the quick version:

35B MoE at full context (76k prompt tokens, 5k output):

ROCm non-MTP: 16.6 tok/s (was 46.2 empty)
ROCm MTP: 37.5 tok/s (was 63.7 empty)
Vulkan non-MTP: 28.9 tok/s (was 32.7 empty)
Vulkan MTP: 34.3 tok/s (was 46.8 empty)

122B MoE:

Vulkan non-MTP: 23.7 tok/s (only 12% drop)
ROCm MTP: 19.2 tok/s (38% drop)
Vulkan MTP: 21.9 tok/s (6% drop)

27B dense (avoid it): 6-9 tok/s at full context regardless of backend.

Insights:

ROCm was 2.3x Vulkan at empty context (46 vs 32 tok/s), but at full context the gap narrows to 1.3x (37.5 vs 28.9)
Vulkan is way more stable at full context - only 12% drop vs ROCm's 64%
MTP on 122B Vulkan actually helps slightly (-6% vs non-MTP) while MTP on 122B ROCm drops 38%
The dense 27B is unusable - 5x slower than 35B MoE because it processes 27B active params per token vs 3B

Setup: ROCm 7.13 with therock-gfx1151 codegen path from kyuz0's toolbox. Vulkan 1.3 RADV. llama.cpp b9188. All live llama-swap proxy tests, not synthetic llama-bench runs.

BF16 models don't work at full context on Strix Halo. Q8 for 35B, Q4 for 122B.

For my setup, ROCm MTP on 35B MoE stays the production choice: 37.5 tok/s at full context, under 100W, 262k context available. But if you care more about quality than speed, 122B on Vulkan at 23-24 tok/s is competitive.

submitted by /u/IvGranite
[link] [comments]

Leave a Comment