Strix Halo ROCm + MTP Notes (May 2026)

Reddit r/LocalLLaMA Tools

Summary

Technical benchmark comparing ROCm and Vulkan backends for LLM inference on Strix Halo hardware after MTP merged into llama.cpp, revealing ROCm suffers severe performance drops at full context while Vulkan remains stable.

With the MTP merge into mainline llama.cpp I wanted to try out some other optimizations i could think of. Ended up tested backends, mtp, and bumping to ROCm nightlies. What's changed: - ROCm 7.13 works on gfx1151 (7.2.2 could see the GPU but couldn't compile shaders) - MTP merged to llama.cpp main yesterday (May 16) - I ran 3 models x 2 backends x 3 prompt lengths + a full-context decode test The headline: ROCm drops 64% at full context, but MTP recovers most of it. Vulkan barely drops. Full writeup with all tables: https://kmarble.dev/posts/strix-halo-full-context-decode-drops/ But the quick version: 35B MoE at full context (76k prompt tokens, 5k output): - ROCm non-MTP: 16.6 tok/s (was 46.2 empty) - ROCm MTP: 37.5 tok/s (was 63.7 empty) - Vulkan non-MTP: 28.9 tok/s (was 32.7 empty) - Vulkan MTP: 34.3 tok/s (was 46.8 empty) 122B MoE: - Vulkan non-MTP: 23.7 tok/s (only 12% drop) - ROCm MTP: 19.2 tok/s (38% drop) - Vulkan MTP: 21.9 tok/s (6% drop) 27B dense (avoid it): 6-9 tok/s at full context regardless of backend. Insights: 1. ROCm was 2.3x Vulkan at empty context (46 vs 32 tok/s), but at full context the gap narrows to 1.3x (37.5 vs 28.9) 2. Vulkan is way more stable at full context - only 12% drop vs ROCm's 64% 3. MTP on 122B Vulkan actually helps slightly (-6% vs non-MTP) while MTP on 122B ROCm drops 38% 4. The dense 27B is unusable - 5x slower than 35B MoE because it processes 27B active params per token vs 3B Setup: ROCm 7.13 with therock-gfx1151 codegen path from kyuz0's toolbox. Vulkan 1.3 RADV. llama.cpp b9188. All live llama-swap proxy tests, not synthetic llama-bench runs. BF16 models don't work at full context on Strix Halo. Q8 for 35B, Q4 for 122B. For my setup, ROCm MTP on 35B MoE stays the production choice: 37.5 tok/s at full context, under 100W, 262k context available. But if you care more about quality than speed, 122B on Vulkan at 23-24 tok/s is competitive.
Original Article

Similar Articles

Is HIPfire worth it for Strix Halo?

Reddit r/LocalLLaMA

The article asks for community evaluations of HIPfire's performance and quality on AMD Strix Halo hardware, specifically regarding long context support compared to llama.cpp.