Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Reddit r/LocalLLaMA Tools

Summary

A rejected PR for llama.cpp provides up to 30% faster prompt processing for MOE models on AMD Strix Halo hardware, with gains diminishing at higher context lengths.

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is. Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR. Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent. main ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1106.11 ± 8.60 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d10000 | 755.79 ± 2.58 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d20000 | 587.61 ± 1.52 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d40000 | 415.09 ± 2.45 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d60000 | 316.89 ± 2.35 | PR ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1447.62 ± 7.10 | **+31%** | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d10000 | 905.60 ± 3.53 | **+20%** | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d20000 | 685.23 ± 3.03 | **+16%** | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d40000 | 459.42 ± 2.70 | **+11%** | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d60000 | 342.41 ± 2.43 | **+8%**
Original Article

Similar Articles

Strix Halo ROCm + MTP Notes (May 2026)

Reddit r/LocalLLaMA

Technical benchmark comparing ROCm and Vulkan backends for LLM inference on Strix Halo hardware after MTP merged into llama.cpp, revealing ROCm suffers severe performance drops at full context while Vulkan remains stable.

Is HIPfire worth it for Strix Halo?

Reddit r/LocalLLaMA

The article asks for community evaluations of HIPfire's performance and quality on AMD Strix Halo hardware, specifically regarding long context support compared to llama.cpp.