Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

Reddit r/LocalLLaMA 05/16/26, 04:41 PM Tools

llama-cpp multi-token-prediction benchmark strix-halo amd qwen

Summary

Benchmarks of MTP (Multi-Token Prediction) in llama.cpp on Strix Halo show significant speedups for 27B Qwen models in long-context chat, but mixed results for 35B models.

### **TL;DR** All models were Qwen3.6 **27B-MTP vs Base 27B (15k single-turn): Faster overall** * **Total Time (wall):** 87.44s → 77.39s (**10.05s faster** / -11.50%) * **Generation:** 7.63 → 16.15 t/s (+111.77% speedup) * **Prompt Processing:** 279.75 → 244.90 t/s (-12.46% slowdown) **35B-MTP vs Base 35B (15k single-turn): Slower overall** * **Total Time (wall):** 20.83s → 23.16s (**2.33s slower** / +11.17%) * **Generation:** 48.18 → 56.12 t/s (+16.47% speedup) * **Prompt Processing:** 972.18 → 811.90 t/s (-16.49% slowdown) **27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings** * **Total Time (wall):** 258.65s → 200.55s (**58.10s faster** / -22.46%) * **Turns 2-5 (wall):** 211.37s → 155.33s (**56.04s faster** / -26.51%) * **Avg Generation:** 7.61 → 17.98 t/s (+136.41% speedup) * **Avg Prompt Processing:** 254.20 → 207.87 t/s (-18.23% slowdown) **35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Roughly tied, slightly slower** * **Total Time (wall):** 58.86s → 60.24s (**1.38s slower** / +2.34%) * **Turns 2-5 (wall):** 47.96s → 49.21s (**1.25s slower** / +2.62%) * **Avg Generation:** 46.66 → 58.23 t/s (+24.80% speedup) * **Avg Prompt Processing:** 826.47 → 703.45 t/s (-14.89% slowdown) **Terminology:** * `wall` = real end-to-end elapsed time from sending the request to receiving the full response. * `pp` = prompt processing throughput (tokens/sec). * `gen t/s` = generation throughput (tokens/sec). --- ### **Hardware / Software** * **CPU:** AMD RYZEN AI MAX+ 395 (16C/32T) * **iGPU:** Radeon 8060S (RADV GFX1151) * **RAM:** 30 GiB * **OS:** Ubuntu 24.04, kernel 6.17 * **llama.cpp / llama-server:** 9187 (0253fb21f) * **Vulkan Instance:** 1.4.313 * **GPU API:** 1.4.305 * **Mesa RADV:** 25.0.7 --- ### **Models Tested (all Unsloth)** * `Qwen3.6-27B-Q8_0.gguf` * `Qwen3.6-27B-Q8_0-MTP.gguf` * `Qwen3.6-35B-A3B-Q8_0.gguf` * `Qwen3.6-35B-A3B-Q8_0-MTP.gguf` --- ### **Runtime Config Used** * `--ctx-size 128000` * `-b 2048` * `--ubatch-size 1024` * `--flash-attn on` * `--threads 16` * `--threads-batch 16` **MTP models only:** * `--spec-type draft-mtp` * `--spec-draft-n-max 3` * `--spec-draft-p-min 0.75` --- ### **Methodology** **15k single-turn uncached** * Synthetic agentic prompt calibrated to ~15k prompt tokens. * `max_tokens=256`, `temperature=0`. * Prompt randomized each run (RUN_TAG) so `cache_n=0` (true uncached prefill). * 2 runs per model. **5-turn subsequent-turn test** * Same scripted 5-turn back-and-forth for each model. * ~3900-word user payload each turn. * Context grows to ~28.5k prompt tokens by turn 5. * `max_tokens=220`, `temperature=0`. * Reported both full 5-turn total and turns 2-5 only (to isolate “subsequent turn” behavior). --- ### **Stability** * Retry logic on transient 502/503/504 for long runs. * Reported both server infer timing and client-observed wall time. --- ### **Full Results (Latency-Focused)** **15k single-turn** | Family | Non-MTP wall | MTP wall | Delta | | --- | --- | --- | --- | | **27B** | 87.44s | 77.39s | -11.50% | | **35B** | 20.83s | 23.16s | +11.17% | **5-turn total (~28.5k by turn 5)** | Family | Non-MTP wall | MTP wall | Delta | | --- | --- | --- | --- | | **27B** | 258.65s | 200.55s | -22.46% | | **35B** | 58.86s | 60.24s | +2.34% | **Subsequent turns only (turns 2-5)** | Family | Non-MTP wall | MTP wall | Delta | | --- | --- | --- | --- | | **27B** | 211.37s | 155.33s | -26.51% | | **35B** | 47.96s | 49.21s | +2.62% | --- ### **Takeaways** * **MTP consistently lowers pp** and increases generation t/s. * **Workload shape dictates the overall winner:** * If decode dominates, MTP can win hard (as seen on 27B here). * If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here). * **On this Strix Halo setup:** * **27B-MTP** is a strong practical upgrade for long-context chat workflows. * **35B-MTP** is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.

Original Article

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

Similar Articles

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

Strix Halo ROCm + MTP Notes (May 2026)

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro

Submit Feedback

Similar Articles

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks
Community benchmarks of Qwen 3.6-27B Dense and MTP variants running via llama.cpp on Strix Halo Windows, showing token/s speeds for various tasks.

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

Strix Halo ROCm + MTP Notes (May 2026)

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro