Benchmarks of MTP (Multi-Token Prediction) in llama.cpp on Strix Halo show significant speedups for 27B Qwen models in long-context chat, but mixed results for 35B models.
### **TL;DR** All models were Qwen3.6 **27B-MTP vs Base 27B (15k single-turn): Faster overall** * **Total Time (wall):** 87.44s → 77.39s (**10.05s faster** / -11.50%) * **Generation:** 7.63 → 16.15 t/s (+111.77% speedup) * **Prompt Processing:** 279.75 → 244.90 t/s (-12.46% slowdown) **35B-MTP vs Base 35B (15k single-turn): Slower overall** * **Total Time (wall):** 20.83s → 23.16s (**2.33s slower** / +11.17%) * **Generation:** 48.18 → 56.12 t/s (+16.47% speedup) * **Prompt Processing:** 972.18 → 811.90 t/s (-16.49% slowdown) **27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings** * **Total Time (wall):** 258.65s → 200.55s (**58.10s faster** / -22.46%) * **Turns 2-5 (wall):** 211.37s → 155.33s (**56.04s faster** / -26.51%) * **Avg Generation:** 7.61 → 17.98 t/s (+136.41% speedup) * **Avg Prompt Processing:** 254.20 → 207.87 t/s (-18.23% slowdown) **35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Roughly tied, slightly slower** * **Total Time (wall):** 58.86s → 60.24s (**1.38s slower** / +2.34%) * **Turns 2-5 (wall):** 47.96s → 49.21s (**1.25s slower** / +2.62%) * **Avg Generation:** 46.66 → 58.23 t/s (+24.80% speedup) * **Avg Prompt Processing:** 826.47 → 703.45 t/s (-14.89% slowdown) **Terminology:** * `wall` = real end-to-end elapsed time from sending the request to receiving the full response. * `pp` = prompt processing throughput (tokens/sec). * `gen t/s` = generation throughput (tokens/sec). --- ### **Hardware / Software** * **CPU:** AMD RYZEN AI MAX+ 395 (16C/32T) * **iGPU:** Radeon 8060S (RADV GFX1151) * **RAM:** 30 GiB * **OS:** Ubuntu 24.04, kernel 6.17 * **llama.cpp / llama-server:** 9187 (0253fb21f) * **Vulkan Instance:** 1.4.313 * **GPU API:** 1.4.305 * **Mesa RADV:** 25.0.7 --- ### **Models Tested (all Unsloth)** * `Qwen3.6-27B-Q8_0.gguf` * `Qwen3.6-27B-Q8_0-MTP.gguf` * `Qwen3.6-35B-A3B-Q8_0.gguf` * `Qwen3.6-35B-A3B-Q8_0-MTP.gguf` --- ### **Runtime Config Used** * `--ctx-size 128000` * `-b 2048` * `--ubatch-size 1024` * `--flash-attn on` * `--threads 16` * `--threads-batch 16` **MTP models only:** * `--spec-type draft-mtp` * `--spec-draft-n-max 3` * `--spec-draft-p-min 0.75` --- ### **Methodology** **15k single-turn uncached** * Synthetic agentic prompt calibrated to ~15k prompt tokens. * `max_tokens=256`, `temperature=0`. * Prompt randomized each run (RUN_TAG) so `cache_n=0` (true uncached prefill). * 2 runs per model. **5-turn subsequent-turn test** * Same scripted 5-turn back-and-forth for each model. * ~3900-word user payload each turn. * Context grows to ~28.5k prompt tokens by turn 5. * `max_tokens=220`, `temperature=0`. * Reported both full 5-turn total and turns 2-5 only (to isolate “subsequent turn” behavior). --- ### **Stability** * Retry logic on transient 502/503/504 for long runs. * Reported both server infer timing and client-observed wall time. --- ### **Full Results (Latency-Focused)** **15k single-turn** | Family | Non-MTP wall | MTP wall | Delta | | --- | --- | --- | --- | | **27B** | 87.44s | 77.39s | -11.50% | | **35B** | 20.83s | 23.16s | +11.17% | **5-turn total (~28.5k by turn 5)** | Family | Non-MTP wall | MTP wall | Delta | | --- | --- | --- | --- | | **27B** | 258.65s | 200.55s | -22.46% | | **35B** | 58.86s | 60.24s | +2.34% | **Subsequent turns only (turns 2-5)** | Family | Non-MTP wall | MTP wall | Delta | | --- | --- | --- | --- | | **27B** | 211.37s | 155.33s | -26.51% | | **35B** | 47.96s | 49.21s | +2.62% | --- ### **Takeaways** * **MTP consistently lowers pp** and increases generation t/s. * **Workload shape dictates the overall winner:** * If decode dominates, MTP can win hard (as seen on 27B here). * If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here). * **On this Strix Halo setup:** * **27B-MTP** is a strong practical upgrade for long-context chat workflows. * **35B-MTP** is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.
Benchmark comparison of Qwen3.5-122B Q5 and Q6 quantized models using llama.cpp with multi-token prediction on Strix Halo, showing throughput of 20.24 t/s and 17.17 t/s respectively.
Technical benchmark comparing ROCm and Vulkan backends for LLM inference on Strix Halo hardware after MTP merged into llama.cpp, revealing ROCm suffers severe performance drops at full context while Vulkan remains stable.
llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.
MTP (Multi-Token Prediction) can accelerate LLM inference by 2x, especially for coding agents. This video demonstrates performance improvements with Qwen 3.6 on AMD Strix Halo and Dual Radeon 9700.