The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.
**TLDR:** The hype is real! 1.5x speedup. Up to 2x speedup with tensor parallelism! After reading the PR I immediately hunted for MTP-compatible Q4\_1 quants (they offer a small speedup on these compute-lacking older cards) but couldn't find any. Luckily I came across [this](https://www.reddit.com/r/LocalLLaMA/comments/1t6r1ny/extracted_mtp_tensor_ggufs_smaller_donor_models/) post which highlighted how to transplant MTP grafting onto your own quants, and thus attached it to Bartowski's quant I already had. # Setup * CachyOS (Arch Linux) * ROCm 7.2 Built the llama.cpp fork [https://github.com/skyne98/llama.cpp-gfx906](https://github.com/skyne98/llama.cpp-gfx906) with [https://github.com/ggml-org/llama.cpp/pull/22673](https://github.com/ggml-org/llama.cpp/pull/22673) and ran the following command with the included PR benchmark script: llama-server -m ~/models/Qwen3.6-27B-MTP-Q4_1.gguf \ --temp 1.0 --min-p 0.0 --top-k 20 --top-p 0.95 \ --jinja --presence-penalty 1.5 \ --chat-template-kwargs '{"preserve_thinking": true}' \ -ub 2048 -b 2048 \ -fa 1 -np 1 \ --no-mmap --no-warmup \ -dev ROCm0,ROCm1 --fit on -fitt 256 # Script Benchmark Stock: code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.2 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.2 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.3 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.3 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.0 With MTP on: `--spec-type mtp --spec-draft-n-max 2` code_python pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.6 code_cpp pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.5 explain_concept pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=36.7 summarize pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=40.7 qa_factual pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.4 translation pred= 192 draft= 152 acc= 115 rate=0.757 tok/s=37.5 creative_short pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.6 stepwise_math pred= 192 draft= 146 acc= 118 rate=0.808 tok/s=39.0 long_code_review pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=37.8 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1340, "total_draft_accepted": 1046, "aggregate_accept_rate": 0.7806, "wall_s_total": 51.42 } With tensor parallelism on: `-sm tensor` code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=35.0 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.8 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.3 Combining MTP and tensor parallelism: code_python pred= 192 draft= 142 acc= 120 rate=0.845 tok/s=59.8 code_cpp pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=56.6 explain_concept pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=56.8 summarize pred= 53 draft= 42 acc= 31 rate=0.738 tok/s=54.5 qa_factual pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.8 translation pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=57.3 creative_short pred= 192 draft= 154 acc= 114 rate=0.740 tok/s=54.8 stepwise_math pred= 192 draft= 140 acc= 121 rate=0.864 tok/s=59.6 long_code_review pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.2 Aggregate: { "n_requests": 9, "total_predicted": 1589, "total_draft": 1214, "total_draft_accepted": 970, "aggregate_accept_rate": 0.799, "wall_s_total": 32.24 # Real-world benchmark The numbers above looks absolutely insane, however in the real-world the speed up dwindles very quickly - not to mention there's a regression in prefill speed which is currently being worked on. I ran [this](https://github.com/alexziskind1/machine_tests/blob/main/ml/auto_prompter/prompts/extra_long_programming_code_heavy_17947t.txt) 18k coding prompt and it's clear the 60t/s is only observable for very short prompts, but combining MTP and tensor parallelism does indeed net a hefty 2x speedup. Stock: prompt eval time = 53173.24 ms / 19191 tokens ( 2.77 ms per token, 360.91 tokens per second) eval time = 337695.94 ms / 7791 tokens ( 43.34 ms per token, 23.07 tokens per second) total time = 390869.18 ms / 26982 tokens With MTP on: prompt eval time = 84388.11 ms / 19191 tokens ( 4.40 ms per token, 227.41 tokens per second) eval time = 260732.83 ms / 8408 tokens ( 31.01 ms per token, 32.25 tokens per second) total time = 345120.94 ms / 27599 tokens With tensor parallelism: prompt eval time = 41925.27 ms / 19191 tokens ( 2.18 ms per token, 457.74 tokens per second) eval time = 253262.25 ms / 8104 tokens ( 31.25 ms per token, 32.00 tokens per second) total time = 295187.53 ms / 27295 tokens Combining MTP and tensor parallelism: prompt eval time = 49696.04 ms / 19191 tokens ( 2.59 ms per token, 386.17 tokens per second) eval time = 155821.64 ms / 7440 tokens ( 20.94 ms per token, 47.75 tokens per second) total time = 205517.69 ms / 26631 tokens
Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.
A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.
A technical test of llama.cpp's new Multi-Token Prediction (MTP) support using Qwen3.6 models on an RTX 5090, comparing performance with and without MTP across different prompts and GGUF quantizations.
Benchmark comparison of Qwen3.5-122B Q5 and Q6 quantized models using llama.cpp with multi-token prediction on Strix Halo, showing throughput of 20.24 t/s and 17.17 t/s respectively.
Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.