More Qwen3.6-27B MTP success but on dual Mi50s

Reddit r/LocalLLaMA 05/09/26, 02:29 PM News

Summary

The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.

**TLDR:** The hype is real! 1.5x speedup. Up to 2x speedup with tensor parallelism! After reading the PR I immediately hunted for MTP-compatible Q4\_1 quants (they offer a small speedup on these compute-lacking older cards) but couldn't find any. Luckily I came across [this](https://www.reddit.com/r/LocalLLaMA/comments/1t6r1ny/extracted_mtp_tensor_ggufs_smaller_donor_models/) post which highlighted how to transplant MTP grafting onto your own quants, and thus attached it to Bartowski's quant I already had. # Setup * CachyOS (Arch Linux) * ROCm 7.2 Built the llama.cpp fork [https://github.com/skyne98/llama.cpp-gfx906](https://github.com/skyne98/llama.cpp-gfx906) with [https://github.com/ggml-org/llama.cpp/pull/22673](https://github.com/ggml-org/llama.cpp/pull/22673) and ran the following command with the included PR benchmark script: llama-server -m ~/models/Qwen3.6-27B-MTP-Q4_1.gguf \ --temp 1.0 --min-p 0.0 --top-k 20 --top-p 0.95 \ --jinja --presence-penalty 1.5 \ --chat-template-kwargs '{"preserve_thinking": true}' \ -ub 2048 -b 2048 \ -fa 1 -np 1 \ --no-mmap --no-warmup \ -dev ROCm0,ROCm1 --fit on -fitt 256 # Script Benchmark Stock: code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.2 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.2 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.3 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.3 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.0 With MTP on: `--spec-type mtp --spec-draft-n-max 2` code_python pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.6 code_cpp pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.5 explain_concept pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=36.7 summarize pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=40.7 qa_factual pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.4 translation pred= 192 draft= 152 acc= 115 rate=0.757 tok/s=37.5 creative_short pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.6 stepwise_math pred= 192 draft= 146 acc= 118 rate=0.808 tok/s=39.0 long_code_review pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=37.8 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1340, "total_draft_accepted": 1046, "aggregate_accept_rate": 0.7806, "wall_s_total": 51.42 } With tensor parallelism on: `-sm tensor` code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=35.0 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.8 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.3 Combining MTP and tensor parallelism: code_python pred= 192 draft= 142 acc= 120 rate=0.845 tok/s=59.8 code_cpp pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=56.6 explain_concept pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=56.8 summarize pred= 53 draft= 42 acc= 31 rate=0.738 tok/s=54.5 qa_factual pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.8 translation pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=57.3 creative_short pred= 192 draft= 154 acc= 114 rate=0.740 tok/s=54.8 stepwise_math pred= 192 draft= 140 acc= 121 rate=0.864 tok/s=59.6 long_code_review pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.2 Aggregate: { "n_requests": 9, "total_predicted": 1589, "total_draft": 1214, "total_draft_accepted": 970, "aggregate_accept_rate": 0.799, "wall_s_total": 32.24 # Real-world benchmark The numbers above looks absolutely insane, however in the real-world the speed up dwindles very quickly - not to mention there's a regression in prefill speed which is currently being worked on. I ran [this](https://github.com/alexziskind1/machine_tests/blob/main/ml/auto_prompter/prompts/extra_long_programming_code_heavy_17947t.txt) 18k coding prompt and it's clear the 60t/s is only observable for very short prompts, but combining MTP and tensor parallelism does indeed net a hefty 2x speedup. Stock: prompt eval time = 53173.24 ms / 19191 tokens ( 2.77 ms per token, 360.91 tokens per second) eval time = 337695.94 ms / 7791 tokens ( 43.34 ms per token, 23.07 tokens per second) total time = 390869.18 ms / 26982 tokens With MTP on: prompt eval time = 84388.11 ms / 19191 tokens ( 4.40 ms per token, 227.41 tokens per second) eval time = 260732.83 ms / 8408 tokens ( 31.01 ms per token, 32.25 tokens per second) total time = 345120.94 ms / 27599 tokens With tensor parallelism: prompt eval time = 41925.27 ms / 19191 tokens ( 2.18 ms per token, 457.74 tokens per second) eval time = 253262.25 ms / 8104 tokens ( 31.25 ms per token, 32.00 tokens per second) total time = 295187.53 ms / 27295 tokens Combining MTP and tensor parallelism: prompt eval time = 49696.04 ms / 19191 tokens ( 2.59 ms per token, 386.17 tokens per second) eval time = 155821.64 ms / 7440 tokens ( 20.94 ms per token, 47.75 tokens per second) total time = 205517.69 ms / 26631 tokens

Original Article

More Qwen3.6-27B MTP success but on dual Mi50s

Similar Articles

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

@Snixtp: https://x.com/Snixtp/status/2055734339346768225

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Submit Feedback

Similar Articles

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

@Snixtp: https://x.com/Snixtp/status/2055734339346768225

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.