llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

Reddit r/LocalLLaMA 06/03/26, 07:07 PM Tools

llama-cpp qwen3 benchmarks multi-token-prediction inference optimization

Summary

llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.

I think the dust has settled(95+%) for Qwen3.6/3.5-MTP. After the initial PR, so much optimizations & fixes. Even sometime ago today, there's a MTP related PR got merged & released([b9495](https://github.com/ggml-org/llama.cpp/releases/tag/b9495)). So try this latest version & share your benchmarks t/s\*. Great work by u/am17an & other folks. \* - Please share all stuff so it would be useful for others too. Also without particular missing details, benchmarks becomes inaccurate. Also I/We would like to have most optimized full command to get best t/s. To save your time, just copy your console output with full command(has all important details like model quant, context size, KVCache, fit/ncmoe, MTP, etc.,) & paste here. Sample is below(Not mine, pasting from random thread). llama-server \ -m ../models/Qwen3.6-35B-A3B-MTP-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ctx-size 150000 \ --flash-attn on \ -b 2048 \ -ub 512 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --jinja \ --threads 11 \ --threads-batch 11 \ -cram 12288 \ --mlock \ -fit on \ --chat-template-kwargs '{"preserve_thinking": true}' \ --spec-type mtp \ --spec-draft-n-max 3 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ -np 1 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 prompt eval time = 128889.09 ms / 26796 tokens (4.81 ms per token, 207.90 tokens per second) eval time = 10969.17 ms / 264 tokens (41.55 ms per token, 24.07 tokens per second) total time = 139858.26 ms / 27060 tokens draft acceptance rate = 0.52614 ( 161 accepted / 306 generated) statistics mtp: #calls(b,g,a) = 6 2811 2305, #gen drafts = 2811, #acc drafts = 2305, #gen tokens = 8433, #acc tokens = 5507, dur(b,g,a) = 0.020, 41478.073, 74.975 ms

Original Article

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

Similar Articles

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks

More Qwen3.6-27B MTP success but on dual Mi50s

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

Submit Feedback

Similar Articles

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks
Community benchmarks of Qwen 3.6-27B Dense and MTP variants running via llama.cpp on Strix Halo Windows, showing token/s speeds for various tasks.

More Qwen3.6-27B MTP success but on dual Mi50s

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed