A user reports that MTP versions of Qwen 3.6 and Gemma 4 models produce lower quality outputs in code review tasks compared to non-MTP counterparts, with only marginal real-world speed improvements despite higher token generation rates.
Hi. I am self-hosting Qwen 3.6 27B Q8_K_XL with Llama.cpp on 4x5070ti. (All 4 cards are on single x16 slot bifurcated to 4x4 with risers). I've been testing it on several work repos with Opencode CLI and in like 8/10 situations the output of non-MTP model is far superior to the MTP ones. The prompt is simple `Do a code review of this branch.`. The non MTP produces more findings, with more detailed descriptions, with fix suggestion snippets, everything is better. Usually takes fewer tokens also (for example like ~40k for non MTP vs ~60k for MTP). And real life speed is not so great either: - The non-MTP for me is like ~2000 pp/s and ~50-60 tg/s. - The MTP is like ~1300 pp/s and ~100-120 tg/s. So while MTP has double TG numbers, the real life agent tasks are like within 20% of time taken when comparing MTP vs Non MTP. I do not understand what I am doing wrong - everyone swears that MTP is like free performance with same quality, but for me the MTP degrades output, needs more VRAM (that I expected before ofc), consumes more context... My settings __Qwen MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) ```bash exec /opt/llama.cpp/build-cuda/bin/llama-server \ --host 0.0.0.0 \ --port 8081 \ --alias Qwen3.6-27B \ --model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf \ --ctx-size 262144 \ --device CUDA0,CUDA1,CUDA2,CUDA3 \ --fit off \ --split-mode tensor \ --tensor-split 1,1,1,1 \ --gpu-layers all \ --flash-attn on \ --kv-offload \ --cache-type-k f16 \ --cache-type-v f16 \ --batch-size 4096 \ --ubatch-size 1024 \ --parallel 1 \ --jinja \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --min-p 0.00 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --no-cache-idle-slots \ --cache-ram 32768 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --mmproj /opt/models/qwen36/27b/unsloth/mmproj-BF16.gguf \ --image-min-tokens 1024 \ --cache-prompt \ --ctx-checkpoints 128 \ --checkpoint-min-step 512 \ --cache-reuse 512 \ --cache-idle-slots \ --no-context-shift \ --no-kv-unified \ --slot-prompt-similarity 0.10 \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --no-mmproj-offload ``` For __Qwen Non MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) the only thing that differs is: ```bash --model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-NoMTP-Q8_K_XL.gguf # missing --spec-type and --spec-draft-n-max flags ``` Also tried https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF with the similar experience comparing MTP and non-MTP. Anyone had the similar experience? P.S. I'll add some examples on some OSS repos perhaps with llama.cpp logs, when I got home.
A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.
Benchmarks of Multi-Token Prediction (MTP) support in llama.cpp for the Qwen3.6-35B-A3B model on a 6GB VRAM laptop show that MTP is not worth using due to significantly slower prompt processing outweighing minor generation speed gains. The author found that using q4_0 quantization for the draft KV cache saves VRAM without hurting quality.
Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.
A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.