Worse quality with MTP - Qwen 3.6, Gemma 4

Reddit r/LocalLLaMA 06/25/26, 07:10 AM News

mtp multi-token-prediction qwen gemma llama.cpp quality-issues self-hosting

Summary

A user reports that MTP versions of Qwen 3.6 and Gemma 4 models produce lower quality outputs in code review tasks compared to non-MTP counterparts, with only marginal real-world speed improvements despite higher token generation rates.

Hi. I am self-hosting Qwen 3.6 27B Q8_K_XL with Llama.cpp on 4x5070ti. (All 4 cards are on single x16 slot bifurcated to 4x4 with risers). I've been testing it on several work repos with Opencode CLI and in like 8/10 situations the output of non-MTP model is far superior to the MTP ones. The prompt is simple `Do a code review of this branch.`. The non MTP produces more findings, with more detailed descriptions, with fix suggestion snippets, everything is better. Usually takes fewer tokens also (for example like ~40k for non MTP vs ~60k for MTP). And real life speed is not so great either: - The non-MTP for me is like ~2000 pp/s and ~50-60 tg/s. - The MTP is like ~1300 pp/s and ~100-120 tg/s. So while MTP has double TG numbers, the real life agent tasks are like within 20% of time taken when comparing MTP vs Non MTP. I do not understand what I am doing wrong - everyone swears that MTP is like free performance with same quality, but for me the MTP degrades output, needs more VRAM (that I expected before ofc), consumes more context... My settings __Qwen MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) ```bash exec /opt/llama.cpp/build-cuda/bin/llama-server \ --host 0.0.0.0 \ --port 8081 \ --alias Qwen3.6-27B \ --model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf \ --ctx-size 262144 \ --device CUDA0,CUDA1,CUDA2,CUDA3 \ --fit off \ --split-mode tensor \ --tensor-split 1,1,1,1 \ --gpu-layers all \ --flash-attn on \ --kv-offload \ --cache-type-k f16 \ --cache-type-v f16 \ --batch-size 4096 \ --ubatch-size 1024 \ --parallel 1 \ --jinja \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --min-p 0.00 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --no-cache-idle-slots \ --cache-ram 32768 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --mmproj /opt/models/qwen36/27b/unsloth/mmproj-BF16.gguf \ --image-min-tokens 1024 \ --cache-prompt \ --ctx-checkpoints 128 \ --checkpoint-min-step 512 \ --cache-reuse 512 \ --cache-idle-slots \ --no-context-shift \ --no-kv-unified \ --slot-prompt-similarity 0.10 \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --no-mmproj-offload ``` For __Qwen Non MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) the only thing that differs is: ```bash --model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-NoMTP-Q8_K_XL.gguf # missing --spec-type and --spec-draft-n-max flags ``` Also tried https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF with the similar experience comparing MTP and non-MTP. Anyone had the similar experience? P.S. I'll add some examples on some OSS repos perhaps with llama.cpp logs, when I got home.

Original Article

Worse quality with MTP - Qwen 3.6, Gemma 4

Similar Articles

MTP is all about acceptance rate

What's your experience with Gemma4 QAT?

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

@Snixtp: https://x.com/Snixtp/status/2055734339346768225

Submit Feedback

Similar Articles

MTP is all about acceptance rate

What's your experience with Gemma4 QAT?

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

@Snixtp: https://x.com/Snixtp/status/2055734339346768225