Not All MTP Assistants Are Created Equal

Reddit r/LocalLLaMA 06/12/26, 10:14 AM Tools

mtp speculative-decoding llama-cpp gemma-4 inference-speed local-llm optimization

Summary

A detailed technical exploration of MTP speculative decoding in llama.cpp with Gemma 4 models, showing that assistant model selection and quantization significantly impact speedups, and that not all 'same name' assistants perform equally.

Since their release there has been a lot of rejection for mtp because it doesn't work. It does, it's just tough to get right. I've been experimenting with MTP speculative decoding in llama.cpp, and one thing became obvious pretty quickly: Not all MTP assistants are created equal. I run Gemma 4 Heretic models locally, and the difference between the wrong assistant and the right assistant was massive. Just because youre running gemma 4 26b q4 does not mean you can plug in any gemma 4 26b q4 assistant draft model. My results so far: - Gemma 4 26B Heretic Q8: ~30 t/s → ~55-62 t/s - Gemma 4 12B Heretic Q4: ~12 t/s → ~35-54 t/s - Gemma 4 26B QAT/Q4 Heretic Vision: ~65 t/s → ~70-75 t/s - Gemma 4 31B Q4 Heretic Vision: ~14 t/s → ~25-30t/s The biggest lesson was that simply loading an assistant model does not mean it'll work well. And the same name, does NOT mean same performance. Two models on huggingface named gemma 4 31b 4q assistant.gguf do not run parallel and are not always copies of eachother. For the 26B Q4 model alone, I tested multiple assistants (at least 6). Some were already available as GGUFs. Others I downloaded from Hugging Face and quantized myself. Some technically worked but gave poor acceptance rates. Others provided almost no measurable speedup. Eventually I found working pairings for all 4 models. Another interesting discovery came from Google's official Gemma 4 assistant models. I downloaded the official assistant/MTP models from Hugging Face, converted them to GGUF, and generated multiple variants including Q4, Q8, and unquantized versions. The results surprised me. For both the 12B and 31B models, the unquantized assistant consistently outperformed the quantized assistants. The Q4 assistants still improved performance over running without MTP, but the unquantized assistants were often roughly 10 t/s faster. In other words, assistant quantization matters too. A few other observations: - Some assistants loaded successfully but barely improved performance. - Some assistants had poor draft acceptance rates and actually reduced gains. - Some mismatched assistants crashed with tensor shape/assertion errors. - Higher draft counts were never better (could be the nature of Heretic). ALL of my best results came from "spec-draft-n-max = 1". - The slower the base model, the larger the benefit tended to be. One thing I learned quickly is that you need to verify MTP is actually active. I started watching the logs for: common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp' and then checking draft acceptance rates and real-world generation speed. Without that confirmation, it's very easy to think you're benchmarking MTP when you're actually just benchmarking the base model. Because it'll silently drop. One of the more interesting results was getting MTP working alongside vision on the 26B QAT/Q4 model. I expected to need separate vision and text configurations, but the model loaded successfully with: - Vision (mmproj) - Draft-MTP - 96k context - Flash Attention and still generated around 70+ t/s in text workloads while retaining image support. My overall takeaway: If you tried MTP once and got weak results, don't assume MTP is useless. Try different assistants. Try different quantizations. Watch your acceptance rates. Verify MTP actually initialized. For me, the difference between "an assistant model" and "the right assistant model" was often the difference between a small improvement and a 2x speedup. So far, "it loads" and "it's the right assistant" aren't the same

Original Article

Not All MTP Assistants Are Created Equal

Similar Articles

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

MTP is all about acceptance rate

MTP and QTA - what is the relation?

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

[WIP] Gemma 4 MTP

Submit Feedback

Similar Articles

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

MTP is all about acceptance rate

MTP and QTA - what is the relation?

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.