A detailed technical exploration of MTP speculative decoding in llama.cpp with Gemma 4 models, showing that assistant model selection and quantization significantly impact speedups, and that not all 'same name' assistants perform equally.
Since their release there has been a lot of rejection for mtp because it doesn't work. It does, it's just tough to get right. I've been experimenting with MTP speculative decoding in llama.cpp, and one thing became obvious pretty quickly: Not all MTP assistants are created equal. I run Gemma 4 Heretic models locally, and the difference between the wrong assistant and the right assistant was massive. Just because youre running gemma 4 26b q4 does not mean you can plug in any gemma 4 26b q4 assistant draft model. My results so far: - Gemma 4 26B Heretic Q8: ~30 t/s → ~55-62 t/s - Gemma 4 12B Heretic Q4: ~12 t/s → ~35-54 t/s - Gemma 4 26B QAT/Q4 Heretic Vision: ~65 t/s → ~70-75 t/s - Gemma 4 31B Q4 Heretic Vision: ~14 t/s → ~25-30t/s The biggest lesson was that simply loading an assistant model does not mean it'll work well. And the same name, does NOT mean same performance. Two models on huggingface named gemma 4 31b 4q assistant.gguf do not run parallel and are not always copies of eachother. For the 26B Q4 model alone, I tested multiple assistants (at least 6). Some were already available as GGUFs. Others I downloaded from Hugging Face and quantized myself. Some technically worked but gave poor acceptance rates. Others provided almost no measurable speedup. Eventually I found working pairings for all 4 models. Another interesting discovery came from Google's official Gemma 4 assistant models. I downloaded the official assistant/MTP models from Hugging Face, converted them to GGUF, and generated multiple variants including Q4, Q8, and unquantized versions. The results surprised me. For both the 12B and 31B models, the unquantized assistant consistently outperformed the quantized assistants. The Q4 assistants still improved performance over running without MTP, but the unquantized assistants were often roughly 10 t/s faster. In other words, assistant quantization matters too. A few other observations: - Some assistants loaded successfully but barely improved performance. - Some assistants had poor draft acceptance rates and actually reduced gains. - Some mismatched assistants crashed with tensor shape/assertion errors. - Higher draft counts were never better (could be the nature of Heretic). ALL of my best results came from "spec-draft-n-max = 1". - The slower the base model, the larger the benefit tended to be. One thing I learned quickly is that you need to verify MTP is actually active. I started watching the logs for: common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp' and then checking draft acceptance rates and real-world generation speed. Without that confirmation, it's very easy to think you're benchmarking MTP when you're actually just benchmarking the base model. Because it'll silently drop. One of the more interesting results was getting MTP working alongside vision on the 26B QAT/Q4 model. I expected to need separate vision and text configurations, but the model loaded successfully with: - Vision (mmproj) - Draft-MTP - 96k context - Flash Attention and still generated around 70+ t/s in text workloads while retaining image support. My overall takeaway: If you tried MTP once and got weak results, don't assume MTP is useless. Try different assistants. Try different quantizations. Watch your acceptance rates. Verify MTP actually initialized. For me, the difference between "an assistant model" and "the right assistant model" was often the difference between a small improvement and a 2x speedup. So far, "it loads" and "it's the right assistant" aren't the same
A new implementation of Multi-Token Prediction (MTP) in llama.cpp achieves a 40% speedup for Gemma 4 models, tested on a MacBook Pro M5Max. The post provides links to quantized GGUF models and the patched source code.
A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.
A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.
Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.
llama.cpp is an open-source C/C++ library for efficient LLM inference on various hardware, supporting multiple quantization formats and GPU backends. This README details its features, installation, and recent updates including Hugging Face cache migration and multimodal support.