MTP is all about acceptance rate

Reddit r/LocalLLaMA 05/08/26, 10:11 PM News

mtp benchmark gemma4 mlx-vlm performance apple-silicon

Summary

A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.

So I was very excited about the MTP stuff especially since Gemma4 has become my "daily driver" for some stuff. I grabbed the latest mlx-vlm and did some tests and found it disappointing. | Workload | MTP off | MTP on | Result | Draft accept rate | |---|---|---|---|---| | Code generation | 75 tok/s | 114.8 tok/s | 1.53× faster | 66% of slots | | Long-form prose | 75 tok/s | 71.1 tok/s | 0.95× (wash) | 31% of slots | | JSON output | 51.3 tok/s | 25.6 tok/s | **0.50× slower** | 8% of slots | - Code generation was the typical "Write some python functions to do X" - Long form prose was "Write an 800 word essay on paper money in the Tang Dynasty" - JSON output was my core use case where I'm handing the LLM a list of items, asking it to group them by similarity according to some rules and then get them back in a structured output*. So if you want to use it for local coding, MTP is great. If you're not, maybe not so hot. My regression testing seems to indicate that once token acceptance dips below 50% the overhead kills the benefit. All this on an M4 Max Studio w/Gemma4-26b-a4b *Bonus for you hackers: Gemma's JSON structure instruction following is pretty good and I find using structured output to be about a 20% hit to token generation. It is faster to just accept a little bit of sloppy JSON and massage it at runtime; so all this is with json_schema off which mlx-vlm doesn't support for spec-decode anyway

Original Article

MTP is all about acceptance rate

Similar Articles

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

Submit Feedback

Similar Articles

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it