B9109: preemptive fix for mtp & mmproj fix soon? It appears so

Reddit r/LocalLLaMA 05/11/26, 10:20 PM Tools

speculative-decoding multimodal inference open-source bug-fix development

Summary

Upcoming updates address crashes between multimodal projection and multi-token prediction by enabling image processing through draft contexts. The changes also introduce parallel draft support to improve speculative decoding scalability.

Summary : spec : process images through the draft context — this directly addresses the mmproj + MTP crash. Previously images (mmproj) couldn't be processed through the speculative/draft context at all. This commit adds that capability. That's the actual fix in progress. server : fix mtmd draft processing — mtmd is the multimodal (mmproj) handler. Explicitly fixing draft processing for multimodal means they know about the crash and are targeting it. spec : support parallel drafts — this is infrastructure for running multiple draft models simultaneously, which is required for MTP to work properly at scale with parallel slots. The combination of all three in one build — multimodal draft fix, parallel draft support, and images through draft context — suggests this is a focused push to get MTP + mmproj working together. PR #22673 might not be far behind.

Original Article

Similar Articles

@ivanfioravanti: llamacpp is gonna get MTP support soon!

X AI KOLs Following

llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.

New Gemma 4 MTP on MLX?

Reddit r/LocalLLaMA

Google released Multi Token Prediction drafters for Gemma 4 to accelerate inference via speculative decoding, but support for MLX is currently unconfirmed or unavailable.

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Reddit r/LocalLLaMA

A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.

MTP is all about acceptance rate

Reddit r/LocalLLaMA

A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.

@jundotkim: oMLX 0.3.9.dev2 released. Highlights: - Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text…

X AI KOLs Timeline

oMLX 0.3.9.dev2 is released with improved Gemma 4 support, DFlash engine integration, and ParoQuant capabilities for local LLM inference on Apple Silicon.

Similar Articles

@ivanfioravanti: llamacpp is gonna get MTP support soon!

New Gemma 4 MTP on MLX?

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

MTP is all about acceptance rate

@jundotkim: oMLX 0.3.9.dev2 released. Highlights: - Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text…

Submit Feedback