B9109: preemptive fix for mtp & mmproj fix soon? It appears so
Summary
Upcoming updates address crashes between multimodal projection and multi-token prediction by enabling image processing through draft contexts. The changes also introduce parallel draft support to improve speculative decoding scalability.
Similar Articles
@ivanfioravanti: llamacpp is gonna get MTP support soon!
llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.
New Gemma 4 MTP on MLX?
Google released Multi Token Prediction drafters for Gemma 4 to accelerate inference via speculative decoding, but support for MLX is currently unconfirmed or unavailable.
MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp
A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.
MTP is all about acceptance rate
A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.
@jundotkim: oMLX 0.3.9.dev2 released. Highlights: - Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text…
oMLX 0.3.9.dev2 is released with improved Gemma 4 support, DFlash engine integration, and ParoQuant capabilities for local LLM inference on Apple Silicon.