Tag
Tested Multi-Token Prediction on a llamacpp fork with a Qwen-based MoE model, achieving +0.41% PPL improvement over fp16 baseline.
atomic.chat's MTP technique speeds up local LLM inference by drafting multiple tokens and verifying them together, achieving up to 137% speedup on Qwen 27B dense model with zero accuracy loss.
A tweet demonstrates that Multi-Token Prediction (MTP) achieves significant speedups for Qwen models on dual RTX 5090 hardware, suggesting that local inference can now rival cloud-model performance.
Introduces Multi-token Residual Prediction (MRP), a lightweight module for diffusion language models that enables dependency-aware multi-token denoising within a single backbone forward pass, achieving up to 1.42× lossless speedup.
Google AI Edge Gallery v1.0.13 & v1.0.14 updates add support for Gemma 4 with multi-token prediction, Pixel TPU optimization, experimental MCP, new skills, and chat history saving, enhancing on-device generative AI capabilities.
Julien C explains how to run llama.cpp with Multi-token prediction (MTP) for ~2x generation speed, using either the Dense 27B or MoE 35B model, with instructions for installation and configuration.
MTP (Multi-Token Prediction) can accelerate LLM inference by 2x, especially for coding agents. This video demonstrates performance improvements with Qwen 3.6 on AMD Strix Halo and Dual Radeon 9700.
Quantizing the Multi-Token Prediction (MTP) KV cache to q8_0 in llama.cpp for Qwen models reduces VRAM usage without affecting inference speed or acceptance rate, effectively providing a 'free lunch' for memory-constrained setups.
Jackrong releases Qwopus3.5-9B-Coder-MTP-GGUF, a Qwen-based 9B coding model fine-tuned with Multi-Token Prediction (MTP) architecture, achieving 35.8% throughput improvement and 8.3% accuracy gain over the base model, with perfect scores on coding and math benchmarks.
A technical test of llama.cpp's new Multi-Token Prediction (MTP) support using Qwen3.6 models on an RTX 5090, comparing performance with and without MTP across different prompts and GGUF quantizations.
Benchmark comparison of Qwen3.5-122B Q5 and Q6 quantized models using llama.cpp with multi-token prediction on Strix Halo, showing throughput of 20.24 t/s and 17.17 t/s respectively.
A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.
llama.cpp version b9180 has been released, featuring Multi-Token Prediction (MTP). The release is marked by successful builds and developer relief.
Benchmarks of MTP (Multi-Token Prediction) in llama.cpp on Strix Halo show significant speedups for 27B Qwen models in long-context chat, but mixed results for 35B models.
The pull request adding MTP (Multi-Token Prediction) support to llama.cpp has been merged into the master branch.
Pull request adding Multi-Token Prediction (MTP) support to llama.cpp, enabling speculative decoding for faster inference.
Multi-token prediction (MTP) has been approved for integration into llama.cpp, indicating an upcoming update to the local LLM inference tool.
Implemented Multi-Token Prediction for Qwen on LLaMA.cpp with TurboQuant, achieving a 40% performance boost and 90% acceptance rate, running locally on a MacBook Pro M5 Max.
Provides Docker images for running MTP models with llama.cpp, including quantization comparisons and usage instructions.
A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.