Tag
Proposes EntMTP, a training-free scheduler that adapts tree-based attention topologies for speculative decoding based on local entropy estimates, achieving 1.09-1.15x speedup over Hydra and up to 1.36x over Medusa.
Google Research introduces a new architecture using frozen Multi-Token Prediction to accelerate Gemini Nano models on Pixel devices, significantly improving speed and energy efficiency for on-device AI features.
This article investigates whether quantization affects the draft rate in multi-token prediction models, exploring potential trade-offs between model compression and inference efficiency.
An interactive guide explaining speculative decoding and multi-token prediction in LLMs, covering techniques from rejection sampling to MTP used in Qwen 3.6 and Gemma 4, with live diagrams and sliders.
A user reports that MTP versions of Qwen 3.6 and Gemma 4 models produce lower quality outputs in code review tasks compared to non-MTP counterparts, with only marginal real-world speed improvements despite higher token generation rates.
GLM-5.2 adopts MTP (Multi-Token Prediction) technology to accelerate inference and fixes a training-inference discrepancy in GLM-5.1's MTP that caused KV cache mixing issues.
SuperThoughts compresses consecutive chain-of-thought tokens into latent representations and decodes two tokens per step, achieving ~20–30% CoT length reduction with minimal accuracy loss on math reasoning benchmarks, while doubling inference throughput.
Release of Qwopus3.6-27B-v2-MTP, a fine-tuned multi-token prediction reasoning model based on Qwen3.6-27B, optimized for coding, DevOps, and math tasks with improved generation speed.
NVIDIA released the Nemotron 3 open model, offering three sizes: Nano, Super, and Ultra. It optimizes hardware efficiency through architectural innovations such as hybrid Mamba Transformer, latent MoE, and multi-token prediction, and adopts the Open MDW 1.1 open license.
Bebop proposes entropy-aware multi-token prediction with rejection sampling and a novel TV loss to accelerate RL training of LLMs, achieving up to 1.8x speedup. The method addresses the degradation of acceptance rates during RL by optimizing training objectives.
This blog post provides tips and benchmarks for achieving nearly 200 tokens per second inference on DeepSeek V4 Flash using vLLM on a dual GH200 workstation, highlighting the use of a quantized checkpoint from Canada-Quant and tensor parallelism optimizations.
llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.
A developer benchmarks Gemma 4 E4B using Google's LiteRT engine against a Q4 GGUF quant, finding ~2.4x speedup in text generation due to multi-token prediction (MTP), but only 1.1x in image captioning. The post provides a Python wrapper for an OpenAI-compatible endpoint, though with limitations like deterministic output and single-session engine.
bytkim releases a 4-bit QLoRA SFT Multi-Token Prediction fine-tune of Qwen3.6-27B, packaged as GGUF for local agentic coding. The no-thinking tune is designed for low-latency direct output in agent loops.
Compares unsloth and bartowski MTP GGUF quantizations for Qwen models across various sizes and quantization levels, finding that unsloth GGUFs are generally smaller and offer similar or better decoding speed; MTP benefits larger dense models more.
Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.
Llama.cpp release B9406 fixes a crash (GGML_ASSERT) when using MTP with MoE vision models like Qwen3.6-35B-A3B.
llama.cpp natively supports Multi-Token Prediction (MTP) without requiring an extra draft model. By leveraging the model's built-in prediction head, local models like Qwen3.6-27B achieve 1.7x+ speedup, making 27B models run smoothly on consumer GPUs.
NVFP4 quantization and Multi-Token Prediction support have been added to llama.cpp in release b9297.
Added native multi-token prediction (MTP) support to the exo local inference tool for Qwen3.6 MLX models, achieving up to 2x speedup on 27B models on an M5 Max laptop while maintaining exactness.