Tag
Provides Docker images for running MTP models with llama.cpp, including quantization comparisons and usage instructions.
A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.
This benchmark compares Gemma 4's Multi-Token Prediction (MTP) and z-lab's DFlash speculative decoding methods on a single H100 GPU, showing MTP faster for dense models and DFlash faster for MoE models.
Unsloth releases GGUF-quantized versions of Qwen3.6 models with Multi Token Prediction (MTP) support.
This article announces the release of the Qwen3.6-35B-A3B model weights on Hugging Face, optimized by Unsloth with Multi-Token Prediction (MTP) for faster generation via llama.cpp. It highlights improvements in agentic coding capabilities, tool calling, and reasoning context preservation.
llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.
A new implementation of Multi-Token Prediction (MTP) in llama.cpp achieves a 40% speedup for Gemma 4 models, tested on a MacBook Pro M5Max. The post provides links to quantized GGUF models and the patched source code.
Google's Gemma 4 achieves up to 3x faster inference speeds through speculative decoding and multi-token prediction, enabling efficient on-device deployment.
This Hugging Face repository provides GGUF files for Qwen3.6-27B with Multi-Token Prediction (MTP) layers grafted onto Unsloth UD XL quantizations. It includes instructions for building llama.cpp with MTP support to enable speculative decoding.
Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.
Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.