Tag
This benchmark compares Gemma 4's Multi-Token Prediction (MTP) and z-lab's DFlash speculative decoding methods on a single H100 GPU, showing MTP faster for dense models and DFlash faster for MoE models.
Unsloth releases GGUF-quantized versions of Qwen3.6 models with Multi Token Prediction (MTP) support.
This article announces the release of the Qwen3.6-35B-A3B model weights on Hugging Face, optimized by Unsloth with Multi-Token Prediction (MTP) for faster generation via llama.cpp. It highlights improvements in agentic coding capabilities, tool calling, and reasoning context preservation.
llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.
A new implementation of Multi-Token Prediction (MTP) in llama.cpp achieves a 40% speedup for Gemma 4 models, tested on a MacBook Pro M5Max. The post provides links to quantized GGUF models and the patched source code.
Google's Gemma 4 achieves up to 3x faster inference speeds through speculative decoding and multi-token prediction, enabling efficient on-device deployment.
This Hugging Face repository provides GGUF files for Qwen3.6-27B with Multi-Token Prediction (MTP) layers grafted onto Unsloth UD XL quantizations. It includes instructions for building llama.cpp with MTP support to enable speculative decoding.
Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.
Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.