multi-token-prediction

#multi-token-prediction

llama.cpp docker images to run MTP models

Reddit r/LocalLLaMA ↗ · 2026-05-13

Provides Docker images for running MTP models with llama.cpp, including quantization comparisons and usage instructions.

0 favorites 0 likes

#multi-token-prediction

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-05-12

A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.

0 favorites 0 likes

#multi-token-prediction

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Reddit r/LocalLLaMA ↗ · 2026-05-12

This benchmark compares Gemma 4's Multi-Token Prediction (MTP) and z-lab's DFlash speculative decoding methods on a single H100 GPU, showing MTP faster for dense models and DFlash faster for MoE models.

0 favorites 0 likes

#multi-token-prediction

MTP on Unsloth

Reddit r/LocalLLaMA ↗ · 2026-05-11

Unsloth releases GGUF-quantized versions of Qwen3.6 models with Multi Token Prediction (MTP) support.

0 favorites 0 likes

#multi-token-prediction

unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Hugging Face Models Trending ↗ · 2026-05-11 Cached

This article announces the release of the Qwen3.6-35B-A3B model weights on Hugging Face, optimized by Unsloth with Multi-Token Prediction (MTP) for faster generation via llama.cpp. It highlights improvements in agentic coding capabilities, tool calling, and reasoning context preservation.

0 favorites 0 likes

#multi-token-prediction

@ivanfioravanti: llamacpp is gonna get MTP support soon!

X AI KOLs Following ↗ · 2026-05-08 Cached

llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.

0 favorites 0 likes

#multi-token-prediction

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Reddit r/LocalLLaMA ↗ · 2026-05-08

A new implementation of Multi-Token Prediction (MTP) in llama.cpp achieves a 40% speedup for Gemma 4 models, tested on a MacBook Pro M5Max. The post provides links to quantized GGUF models and the patched source code.

0 favorites 0 likes

#multi-token-prediction

@googlegemma: Gemma 4 up to 3x faster, directly in your phone! Check out the difference Speculative Decoding makes! Multi-Token Predi…

X AI KOLs Timeline ↗ · 2026-05-07 Cached

Google's Gemma 4 achieves up to 3x faster inference speeds through speculative decoding and multi-token prediction, enabling efficient on-device deployment.

0 favorites 0 likes

#multi-token-prediction

havenoammo/Qwen3.6-27B-MTP-UD-GGUF

Hugging Face Models Trending ↗ · 2026-05-06 Cached

This Hugging Face repository provides GGUF files for Qwen3.6-27B with Multi-Token Prediction (MTP) layers grafted onto Unsloth UD XL quantizations. It includes instructions for building llama.cpp with MTP support to enable speculative decoding.

0 favorites 0 likes

#multi-token-prediction

google/gemma-4-26B-A4B-it-assistant

Hugging Face Models Trending ↗ · 2026-04-23 Cached

Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.

0 favorites 0 likes

#multi-token-prediction

google/gemma-4-31B-it-assistant

Hugging Face Models Trending ↗ · 2026-04-23 Cached

Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.

0 favorites 0 likes

multi-token-prediction

Submit Feedback