@no_stp_on_snek: Tested out MTP for the first time on my llamacpp fork last night with turbo4 sym. GX10 hardware. using MoE model: llmfa…
Summary
Tested Multi-Token Prediction on a llamacpp fork with a Qwen-based MoE model, achieving +0.41% PPL improvement over fp16 baseline.
View Cached Full Text
Cached at: 05/23/26, 08:01 AM
Tested out MTP for the first time on my llamacpp fork last night with turbo4 sym.
GX10 hardware.
using MoE model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved
+0.41 % PPL vs fp16 baseline https://t.co/pwzhfphHCK
Similar Articles
Testing llama.cpp MTP support on Qwen3.6 - RTX 5090
A technical test of llama.cpp's new Multi-Token Prediction (MTP) support using Qwen3.6 models on an RTX 5090, comparing performance with and without MTP across different prompts and GGUF quantizations.
I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.
Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.
@julien_c: I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.…
Julien C explains how to run llama.cpp with Multi-token prediction (MTP) for ~2x generation speed, using either the Dense 27B or MoE 35B model, with instructions for installation and configuration.
@ivanfioravanti: llamacpp is gonna get MTP support soon!
llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.
Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant
Implemented Multi-Token Prediction for Qwen on LLaMA.cpp with TurboQuant, achieving a 40% performance boost and 90% acceptance rate, running locally on a MacBook Pro M5 Max.