@no_stp_on_snek: Tested out MTP for the first time on my llamacpp fork last night with turbo4 sym. GX10 hardware. using MoE model: llmfa…

X AI KOLs Following 05/22/26, 01:34 PM Models

multi-token-prediction llamacpp moe open-source inference qwen ppl

Summary

Tested Multi-Token Prediction on a llamacpp fork with a Qwen-based MoE model, achieving +0.41% PPL improvement over fp16 baseline.

Tested out MTP for the first time on my llamacpp fork last night with turbo4 sym. GX10 hardware. using MoE model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved +0.41 % PPL vs fp16 baseline https://t.co/pwzhfphHCK

Original Article

View Cached Full Text

Cached at: 05/23/26, 08:01 AM

Tested out MTP for the first time on my llamacpp fork last night with turbo4 sym.

GX10 hardware.

using MoE model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved

+0.41 % PPL vs fp16 baseline https://t.co/pwzhfphHCK

Similar Articles

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Reddit r/LocalLLaMA

A technical test of llama.cpp's new Multi-Token Prediction (MTP) support using Qwen3.6 models on an RTX 5090, comparing performance with and without MTP across different prompts and GGUF quantizations.

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Reddit r/LocalLLaMA

Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.

@julien_c: I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.…

X AI KOLs Following

Julien C explains how to run llama.cpp with Multi-token prediction (MTP) for ~2x generation speed, using either the Dense 27B or MoE 35B model, with instructions for installation and configuration.

@ivanfioravanti: llamacpp is gonna get MTP support soon!

X AI KOLs Following

llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Reddit r/LocalLLaMA

Implemented Multi-Token Prediction for Qwen on LLaMA.cpp with TurboQuant, achieving a 40% performance boost and 90% acceptance rate, running locally on a MacBook Pro M5 Max.

Similar Articles

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

@julien_c: I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.…

@ivanfioravanti: llamacpp is gonna get MTP support soon!

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Submit Feedback