mtp

#mtp

Turboquant+MTP for ROCm(Llama CPP)

Reddit r/LocalLLaMA ↗ · yesterday

A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.

0 favorites 0 likes

#mtp

@berryxia: Damn, even my eyes can't keep up with this speed! Daniel Han, founder of UnslothAI, YC S24, previously at NVIDIA doing ML, just released the experimental MTP GGUF of Qwen3.6. The 27B model hits 140 tokens/s on a single GPU. 35B-A...

X AI KOLs Timeline ↗ · 2d ago

UnslothAI founder Daniel Han released the experimental MTP GGUF version of Qwen3.6, achieving 140 tokens/s for the 27B model and 220 tokens/s for the 35B-A3B version on consumer GPUs — a 1.4x speedup with zero accuracy loss.

0 favorites 0 likes

#mtp

MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close.

Reddit r/LocalLLaMA ↗ · 5d ago

A systematic analysis of Qwen 3.6 27B benchmarks reveals that speculative inference (MTP) significantly accelerates coding tasks but slows down creative writing, with task type being the dominant factor over quantization or temperature settings.

0 favorites 0 likes

#mtp

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA ↗ · 6d ago

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

0 favorites 0 likes

#mtp

MTP is all about acceptance rate

Reddit r/LocalLLaMA ↗ · 2026-05-08

A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.

1 favorites 1 likes

#mtp

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Reddit r/LocalLLaMA ↗ · 2026-05-08

Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.

1 favorites 1 likes

#mtp

@ivanfioravanti: llamacpp is gonna get MTP support soon!

X AI KOLs Following ↗ · 2026-05-08 Cached

llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.

0 favorites 0 likes

#mtp

Extracted MTP tensor GGUFs - smaller donor models for grafting.

Reddit r/LocalLLaMA ↗ · 2026-05-07

The author provides extracted GGUF files containing only MTP tensors for Qwen3.6 models, allowing users to graft tensors with a significantly reduced download size compared to full model files.

0 favorites 0 likes

mtp

Submit Feedback