@victormustar: llama.cpp with MTP support makes local models fast enough to use as daily drivers Qwen3.6-27B dense generation (on A10G…

X AI KOLs Following Tools

Summary

llama.cpp adds MTP support for Qwen3.6 models, boosting generation speed by 78% on A10G hardware, making local models viable as daily drivers.

llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀 Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%). Two flags on llama-server: --spec-type draft-mtp --spec-draft-n-max 2 https://t.co/hhslKpLE71
Original Article
View Cached Full Text

Cached at: 05/18/26, 10:38 PM

llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀

Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%).

Two flags on llama-server: –spec-type draft-mtp –spec-draft-n-max 2 https://t.co/hhslKpLE71

Georgi Gerganov (@ggerganov): llama.cpp adds MTP for the Qwen3.6 family

This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.

Special thanks to Aman Gupta for leading this development!

Similar Articles

Llama.cpp B9406 MTP mmproj fix

Reddit r/LocalLLaMA

Llama.cpp release B9406 fixes a crash (GGML_ASSERT) when using MTP with MoE vision models like Qwen3.6-35B-A3B.

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

Reddit r/LocalLLaMA

Benchmarks of Multi-Token Prediction (MTP) support in llama.cpp for the Qwen3.6-35B-A3B model on a 6GB VRAM laptop show that MTP is not worth using due to significantly slower prompt processing outweighing minor generation speed gains. The author found that using q4_0 quantization for the draft KV cache saves VRAM without hurting quality.

More Qwen3.6-27B MTP success but on dual Mi50s

Reddit r/LocalLLaMA

The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.