@victormustar: llama.cpp with MTP support makes local models fast enough to use as daily drivers Qwen3.6-27B dense generation (on A10G…
Summary
llama.cpp adds MTP support for Qwen3.6 models, boosting generation speed by 78% on A10G hardware, making local models viable as daily drivers.
View Cached Full Text
Cached at: 05/18/26, 10:38 PM
llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀
Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%).
Two flags on llama-server: –spec-type draft-mtp –spec-draft-n-max 2 https://t.co/hhslKpLE71
Georgi Gerganov (@ggerganov): llama.cpp adds MTP for the Qwen3.6 family
This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.
Special thanks to Aman Gupta for leading this development!
Similar Articles
@hank_aibtc: https://x.com/ClementDelangue/status/2058672394865111544/video/1… Local LLM speed ceiling broken again! llama.cpp natively supports MTP (Multi-Token Prediction): - No extra draft model needed…
llama.cpp natively supports Multi-Token Prediction (MTP) without requiring an extra draft model. By leveraging the model's built-in prediction head, local models like Qwen3.6-27B achieve 1.7x+ speedup, making 27B models run smoothly on consumer GPUs.
@ggerganov: llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance j…
llama.cpp adds Multi-Token Prediction (MTP) support for the Qwen3.6 family, delivering massive performance improvements for local AI inference on commodity hardware.
Llama.cpp B9406 MTP mmproj fix
Llama.cpp release B9406 fixes a crash (GGML_ASSERT) when using MTP with MoE vision models like Qwen3.6-35B-A3B.
MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it
Benchmarks of Multi-Token Prediction (MTP) support in llama.cpp for the Qwen3.6-35B-A3B model on a 6GB VRAM laptop show that MTP is not worth using due to significantly slower prompt processing outweighing minor generation speed gains. The author found that using q4_0 quantization for the draft KV cache saves VRAM without hurting quality.
More Qwen3.6-27B MTP success but on dual Mi50s
The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.