@Snixtp: https://x.com/Snixtp/status/2055734339346768225
Summary
A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.
View Cached Full Text
Cached at: 05/17/26, 03:27 AM
Qwen3.6 27B vs MTP on a single RTX 3090
I tested the “normal” Qwen3.6 27B GGUF against the new MTP GGUF variant from @UnslothAI in llama.cpp, usng one RTX 3090 with 24GB VRAM.
The goal was simple: see whether MTP actually helps on local consumer hardware, especially at longer context lengths.
Setup:
-
GPU: single RTX 3090
-
Runtime: llama.cpp
-
Quant: Q4_K_S GGUF
-
KV cache: q8_0 for K and V
-
GPU power limit: 250W
-
Prompt lengths tested: 4k, 16k, 32k, 64k
The Short Version
MTP was fast, but not everywhere.
At 4k context, the normal baseline was faster. But as context length increased, MTP speed increased a lot as well.
Generation speed:
MTP is not a win for small prompts, but it becomes very useful for long context generation.
At 32k, it was more than 2x fasThe p8 failure is not surprising. Eight 32k slots create a very large KV/cache memory requirement, and the 3090 ran out of VRAM while trying to allocate another 1197 MiB.ter for generation. At 64k, it was even faster compared to baseline
This seems too good to be true, there has to be a tradeoff. And yes, there is a small trade off.
The Tradeoff: Prefill Gets Slower
The downside from my testing is prompt processing.
MTP had slower prefill/prompt processing across the tested context lengths. In this run, MTP prompt processing was about 69-86% of baseline speed. So at most 31% slower, and minimum 14%.
That can be a big caveat.
If your workload is mostly short prompts, small completions, or many fresh requests where prefill dominates, MTP may not feel faster. At 4k context, it was actually slower overall for generation too.
But if your workload is long-context generation, where decode speed matters more after a big prompt is loaded, MTP starts to look much better.
What Happened at Long Context?
The 32k and 64k numbers are where MTP became clearly useful.
At 32k:
-
Baseline: 27.15 tok/s
-
MTP: 57.29 tok/s
-
Speedup: 2.11x
At 64k:
-
Baseline: 21.88 tok/s
-
MTP: 51.89 tok/s
-
Speedup: 2.37x
This is on a single 3090, not a highVRAM workstation card.
The test used q8_0 KV cache, which helped make the longer context fit on 24GB VRAM.
Concurrency Results
True MTP concurrency above p1 was not tested because llama.cpp MTP currently does not support -np > 1, according to the Unsloth model card. So the fair MTP comparison here is p1 only.
Even then, I did log the results for baseline Qwen3.6 27B at 32k:
The p8 failure is not surprising. Eight 32k slots create a very large KV/cache memory requirement, and the 3090 ran out of VRAM while trying to allocate another 1197 MiB.
My Takeaway
For local llama.cpp users, MTP looks promising, but the use case matters.
-
For short prompts, baseline may still be better.
-
For long-context generation, MTP can be much faster
-
Prefill is slower with MTP
-
Current llama.cpp MTP support is limited to p1, so it does not yet support concurrency
-
On 24GB VRAM, 32k p4 is possible for baseline with q8_0 KV, but 32k p8 does not fit
It seems to fit very well into setups with **Hermes Agent **and @openclaw as well as agentic coding where you might care more about generation speed rather than prompt processing.
Similar Articles
More Qwen3.6-27B MTP success but on dual Mi50s
The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.
Testing llama.cpp MTP support on Qwen3.6 - RTX 5090
A technical test of llama.cpp's new Multi-Token Prediction (MTP) support using Qwen3.6 models on an RTX 5090, comparing performance with and without MTP across different prompts and GGUF quantizations.
Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks
Community benchmarks of Qwen 3.6-27B Dense and MTP variants running via llama.cpp on Strix Halo Windows, showing token/s speeds for various tasks.
Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090
Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.
Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK
A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.