@Snixtp: https://x.com/Snixtp/status/2055734339346768225

X AI KOLs Timeline News

Summary

A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.

https://t.co/Vy4UrWlKLc
Original Article
View Cached Full Text

Cached at: 05/17/26, 03:27 AM

Qwen3.6 27B vs MTP on a single RTX 3090

I tested the “normal” Qwen3.6 27B GGUF against the new MTP GGUF variant from @UnslothAI in llama.cpp, usng one RTX 3090 with 24GB VRAM.

The goal was simple: see whether MTP actually helps on local consumer hardware, especially at longer context lengths.

Setup:

  • GPU: single RTX 3090

  • Runtime: llama.cpp

  • Quant: Q4_K_S GGUF

  • KV cache: q8_0 for K and V

  • GPU power limit: 250W

  • Prompt lengths tested: 4k, 16k, 32k, 64k

The Short Version

MTP was fast, but not everywhere.

At 4k context, the normal baseline was faster. But as context length increased, MTP speed increased a lot as well.

Generation speed:

MTP is not a win for small prompts, but it becomes very useful for long context generation.

At 32k, it was more than 2x fasThe p8 failure is not surprising. Eight 32k slots create a very large KV/cache memory requirement, and the 3090 ran out of VRAM while trying to allocate another 1197 MiB.ter for generation. At 64k, it was even faster compared to baseline

This seems too good to be true, there has to be a tradeoff. And yes, there is a small trade off.

The Tradeoff: Prefill Gets Slower

The downside from my testing is prompt processing.

MTP had slower prefill/prompt processing across the tested context lengths. In this run, MTP prompt processing was about 69-86% of baseline speed. So at most 31% slower, and minimum 14%.

That can be a big caveat.

If your workload is mostly short prompts, small completions, or many fresh requests where prefill dominates, MTP may not feel faster. At 4k context, it was actually slower overall for generation too.

But if your workload is long-context generation, where decode speed matters more after a big prompt is loaded, MTP starts to look much better.

What Happened at Long Context?

The 32k and 64k numbers are where MTP became clearly useful.

At 32k:

  • Baseline: 27.15 tok/s

  • MTP: 57.29 tok/s

  • Speedup: 2.11x

At 64k:

  • Baseline: 21.88 tok/s

  • MTP: 51.89 tok/s

  • Speedup: 2.37x

This is on a single 3090, not a highVRAM workstation card.

The test used q8_0 KV cache, which helped make the longer context fit on 24GB VRAM.

Concurrency Results

True MTP concurrency above p1 was not tested because llama.cpp MTP currently does not support -np > 1, according to the Unsloth model card. So the fair MTP comparison here is p1 only.

Even then, I did log the results for baseline Qwen3.6 27B at 32k:

The p8 failure is not surprising. Eight 32k slots create a very large KV/cache memory requirement, and the 3090 ran out of VRAM while trying to allocate another 1197 MiB.

My Takeaway

For local llama.cpp users, MTP looks promising, but the use case matters.

  • For short prompts, baseline may still be better.

  • For long-context generation, MTP can be much faster

  • Prefill is slower with MTP

  • Current llama.cpp MTP support is limited to p1, so it does not yet support concurrency

  • On 24GB VRAM, 32k p4 is possible for baseline with q8_0 KV, but 32k p8 does not fit

It seems to fit very well into setups with **Hermes Agent **and @openclaw as well as agentic coding where you might care more about generation speed rather than prompt processing.

Similar Articles

More Qwen3.6-27B MTP success but on dual Mi50s

Reddit r/LocalLLaMA

The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Reddit r/LocalLLaMA

A technical test of llama.cpp's new Multi-Token Prediction (MTP) support using Qwen3.6 models on an RTX 5090, comparing performance with and without MTP across different prompts and GGUF quantizations.

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Reddit r/LocalLLaMA

A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.