@Snixtp: https://x.com/Snixtp/status/2055734339346768225

X AI KOLs Timeline 05/16/26, 07:37 PM News

llama-cpp qwen multi-token-prediction rtx-3090 benchmark local-inference gguf

Summary

A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.

https://t.co/Vy4UrWlKLc

Original Article

View Cached Full Text

Cached at: 05/17/26, 03:27 AM

Qwen3.6 27B vs MTP on a single RTX 3090

I tested the “normal” Qwen3.6 27B GGUF against the new MTP GGUF variant from @UnslothAI in llama.cpp, usng one RTX 3090 with 24GB VRAM.

The goal was simple: see whether MTP actually helps on local consumer hardware, especially at longer context lengths.

Setup:

GPU: single RTX 3090
Runtime: llama.cpp
Quant: Q4_K_S GGUF
KV cache: q8_0 for K and V
GPU power limit: 250W
Prompt lengths tested: 4k, 16k, 32k, 64k

The Short Version

MTP was fast, but not everywhere.

At 4k context, the normal baseline was faster. But as context length increased, MTP speed increased a lot as well.

Generation speed:

MTP is not a win for small prompts, but it becomes very useful for long context generation.

At 32k, it was more than 2x fasThe p8 failure is not surprising. Eight 32k slots create a very large KV/cache memory requirement, and the 3090 ran out of VRAM while trying to allocate another 1197 MiB.ter for generation. At 64k, it was even faster compared to baseline

This seems too good to be true, there has to be a tradeoff. And yes, there is a small trade off.

The Tradeoff: Prefill Gets Slower

The downside from my testing is prompt processing.

MTP had slower prefill/prompt processing across the tested context lengths. In this run, MTP prompt processing was about 69-86% of baseline speed. So at most 31% slower, and minimum 14%.

That can be a big caveat.

If your workload is mostly short prompts, small completions, or many fresh requests where prefill dominates, MTP may not feel faster. At 4k context, it was actually slower overall for generation too.

But if your workload is long-context generation, where decode speed matters more after a big prompt is loaded, MTP starts to look much better.

What Happened at Long Context?

The 32k and 64k numbers are where MTP became clearly useful.

At 32k:

Baseline: 27.15 tok/s
MTP: 57.29 tok/s
Speedup: 2.11x

At 64k:

Baseline: 21.88 tok/s
MTP: 51.89 tok/s
Speedup: 2.37x

This is on a single 3090, not a highVRAM workstation card.

The test used q8_0 KV cache, which helped make the longer context fit on 24GB VRAM.

Concurrency Results

True MTP concurrency above p1 was not tested because llama.cpp MTP currently does not support -np > 1, according to the Unsloth model card. So the fair MTP comparison here is p1 only.

Even then, I did log the results for baseline Qwen3.6 27B at 32k:

The p8 failure is not surprising. Eight 32k slots create a very large KV/cache memory requirement, and the 3090 ran out of VRAM while trying to allocate another 1197 MiB.

My Takeaway

For local llama.cpp users, MTP looks promising, but the use case matters.

For short prompts, baseline may still be better.
For long-context generation, MTP can be much faster
Prefill is slower with MTP
Current llama.cpp MTP support is limited to p1, so it does not yet support concurrency
On 24GB VRAM, 32k p4 is possible for baseline with q8_0 KV, but 32k p8 does not fit

It seems to fit very well into setups with **Hermes Agent **and @openclaw as well as agentic coding where you might care more about generation speed rather than prompt processing.

@Snixtp: https://x.com/Snixtp/status/2055734339346768225

Qwen3.6 27B vs MTP on a single RTX 3090

The Short Version

The Tradeoff: Prefill Gets Slower

What Happened at Long Context?

Concurrency Results

My Takeaway

Similar Articles

More Qwen3.6-27B MTP success but on dual Mi50s

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Submit Feedback

Similar Articles

More Qwen3.6-27B MTP success but on dual Mi50s

Testing llama.cpp MTP support on Qwen3.6 - RTX 5090

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks
Community benchmarks of Qwen 3.6-27B Dense and MTP variants running via llama.cpp on Strix Halo Windows, showing token/s speeds for various tasks.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK