@populartourist: Unsloth Qwen3.6 27B Q6_K doing over 100 t/s with MTP on RTX 5090. That's coming up from 45-50 t/s without MTP. That's i…
Summary
Unsloth Qwen3.6 27B Q6_K achieves over 100 tokens per second with MTP on RTX 5090, up from 45-50 t/s without MTP.
View Cached Full Text
Cached at: 05/17/26, 11:32 AM
Unsloth Qwen3.6 27B Q6_K doing over 100 t/s with MTP on RTX 5090.
That’s coming up from 45-50 t/s without MTP. That’s insane.
–spec-draft-n-max 3 –spec-draft-p-min 0.75 https://t.co/D0tQkBU7r9
Similar Articles
Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090
Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.
@Snixtp: https://x.com/Snixtp/status/2055734339346768225
A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.
125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar
A user reports achieving 125 tokens per second running Qwen3.6 q4xl on two RTX 4060 Ti GPUs, highlighting excellent performance per dollar and wondering if further optimization can reach 150 tok/s.
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)
Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.