@atomic_chat_hq: MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-…

X AI KOLs Timeline 05/20/26, 10:25 PM Tools

mtp speedup qwen inference-optimization local-ai open-source

Summary

Atomic Chat's MTP technique speeds up Qwen dense models by 2.5x and MoE models by 25% on 2x RTX 5090 with zero accuracy loss and ~1 GB extra VRAM, using speculative decoding to draft and verify multiple tokens in one pass.

MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25% MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B params per token, MoE 35B-A3B only reads 3B active. Dense had way more to save by batching. The baseline tps also differ (218 vs 51) for the same reason from the other side. Token generation is memory-bandwidth bound, and MoE moves ~8x less memory per token, so its baseline is already 4x ahead. ~80% draft acceptance. Zero accuracy loss. ~1 GB extra VRAM. Open-source code and local AI app – in the comments

Original Article

View Cached Full Text

Cached at: 05/21/26, 08:24 AM

MTP speedup Qwen by 2.5x in Atomic Chat

Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25%

MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B params per token, MoE 35B-A3B only reads 3B active. Dense had way more to save by batching.

The baseline tps also differ (218 vs 51) for the same reason from the other side. Token generation is memory-bandwidth bound, and MoE moves ~8x less memory per token, so its baseline is already 4x ahead.

~80% draft acceptance. Zero accuracy loss. ~1 GB extra VRAM.

Open-source code and local AI app – in the comments

@atomic_chat_hq: MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-…

Similar Articles

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

@Snixtp: https://x.com/Snixtp/status/2055734339346768225

Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

@atomic_chat_hq: 1-bit Hy3 running locally is 2.2x faster than its API at the same quality! We gave both models the same task and compar…

Submit Feedback

Similar Articles

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

@Snixtp: https://x.com/Snixtp/status/2055734339346768225

Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

@atomic_chat_hq: 1-bit Hy3 running locally is 2.2x faster than its API at the same quality! We gave both models the same task and compar…