@rohanpaul_ai: Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Mult…

X AI KOLs Following Tools

Summary

atomic.chat's MTP technique speeds up local LLM inference by drafting multiple tokens and verifying them together, achieving up to 137% speedup on Qwen 27B dense model with zero accuracy loss.

Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B. And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090. Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints. And this makes local LLMs much faster when the draft tokens are accepted often enough. For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation. A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token, so if MTP lets the model check several drafted tokens in one forward pass, it reduces how often the same giant weight matrix has to be reread. The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM, because speculative decoding often becomes useful only when the draft tokens are accepted often enough. So we get this strong local AI result because it improves generation speed without changing the model’s answers, but the dense model is the real winner because memory bandwidth was its main bottleneck. Their GitHub repo is fully open source.
Original Article
View Cached Full Text

Cached at: 05/21/26, 08:13 AM

Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer.

They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B.

And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090.

Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints.

And this makes local LLMs much faster when the draft tokens are accepted often enough.

For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation.

A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token, so if MTP lets the model check several drafted tokens in one forward pass, it reduces how often the same giant weight matrix has to be reread.

The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM, because speculative decoding often becomes useful only when the draft tokens are accepted often enough.

So we get this strong local AI result because it improves generation speed without changing the model’s answers, but the dense model is the real winner because memory bandwidth was its main bottleneck.

Their GitHub repo is fully open source.

atomic.chat (@atomic_chat_hq): MTP speedup Qwen by 2.5x in Atomic Chat

Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25%

MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B

Similar Articles