@rohanpaul_ai: Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Mult…

X AI KOLs Following 05/21/26, 03:50 AM Tools

local-llm multi-token-prediction inference-speed open-source qwen gpu-optimization

Summary

atomic.chat's MTP technique speeds up local LLM inference by drafting multiple tokens and verifying them together, achieving up to 137% speedup on Qwen 27B dense model with zero accuracy loss.

Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B. And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090. Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints. And this makes local LLMs much faster when the draft tokens are accepted often enough. For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation. A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token, so if MTP lets the model check several drafted tokens in one forward pass, it reduces how often the same giant weight matrix has to be reread. The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM, because speculative decoding often becomes useful only when the draft tokens are accepted often enough. So we get this strong local AI result because it improves generation speed without changing the model’s answers, but the dense model is the real winner because memory bandwidth was its main bottleneck. Their GitHub repo is fully open source.

Original Article

View Cached Full Text

Cached at: 05/21/26, 08:13 AM

Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer.

They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B.

And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090.

Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints.

And this makes local LLMs much faster when the draft tokens are accepted often enough.

For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation.

A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token, so if MTP lets the model check several drafted tokens in one forward pass, it reduces how often the same giant weight matrix has to be reread.

The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM, because speculative decoding often becomes useful only when the draft tokens are accepted often enough.

So we get this strong local AI result because it improves generation speed without changing the model’s answers, but the dense model is the real winner because memory bandwidth was its main bottleneck.

Their GitHub repo is fully open source.

atomic.chat (@atomic_chat_hq): MTP speedup Qwen by 2.5x in Atomic Chat

Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25%

MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B

@rohanpaul_ai: Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Mult…

Similar Articles

@rohanpaul_ai: atomic[.]chat just made Gemma 4 26B faster inside LLaMA.cpp. making token generation about 40% faster in its MacBook Pr…

@rohanpaul_ai: atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook …

@rohanpaul_ai: Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat 90% acceptance rate, i.e…

@atomic_chat_hq: MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-…

@DivyanshT91162: Local LLMs just hit a whole new level This Hugging Face release is actually insane: "gpt-oss-20b-tq3" An official 20B+ …

Submit Feedback

Similar Articles

@rohanpaul_ai: atomic[.]chat just made Gemma 4 26B faster inside LLaMA.cpp. making token generation about 40% faster in its MacBook Pr…

@rohanpaul_ai: atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook …

@rohanpaul_ai: Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat 90% acceptance rate, i.e…

@atomic_chat_hq: MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-…

@DivyanshT91162: Local LLMs just hit a whole new level This Hugging Face release is actually insane: "gpt-oss-20b-tq3" An official 20B+ …