@rohanpaul_ai: Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Mult…
Summary
atomic.chat's MTP technique speeds up local LLM inference by drafting multiple tokens and verifying them together, achieving up to 137% speedup on Qwen 27B dense model with zero accuracy loss.
View Cached Full Text
Cached at: 05/21/26, 08:13 AM
Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer.
They just showed MTP (Multi-Token Prediction) pushing local Qwen models from 51 to 117 tokens/s on dense 27B.
And an MoE 35B-A3B model rose from 218 to 267 tokens/s on 2x RTX 5090.
Instead of generating and checking one token at a time, MTP (Multi-Token Prediction) drafts multiple future tokens and verifies them together, so the GPU does less repeated work for every word it prints.
And this makes local LLMs much faster when the draft tokens are accepted often enough.
For many local LLM runs, the limit is not pure compute, but memory bandwidth: how fast the GPU can keep feeding weights into computation.
A local GPU generating text often spends most of its time pulling model weights from VRAM again and again for each token, so if MTP lets the model check several drafted tokens in one forward pass, it reduces how often the same giant weight matrix has to be reread.
The most interesting claim in their test is ~80% draft acceptance with zero accuracy loss and only ~1GB extra VRAM, because speculative decoding often becomes useful only when the draft tokens are accepted often enough.
So we get this strong local AI result because it improves generation speed without changing the model’s answers, but the dense model is the real winner because memory bandwidth was its main bottleneck.
Their GitHub repo is fully open source.
atomic.chat (@atomic_chat_hq): MTP speedup Qwen by 2.5x in Atomic Chat
Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-A3B: 218 → 267 tps +25%
MTP drafts several tokens ahead and verifies them in one pass. The speedup depends on memory moved per pass. Dense 27B reads all 27B
Similar Articles
@rohanpaul_ai: atomic[.]chat just made Gemma 4 26B faster inside LLaMA.cpp. making token generation about 40% faster in its MacBook Pr…
atomic.chat has optimized Gemma 4 26B inference in LLaMA.cpp, achieving ~40% faster token generation on MacBook Pro M5 Max using Multi-Token Prediction (MTP) speculative decoding. This is a notable win for local AI users running desktop apps, coding agents, and private on-device assistants.
@rohanpaul_ai: atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook …
Liquid's LFM2.5-8B-A1B outperformed OpenAI's gpt-oss-20b on a tool-calling benchmark when run locally on a MacBook Pro, completing all required tool calls in half the time while using less memory.
@rohanpaul_ai: Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat 90% acceptance rate, i.e…
Qwen 3.6 27B achieves 34 tokens/sec on a MacBook Pro M5 Max 64GB locally with 90% draft acceptance, enabled by TurboQuant, GGUF, and llama.cpp, showcasing a major advancement in laptop-based AI inference.
@atomic_chat_hq: MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-…
Atomic Chat's MTP technique speeds up Qwen dense models by 2.5x and MoE models by 25% on 2x RTX 5090 with zero accuracy loss and ~1 GB extra VRAM, using speculative decoding to draft and verify multiple tokens in one pass.
@DivyanshT91162: Local LLMs just hit a whole new level This Hugging Face release is actually insane: "gpt-oss-20b-tq3" An official 20B+ …
A new 20B+ parameter MoE model from OpenAI, quantized to 3-bit via TurboQuant and optimized with MLX, allows for high-performance local LLM inference on standard 16GB MacBooks.