@berryxia: Damn, even my eyes can't keep up with this speed! Daniel Han, founder of UnslothAI, YC S24, previously at NVIDIA doing ML, just released the experimental MTP GGUF of Qwen3.6. The 27B model hits 140 tokens/s on a single GPU. 35B-A...

X AI KOLs Timeline Models

Summary

UnslothAI founder Daniel Han released the experimental MTP GGUF version of Qwen3.6, achieving 140 tokens/s for the 27B model and 220 tokens/s for the 35B-A3B version on consumer GPUs — a 1.4x speedup with zero accuracy loss.

Damnnn, even my eyes can't keep up with this speed! Daniel Han, founder of UnslothAI, YC S24, previously at NVIDIA doing ML, just released the experimental MTP GGUF of Qwen3.6. The 27B model runs at 140 tokens/s on a single GPU. The 35B-A3B version is even more insane, hitting 220 tokens/s. That's over 1.4x faster than the original GGUF, with zero accuracy loss. They tested extensively and found that setting draft tokens to 2 is the sweet spot — any higher and the acceptance rate plummets, causing actual speed to drop. Looking at that benchmark curve, my biggest takeaway is that the performance ceiling for local large models has been pushed up significantly. I used to think 30B+ models were too slow to run locally, but now MTP speculative decoding is squeezing every drop of potential out of consumer GPUs. If you're playing with llama.cpp, running local agents, or doing daily coding, you need to try this update immediately. Local AI is starting to feel less and less like a "compromise version."
Original Article

Similar Articles

@AI_jacksaku: This week’s GitHub dark horse—Unsloth speeds up AI model training 2-5× while cutting VRAM use by 80%. What does that mean? Fine-tuning a large model used to require an A100 cluster and tens of thousands of dollars. Now one RTX 4090 can finish the job in a few hours. How? By optimizing attention compute, eliminating redundant memory copies, and adding QLoRA & Flash Attention support.

X AI KOLs Timeline

Unsloth open-source tool boosts large-model fine-tuning speed 2-5× and slashes VRAM by 80%, letting a single RTX 4090 finish in hours what once needed an A100 cluster.