@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…
Summary
A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.
View Cached Full Text
Cached at: 06/18/26, 12:05 AM
Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl).
Crossed 90 t/s for the very first time, no MTP yet, prefill speed improvements is also soo good, crossed 1000t/s.
I’m going to also download even bigger Q8 model now, prev it was not an option, now it is, and since i’m ok with reduced speed (40 - 50t/s is fine) for a even better quality quant, i’ll take it.
AJ (@ItsmeAjayKV): 🔓Achievement Unlocked: Running Qwen3.6-27b dense🚀🚀
Thanks to the RTX 3090, now I can do this. Running @Alibaba_Qwen Qwen 3.6 27B (Q5_K_XL from @UnslothAI)
quick llama.cpp benchmark results (without MTP):
- 1,247 tok/s prompt processing (512 token prompt)
- 35 tok/s
Similar Articles
Wow! Qwen 3.6:35b-a3b on a 3090... pretty amazing.
A user shares impressive results running a quantized Qwen 3.6:35b-a3b model on a used RTX 3090, achieving 160 tokens per second output after fitting the model into VRAM, and demonstrates vision capabilities with a 75-second video processing time.
@ItsmeAjayKV: Achievement Unlocked: Running Qwen3.6-27b dense Thanks to the RTX 3090, now I can do this. Running @Alibaba_Qwen Qwen 3…
User benchmarks Qwen3.6-27B on an RTX 3090 using llama.cpp, achieving 35 tok/s generation and 1247 tok/s prompt processing.
Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown
Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.
@seclink: Just hit 134 tok/s with Qwen 3.5-27B Dense and 73 tok/s with the new Qwen 3.6-27B on a single RTX 3090. The 2026 open-source scene is moving at lightspeed…
A single RTX 3090 pushes 134 tok/s on the fresh 27B Qwen 3.5 Dense and 73 tok/s on Qwen 3.6-27B via fused kernels plus speculative decoding, with GGUF drops the same evening.
Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps
The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.