@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…

X AI KOLs Timeline News

Summary

A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.

Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill speed improvements is also soo good, crossed 1000t/s. I'm going to also download even bigger Q8 model now, prev it was not an option, now it is, and since i'm ok with reduced speed (40 - 50t/s is fine) for a even better quality quant, i'll take it.
Original Article
View Cached Full Text

Cached at: 06/18/26, 12:05 AM

Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl).

Crossed 90 t/s for the very first time, no MTP yet, prefill speed improvements is also soo good, crossed 1000t/s.

I’m going to also download even bigger Q8 model now, prev it was not an option, now it is, and since i’m ok with reduced speed (40 - 50t/s is fine) for a even better quality quant, i’ll take it.

AJ (@ItsmeAjayKV): 🔓Achievement Unlocked: Running Qwen3.6-27b dense🚀🚀

Thanks to the RTX 3090, now I can do this. Running @Alibaba_Qwen Qwen 3.6 27B (Q5_K_XL from @UnslothAI)

quick llama.cpp benchmark results (without MTP):

  • 1,247 tok/s prompt processing (512 token prompt)
  • 35 tok/s

Similar Articles

Wow! Qwen 3.6:35b-a3b on a 3090... pretty amazing.

Reddit r/artificial

A user shares impressive results running a quantized Qwen 3.6:35b-a3b model on a used RTX 3090, achieving 160 tokens per second output after fitting the model into VRAM, and demonstrates vision capabilities with a 75-second video processing time.

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Reddit r/LocalLLaMA

The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.