@oscarmartin: The world of AI is local, I have no doubt about it anymore @_nasch_ getting 87 tok/s with Qwen3.6 27B on a consumer AMD…
Summary
Una demostración de cómo usar el flag -ncmoe en llama.cpp aumenta significativamente la velocidad de inferencia de Qwen3.6 en GPUs de consumo, logrando 70 tok/s en una RTX 4070 12GB frente a los 21 tok/s de Ollama.
View Cached Full Text
Cached at: 05/30/26, 08:09 AM
El mundo de la IA es local, ya no me cabe duda 💪
@nasch sacando 87 tok/s con Qwen3.6 27B en una AMD de consumo.
Yo en mi vídeo: 70 tok/s con Qwen3.6 35B en una 4070 12GB.
Esto avanza muy rápido. Es emocionante. https://t.co/dPqGJ8AR3P
OscarMartin (@oscarmartin): Ollama me daba 21 tok/s con Qwen3.6 35B (12 GB VRAM).
Mismo modelo, misma GPU → llama.cpp + -ncmoe 15 = 70 tok/s.
No es magia. Es un flag que Ollama no expone.
Comando exacto: llama-cli -m ~/models/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf -ngl 99 -ncmoe 15 -p “Hola”
Demo real aquí 👇
Similar Articles
@cniongolo: I’m not sure people realize yet that you can actually run Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-MTP-GGUF on a dua…
Demonstrates running a custom Qwen model (Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-MTP-GGUF) on dual Nvidia RTX PRO 6000 Blackwell GPUs at 195 tokens per second using Hugging Face Inference.
@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…
A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.
@ItsmeAjayKV: Achievement Unlocked: Running Qwen3.6-27b dense Thanks to the RTX 3090, now I can do this. Running @Alibaba_Qwen Qwen 3…
User benchmarks Qwen3.6-27B on an RTX 3090 using llama.cpp, achieving 35 tok/s generation and 1247 tok/s prompt processing.
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
@rohanpaul_ai: Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat 90% acceptance rate, i.e…
Qwen 3.6 27B achieves 34 tokens/sec on a MacBook Pro M5 Max 64GB locally with 90% draft acceptance, enabled by TurboQuant, GGUF, and llama.cpp, showcasing a major advancement in laptop-based AI inference.