@oscarmartin: The world of AI is local, I have no doubt about it anymore @_nasch_ getting 87 tok/s with Qwen3.6 27B on a consumer AMD…

X AI KOLs Following 05/29/26, 07:21 AM Tools

local-ai inference llama-cpp qwen performance gpu ollama

Summary

Una demostración de cómo usar el flag -ncmoe en llama.cpp aumenta significativamente la velocidad de inferencia de Qwen3.6 en GPUs de consumo, logrando 70 tok/s en una RTX 4070 12GB frente a los 21 tok/s de Ollama.

El mundo de la IA es local, ya no me cabe duda 💪 @_nasch_ sacando 87 tok/s con Qwen3.6 27B en una AMD de consumo. Yo en mi vídeo: 70 tok/s con Qwen3.6 35B en una 4070 12GB. Esto avanza muy rápido. Es emocionante. https://t.co/dPqGJ8AR3P

Original Article

View Cached Full Text

Cached at: 05/30/26, 08:09 AM

El mundo de la IA es local, ya no me cabe duda 💪

@nasch sacando 87 tok/s con Qwen3.6 27B en una AMD de consumo.

Yo en mi vídeo: 70 tok/s con Qwen3.6 35B en una 4070 12GB.

Esto avanza muy rápido. Es emocionante. https://t.co/dPqGJ8AR3P

OscarMartin (@oscarmartin): Ollama me daba 21 tok/s con Qwen3.6 35B (12 GB VRAM).

Mismo modelo, misma GPU → llama.cpp + -ncmoe 15 = 70 tok/s.

No es magia. Es un flag que Ollama no expone.

Comando exacto: llama-cli -m ~/models/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf -ngl 99 -ncmoe 15 -p “Hola”

Demo real aquí 👇

Similar Articles

@cniongolo: I’m not sure people realize yet that you can actually run Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-MTP-GGUF on a dua…

X AI KOLs Following

Demonstrates running a custom Qwen model (Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-MTP-GGUF) on dual Nvidia RTX PRO 6000 Blackwell GPUs at 195 tokens per second using Hugging Face Inference.

@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…

X AI KOLs Timeline

A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.

@ItsmeAjayKV: Achievement Unlocked: Running Qwen3.6-27b dense Thanks to the RTX 3090, now I can do this. Running @Alibaba_Qwen Qwen 3…

X AI KOLs Timeline

User benchmarks Qwen3.6-27B on an RTX 3090 using llama.cpp, achieving 35 tok/s generation and 1247 tok/s prompt processing.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

@rohanpaul_ai: Qwen 3.6 27B on a MacBook Pro M5 Max 64GB hitting 34tokens per sec, locally with atomic[.]chat 90% acceptance rate, i.e…