@leftcurvedev_: Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp…
Summary
Explains how the -ncmoe flag in llama.cpp improves performance for MoE models like Qwen3.6 35B A3B on limited VRAM (8-12GB) by offloading some expert layers to CPU+RAM, with benchmarks showing up to 5x speedup on an RTX 3070Ti.
Similar Articles
Experts first llama.cpp
A developer created an experimental fork of llama.cpp that offloads only used experts instead of entire layers to VRAM, achieving speed improvements for MoE models on GPUs with limited VRAM like the RTX 2060 12GB. The author is asking for testers to validate performance on other Nvidia GPUs.
Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps
The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.
Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown
Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.
Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models
The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.