@leftcurvedev_: Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp…

X AI KOLs Timeline 05/08/26, 06:06 PM Tools

llama-cpp vram moe performance optimization inference local-llm

Summary

Explains how the -ncmoe flag in llama.cpp improves performance for MoE models like Qwen3.6 35B A3B on limited VRAM (8-12GB) by offloading some expert layers to CPU+RAM, with benchmarks showing up to 5x speedup on an RTX 3070Ti.

Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti: no flag → 8.7 tok/s RAM: 13.6GB & VRAM: 7.8GB -ncmoe 35 → 27.5 tok/s RAM: 12.1GB & VRAM: 4.3GB -ncmoe 30 → 32.5 tok/s RAM: 12GB & VRAM: 5.6GB -ncmoe 25 → 40.9 tok/s RAM: 12GB & VRAM: 6.9GB Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind. Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed. As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress. ↓ server flags below

Original Article

Similar Articles

Experts first llama.cpp

Reddit r/LocalLLaMA

A developer created an experimental fork of llama.cpp that offloads only used experts instead of entire layers to VRAM, achieving speed improvements for MoE models on GPUs with limited VRAM like the RTX 2060 12GB. The author is asking for testers to validate performance on other Nvidia GPUs.

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Reddit r/LocalLLaMA

The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown

Reddit r/openclaw

Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

Reddit r/LocalLLaMA

The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

Similar Articles

Experts first llama.cpp

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Submit Feedback