gpu-tuning

Tag

Cards List
#gpu-tuning

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

Reddit r/LocalLLaMA · 2026-05-12

The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.

0 favorites 0 likes
← Back to home

Submit Feedback