Tag
Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.
The article presents benchmark results for 8 local LLMs on an RTX 3090, showing that power efficiency peaks around 225W, with diminishing returns at maximum power.