inference-performance

Tag

Cards List
#inference-performance

Pipeline parallelism in llama.cpp may be wasting your VRAM

Reddit r/LocalLLaMA · 23h ago

Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.

0 favorites 0 likes
#inference-performance

@Snixtp: More efficiency tests on a single 3090 TL;DR: - I tested 8 local LLMs on a single RTX 3090, power limit from 100W to 45…

X AI KOLs Following · 2026-05-08

The article presents benchmark results for 8 local LLMs on an RTX 3090, showing that power efficiency peaks around 225W, with diminishing returns at maximum power.

0 favorites 0 likes
← Back to home

Submit Feedback