Tag
The author states they will immediately order an M5 Ultra Mac Studio with max RAM if Apple releases it soon, citing the M3 Ultra's high resale value and the M5's inference performance leap as reasons.
Charles Frye announces a blog post detailing contributions to FA4 internals, focusing on inference performance improvements that have been upstreamed.
Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.
The article presents benchmark results for 8 local LLMs on an RTX 3090, showing that power efficiency peaks around 225W, with diminishing returns at maximum power.