inference-performance

#inference-performance

@jun_song: If Apple drops the M5 Ultra Mac Studio soon, I am ordering it with max RAM instantly. No time to hesitate. The M3 Ultra…

X AI KOLs Following ↗ · 2026-06-21 Cached

The author states they will immediately order an M5 Ultra Mac Studio with max RAM if Apple releases it soon, citing the M3 Ultra's high resale value and the M5's inference performance leap as reasons.

0 favorites 0 likes

#inference-performance

@derangineer: the goats in the game

X AI KOLs Following ↗ · 2026-06-11 Cached

Charles Frye announces a blog post detailing contributions to FA4 internals, focusing on inference performance improvements that have been upstreamed.

0 favorites 0 likes

#inference-performance

Pipeline parallelism in llama.cpp may be wasting your VRAM

Reddit r/LocalLLaMA ↗ · 2026-06-08

Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.

0 favorites 0 likes

#inference-performance

@Snixtp: More efficiency tests on a single 3090 TL;DR: - I tested 8 local LLMs on a single RTX 3090, power limit from 100W to 45…

X AI KOLs Following ↗ · 2026-05-08

The article presents benchmark results for 8 local LLMs on an RTX 3090, showing that power efficiency peaks around 225W, with diminishing returns at maximum power.

0 favorites 0 likes

inference-performance

@jun_song: If Apple drops the M5 Ultra Mac Studio soon, I am ordering it with max RAM instantly. No time to hesitate. The M3 Ultra…

@derangineer: the goats in the game

Pipeline parallelism in llama.cpp may be wasting your VRAM

@Snixtp: More efficiency tests on a single 3090 TL;DR: - I tested 8 local LLMs on a single RTX 3090, power limit from 100W to 45…

Submit Feedback