I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Reddit r/LocalLLaMA 05/30/26, 12:44 AM News

gpu-comparison llm-inference hardware-benchmarks mac-studio nvidia amd prefill-performance

Summary

The author compares various GPUs for LLM inference, critiquing common benchmarks and emphasizing the importance of prefill performance over generation speed, offering recommendations for different budgets and use cases.

Hot takes: \- Mac studio is overpriced Raspberry Pi that is way more inefficient than people think (together with most macs). M5 MBP is better with the "tensor" matrix MMA, but not by much. \- Spark was actually decent when it was just 3-4k. Strix is obviously much better now \- 3090 are complete overkill for single stream usage, V100s are much better value if you can find them cheap. P40 are very niche, but decent if you want exactly 48GB of vram, run moe and don't have money for Mi50s or V100s. \- P100s are extremely underrated entry level LLM gpu's that are not talked about enough. 200 bucks (dual gpu) for a combined 32GB of 700GB/s memory and 70% of M3 Ultra compute is crazy. I understand that this sub is now filled with gamers who do nothing but ERP with anime waifus on their setups, but for people who do something actually productive, prefill is still very important and this is completely hidden by the "generate 1000 word story" benchmarks that most posts or big AI youtube channels do. Especially with multimodal models that eat up context like mad. I'm still collecting data for prefill and generation charts I'd like to do in the future... I also couldn't find much reliable power data, so if you could provide that from your own setups in the comments I'll be glad. Thanks for coming to my ted talk. Edit: Grammar shenanigans

Original Article

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Similar Articles

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Memory Bandwidth for Local AI Hardware (2026 Edition)

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Small comparison on full compute performance (Anima) of 5090 (600,475 and 400W) vs 6000 PRO MaxQ (325W), and 6000 PRO WS/SE (600W).

(Rant ;)) Make your benchmarks realistic

Submit Feedback

Similar Articles

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Memory Bandwidth for Local AI Hardware (2026 Edition)

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Small comparison on full compute performance (Anima) of 5090 (600,475 and 400W) vs 6000 PRO MaxQ (325W), and 6000 PRO WS/SE (600W).

(Rant ;)) Make your benchmarks realistic