gpu-comparison

#gpu-comparison

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Reddit r/LocalLLaMA ↗ · 2026-05-30

The author compares various GPUs for LLM inference, critiquing common benchmarks and emphasizing the importance of prefill performance over generation speed, offering recommendations for different budgets and use cases.

0 favorites 0 likes

#gpu-comparison

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Reddit r/LocalLLaMA ↗ · 2026-05-16

The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.

0 favorites 0 likes

#gpu-comparison

Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

Reddit r/LocalLLaMA ↗ · 2026-05-14

A user reports that llama.cpp with ROCm consumes significantly more VRAM for the KV cache than the Vulkan backend, despite identical model and settings, prompting investigation into potential causes.

0 favorites 0 likes

gpu-comparison

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

Submit Feedback