multi-gpu

#multi-gpu

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Reddit r/LocalLLaMA ↗ · 5d ago

A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.

0 favorites 0 likes

#multi-gpu

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…

X AI KOLs Following ↗ · 5d ago

llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.

0 favorites 0 likes

#multi-gpu

Tensor split mode: CUDA error on latest llama.cpp with Qwen-3.6-27b

Reddit r/LocalLLaMA ↗ · 6d ago

User reports a CUDA error when using tensor split mode with the latest llama.cpp and Qwen-3.6-27b model on dual RTX 3090s with Ubuntu Server 24.04 and Docker.

0 favorites 0 likes

#multi-gpu

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

Reddit r/LocalLLaMA ↗ · 2026-06-01

llama.cpp version b9455 merges a fix for `-sm tensor` KV cache quantization on multi-GPU setups, addressing a shape information loss issue when flattening tensors.

0 favorites 0 likes

#multi-gpu

Llama.cpp : Split Mode Tensor Fix Incoming?

Reddit r/LocalLLaMA ↗ · 2026-05-25

Llama.cpp is expected to receive a fix for split mode tensor crashes on multi-GPU setups, which currently cause VRAM exhaustion every 90-120 minutes. The fix also reportedly brings a ~35% throughput improvement over layer mode.

0 favorites 0 likes

#multi-gpu

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …

X AI KOLs Timeline ↗ · 2026-05-21 Cached

A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.

0 favorites 0 likes

#multi-gpu

@barrowjoseph: https://x.com/barrowjoseph/status/2056417511826989310

X AI KOLs Timeline ↗ · 2026-05-18 Cached

A grad student shares their experience building a multi-GPU workstation with 4x3090 Ti running on a single US wall outlet, detailing constraints, power-limiting challenges, and component choices.

0 favorites 0 likes

#multi-gpu

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

arXiv cs.LG ↗ · 2026-05-15 Cached

EnergyLens is an end-to-end framework for predictive energy-aware optimization of multi-GPU LLM inference, validated on Llama3 and Qwen3-MoE, achieving mean absolute percentage errors between 9.25% and 13.19% and revealing significant energy variation across configurations.

0 favorites 0 likes

#multi-gpu

NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released!

Reddit r/LocalLLaMA ↗ · 2026-05-10

llama.cpp build b9095 introduces NCCL-free tensor parallelism for dual Blackwell PCIe GPUs, enabling efficient multi-GPU inference without relying on NCCL.

0 favorites 0 likes

#multi-gpu

What speed is everyone getting on Qwen3.6 27b?

Reddit r/LocalLLaMA ↗ · 2026-04-22

User benchmarks Qwen3.6-27B-Q8_0 at ~13 tokens/sec on 3 mixed GPUs with 128k context via llama.cpp, asking if performance is typical.

0 favorites 0 likes

multi-gpu

Submit Feedback