Tag
A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.
llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.
User reports a CUDA error when using tensor split mode with the latest llama.cpp and Qwen-3.6-27b model on dual RTX 3090s with Ubuntu Server 24.04 and Docker.
llama.cpp version b9455 merges a fix for `-sm tensor` KV cache quantization on multi-GPU setups, addressing a shape information loss issue when flattening tensors.
Llama.cpp is expected to receive a fix for split mode tensor crashes on multi-GPU setups, which currently cause VRAM exhaustion every 90-120 minutes. The fix also reportedly brings a ~35% throughput improvement over layer mode.
A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.
A grad student shares their experience building a multi-GPU workstation with 4x3090 Ti running on a single US wall outlet, detailing constraints, power-limiting challenges, and component choices.
EnergyLens is an end-to-end framework for predictive energy-aware optimization of multi-GPU LLM inference, validated on Llama3 and Qwen3-MoE, achieving mean absolute percentage errors between 9.25% and 13.19% and revealing significant energy variation across configurations.
llama.cpp build b9095 introduces NCCL-free tensor parallelism for dual Blackwell PCIe GPUs, enabling efficient multi-GPU inference without relying on NCCL.
User benchmarks Qwen3.6-27B-Q8_0 at ~13 tokens/sec on 3 mixed GPUs with 128k context via llama.cpp, asking if performance is typical.