A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.
I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards. My setup (Node #04) * Gigabyte X399 Designare EX * Threadripper 1950X * 128GB DDR4 * 4x RTX 3090 * 10GbE TP-Link/Aquantia NIC * llama.cpp NCCL build * vLLM for safetensors models I was getting weirdly disappointing multi-GPU results. The rig worked, all 4 GPUs were detected, VRAM was available, models loaded, but some workloads were underwhelming. Example: Mistral Medium 3.5 128B Q4_K GGUF was only doing around 11 tok/s with low GPU usage, roughly 30%. I assumed it was a backend/model/split/NCCL issue. Turns out one of the 3090s was sitting in a physical x16 slot that is electrically PCIe 2.0 x4 on this board. Even worse, before fixing BIOS/settings/placement, Linux showed that GPU negotiating as low as Gen2 x1 / Gen1 x4. The smoking gun: ```bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv ``` Bad layout showed one GPU effectively crippled. After moving the cards around, the GPUs now show: ```text GPU0: Gen3 max, x8 GPU1: Gen3 max, x16 GPU2: Gen3 max, x8 GPU3: Gen3 max, x16 ``` The hidden mistake was that the board has multiple physical x16-length slots, but not all are electrically equal. The PCIe 2.0 x4 slot belongs to the NIC, not a 3090. After fixing the slot layout, results changed dramatically. Qwen3.6 27B BF16 with vLLM TP=4 + MTP at 260K context: ```text ~78-80 tok/s generation ~80% draft acceptance rate ``` Qwen3.6 27B BF16 GGUF with llama.cpp NCCL build, `--split-mode tensor`, MTP enabled: ```text ~66.5 tok/s ~85% draft acceptance ``` Mistral Medium 3.5 128B Q4_K GGUF with llama.cpp: Before, using `--split-mode layer`: ```text ~11 tok/s low GPU utilization ``` After switching to proper PCIe layout and using: ```bash --split-mode tensor --tensor-split 25,25,25,25 ``` Result: ```text ~24.7 tok/s ``` So the lessons: 1. Do not trust physical slot length. Check electrical lane layout in the motherboard manual. 2. Always verify real negotiated PCIe width/speed from Linux. 3. `nvidia-smi` and `lspci -vv` are your friends. 4. On llama.cpp, `--split-mode layer` can badly underuse GPUs for some large GGUF models. 5. `--split-mode tensor` made a huge difference for my Mistral 128B GGUF test. 6. If one GPU is accidentally on a bad PCIe path, the whole multi-GPU inference setup can look like a backend problem when it is actually a slot layout problem. Useful commands: ```bash nvidia-smi topo -m ``` ```bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv ``` ```bash for B in 09:00.0 0a:00.0 41:00.0 42:00.0; do echo "===== $B =====" sudo lspci -vv -s "$B" | grep -E "LnkCap|LnkSta" done ``` If you are building a “cheap VRAM monster” with used 3090s, check this before blaming NCCL, llama.cpp, vLLM, quantization, or the model. In my case, fixing PCIe slot placement turned the rig from “why is this so underwhelming?” into “okay, this thing is actually a monster.”
A user shares power limit testing on a 4x RTX 3090 setup running Qwen3.6-27B with vLLM, finding 220W as the sweet spot for peak efficiency with minimal throughput loss.
A user details their modding and benchmarking of an AMD Strix Halo system with dual RTX 3090 eGPUs and NVLink, finding improvements in LLM inference speed for dense models, especially with vLLM, and discusses power efficiency trade-offs.
The article presents benchmark results for 8 local LLMs on an RTX 3090, showing that power efficiency peaks around 225W, with diminishing returns at maximum power.
A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.
A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.