I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Reddit r/LocalLLaMA 06/04/26, 04:45 PM News

multi-gpu llm-inference pcie local-llm hardware llama-cpp performance-optimization

Summary

A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.

I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards. My setup (Node #04) * Gigabyte X399 Designare EX * Threadripper 1950X * 128GB DDR4 * 4x RTX 3090 * 10GbE TP-Link/Aquantia NIC * llama.cpp NCCL build * vLLM for safetensors models I was getting weirdly disappointing multi-GPU results. The rig worked, all 4 GPUs were detected, VRAM was available, models loaded, but some workloads were underwhelming. Example: Mistral Medium 3.5 128B Q4_K GGUF was only doing around 11 tok/s with low GPU usage, roughly 30%. I assumed it was a backend/model/split/NCCL issue. Turns out one of the 3090s was sitting in a physical x16 slot that is electrically PCIe 2.0 x4 on this board. Even worse, before fixing BIOS/settings/placement, Linux showed that GPU negotiating as low as Gen2 x1 / Gen1 x4. The smoking gun: ```bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv ``` Bad layout showed one GPU effectively crippled. After moving the cards around, the GPUs now show: ```text GPU0: Gen3 max, x8 GPU1: Gen3 max, x16 GPU2: Gen3 max, x8 GPU3: Gen3 max, x16 ``` The hidden mistake was that the board has multiple physical x16-length slots, but not all are electrically equal. The PCIe 2.0 x4 slot belongs to the NIC, not a 3090. After fixing the slot layout, results changed dramatically. Qwen3.6 27B BF16 with vLLM TP=4 + MTP at 260K context: ```text ~78-80 tok/s generation ~80% draft acceptance rate ``` Qwen3.6 27B BF16 GGUF with llama.cpp NCCL build, `--split-mode tensor`, MTP enabled: ```text ~66.5 tok/s ~85% draft acceptance ``` Mistral Medium 3.5 128B Q4_K GGUF with llama.cpp: Before, using `--split-mode layer`: ```text ~11 tok/s low GPU utilization ``` After switching to proper PCIe layout and using: ```bash --split-mode tensor --tensor-split 25,25,25,25 ``` Result: ```text ~24.7 tok/s ``` So the lessons: 1. Do not trust physical slot length. Check electrical lane layout in the motherboard manual. 2. Always verify real negotiated PCIe width/speed from Linux. 3. `nvidia-smi` and `lspci -vv` are your friends. 4. On llama.cpp, `--split-mode layer` can badly underuse GPUs for some large GGUF models. 5. `--split-mode tensor` made a huge difference for my Mistral 128B GGUF test. 6. If one GPU is accidentally on a bad PCIe path, the whole multi-GPU inference setup can look like a backend problem when it is actually a slot layout problem. Useful commands: ```bash nvidia-smi topo -m ``` ```bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv ``` ```bash for B in 09:00.0 0a:00.0 41:00.0 42:00.0; do echo "===== $B =====" sudo lspci -vv -s "$B" | grep -E "LnkCap|LnkSta" done ``` If you are building a “cheap VRAM monster” with used 3090s, check this before blaming NCCL, llama.cpp, vLLM, quantization, or the model. In my case, fixing PCIe slot placement turned the rig from “why is this so underwhelming?” into “okay, this thing is actually a monster.”

Original Article

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Similar Articles

Finding the 4x 3090 Sweet Spot

Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

@Snixtp: More efficiency tests on a single 3090 TL;DR: - I tested 8 local LLMs on a single RTX 3090, power limit from 100W to 45…

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Submit Feedback

Similar Articles

Finding the 4x 3090 Sweet Spot

Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

@Snixtp: More efficiency tests on a single 3090 TL;DR: - I tested 8 local LLMs on a single RTX 3090, power limit from 100W to 45…

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache