Weird to get near linear scaling by adding another GPU?

Reddit r/LocalLLaMA 06/08/26, 08:26 AM News

gpu-scaling inference-benchmark qwen3 tensor-parallelism decode-tps 3090

Summary

A user reports near-linear performance scaling when adding a second RTX 3090 for inference with a Qwen model, achieving roughly 1.8x decode TPS improvement without NVLink.

Single steam benchmarks (club-3090) model: qwen3.6-27b-autoround-int4 **BEFORE:** 1x3090 \*Their default script recipe for single 3090'\*s *(4-bit quant and 4-bit kv cache, mtp=2)* NARRATIVE decode\_TPS: mean = **53** std = **0.6** CODE decode\_TPS: mean = **62** std= **1.4** **AFTER:** 2x3090 *Their default script recipe for dual 3090's (4-bit quant and 8-bit kv cache, mpt=3)* NARRATIVE decode\_TPS: mean= **94** std= **1.3** CODE decode\_TPS: mean= **120** std= **2.1** This is running *without NVLink,* on a 8x/8x motherboard, for some reason P2P was automatically enabled (no driver hack needed), Tensor parallelism = 2 I am truly shocked that I got almost linear scaling in performance. I still get odd parsing errors in my quality tests when editing large code files in Agent mode (VSCode), (but not the same ones as before), for some reason forcing the model to use CLI editing tools is much more reliable than whatever VSCode is doing with the Agent. I am going to likely move to their 8-bit weight model recipe as well.

Original Article

Similar Articles

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Reddit r/LocalLLaMA

A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.

@rumgewieselt: Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently

X AI KOLs Timeline

A user demonstrates successful local inference of a 27B parameter Qwen model across three GTX 1080 Ti GPUs, achieving approximately 28-30 tokens per second using TurboQuant optimization.

RTX Pro 4500 Blackwell Performance Numbers

Reddit r/LocalLLaMA

A user shares performance benchmarks comparing the Nvidia RTX Pro 4500 Blackwell 32GB GPU against the RTX 5060 Ti 16GB for AI inference, showing 1.6-6x speed improvements depending on model size and quantization.

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

X AI KOLs Timeline

A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Reddit r/LocalLLaMA

A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.

Similar Articles

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

@rumgewieselt: Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently

RTX Pro 4500 Blackwell Performance Numbers

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Submit Feedback