Weird to get near linear scaling by adding another GPU?

Reddit r/LocalLLaMA News

Summary

A user reports near-linear performance scaling when adding a second RTX 3090 for inference with a Qwen model, achieving roughly 1.8x decode TPS improvement without NVLink.

Single steam benchmarks (club-3090) model: qwen3.6-27b-autoround-int4 **BEFORE:** 1x3090 \*Their default script recipe for single 3090'\*s *(4-bit quant and 4-bit kv cache, mtp=2)* NARRATIVE decode\_TPS: mean = **53** std = **0.6** CODE decode\_TPS: mean = **62** std= **1.4** **AFTER:** 2x3090 *Their default script recipe for dual 3090's (4-bit quant and 8-bit kv cache, mpt=3)* NARRATIVE decode\_TPS: mean= **94** std= **1.3** CODE decode\_TPS: mean= **120** std= **2.1** This is running *without NVLink,* on a 8x/8x motherboard, for some reason P2P was automatically enabled (no driver hack needed), Tensor parallelism = 2 I am truly shocked that I got almost linear scaling in performance. I still get odd parsing errors in my quality tests when editing large code files in Agent mode (VSCode), (but not the same ones as before), for some reason forcing the model to use CLI editing tools is much more reliable than whatever VSCode is doing with the Agent. I am going to likely move to their 8-bit weight model recipe as well.
Original Article

Similar Articles

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Reddit r/LocalLLaMA

A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.

RTX Pro 4500 Blackwell Performance Numbers

Reddit r/LocalLLaMA

A user shares performance benchmarks comparing the Nvidia RTX Pro 4500 Blackwell 32GB GPU against the RTX 5060 Ti 16GB for AI inference, showing 1.6-6x speed improvements depending on model size and quantization.