Tag
A user shares their experience running large LLMs on a 3x3090 (72GB VRAM) setup in Q2 2026, recommending models like GPT-OSS 120b, Qwen3.5 122b, and GLM Air 4.5 106B, and asking for newer alternatives.
A user reports near-linear performance scaling when adding a second RTX 3090 for inference with a Qwen model, achieving roughly 1.8x decode TPS improvement without NVLink.
A user shares impressive results running a quantized Qwen 3.6:35b-a3b model on a used RTX 3090, achieving 160 tokens per second output after fitting the model into VRAM, and demonstrates vision capabilities with a 75-second video processing time.
A GitHub repository called club-3090 provides recipes and configs for serving large language models locally on RTX 3090 GPUs, with support for multiple engines and quantization methods like Dflash and TurboQuant, including newly unlocked Q5 quants.
Discussion of performance tradeoffs when using the new MTP merge in llama.cpp to run Qwen 3.6 35B on dual 3090s, with users sharing token speeds and seeking optimal configurations.