Best Settings for 48GB VRAM + Qwen 3.6 27B

Reddit r/LocalLLaMA Tools

Summary

A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.

Hey everyone, I've been running Qwen3.6 27B (Q8_0) across an RTX 4090 + RTX 3090 setup using llama.cpp with tensor split, and I wanted to share what's been working best for me so far. See if anyone has any better settings Hardware: RTX 4090 (24GB) + RTX 3090 (24GB), 48GB VRAM total OS Arch Linux (using igpu for display) Settings: Quant: Q8_0 Split mode: tensor Layers on GPU: -ngl 999 Context: 250k (-c 250000) Speculative decoding: --spec-type draft-mtp --spec-draft-n-max 4 parallel requests: -np 3 Unified KV cache: -kvu Chat template: --chat-template-kwargs '{"preserve_thinking": true}' Flags: --no-mmap -fa on --jinja -fit off --no-op-offload Vision: mmproj-F16 with --no-mmproj-offload This gives me 75-100t/s tg and 1500 pp 250k un quantized context + vision + MTP
Original Article

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.