Best Settings for 48GB VRAM + Qwen 3.6 27B

Reddit r/LocalLLaMA 06/20/26, 09:07 AM Tools

llamacpp qwen model-inference multi-gpu vram-optimization speculative-decoding tensor-split

Summary

A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.

Hey everyone, I've been running Qwen3.6 27B (Q8_0) across an RTX 4090 + RTX 3090 setup using llama.cpp with tensor split, and I wanted to share what's been working best for me so far. See if anyone has any better settings Hardware: RTX 4090 (24GB) + RTX 3090 (24GB), 48GB VRAM total OS Arch Linux (using igpu for display) Settings: Quant: Q8_0 Split mode: tensor Layers on GPU: -ngl 999 Context: 250k (-c 250000) Speculative decoding: --spec-type draft-mtp --spec-draft-n-max 4 parallel requests: -np 3 Unified KV cache: -kvu Chat template: --chat-template-kwargs '{"preserve_thinking": true}' Flags: --no-mmap -fa on --jinja -fit off --no-op-offload Vision: mmproj-F16 with --no-mmproj-offload This gives me 75-100t/s tg and 1500 pp 250k un quantized context + vision + MTP

Original Article

Similar Articles

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Reddit r/LocalLLaMA

A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.

Best config for Qwen3.6 27b / llama.cpp / opencode

Reddit r/LocalLLaMA

Community thread sharing optimized llama.cpp launch commands for running the 27B Qwen3.6 GGUF model with long 100K-512K context on multi-GPU setups.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Reddit r/LocalLLaMA

The article compares llama.cpp backends for running Qwen 3.6 27B on an RTX 3090 24GB, finding ik_llama.cpp with IQ4_KS quantization yields the best performance (1261 tok/s prefill, 72.9 tok/s decode).

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Reddit r/LocalLLaMA

A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.

Similar Articles

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Best config for Qwen3.6 27b / llama.cpp / opencode

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Submit Feedback