7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Reddit r/LocalLLaMA Tools

Summary

A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.

OS: CatchyOS Instructions: Connect monitor to iGPU directly so when you boot Linux your dGPU vram is 100% free since by default when you use your dGPU it consumes about 700mb~1.2gb of lost context space, yes you can still game normally using this approach. Setup kvcache at q5_0/q4_0 (make sure to compile with CUDA_ALL_QUANTS) Yes, Q5_0/Q4_0 is 1.6%~ less precise than Q8 by giving 12% less vram usage as proven here: (Qwen does an amazing job with kvcache). https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context Now I can run Qwen 3.6 27B Unsloth Q6K model (22GB~) with 131k context at 55~60t/s Add these arguments to compile (the blas changes I got from here with a guy saying that it helped him reduce vram usage, and well...) -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_CUDA_FA_ALL_QUANTS=true You can then just pass the llama.cpp arguments: -ctk q5_0 -ctv q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0 -c 131000 --jinja --mlock --parallel 1 --no-mmproj
Original Article

Similar Articles

Best Settings for 48GB VRAM + Qwen 3.6 27B

Reddit r/LocalLLaMA

A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.