7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Reddit r/LocalLLaMA 06/20/26, 08:23 AM Tools

vram-optimization amd-gpu qwen-model llama-cpp context-length quantization ai-inference

Summary

A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.

OS: CatchyOS Instructions: Connect monitor to iGPU directly so when you boot Linux your dGPU vram is 100% free since by default when you use your dGPU it consumes about 700mb~1.2gb of lost context space, yes you can still game normally using this approach. Setup kvcache at q5_0/q4_0 (make sure to compile with CUDA_ALL_QUANTS) Yes, Q5_0/Q4_0 is 1.6%~ less precise than Q8 by giving 12% less vram usage as proven here: (Qwen does an amazing job with kvcache). https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context Now I can run Qwen 3.6 27B Unsloth Q6K model (22GB~) with 131k context at 55~60t/s Add these arguments to compile (the blas changes I got from here with a guy saying that it helped him reduce vram usage, and well...) -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_CUDA_FA_ALL_QUANTS=true You can then just pass the llama.cpp arguments: -ctk q5_0 -ctv q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0 -c 131000 --jinja --mlock --parallel 1 --no-mmproj

Original Article

Similar Articles

Best Settings for 48GB VRAM + Qwen 3.6 27B

Reddit r/LocalLLaMA

A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Reddit r/LocalLLaMA

The article compares llama.cpp backends for running Qwen 3.6 27B on an RTX 3090 24GB, finding ik_llama.cpp with IQ4_KS quantization yields the best performance (1261 tok/s prefill, 72.9 tok/s decode).

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Reddit r/LocalLLaMA

This article describes how to use the SYCL backend with llama.cpp to achieve over 60 tokens per second on the Qwen 3.6-35B-A3B model using an Intel Arc Pro B70 GPU, with the entire model and KV cache in VRAM.

Similar Articles

Best Settings for 48GB VRAM + Qwen 3.6 27B

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Submit Feedback