7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context
Summary
A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.
Similar Articles
Best Settings for 48GB VRAM + Qwen 3.6 27B
A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)
The article compares llama.cpp backends for running Qwen 3.6 27B on an RTX 3090 24GB, finding ik_llama.cpp with IQ4_KS quantization yields the best performance (1261 tok/s prefill, 72.9 tok/s decode).
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro
This article describes how to use the SYCL backend with llama.cpp to achieve over 60 tokens per second on the Qwen 3.6-35B-A3B model using an Intel Arc Pro B70 GPU, with the entire model and KV cache in VRAM.