NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable
Summary
NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.
Similar Articles
Quantizing MTP KV Cache = free lunch?
Quantizing the Multi-Token Prediction (MTP) KV cache to q8_0 in llama.cpp for Qwen models reduces VRAM usage without affecting inference speed or acceptance rate, effectively providing a 'free lunch' for memory-constrained setups.
KVarN: Native vLLM backend for KV-cache quantization by Huawei
Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.
High VRAM local coding model — still Qwen 3.6 27B?
The user discusses their experience with Qwen 3.6 27B for local coding tasks and asks for recommendations for larger models (100B+) suitable for systems with 224GB of VRAM.
@witcheer: everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K…
A benchmark of NVFP4 on an RTX 5090 with Qwen3.6-27B shows prefill speed gains of 32-42% over equal-bit Q4_K_M and 52-68% over Q6_K, but decode gains are modest (+9% vs Q4) as decode is memory-bandwidth bound. The quality loss compared to Q6 is minimal (-0.8 average), making NVFP4 a good choice for local inference.
7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context
A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.