NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

Reddit r/LocalLLaMA 06/18/26, 12:51 PM Tools

nvfp4 kv-cache quantization memory-optimization qwen inference vram

Summary

NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.

The best i can get from Qwen3.6-27B on my 32GB VRAM (2 x 5060) is ~60 tok/sec gen speed at context size 196608. (sakamakismile text nvfp4). Fp8 kv quantization. NVFP4 kv cache quantization can’t get here fast enough. Reminds me of the time there was this game i couldn’t play on my first pc, because it needed 640KB of RAM minimum.

Original Article

Similar Articles

Quantizing MTP KV Cache = free lunch?

Reddit r/LocalLLaMA

Quantizing the Multi-Token Prediction (MTP) KV cache to q8_0 in llama.cpp for Qwen models reduces VRAM usage without affecting inference speed or acceptance rate, effectively providing a 'free lunch' for memory-constrained setups.

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Hacker News Top

Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.

High VRAM local coding model — still Qwen 3.6 27B?

Reddit r/LocalLLaMA

The user discusses their experience with Qwen 3.6 27B for local coding tasks and asks for recommendations for larger models (100B+) suitable for systems with 224GB of VRAM.

@witcheer: everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K…

X AI KOLs Timeline

A benchmark of NVFP4 on an RTX 5090 with Qwen3.6-27B shows prefill speed gains of 32-42% over equal-bit Q4_K_M and 52-68% over Q6_K, but decode gains are modest (+9% vs Q4) as decode is memory-bandwidth bound. The quality loss compared to Q6 is minimal (-0.8 average), making NVFP4 a good choice for local inference.

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Reddit r/LocalLLaMA

A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.

Similar Articles

Quantizing MTP KV Cache = free lunch?

KVarN: Native vLLM backend for KV-cache quantization by Huawei

High VRAM local coding model — still Qwen 3.6 27B?

@witcheer: everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K…

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Submit Feedback