NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

Reddit r/LocalLLaMA Tools

Summary

NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.

The best i can get from Qwen3.6-27B on my 32GB VRAM (2 x 5060) is ~60 tok/sec gen speed at context size 196608. (sakamakismile text nvfp4). Fp8 kv quantization. NVFP4 kv cache quantization can’t get here fast enough. Reminds me of the time there was this game i couldn’t play on my first pc, because it needed 640KB of RAM minimum.
Original Article

Similar Articles

Quantizing MTP KV Cache = free lunch?

Reddit r/LocalLLaMA

Quantizing the Multi-Token Prediction (MTP) KV cache to q8_0 in llama.cpp for Qwen models reduces VRAM usage without affecting inference speed or acceptance rate, effectively providing a 'free lunch' for memory-constrained setups.

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Hacker News Top

Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.