I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT
Summary
The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.
Similar Articles
Gemma 4 QAT 31B responds better to KV cache quantization too
The Gemma 4 QAT 31B model demonstrates improved behavior with KV cache quantization, suggesting enhanced inference efficiency.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.
@anirudhbv_ce: Introducing SpectralQuant.. here to save your KV cache :)
SpectralQuant is a new KV cache quantization technique achieving 5.95× compression on Mistral 7B with only 7.5% perplexity overhead, significantly outperforming TurboQuant while requiring only 15 seconds of calibration per model.
KVarN: Native vLLM backend for KV-cache quantization by Huawei
Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.