I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

Reddit r/LocalLLaMA 06/23/26, 03:12 PM Papers

kv-cache quantization kld qwen gemma4 model-analysis efficient-inference

Summary

The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.

No content available

Original Article

Similar Articles

Gemma 4 QAT 31B responds better to KV cache quantization too

Reddit r/LocalLLaMA

The Gemma 4 QAT 31B model demonstrates improved behavior with KV cache quantization, suggesting enhanced inference efficiency.

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Reddit r/LocalLLaMA

A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

arXiv cs.LG

This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.

@anirudhbv_ce: Introducing SpectralQuant.. here to save your KV cache :)

X AI KOLs Timeline

SpectralQuant is a new KV cache quantization technique achieving 5.95× compression on Mistral 7B with only 7.5% perplexity overhead, significantly outperforming TurboQuant while requiring only 15 seconds of calibration per model.

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Hacker News Top

Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.