Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.
Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using [BeeLlama v0.1.2](https://github.com/Anbeeld/beellama.cpp), with some backstory of unsuccessfully trying other tests and then re-exploring PPL and KLD even more thoroughly to compensate. Tests were done with Qwen 3.6 27B (`Q5_K_S` and `IQ4_XS`) at 64k and 128k context, so a decent model with decent quants at decent context length. Basically the setup we 24 GB VRAM folks are actually using, making the results actually grounded. I'm not in any position to talk shit about [vLLM study](https://vllm.ai/blog/2026-05-11-turboquant), but it really looked like a "how to invest and become rich if you already have $1,000,000" book to me, with regular 4-bit and 5-bit quants missing from comparison. Here are my findings: * **PPL Hides the Tail, KLD Exposes It.** Through `q4_0`, the entire PPL range stays under 0.01 above `bf16`. Even `turbo3_tcq` only adds \~0.02 PPL. But 99.9% KL divergence tells a different story: while `q5_0` (at 34.4% of `bf16`) is obviously behind `q8_0`, it's still not bad. But then `q4_0`'s tail KLD is 32% worse than q5\_0's. Now this is what breaks your tool calls and JSON structure. * **Rotation closed the gap at 4 bits.** llama.cpp already applies random rotation to KV vectors before quantizing, which is the same basic trick TurboQuant uses. At 4 bits, `turbo4` has no quality advantage over `q4_0`, saves almost no memory, and runs 17% slower. TurboQuant's value is at 2-3 bits where it has no alternatives anyways. * **TCQ saves the low end.** `turbo3_tcq` is consistently much better than plain `turbo3`, and `turbo2_tcq` is much better than `turbo2`. They are a legit solution for cases where you need aggressive compression. Now what is TCQ, you might ask? Luckily, the article covers this as well! * **Asymmetric KV beats symmetric at the same size.** `q5_0/q4_0` is the same memory as `q4_1/q4_1` but beats it across all test configs in 99.9% precision. After K reaches `q5_0`, the next useful bit goes to V, not to `q5_1` K. * **Higher model precision means more cache damage.** `Q5_K_S` took 3-5% more 99.9% precision damage than `IQ4_XS` at the same cache quant. Model and KV cache quants are not independent, and it's better to balance their quants rather than focusing on only one or the other, as they both feed from the same VRAM pool. * **q8 is mostly a luxury tier, unless you have spare VRAM.** `q8_0/q5_0` at 43.8% of `bf16` KV keeps 99.9% precision at 93.7-98.2% across configs, so full `q8_0/q8_0` at 53.1% is mostly validation when you don't struggle with VRAM anyways. **Here's the article, with all the data and quite a bit of analysis:** [https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context](https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context)
This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.
SpectralQuant is a new KV cache quantization technique achieving 5.95× compression on Mistral 7B with only 7.5% perplexity overhead, significantly outperforming TurboQuant while requiring only 15 seconds of calibration per model.
Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.
This article benchmarks various Qwen3.6-27B quantizations (Q8 to Q2) using KLD and Same Top P metrics, comparing providers like Unsloth and mradermacher, and offers recommendations for quality-size trade-offs.
This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.