Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Reddit r/LocalLLaMA 05/19/26, 05:37 PM Tools

kv-cache quantization benchmarks llama-cpp turboquant tcg ppl kld long-context

Summary

A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.

Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using [BeeLlama v0.1.2](https://github.com/Anbeeld/beellama.cpp), with some backstory of unsuccessfully trying other tests and then re-exploring PPL and KLD even more thoroughly to compensate. Tests were done with Qwen 3.6 27B (`Q5_K_S` and `IQ4_XS`) at 64k and 128k context, so a decent model with decent quants at decent context length. Basically the setup we 24 GB VRAM folks are actually using, making the results actually grounded. I'm not in any position to talk shit about [vLLM study](https://vllm.ai/blog/2026-05-11-turboquant), but it really looked like a "how to invest and become rich if you already have $1,000,000" book to me, with regular 4-bit and 5-bit quants missing from comparison. Here are my findings: * **PPL Hides the Tail, KLD Exposes It.** Through `q4_0`, the entire PPL range stays under 0.01 above `bf16`. Even `turbo3_tcq` only adds \~0.02 PPL. But 99.9% KL divergence tells a different story: while `q5_0` (at 34.4% of `bf16`) is obviously behind `q8_0`, it's still not bad. But then `q4_0`'s tail KLD is 32% worse than q5\_0's. Now this is what breaks your tool calls and JSON structure. * **Rotation closed the gap at 4 bits.** llama.cpp already applies random rotation to KV vectors before quantizing, which is the same basic trick TurboQuant uses. At 4 bits, `turbo4` has no quality advantage over `q4_0`, saves almost no memory, and runs 17% slower. TurboQuant's value is at 2-3 bits where it has no alternatives anyways. * **TCQ saves the low end.** `turbo3_tcq` is consistently much better than plain `turbo3`, and `turbo2_tcq` is much better than `turbo2`. They are a legit solution for cases where you need aggressive compression. Now what is TCQ, you might ask? Luckily, the article covers this as well! * **Asymmetric KV beats symmetric at the same size.** `q5_0/q4_0` is the same memory as `q4_1/q4_1` but beats it across all test configs in 99.9% precision. After K reaches `q5_0`, the next useful bit goes to V, not to `q5_1` K. * **Higher model precision means more cache damage.** `Q5_K_S` took 3-5% more 99.9% precision damage than `IQ4_XS` at the same cache quant. Model and KV cache quants are not independent, and it's better to balance their quants rather than focusing on only one or the other, as they both feed from the same VRAM pool. * **q8 is mostly a luxury tier, unless you have spare VRAM.** `q8_0/q5_0` at 43.8% of `bf16` KV keeps 99.9% precision at 93.7-98.2% across configs, so full `q8_0/q8_0` at 53.1% is mostly validation when you don't struggle with VRAM anyways. **Here's the article, with all the data and quite a bit of analysis:** [https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context](https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context)

Original Article

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Similar Articles

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

@anirudhbv_ce: Introducing SpectralQuant.. here to save your KV cache :)

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Qwen3.6-27B Quantization Benchmark

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

Submit Feedback

Similar Articles

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

@anirudhbv_ce: Introducing SpectralQuant.. here to save your KV cache :)

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Qwen3.6-27B Quantization Benchmark

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory