Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Reddit r/LocalLLaMA 04/22/26, 10:38 AM Tools

quantization kv-cache llama-cpp turboquant vram compression

Summary

Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.

Hey everyone, Ever since the day Google announced [TurboQuant](https://www.google.com/url?sa=E&q=https%3A%2F%2Fresearch.google%2Fblog%2Fturboquant-redefining-ai-efficiency-with-extreme-compression%2F), I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the discussions, I'm honestly still a bit confused: is it actually applicable for us right now? And if so, how? I recently saw an article/post where someone applied this TQ quantization directly to the **model weights**. They managed to get Qwen3.5-27B running at near-Q4\_0 quality, making it about 10% smaller, which finally allowed it to fit comfortably on a 16GB card (specifically an RTX 5060 Ti). This is huge for us with consumer GPUs. However, since TurboQuant was initially heavily pitched for its efficiency with context and memory, my main question is about the **KV Cache**. As we know, context length is the real VRAM killer. So my doubts are: 1. **Can we currently apply TQ quantization to the KV cache when using llama-server (llama.cpp)?** 2. If yes, how do we enable it? Is there already a CLI flag similar to --cache-type q4\_0 / --cache-type q8\_0? 3. Or is this strictly limited to model weights right now, and we are still waiting for an official PR/release from the llama.cpp team to implement TQ for the KV cache? I'd love to hear if anyone has tested this or knows the current development status. Thanks!

Original Article

Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Similar Articles

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Submit Feedback

Similar Articles

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)