Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Reddit r/LocalLLaMA Tools

Summary

Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.

Hey everyone, Ever since the day Google announced [TurboQuant](https://www.google.com/url?sa=E&q=https%3A%2F%2Fresearch.google%2Fblog%2Fturboquant-redefining-ai-efficiency-with-extreme-compression%2F), I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the discussions, I'm honestly still a bit confused: is it actually applicable for us right now? And if so, how? I recently saw an article/post where someone applied this TQ quantization directly to the **model weights**. They managed to get Qwen3.5-27B running at near-Q4\_0 quality, making it about 10% smaller, which finally allowed it to fit comfortably on a 16GB card (specifically an RTX 5060 Ti). This is huge for us with consumer GPUs. However, since TurboQuant was initially heavily pitched for its efficiency with context and memory, my main question is about the **KV Cache**. As we know, context length is the real VRAM killer. So my doubts are: 1. **Can we currently apply TQ quantization to the KV cache when using llama-server (llama.cpp)?** 2. If yes, how do we enable it? Is there already a CLI flag similar to --cache-type q4\_0 / --cache-type q8\_0? 3. Or is this strictly limited to model weights right now, and we are still waiting for an official PR/release from the llama.cpp team to implement TQ for the KV cache? I'd love to hear if anyone has tested this or knows the current development status. Thanks!
Original Article

Similar Articles

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

arXiv cs.CL

OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.