Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?
Summary
Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.
Similar Articles
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)
BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.