@songhan_mit: Explore our continued efforts on KV cache compression:
Summary
A tweet from Song Han highlights continued work on KV cache compression, featuring a blog by Weian Mao that discusses system-level aspects often overlooked in papers.
View Cached Full Text
Cached at: 06/15/26, 07:06 PM
Explore our continued efforts on KV cache compression:
Weian Mao (@WeianMaoX): 🔗 Our new blog digs into a side of KV cache efficiency that papers usually skip: https://t.co/GXo228eJtf Most work here is about the algorithm: eviction papers, for instance, focus on which entries are worth keeping. But the algorithm only helps if the systems underneath can
Similar Articles
@yukangchen_: We are excited to share a new technical article “KV Cache Compression and Its Infra Problems.” https://research.nvidia.…
NVIDIA Research publishes a technical blog post examining KV cache compression techniques and their infrastructure problems, including how FlashAttention and paged attention create practical obstacles for production deployment of long-context LLMs, with a proposed geometric solution using RoPE.
@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …
This thread challenges the fundamental assumption that Transformers require separate Q, K, and V projections, proposing that merging them can yield massive memory savings for KV cache.
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
@ZeroZ_JQ: https://x.com/ZeroZ_JQ/status/2066380476970103028
The article redefines KV Cache from an engineering perspective, pointing out that it is not just an inference optimization technique, but becomes a runtime infrastructure for reusing already computed results in the Agent era, helping AI avoid redundant reasoning.
PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression
PolyKV is a layer-wise KV cache compression framework that assigns heterogeneous eviction policies and non-uniform budgets per layer, significantly improving over uniform baselines on LongBench with LLaMA-3.1-8B and Qwen3-8B.