Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
Summary
This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.
View Cached Full Text
Cached at: 05/12/26, 02:49 AM
Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
Source: https://huggingface.co/papers/2605.09649
Abstract
Learned global retention-based key-value cache eviction improves long-context reasoning by selectively retaining useful tokens while reducing memory usage.
The key-value (KV) cache is a major bottleneck inlong-context inference, where memory and computation grow with sequence length. ExistingKV evictionmethods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce aglobal retention-basedKV evictionmethod that learns each token’s future utility under a unifiedmemory budget. Lightweightretention gatesassign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly forcache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reducesattention dilution, and we justify geometric retention as aquery-agnostic proxyfor future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibratedKV evictionis not only a compression technique, but also a mechanism for improving long-context reasoning.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.09649
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09649 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09649 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09649 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Epiphany-Aware KV Cache Eviction Without the Attention Matrix
This paper introduces EpiKV, a KV cache eviction method that scores token importance via changes in internal representations (epiphany score) instead of attention weights, avoiding the need to materialize the attention matrix. It achieves competitive performance on reasoning benchmarks while enabling up to 16× longer context lengths.
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.
Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo
InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
Information-Aware KV Cache Compression for Long Reasoning
This paper proposes InfoKV, an entropy-aware KV cache compression framework that combines token-level predictive uncertainty with attention scores to improve long-context reasoning efficiency. Experiments show it outperforms existing attention-based methods on Llama-3.1, Llama-3.2, and DeepSeek-R1.