Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Hugging Face Daily Papers Papers

Summary

This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.

The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.
Original Article
View Cached Full Text

Cached at: 05/12/26, 02:49 AM

Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Source: https://huggingface.co/papers/2605.09649

Abstract

Learned global retention-based key-value cache eviction improves long-context reasoning by selectively retaining useful tokens while reducing memory usage.

The key-value (KV) cache is a major bottleneck inlong-context inference, where memory and computation grow with sequence length. ExistingKV evictionmethods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce aglobal retention-basedKV evictionmethod that learns each token’s future utility under a unifiedmemory budget. Lightweightretention gatesassign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly forcache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reducesattention dilution, and we justify geometric retention as aquery-agnostic proxyfor future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibratedKV evictionis not only a compression technique, but also a mechanism for improving long-context reasoning.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.09649

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09649 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09649 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09649 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

arXiv cs.LG

This paper introduces EpiKV, a KV cache eviction method that scores token importance via changes in internal representations (epiphany score) instead of attention weights, avoiding the need to materialize the attention matrix. It achieves competitive performance on reasoning benchmarks while enabling up to 16× longer context lengths.

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

Reddit r/LocalLLaMA

InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

Information-Aware KV Cache Compression for Long Reasoning

arXiv cs.CL

This paper proposes InfoKV, an entropy-aware KV cache compression framework that combines token-level predictive uncertainty with attention scores to improve long-context reasoning efficiency. Experiments show it outperforms existing attention-based methods on Llama-3.1, Llama-3.2, and DeepSeek-R1.