Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Hugging Face Daily Papers 05/10/26, 12:00 AM Papers

Summary

This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.

The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.

Original Article

View Cached Full Text

Cached at: 05/12/26, 02:49 AM

Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Source: https://huggingface.co/papers/2605.09649

Abstract

Learned global retention-based key-value cache eviction improves long-context reasoning by selectively retaining useful tokens while reducing memory usage.

The key-value (KV) cache is a major bottleneck inlong-context inference, where memory and computation grow with sequence length. ExistingKV evictionmethods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce aglobal retention-basedKV evictionmethod that learns each token’s future utility under a unifiedmemory budget. Lightweightretention gatesassign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly forcache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reducesattention dilution, and we justify geometric retention as aquery-agnostic proxyfor future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibratedKV evictionis not only a compression technique, but also a mechanism for improving long-context reasoning.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.09649

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09649 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09649 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09649 in a Space README.md to link it from this page.

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Information-Aware KV Cache Compression for Long Reasoning

Submit Feedback

Similar Articles

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Information-Aware KV Cache Compression for Long Reasoning