CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
Summary
CONF-KV is a KV-cache management system that uses model uncertainty to dynamically adjust cache retention, improving memory efficiency for long-context LLM inference while maintaining accuracy within 1.5-2.1 perplexity points.
View Cached Full Text
Cached at: 05/29/26, 11:04 PM
Paper page - CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
Source: https://huggingface.co/papers/2605.24786
Abstract
CONF-KV is a KV-cache management system that dynamically adjusts cache retention based on model uncertainty, improving memory efficiency and performance for long-sequence language model inference.
Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model’s current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalarconfidence scoreand uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulatedattention massand recency, while a protected recent window preserves local coherence. We combine the policy withblockwise online-softmax attention,mixed FP16/INT8 storage, and apyramidal per-layer budgetvariant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.24786
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.24786 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.24786 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.24786 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
This paper introduces a fixed-contract diagnostic tool to analyze why KV cache compression methods succeed or fail in long-context LLM inference. It identifies three failure modes—missing evidence, scoring irrelevant tokens, and breaking related evidence—and evaluates them on LongBench and NeedleBench.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
This paper introduces LKV, a method for end-to-end learning of head-wise budgets and token selection to optimize KV cache eviction in large language models, achieving state-of-the-art performance with high compression rates.