KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
Summary
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
Paper page - KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
Source: https://huggingface.co/papers/2604.13226
Abstract
KV Packet is a cache reuse framework that eliminates recomputation overhead in large language models by treating cached documents as immutable packets with trainable soft-token adapters.
Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution (https://huggingface.co/papers?q=attention%20distribution). Existing solutions such as CacheBlend (https://huggingface.co/papers?q=CacheBlend), EPIC (https://huggingface.co/papers?q=EPIC), and SAM-KV (https://huggingface.co/papers?q=SAM-KV) mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs (https://huggingface.co/papers?q=FLOPs)) and increased Time-to-First-Token (https://huggingface.co/papers?q=Time-to-First-Token) (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable “packets” wrapped in lightweight trainable soft-token adapters (https://huggingface.co/papers?q=soft-token%20adapters), which are trained via self-supervised distillation (https://huggingface.co/papers?q=self-supervised%20distillation) to bridge context discontinuities (https://huggingface.co/papers?q=context%20discontinuities). Experiments on Llama-3.1 (https://huggingface.co/papers?q=Llama-3.1) and Qwen2.5 (https://huggingface.co/papers?q=Qwen2.5) demonstrate that the proposed KV Packet method achieves near-zero FLOPs (https://huggingface.co/papers?q=FLOPs) and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.
View arXiv page (https://arxiv.org/abs/2604.13226) View PDF (https://arxiv.org/pdf/2604.13226) GitHub13 (https://github.com/ChuangtaoChen-TUM/KVPacket) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.13226)
Community
Paper submitter
3 days ago (https://huggingface.co/papers/2604.13226#69e19469558e83c77cf611f2)
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2604.13226
Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.13226 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.13226 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.13226 in a Space README.md to link it from this page.
Collections including this paper3
Similar Articles
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.
High-Fidelity KV Cache Summarization Using Entropy and Low-Rank Reconstruction
Proposes an SRC pipeline that uses entropy-based selection and low-rank reconstruction to summarize KV cache instead of pruning tokens, reducing VRAM for million-token LLM contexts while avoiding catastrophic attention errors.