KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
Summary
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
Paper page - KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
Source: https://huggingface.co/papers/2604.13226
Abstract
KV Packet is a cache reuse framework that eliminates recomputation overhead in large language models by treating cached documents as immutable packets with trainable soft-token adapters.
Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution (https://huggingface.co/papers?q=attention%20distribution). Existing solutions such as CacheBlend (https://huggingface.co/papers?q=CacheBlend), EPIC (https://huggingface.co/papers?q=EPIC), and SAM-KV (https://huggingface.co/papers?q=SAM-KV) mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs (https://huggingface.co/papers?q=FLOPs)) and increased Time-to-First-Token (https://huggingface.co/papers?q=Time-to-First-Token) (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable “packets” wrapped in lightweight trainable soft-token adapters (https://huggingface.co/papers?q=soft-token%20adapters), which are trained via self-supervised distillation (https://huggingface.co/papers?q=self-supervised%20distillation) to bridge context discontinuities (https://huggingface.co/papers?q=context%20discontinuities). Experiments on Llama-3.1 (https://huggingface.co/papers?q=Llama-3.1) and Qwen2.5 (https://huggingface.co/papers?q=Qwen2.5) demonstrate that the proposed KV Packet method achieves near-zero FLOPs (https://huggingface.co/papers?q=FLOPs) and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.
View arXiv page (https://arxiv.org/abs/2604.13226) View PDF (https://arxiv.org/pdf/2604.13226) GitHub13 (https://github.com/ChuangtaoChen-TUM/KVPacket) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.13226)
Community
Paper submitter
3 days ago (https://huggingface.co/papers/2604.13226#69e19469558e83c77cf611f2)
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2604.13226
Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.13226 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.13226 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.13226 in a Space README.md to link it from this page.
Collections including this paper3
Similar Articles
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.
SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference
SeKV is a resolution-adaptive KV cache method that organizes context into entropy-guided semantic spans stored across a GPU-CPU hierarchy, enabling selective token-level reconstruction during decoding while reducing GPU memory by 53.3% versus full caching at 128K context.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.
PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression
PolyKV is a layer-wise KV cache compression framework that assigns heterogeneous eviction policies and non-uniform budgets per layer, significantly improving over uniform baselines on LongBench with LLaMA-3.1-8B and Qwen3-8B.