KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers Papers

Summary

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

Paper page - KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Source: https://huggingface.co/papers/2604.13226

Abstract

KV Packet is a cache reuse framework that eliminates recomputation overhead in large language models by treating cached documents as immutable packets with trainable soft-token adapters.

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution (https://huggingface.co/papers?q=attention%20distribution). Existing solutions such as CacheBlend (https://huggingface.co/papers?q=CacheBlend), EPIC (https://huggingface.co/papers?q=EPIC), and SAM-KV (https://huggingface.co/papers?q=SAM-KV) mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs (https://huggingface.co/papers?q=FLOPs)) and increased Time-to-First-Token (https://huggingface.co/papers?q=Time-to-First-Token) (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable “packets” wrapped in lightweight trainable soft-token adapters (https://huggingface.co/papers?q=soft-token%20adapters), which are trained via self-supervised distillation (https://huggingface.co/papers?q=self-supervised%20distillation) to bridge context discontinuities (https://huggingface.co/papers?q=context%20discontinuities). Experiments on Llama-3.1 (https://huggingface.co/papers?q=Llama-3.1) and Qwen2.5 (https://huggingface.co/papers?q=Qwen2.5) demonstrate that the proposed KV Packet method achieves near-zero FLOPs (https://huggingface.co/papers?q=FLOPs) and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.

View arXiv page (https://arxiv.org/abs/2604.13226) View PDF (https://arxiv.org/pdf/2604.13226) GitHub13 (https://github.com/ChuangtaoChen-TUM/KVPacket) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.13226)

Community

Paper submitter

3 days ago (https://huggingface.co/papers/2604.13226#69e19469558e83c77cf611f2)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2604.13226

Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.13226 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.13226 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.13226 in a Space README.md to link it from this page.

Collections including this paper3

Similar Articles