KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers Papers

Summary

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

Paper page - KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Source: https://huggingface.co/papers/2604.13226

Abstract

KV Packet is a cache reuse framework that eliminates recomputation overhead in large language models by treating cached documents as immutable packets with trainable soft-token adapters.

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution (https://huggingface.co/papers?q=attention%20distribution). Existing solutions such as CacheBlend (https://huggingface.co/papers?q=CacheBlend), EPIC (https://huggingface.co/papers?q=EPIC), and SAM-KV (https://huggingface.co/papers?q=SAM-KV) mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs (https://huggingface.co/papers?q=FLOPs)) and increased Time-to-First-Token (https://huggingface.co/papers?q=Time-to-First-Token) (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable “packets” wrapped in lightweight trainable soft-token adapters (https://huggingface.co/papers?q=soft-token%20adapters), which are trained via self-supervised distillation (https://huggingface.co/papers?q=self-supervised%20distillation) to bridge context discontinuities (https://huggingface.co/papers?q=context%20discontinuities). Experiments on Llama-3.1 (https://huggingface.co/papers?q=Llama-3.1) and Qwen2.5 (https://huggingface.co/papers?q=Qwen2.5) demonstrate that the proposed KV Packet method achieves near-zero FLOPs (https://huggingface.co/papers?q=FLOPs) and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.

View arXiv page (https://arxiv.org/abs/2604.13226) View PDF (https://arxiv.org/pdf/2604.13226) GitHub13 (https://github.com/ChuangtaoChen-TUM/KVPacket) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.13226)

Community

Paper submitter

3 days ago (https://huggingface.co/papers/2604.13226#69e19469558e83c77cf611f2)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2604.13226

Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.13226 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.13226 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.13226 in a Space README.md to link it from this page.

Collections including this paper3

Similar Articles

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

arXiv cs.CL

OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv cs.LG

This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.