CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

arXiv cs.AI 06/24/26, 04:00 AM Papers
Summary
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
arXiv:2606.24467v1 Announce Type: new Abstract: Long-context large language model (LLM) inference is increasingly constrained by the memory footprint and decoding cost of key-value (KV) caches, limiting sustainable deployment on resource-constrained hardware. Existing KV cache eviction methods typically apply heuristic token scoring over all heads in GQA-based LLMs. These methods ignore the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrading the performance of LLMs. To address this issue, we propose CompressKV, a resource-efficient KV-cache compression framework for GQA-based LLMs. Instead of aggregating attention scores from all heads, CompressKV identifies Semantic Retrieval Heads (SRHs) that capture both the initial and final tokens of a prompt and semantically important mid-context evidence, and uses them to select tokens whose KV pairs should be retained. Furthermore, CompressKV allocates cache budgets across layers according to offline estimates of layer-wise eviction error. Experiments on LongBench and Needle-in-a-Haystack show that CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets. Notably, it preserves over 97\% of full-cache performance using only 3\% of the KV cache on LongBench question-answering tasks and achieves 90\% accuracy with just 0.7\% KV storage on Needle-in-a-Haystack. These results demonstrate an improved resource--performance trade-off for long-context LLM inference. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:47 AM
# CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
Source: [https://arxiv.org/html/2606.24467](https://arxiv.org/html/2606.24467)
\\copyrightclause

Copyright for this paper by its authors\. Use permitted under Creative Commons License Attribution 4\.0 International \(CC BY 4\.0\)\.

\\conference

SuRE’26: Workshop on Sustainability and Resource\-Efficiency of Artificial Intelligence, August 17, 2026, Bremen, Germany

\[email=xiaolin\.lin@tu\-darmstadt\.de, \]\\cormark\[1\] \[email=jingcun\.wang@tu\-darmstadt\.de, \]

\[email=olga\.kondrateva@tu\-darmstadt\.de, \]

\[email=yshi4@nd\.edu, \]

\[email=bing\.li@tu\-ilmenau\.de, \]

\[email=grace\.zhang@tu\-darmstadt\.de, \]

\\cortext

\[1\]Corresponding author\.

Jingcun WangOlga KondratevaYiyu ShiBing LiGrace Li ZhangTechnical University of Darmstadt, Darmstadt, GermanyUniversity of Notre Dame, Notre Dame, IN, USATechnical University of Ilmenau, Ilmenau, Germany

\(2026\)

###### Abstract

Long\-context large language model \(LLM\) inference is increasingly constrained by the memory footprint and decoding cost of key\-value \(KV\) caches, limiting sustainable deployment on resource\-constrained hardware\. Existing KV cache eviction methods typically apply heuristic token scoring over all heads in GQA\-based LLMs\. These methods ignore the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrading the performance of LLMs\. To address this issue, we propose CompressKV, a resource\-efficient KV\-cache compression framework for GQA\-based LLMs\. Instead of aggregating attention scores from all heads, CompressKV identifies Semantic Retrieval Heads \(SRHs\) that capture both the initial and final tokens of a prompt and semantically important mid\-context evidence, and uses them to select tokens whose KV pairs should be retained\. Furthermore, CompressKV allocates cache budgets across layers according to offline estimates of layer\-wise eviction error\. Experiments on LongBench and Needle\-in\-a\-Haystack show that CompressKV consistently outperforms existing KV\-cache eviction methods across memory budgets\. Notably, it preserves over 97% of full\-cache performance using only 3% of the KV cache on LongBench question\-answering tasks and achieves 90% accuracy with just 0\.7% KV storage on Needle\-in\-a\-Haystack\. These results demonstrate an improved resource–performance trade\-off for long\-context LLM inference\. Our code is publicly available at:[https://github\.com/TUDa\-HWAI/CompressKV](https://github.com/TUDa-HWAI/CompressKV)

###### keywords:

Large Language Models\\sepLong\-Context Inference\\sepKV\-Cache Compression\\sepResource\-Efficient AI\\sepEfficient Inference

## 1Introduction

Recent advances in large language models \(LLMs\)\[openai2024gpt4technicalreport,anthropic\_claude3\_2024,grattafiori2024llama3herdmodels,qwen2025qwen25technicalreport,jingcun2025\]have boosted their long\-context processing capabilities\. However, with the increasing length of texts, the resulting key\-value \(KV\) cache size grows linearly\. The large KV cache leads to slow inference due to the attention calculation across past KV cache\. In addition, the large KV cache requires substantial memory storage, which creates a major bottleneck in the deployment of long\-context LLMs\. Therefore, effective compression of KV cache is essential for optimizing the computational efficiency and model scalability\.

State\-of\-the\-art KV cache compression focuses on quantization, low\-rank approximation, and KV cache eviction\[liu2024kivi,kang2024gearefficientkvcache,ge2024modeltellsdiscardadaptive,xiao2024efficientstreaminglanguagemodels,li2024snapkvllmknowslooking,cai2025pyramidkvdynamickvcache,yang2024pyramidinferpyramidkvcache,qin2025cakecascadingadaptivekv\]\. Among them, KV\-cache eviction—discarding KV pairs for unimportant tokens while retaining the rest—has attracted increasing attention\.

Several criteria have been proposed to identify tokens for KV\-cache eviction\. For example, StreamingLLM\[xiao2024efficientstreaminglanguagemodels\]retains the first and last tokens and neglects potentially important tokens in the middle of the prompt\. SnapKV\[li2024snapkvllmknowslooking\]clusters recent attention scores within an observation window at the end of the prompt, either per head or per head group, to identify and retain the important tokens receiving the highest attention values\. CAKE\[qin2025cakecascadingadaptivekv\]extends SnapKV’s method by adding the attention variance in an observation window to the eviction score, enabling it to capture tokens whose importance fluctuates over time\.

While the above criteria work well in many KV\-cache eviction scenarios, they overlook head heterogeneity: all heads are weighted equally, and eviction decisions are made from aggregated attention scores \(typically the sum across heads within a group\)\. In fact, attention heads exhibit different functionalities\. For example, in Grouped Query Attention \(GQA\)\-based LLMs\[ainslie2023gqatraininggeneralizedmultiquery\], some attention heads, called Streaming Heads, exclusively focus on the beginning and the end of a prompt\[xiao2024efficientstreaminglanguagemodels,xiao2024duoattentionefficientlongcontextllm\]\. When the attention heads within a GQA group are dominated by Streaming Heads, those heads have the largest influence on KV cache eviction, resulting in only the initial and last tokens’ KV pairs being retained\. As a result, crucial mid\-context tokens may be evicted, degrading LLM performance\.

Besides eliminating KV pairs for those unimportant tokens, state\-of\-the\-art research also allocates specified memory budgets to layers\. For example,\[xiao2024efficientstreaminglanguagemodels,li2024snapkvllmknowslooking\]allocates each layer to a fixed number of KV pairs without considering layer difference\.\[yang2024pyramidinferpyramidkvcache,cai2025pyramidkvdynamickvcache,qin2025cakecascadingadaptivekv\]allocates KV cache budget across layers based on attention distributions or layer\-wise statistics such as attention entropy or variance, which often require additional online computation cost\. Moreover, attention distributions can vary significantly across models, which limits the generalization ability and effectiveness of these allocation strategies\. Orthogonally, HeadKV\[fu2024headsmatterheadlevelkv\]and AdaKV\[feng2025adakvoptimizingkvcache\]extend to head\-level budget allocation\.

In this paper, we observe that certain attention heads are capable of retrieving important tokens within the text and attending to their surrounding semantic context\. We refer to these heads as Semantic Retrieval Heads\. Motivated by this observation, we identify such Semantic Retrieval Heads in each layer and use them to determine the crucial tokens and share a unified set of crucial token indices across all heads within that layer\. This approach can substantially address the dominance of Streaming Heads in KV cache evictions, so that it can enhance the performance of GQA\-based models\. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer\-adaptive KV cache allocation strategy\. Our contributions are as follows:

\(1\) We introduce a Semantic\-Retrieval–driven mechanism to address streaming\-head dominance in GQA, preventing important tokens from being evicted\. The identified Semantic Retrieval Heads then guide token importance and KV\-cache eviction\. Our experimental results demonstrate Semantic Retrieval Heads know what tokens are unimportant before generation\.

\(2\) We estimate each layer’s compression impact by computing the Frobenius norm of the difference between its attention‐block outputs with the compressed cache and those with the full cache, during the decoding stage\. Cache budgets are then proportionally assigned across layers, prioritizing layers with higher errors\. Importantly, this analysis is performed offline and does not introduce any additional overhead during online inference\.

\(3\) CompressKV is validated on multiple LLMs using LongBench and Needle\-in\-a\-Haystack \(NIAH\)\. On LongBench, CompressKV maintains over 99% of full‐cache performance with only 19% of KV budget and retains 97% of question‐answering accuracy using just 3% of the cache\. On Needle‐in‐a‐Haystack retrieval benchmark, it achieves 90% of the baseline accuracy with only 0\.7% of KV storage\.

## 2Background and Related Work

![Refer to caption](https://arxiv.org/html/2606.24467v1/x1.png)Figure 1:Motivation\. \(a\) The attention score distribution of a streaming head \(SH\)\. \(b\) The attention score distribution of a retrieval head \(RH\)\. \(c\) Streaming attention heads in a GQA group dominate the token eviction, indicating only the initial and final tokens are retained\. The critical tokens are evicted\.### 2\.1KV\-Cache Compression for Budgeted Long\-Context Inference

To alleviate the burden of KV cache storage, various KV cache compression methods, e\.g\., quantization\[liu2024kivi\], low‐rank approximations\[kang2024gearefficientkvcache\], and KV cache eviction strategy have been proposed\. In particular, KV cache eviction reduces cache size by removing KV cache pairs of unimportant tokens without retraining\. There are different eviction strategies\. For example, StreamingLLM\[xiao2024efficientstreaminglanguagemodels\]focuses solely on retaining the first and last tokens, which only addresses the Streaming Head scenario and neglects potentially important tokens in the middle of the sequence\. To overcome this limitation, more advanced methods have been proposed\[liu2023scissorhandsexploitingpersistenceimportance,zhang2023h2oheavyhitteroracleefficient,li2024snapkvllmknowslooking,han2024lminfinitezeroshotextremelength,oren2024transformersmultistaternns\]\. A representative example is SnapKV\[li2024snapkvllmknowslooking\], which clusters recent attention scores, either per head or per head group to identify important token and retain the KV cache pairs of such tokens\. Besides, recent approaches, including PyramidKV\[cai2025pyramidkvdynamickvcache\], D2O\[wan2025d2odynamicdiscriminativeoperations\], and CAKE\[qin2025cakecascadingadaptivekv\], dynamically allocate cache budgets based on attention statistics or modeled attention dynamics of all the layers in an LLM\. Beyond layer\-level allocation, HeadKV\[fu2024headsmatterheadlevelkv\]and AdaKV\[feng2025adakvoptimizingkvcache\]further enhance cache budget with head\-level budget allocation\. Their selection strategies for important tokens are an extended version of SnapKV’s eviction strategy\.

Despite their effectiveness, existing eviction pipelines have two limitations that are especially relevant to GQA\-based LLMs\. First, many prior KV cache eviction pipelines compute token importance via head\-agnostic pooling \(e\.g\., across heads within each GQA group\) when selecting tokens for eviction, effectively treating all attention heads equally and ignoring their functional heterogeneity; Recent work\[olsson2022incontextlearninginductionheads,kwon2022fastposttrainingpruningframework,zheng2024attentionheadslargelanguage,ren2024identifyingsemanticinductionheads,wu2024retrievalheadmechanisticallyexplains,todd2024functionvectorslargelanguage,yin2025attentionheadsmatterincontext,tang2024razorattentionefficientkvcache,fu2024headsmatterheadlevelkv\]has shown that different attention heads have distinct roles\. For example, some attention heads, called Streaming Heads in the state\-of\-the\-art research, always focus on the beginning and the end of a prompt\. For example, in Figure[1](https://arxiv.org/html/2606.24467#S2.F1)\(a\), head 0 is such a Streaming Head since the attention scores of the initial token and the last tokens are larger than the remaining tokens\. On the contrary, some attention heads, called Retrieval heads in\[wu2024retrievalheadmechanisticallyexplains\], exhibit copy‑and‑paste behaviors for long‑context scenarios\. For example, in Figure[1](https://arxiv.org/html/2606.24467#S2.F1)\(b\), head 1 is such a retrieval head since the attention scores of the correct answer “sandwich" are larger\. HeadKV\[fu2024headsmatterheadlevelkv\]further scores heads using retrieval and reasoning signals\. In GQA\-based LLMs, Streaming Heads tend to have larger effect than the other heads for KV cache eviction, which indicates only KV cache pairs corresponding to initial and last tokens are retained\. This leads to the eviction of crucial tokens in the middle of a prompt and thus degrades the performance of LLMs\. Figure[1](https://arxiv.org/html/2606.24467#S2.F1)\(c\) illustrates such an example, where Streaming Heads including head0 and head1 dominate token eviction for KV cache compression\.

Second, existing layer\-adaptive allocation methods\[yang2024pyramidinferpyramidkvcache,cai2025pyramidkvdynamickvcache,qin2025cakecascadingadaptivekv\]often rely on attention distributions or layer\-wise statistics such as entropy and variance\. These signals can introduce additional online computation and may vary across models, making the resulting allocation less robust under fixed resource budgets\. In contrast, CompressKV identifies Semantic Retrieval Heads offline and uses them to guide retrieval\-aware token selection, while assigning layer\-wise budgets according to offline eviction\-error estimates\. This design improves the accuracy–memory trade\-off without requiring online layer\-importance estimation or budget search during generation\.

## 3CompressKV

CompressKV includes three key components: \(1\) Identification of the attention heads that are capable of retrieving important tokens within the text and attending to their surrounding semantic context\. \(2\) Important token selection driven by such identified heads\. \(3\) Error\-aware layer\-adaptive cache allocation\. In the following subsections, we will first explain our observations and insights into identification of attention heads with specified functionalities\. Afterwards, we will take advantage of such heads to select tokens for KV cache eviction\. Furthermore, different cache budgets will be allocated to different layers\.

![Refer to caption](https://arxiv.org/html/2606.24467v1/x2.png)Figure 2:Illustration of Semantic Retrieval Head identification versus traditional Retrieval Head selection\. Semantic Retrieval Heads capture attention over the entire answer span, addressing the limitations of traditional methods that rely solely on copy\-and\-paste behavior\.### 3\.1Observations and Insights

To prevent Streaming Attention Heads from dominating KV\-cache eviction as illustrated in Figure[1](https://arxiv.org/html/2606.24467#S2.F1)\(c\), we use Retrieval Heads—rather than all attention heads—to identify important tokens for KV\-cache eviction\. Importantly, most prior methods do not leverage retrieval heads for token\-level eviction decisions; for example, HeadKV mainly uses retrieval\-head signals for head\-level KV budget allocation instead of selecting tokens to keep or evict\. To this end, we first review how prior work identifies Retrieval Heads\.

Previous work identifies Retrieval Heads using a strict top\-1 rule, indicating that those attention heads, the highest attention score of which aligns exactly with the correct token answer during generation, are labeled as Retrieval Heads\[wu2024retrievalheadmechanisticallyexplains\]\. This identification technique emphasizes copy\-and\-paste behavior\.\[tang2024razorattentionefficientkvcache\]extends copy\-and\-paste identification by classifying both echo heads \(copy\-and\-paste to the identical prior token\) and induction heads \(an extension that attends to the immediately preceding token\) as Retrieval Heads\. HeadKV\[fu2024headsmatterheadlevelkv\]relaxes the strict top\-1 criterion to a top\-N hit: at each decoding step, a head is credited if the ground\-truth answer token ranks within its top\-k attention weights\.

Although HeadKV are more relaxed than strict top\-1, this criteria still remains peak\-driven, privileging sharp attentions on the answer token\. In long contexts where attention is sparse and skewed towards boundary tokens—top\-1 rules yield low hit rates and can under\-credit attention heads whose attention covers the answer span and its semantic neighborhood without placing a single sharp peak on the exact answer token\. In HeadKV, if parts of the answer span do not appear within the top\-k ranked positions, heads allocating substantial attention to these tokens may not be credited\. For instance, in Figure[2](https://arxiv.org/html/2606.24467#S3.F2)\(a\), head 0 fails to receive credit because the relevant tokens fall outside the top\-k range despite providing coverage around the correct answer\. Moreover, because the top\-k threshold in HeadKV is tied to the answer length, when answers are short, e\.g\., only one or two tokens, this method returns back to the original strict top\-1 regime\.

To address this limitation, we introduce Semantic Retrieval Heads \(SRH\), a span\-aggregation standard that credits attention heads for both copy\-and\-paste behaviours and deeper semantic dependencies\. We then use such heads to identify important tokens for KV cache eviction, thereby preventing crucial mid\-prompt evidence from being suppressed by streaming heads\. For a visual comparison between Semantic Retrieval Heads and traditional Retrieval Heads, please refer to Section[4\.7](https://arxiv.org/html/2606.24467#S4.SS7)\.

### 3\.2Semantic Retrieval Head Identification Standards

Instead of requiring exact top‑k hits in the traditional Retrieval Head identification, we use a calibration dataset \(following\[wu2024retrievalheadmechanisticallyexplains\]; provided in our codebase\) to evaluate each headhhby aggregating its attention mass over the entire answer span whenever the model generates a correct answer token\. Formally, we define the SRH scoreSSRH\(h\)S\_\{\\text\{SRH\}\}\(h\)as

SSRH\(h\)\\displaystyle S\_\{\\mathrm\{SRH\}\}\(h\)=∑t=1N𝟏\{yt∈𝒜\}∑j∈𝒜at,j\(h\)\.\\displaystyle=\\sum\_\{t=1\}^\{N\}\\mathbf\{1\}\_\{\\\{y\_\{t\}\\in\\mathcal\{A\}\\\}\}\\sum\_\{j\\in\\mathcal\{A\}\}a\_\{t,j\}^\{\(h\)\}\.\(1\)whereyty\_\{t\}is the generated token at steptt,𝒜\\mathcal\{A\}is the answer span, andat,jha\_\{t,j\}^\{h\}is headhh’s attention weight on thejj‑th token of𝒜\\mathcal\{A\}\. The higher the score of a head is, the more capable of capturing semantic information this head is\.

Figure[2](https://arxiv.org/html/2606.24467#S3.F2)\(b\) illustrates the concept of this new identification standard\. By summing over the entire span, we can capture attention heads that contribute semantically relevant context even when they never achieve top‑1 attention on a single token\. Aggregation over multiple tokens enables the method to recognize heads that attend to semantic cues—such as “eat” or “a thing” around “sandwich”—rather than only pure copy‑and‑paste patterns\. For example, head 0 in Figure[2](https://arxiv.org/html/2606.24467#S3.F2)is considered as Semantic Retrieval Head in our new standard although it is not considered as Retrieval Head in the traditional identification methods\. For a visual comparison between Semantic Retrieval Heads and traditional Retrieval Heads, please refer to Section[4\.7](https://arxiv.org/html/2606.24467#S4.SS7)\.

![Refer to caption](https://arxiv.org/html/2606.24467v1/x3.png)Figure 3:Illustration of the token selection driven by Semantic Retrieval Heads\.
### 3\.3Token Selection Driven by Semantic Retrieval Heads

In GQA\-based LLMs, for each layer, we will select the top\-kkSemantic Retrieval Heads with high scores defined with equation[1](https://arxiv.org/html/2606.24467#S3.E1)as the criterion for selecting important tokens for KV cache eviction\. All the attention heads within this layer share a common set of selected token indices determined by these top Semantic Retrieval Heads\. This concept is illustrated in Figure[3](https://arxiv.org/html/2606.24467#S3.F3), where a layer has two groups\. In this example, Head 2 and Head 3 are top 2 Semantic Retrieval Heads\. Following SnapKV, the attention score matrices of such heads are compressed by summing over the observation window and pooling across the token dimension\. Afterwards, such compressed vectors are averaged\. The tokens with the topNNhighest attention scores will be selected and their corresponding KV cache pairs will be retained\. The KV cache pairs for the remaining tokens will be evicted to compress KV cache\.

### 3\.4Error\-Aware Layer\-Adaptive Cache Allocation

To maximize memory efficiency under strict budget constraints, we propose an error\-aware and layer\-adaptive cache allocation strategy\. Instead of relying on attention statistics as in the previous methods, this approach quantifies the compression error caused by KV cache compression, using full\-cache outputs as the reference\. We specifically focus on the extreme compression setting, where only a small fraction of tokens are retained in each layer’s KV cache\. For each layerlland decoding steptt, let𝐎full,tl\\mathbf\{O\}\_\{\\text\{full\},t\}^\{l\}and𝐎comp,tl\\mathbf\{O\}\_\{\\text\{comp\},t\}^\{l\}denote the attention outputs using the full and compressed KV caches, respectively:

𝐎full,tl\\displaystyle\\mathbf\{O\}^\{l\}\_\{\\text\{full\},t\}=𝐖OlAttn\(𝐐tl,𝐊fulll,𝐕fulll\),\\displaystyle=\\mathbf\{W\}^\{l\}\_\{O\}\\,\\mathrm\{Attn\}\\\!\\left\(\\mathbf\{Q\}^\{l\}\_\{t\},\\ \\mathbf\{K\}^\{l\}\_\{\\text\{full\}\},\\ \\mathbf\{V\}^\{l\}\_\{\\text\{full\}\}\\right\),\(2\)𝐎comp,tl\\displaystyle\\mathbf\{O\}^\{l\}\_\{\\text\{comp\},t\}=𝐖OlAttn\(𝐐tl,𝐊compl,𝐕compl\)\.\\displaystyle=\\mathbf\{W\}^\{l\}\_\{O\}\\,\\mathrm\{Attn\}\\\!\\left\(\\mathbf\{Q\}^\{l\}\_\{t\},\\ \\mathbf\{K\}^\{l\}\_\{\\text\{comp\}\},\\ \\mathbf\{V\}^\{l\}\_\{\\text\{comp\}\}\\right\)\.\(3\)where𝐖O\(l\)\\mathbf\{W\}\_\{O\}^\{\(l\)\}is the output projection matrix of layerll,𝐐tl\\mathbf\{Q\}\_\{t\}^\{l\}is the query,𝐊l\\mathbf\{K\}^\{l\}is the key, and𝐕l\\mathbf\{V\}^\{l\}is the value representation at layerll\. To evaluate the error incurred by compressing KV cache per layer, the error score for layerllis computed and normalized as:

e\(l\)\\displaystyle e^\{\(l\)\}=∑t=1T‖Δ𝐎tl‖F‖𝐎full,tl‖F\+ϵ,e~\(l\)\\displaystyle=\\sum\\nolimits\_\{t=1\}^\{T\}\\frac\{\\\|\\Delta\\mathbf\{O\}^\{l\}\_\{t\}\\\|\_\{F\}\}\{\\\|\\mathbf\{O\}^\{l\}\_\{\\mathrm\{full\},t\}\\\|\_\{F\}\+\\epsilon\},\\quad\\tilde\{e\}^\{\(l\)\}=e\(l\)∑ke\(k\)\\displaystyle=\\frac\{e^\{\(l\)\}\}\{\\sum\\nolimits\_\{k\}e^\{\(k\)\}\}\(4\)whereTTis the total number of decoding steps,∥⋅∥F\\\|\\cdot\\\|\_\{F\}denotes the Frobenius norm andϵ\\epsilonis a small positive constant \(e\.g\.,10−610^\{\-6\}\) to prevent division by zero\.

Given the normalized per\-layer error scorese~\{\\tilde\{e\}\}and total cache budgetBtotalB\_\{total\}, we first assign a minimum allocationmmand a maximum allocationMMto each layer to avoid a layer either has no memory budget or a large memory budget\. The remaining budget is distributed in proportion to the error scores\.

## 4Experiments

#### Baselines and Models\.

We compare CompressKV with six KV\-cache eviction baselines: StreamingLLM\[xiao2024efficientstreaminglanguagemodels\], SnapKV\[li2024snapkvllmknowslooking\], PyramidKV\[cai2025pyramidkvdynamickvcache\], CAKE\[qin2025cakecascadingadaptivekv\], HeadKV\[fu2024headsmatterheadlevelkv\], and AdaKV\[feng2025adakvoptimizingkvcache\]\. All methods are evaluated with greedy decoding on Llama\-3\.1\-8B\-Instruct\[grattafiori2024llama3herdmodels\], Mistral\-7B\-Instruct\-v0\.3\[jiang2024clipdinovisualencoders\], Qwen2\.5\-14B\-Instruct, and Qwen2\.5\-32B\-Instruct\[qwen2025qwen25technicalreport\]\. We further examine the orthogonality of CompressKV in Section[4\.6](https://arxiv.org/html/2606.24467#S4.SS6), where we integrate it with head\-level allocation, prefilling acceleration, and KV\-cache quantization\.

#### Evaluating Tasks\.

To evaluate CompressKV’s performance under different memory budgets, we adopt two comprehensive benchmarks and one masking‑based ablation analysis: \(1\) LongBench\[bai2024longbenchbilingualmultitaskbenchmark\], which contains 16 long\-context subtasks across single\-document QA, multi\-document QA, summarization, few\-shot learning, synthetic tasks, and code completion\. \(2\) Needle‑in‑a‑Haystack\(NIAH\)\[gkamradt2024llmtest\], which measures the retrieval of a target answer hidden in extended text; and \(3\) an ablation of retrieval head types, following\[wu2024retrievalheadmechanisticallyexplains\], where we selectively disable SRH and TRH to quantify their contributions\. We also compare CompressKV with TRH vs\. SRH under equal per\-layer KV budgets, e\.g\., 256 tokens and report results separately\.

#### Implementation Details\.

We evaluate all methods under average per\-layer KV\-cache budgets, denoted asBper\-layerB\_\{\\mathrm\{per\\text\{\-\}layer\}\}, ranging from 128 to 2048 tokens\. Given a total KV\-cache budgetBtotalB\_\{\\mathrm\{total\}\}overLLtransformer layers,Bper\-layer=Btotal/LB\_\{\\mathrm\{per\\text\{\-\}layer\}\}=B\_\{\\mathrm\{total\}\}/Ldenotes the average budget assigned to each layer\. StreamingLLM and SnapKV use uniform layer\-wise budgets, whereas PyramidKV, CAKE, and CompressKV redistribute budgets across layers under the same total memory constraint\. HeadKV and AdaKV are applied at the GQA\-group granularity to respect grouped\-query attention\. For fairness, all methods evict tokens only during prefilling and use the same local attention setting as SnapKV\[li2024snapkvllmknowslooking\]:window\_size=8\\texttt\{window\\\_size\}=8andkernel\_size=5\\texttt\{kernel\\\_size\}=5\.

For CompressKV, we select the top four SRHs per layer, identified offline once per model using the calibration data from\[wu2024retrievalheadmechanisticallyexplains\]\. Layer\-adaptive allocation is also computed offline by measuring normalized Frobenius\-norm reconstruction errors between compressed\-cache and full\-cache attention\-block outputs under minimal\-size KV compression on LongBench\. We constrain each layer’s budget to\[m,M\]\[m,M\], withm=32m=32andM=3×Bper\-layerM=3\\times B\_\{\\mathrm\{per\\text\{\-\}layer\}\}, and allocate the remaining KV pairs proportionally to the normalized errors\. During inference, the precomputed SRH sets and layer\-wise budgets are fixed and reused for all samples, so CompressKV does not require online layer\-importance estimation or additional dynamic profiling during generation\.

### 4\.1Evaluation on LongBench Benchmark

Table[1](https://arxiv.org/html/2606.24467#S4.T1)reports average LongBench scores under two representative KV\-cache budgets: 256 tokens for tight memory settings and 1024 tokens for moderate compression\. CompressKV consistently achieves the highest average performance across model families and scales, ranging from Llama\-3\.1\-8B\-Instruct\(Llama\-3\.1\-8B\) and Mistral\-7B\-Instruct\-v0\.3\(Mistral\-7B\) to larger Qwen2\.5\-14B\-Instruct\(Qwen2\.5\-14B\) and Qwen2\.5\-32B\-Instruct\(Qwen2\.5\-32B\) models\. The gains are most pronounced under the tighter 256\-token budget, showing that CompressKV is especially effective when KV\-cache memory is severely constrained\. These results also indicate that the benefit of CompressKV generalizes from 7B/8B\-scale models to larger 14B/32B models\. As illustrated in Figure[4](https://arxiv.org/html/2606.24467#S4.F4), we further benchmark CompressKV on LongBench across KV\-cache sizes from 128 to 2048 using Llama\-3\.1\-8B\-Instruct and Mistral\-7B\-Instruct\-v0\.3\. CompressKV consistently outperforms all baselines across the full budget range, with the largest margins at small cache sizes\. These results show that SRH\-guided token selection and layer\-adaptive budget allocation improve the memory–performance trade\-off of long\-context LLM inference, especially under strict memory constraints\.

Table 1:Average LongBench scores under fixed KV\-cache budgets\. FullKV is the uncompressed reference and does not depend on the budget\. The Budget row indicates the average retained KV tokens per layer for compressed methods\.MethodSmall\-scale LLMsLarge\-scale LLMsLlama\-3\.1\-8BMistral\-7BQwen2\.5\-14BQwen2\.5\-32B2561024256102425610242561024FullKV49\.0847\.8249\.8048\.57Budget2561024256102425610242561024StreamingLLM33\.9236\.9531\.2234\.7325\.8629\.8425\.2829\.62SnapKV45\.2147\.8243\.7646\.4843\.7748\.1843\.3647\.34PyramidKV44\.3647\.6543\.0645\.9642\.7147\.7042\.1146\.98CAKE46\.3047\.9744\.7346\.6644\.7048\.5244\.4947\.51HeadKV44\.1147\.0544\.1046\.4144\.2148\.4244\.0247\.50AdaKV44\.4547\.9443\.7546\.3843\.6848\.1943\.3047\.33CompressKV46\.7148\.2445\.4346\.9645\.3748\.6944\.7347\.78![Refer to caption](https://arxiv.org/html/2606.24467v1/experiment_results/longbench_cache_budget_broken_axis.png)Figure 4:Average performance on 16 LongBench datasets under varying KV\-cache budgets, compared with baseline methods\.
### 4\.2Evaluation on Needle In A Haystack

Figure[5](https://arxiv.org/html/2606.24467#S4.F5)presents average Needle\-in\-a\-Haystack performance across KV budgets for Llama\-3\.1\-8B\-Instruct \(8K–128K context\) and Mistral\-7B\-Instruct\-v0\.3 \(2K–32K\)—showing CompressKV consistently surpasses competing methods at every budget\. On Mistral\-7B\-Instruct\-v0\.3, CompressKV, HeadKV, and CAKE achieve near lossless compression with as few as 256 KV budget, highlighting their robustness\. On Llama\-3\.1\-8B\-Instruct, AdaKV and HeadKV also underperform at low budgets, while CompressKV achieves nearly lossless performance at a 2048 KV budget \(5% of the full cache\) and still retains 90% of the original performance with only 256 KV budget \(0\.7% capacity\)\. Together with the LongBench evaluation, these results show that CompressKV preserves general LLM performance across diverse long\-context tasks while delivering efficient KV\-cache compression\.

![Refer to caption](https://arxiv.org/html/2606.24467v1/experiment_results/niah_llama_broken_mistral_normal.png)Figure 5:Average performance on the NIAH benchmark under different KV cache budget settings, in comparison with baseline methods\.
### 4\.3Semantic Retrieval Heads: Causal Ablation and Head\-Agnostic Gains

Following the masking\-based causal test of\[wu2024retrievalheadmechanisticallyexplains\], we conduct targeted ablations on Mistral\-7B\-Instruct\-v0\.3\. Specifically, we mask the top\-kkheads \(k∈\{10,20,30\}k\\in\\\{10,20,30\\\}\) on the NIAH benchmark\. Table[2](https://arxiv.org/html/2606.24467#S4.T2)\(a\) reports the resulting performance drop and compares Semantic Retrieval Heads against traditional Retrieval Heads \(TRH\)\. While masking TRH leads to only a minor degradation, masking even a small subset of Semantic Retrieval Heads yields a substantial drop in retrieval accuracy and markedly increases hallucinations, highlighting their critical role in faithful retrieval and localization of supporting evidence\. CompressKV is compatible with heterogeneous head definitions\. Table[2](https://arxiv.org/html/2606.24467#S4.T2)\(b\) compares CompressKV using TRH vs\. SRH on Mistral\-7B\-Instruct\-v0\.3 under a fixed per\-layer KV budget of 256 tokens\. SRH yields a modest yet consistent average gain over TRH \(\+0\.24\)\. Moreover, even with TRH and without dynamic budget allocation, CompressKV still surpasses most representative baselines \(Table[1](https://arxiv.org/html/2606.24467#S4.T1)\), evidencing more precise salient\-token selection\.

Table 2:Comparison between traditional Retrieval Heads \(TRH\) and Semantic Retrieval Heads \(SRH\)\. \(a\) Performance drop after masking top\-kkheads on NIAH\. \(b\) LongBench average score when CompressKV uses TRH or SRH under the same KV\-cache budget\. Darker cells in \(a\) indicate larger drops\.HeadsTop\-10Top\-20Top\-30TRH1\.027\.6713\.30SRH24\.5572\.5673\.81\(a\) Causal masking on NIAH\.

HeadsAvg\.Δ\\DeltaTRH44\.720\.00SRH44\.96\+0\.24\(b\) LongBench average score\.

### 4\.4Memory and Latency under Long\-Context Scaling

We evaluate end\-to\-end latency, decoding latency, time to first token, peak GPU memory, and throughput on Llama\-3\.1\-8B\-Instruct with FlashAttention\-2\[dao2023flashattention2fasterattentionbetter\]on a single NVIDIA A100, sweeping context length from 4K to 128K with a fixed generation length of 1024\. We compare CompressKV with a full\-cache baseline and six KV\-cache eviction methods under a 1024\-token KV budget \(except full cache\)\. Figure[6](https://arxiv.org/html/2606.24467#S4.F6)shows that end\-to\-end latency and time\-to\-first\-token increase with context length for all methods, while eviction\-based approaches keep decoding latency nearly constant; in contrast, full\-cache decoding latency grows with context length\. Under the fixed KV budget, eviction methods have similar peak memory, whereas full cache uses substantially more memory at long contexts\.

![Refer to caption](https://arxiv.org/html/2606.24467v1/experiment_results/hardware_summary_pretty.png)Figure 6:Comprehensive evaluation of inference efficiency on a single NVIDIA A100 GPU\.
### 4\.5Ablation Studies

To evaluate the effectiveness of each part in CompressKV, we conduct a series of ablation studies on the LongBench benchmark using Mistral\-7B\-Instruct\-v0\.3 with a fixed KV cache budget of 256\.

#### Token selection and layer\-wise cache allocation\.

We ablate SRHs–driven token selection and layer\-aware budget allocation on Table[3](https://arxiv.org/html/2606.24467#S4.T3)\(a\)\. Adding our selection to SnapKV improves accuracy; adding layer\-aware allocation yields further gains—both components are complementary\.

#### Number of selected heads per layer\.

We sweep SRH per layer from 2 to 24 \(Table[3](https://arxiv.org/html/2606.24467#S4.T3)\(b\)\)\. Accuracy peaks at 4 and saturates thereafter \(Top\-6:−0\.17\-0\.17; Top\-12:0\.000\.00\), with Top\-24 slightly worse; thus, 4 heads per layer suffice\.

Table 3:Component analysis on Mistral\-7B\-Instruct\-v0\.3 under a fixed KV\-cache budget of 256\.MethodAcc\. \(%\)SnapKV43\.76\+ SRH Selection44\.96\+ SRH \+ Layer Alloc\.45\.43\(a\) Contribution of each component\.

SRHs per LayerMean Acc\. \(%\)Δ\\Deltavs\. Top\-4Top\-244\.33−0\.63\-0\.63Top\-444\.960\.00Top\-644\.79−0\.17\-0\.17Top\-1244\.960\.00Top\-2444\.30−0\.66\-0\.66\(b\) Effect of the number of SRHs per layer\.

### 4\.6Orthogonal to Prior Efficiency Methods

#### With Prefilling Acceleration\.

CompressKV can be integrated with prefilling\-stage accelerators such as MInference\[jiang2024minference\]and XAttention\[xu2025xattention\], as they target prefilling cost while CompressKV targets decoding\-stage KV\-cache memory\. We conduct the integration experiments on Mistral\-7B\-Instruct\-v0\.3 under a 2048\-token per\-layer KV\-cache budget\. As shown in Figure[7](https://arxiv.org/html/2606.24467#S4.F7), the combined variants maintain accuracy close to the prefilling\-only baselines while further reducing decoding memory\.

#### With KV\-Cache Quantization\.

CompressKV also complements KV\-cache quantization methods such as KIVI\[liu2024kivi\]: KIVI reduces KV precision, whereas CompressKV prunes less critical tokens while preserving full\-precision KV entries\. On Mistral\-7B\-Instruct\-v0\.3 with a 2048\-token per\-layer KV\-cache budget, Figure[7](https://arxiv.org/html/2606.24467#S4.F7)shows that 2\-bit KIVI slightly outperforms CompressKV at comparable memory usage, but degrades sharply under 1\-bit quantization\. In contrast, CompressKV remains robust, and combining it with 2\-bit KIVI further reduces KV memory to about1\.6%1\.6\\%of the 16\-bit full\-cache baseline while maintaining strong accuracy\.

![Refer to caption](https://arxiv.org/html/2606.24467v1/experiment_results/integration_combined_horizontal_largefont.png)Figure 7:Integration of CompressKV with existing efficiency techniques on Mistral\-7B\-Instruct\-v0\.3\. Left: integration with prefilling\-stage accelerators; the dashed line denotes standalone CompressKV\. Right: integration with KV\-cache quantization; the dashed line denotes 16\-bit FullKV\.
#### With Head\-Level Allocation\.

CompressKV can also be combined with head\-level budget allocation methods such as HeadKV\[fu2024headsmatterheadlevelkv\]and AdaKV\[feng2025adakvoptimizingkvcache\]\. We evaluate these integrations on LLaMA\-3\.1\-8B\-Instruct under different KV\-cache budgets\. Integrating our token selection with HeadKV yields HeadCompressKV, while combining our token selection with error\-aware layer\-wise allocation on AdaKV yields AdaCompressKV\. As shown in Figure[8](https://arxiv.org/html/2606.24467#S4.F8), both variants consistently improve performance across KV\-cache budgets, achieving gains of up to nearly 2 points on LongBench and 11 points on Needle\-in\-a\-Haystack under tight memory\.

![Refer to caption](https://arxiv.org/html/2606.24467v1/experiment_results/headlevel_combined_score_delta.png)Figure 8:Integration of CompressKV with head\-level allocation methods on Llama\-3\.1\-8B\-Instruct

### 4\.7Head Visualization

In Figures[9](https://arxiv.org/html/2606.24467#S4.F9), we present a comparison between traditional Retrieval Heads and Semantic Retrieval Heads identified using Mistral\-7B\-Instruct\-v0\.3\. All scores are L1\-normalized across the attention head importance distributions\. Unlike traditional methods that require exact top\-kkattention hits, our approach aggregates scores over entire answer spans, capturing heads that contribute semantically relevant context even when they never achieve top\-1 attention for individual tokens\. For instance, as shown in Figure[9](https://arxiv.org/html/2606.24467#S4.F9), layers 0 and 1 of the Mistral model have zero scores for all heads using the traditional method, whereas our approach successfully identifies heads of lower yet meaningful importance\.

![Refer to caption](https://arxiv.org/html/2606.24467v1/experiment_results/appendix/mistral_new.png)Figure 9:Head visualization for Mistral\-7B\-Instruct\-v0\.3\. Left: Traditional Retrieval Heads\. Right: Semantic Retrieval Heads identified\.

## 5Conclusion

We presented CompressKV, a KV\-cache compression framework for GQA\-based LLMs that improves the resource–performance trade\-off of long\-context inference\. CompressKV identifies Semantic Retrieval Heads to avoid streaming\-head\-dominated token eviction and uses offline layer\-wise eviction errors to allocate cache budgets adaptively across layers\. Experiments on LongBench and Needle\-in\-a\-Haystack across multiple models and cache budgets show that CompressKV consistently preserves accuracy under tight KV\-cache memory constraints\. These results demonstrate that retrieval\-aware token selection and error\-aware allocation provide an effective path toward more memory\-efficient and sustainable long\-context LLM inference\.

## Acknowledgement

This work is funded by the European Union \- European Research Council \(ERC\) Starting Grant \- Project\-ID 101219243\. Views and opinions expressed are however those of the author\(s\) only and do not necessarily reflect those of the European Union or the European Research Council\. Neither the European Union nor the granting authority can be held responsible for them\.

## Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT \(OpenAI\) to assist with grammar improvement, language polishing, and manuscript editing\. After using this tool, the authors carefully reviewed and revised all content and take full responsibility for the accuracy, originality, and integrity of the publication\.

## References
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

Similar Articles

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Submit Feedback

Similar Articles

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression