HARD-KV: Head-Adaptive Regularization for Decoding-time KV Compression
Summary
Hard-KV introduces a Cascade Cache hierarchy and Logits Calibration mechanism to resolve the static-dynamic mismatch in head-adaptive KV cache compression, achieving up to 2x throughput improvement in long-context LLM inference.
View Cached Full Text
Cached at: 06/30/26, 05:29 AM
# Hard-KV: Head-Adaptive Regularization for Decoding-time KV Compression
Source: [https://arxiv.org/html/2606.28831](https://arxiv.org/html/2606.28831)
Feiyang RenBowen ZengDalin ZhangJinpeng ChenGang ChenHuan Li
###### Abstract
Long\-context LLM inference faces a fundamental conflict: head\-adaptive compression algorithms \(e\.g\., Top\-ppnucleus sampling\) offer superior accuracy by dynamically fluctuating memory budgets, yet modern inference engines \(e\.g\., vLLM\) demand rigid, static memory patterns to leverage CUDA Graphs and PagedAttention\. We resolve this “Static\-Dynamic” mismatch withHard\-KV, a unified framework that that bridges dynamic selection with rigid system constraints\.Hard\-KVintroduces a Cascade Cache hierarchy, managing the token lifecycle across dense, sparse, and condensed tiers\. Crucially, we propose a Logits Calibration mechanism that normalizes diverse importance metrics into a unified probability space, enabling consistent Top\-ppbudgeting across heterogeneous heads\. To bridge the efficiency gap, we offer a system\-level solution, which rewrites fragmented, dynamic indices into contiguous physical layouts compatible with high\-performance inference engine\. Extensive experiments on math\-reasoning benchmarks \(AIME, U\-Math\) verify thatHard\-KVachieves up to 2×\\timesthroughput improvement over static baselines while maintaining high\-fidelity generation in 10k\+ token scenarios\. Code is available at[https://github\.com/SuDIS\-ZJU/HARDInfer](https://github.com/SuDIS-ZJU/HARDInfer)\.
Machine Learning, ICML
## 1Introduction
The deployment of Large Language Models \(LLMs\) in long\-context scenarios is increasingly constrained by the linear growth of Key\-Value \(KV\) cache memory\. As sequence lengths extend into the tens of thousands, the memory footprint of the KV cache frequently exceeds that of the model weights, creating a critical bottleneck for throughput and latency\. To mitigate this, decoding\-time compression methods\(Liaoet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib32); Wuet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib34); Behnamet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib21); Joet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib51); Zenget al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib58); Zhanget al\.,[2025a](https://arxiv.org/html/2606.28831#bib.bib61)\)have emerged, dynamically reducing memory usage by retaining only the most critical tokens during generation\.
A central tension in this domain lies between algorithmic accuracy and system efficiency\. Recent research\(Fuet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib5); Xiaoet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib35); Tanget al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib36)\)indicates that “head\-adaptive” strategies—which allow cache size to fluctuate based on the specific information density of each attention head—offer superior performance\. This accuracy is crucial for extending compression to decoding time, particularly for tasks like mathematical reasoning where errors accumulate over long generation trajectories\(Chenet al\.,[2025a](https://arxiv.org/html/2606.28831#bib.bib24)\)\.Linet al\.\([2025](https://arxiv.org/html/2606.28831#bib.bib8)\)argue that the superiority of head\-adaptive budget allocation stems from the intrinsic dynamism of attention, demonstrating that simple\-yet\-effective Top\-pp\(nucleus sampling\(Holtzmanet al\.,[2019](https://arxiv.org/html/2606.28831#bib.bib25)\)\) is sufficient for selection\. Similarly, other works\(Xuanfan Niet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib38); Fenget al\.,[2025b](https://arxiv.org/html/2606.28831#bib.bib40),[a](https://arxiv.org/html/2606.28831#bib.bib41)\)propose new criteria for performance\-prioritized compression that rely on dynamic allocation\.
However, this dynamic nature is fundamentally incompatible with modern inference engines\. The uneven, unpredictable, and fragmented memory indices generated by head\-adaptive selection conflict with standard optimizations such as PagedAttention\(Kwonet al\.,[2023b](https://arxiv.org/html/2606.28831#bib.bib12)\), continuous batching\(Yuet al\.,[2022](https://arxiv.org/html/2606.28831#bib.bib19)\), and CUDA Graphs\(NVIDIA Corporation,[2024](https://arxiv.org/html/2606.28831#bib.bib16)\), all of which rely on*static*or*regularized*memory block structures\.
Challenge:We characterize the core conflict as the gap between*Dynamic, Context\-Dependent Attention Patterns*and*Static, Request\-Independent System Designs*\.
To bridge this gap, we proposeHard\-KV\(Head\-AdaptiveRegularization forDecoding\-timeKVCompression\), a framework designed to enable dynamic cache allocation within the constraints of modern inference architectures\.
To bridge this gap,Hard\-KVintroduces a Cascade Cache architecture that organizes KV memory into three sequence\-level tiers\. We further propose a Logits Calibration procedure that converts Top\-kkselections into Top\-pp\-aligned counterparts, allowing diverse KV selection strategies to share a unified head\-adaptive budget allocation mechanism\. This calibration produces cache patterns that better follow the raw attention distribution, making them naturally compatible with the three\-tier cache\. Finally,Hard\-KVregularizes dynamic head\-wise indices through a customized inference system that combines memory rewriting and sparse loading\. This design reconciles adaptive token selection with the structural constraints of modern inference engines, preserving algorithmic accuracy while improving system efficiency\. On math\-reasoning benchmarks, including AIME and U\-MATH,Hard\-KVachieves up to a 1\.5×\\timesperformance gain and a 2×\\timesthroughput improvement\.
To summarize, our work unifies head\-adaptive KV cache management into a robust system that is both algorithmically performant and system\-efficient We summarize our technical contributions as follows:
1. 1\.Unified Cascade Cache Architecture:We introduce a multi\-tier memory hierarchy that standardizes the token lifecycle\. This structure accommodates various KV selection methods, unifying diverse compression techniques into a single pipeline\. \(Section[3\.1](https://arxiv.org/html/2606.28831#S3.SS1)\)
2. 2\.Logits Calibration for Effective Top\-ppBudgeting:We propose a calibration method connecting Top\-kkand Top\-ppselection\. By mapping distribution alterations onto the raw attention distribution, we enable Top\-ppbudgeting to dynamically adapt to attention patterns rather than relying on fixed limits\. \(Section[3\.2](https://arxiv.org/html/2606.28831#S3.SS2)\)
3. 3\.System\-Aware Index Regularization:We provide a method to integrate dynamic compression into rigid system architectures\. Our regularization for head\-adaptive indices resolves compatibility issues with CUDA Graphs, ensuring high system efficiency with minimal metadata overhead\.\(Section[3\.3](https://arxiv.org/html/2606.28831#S3.SS3)\)
## 2Related Work
##### Adaptive KV Cache Compression
KV cache compression\(Liet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib1)\)optimizes LLM inference SLOs \(e\.g\., latency, throughput\) by reducing memory footprints during attention computation\. Unlike Efficient Attention methods\(Yanget al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib2); Ainslieet al\.,[2023](https://arxiv.org/html/2606.28831#bib.bib3); Fuet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib4)\)designed for full\-stack industrial deployment, KV cache compression has largely remained a research testbed, lacking robust integration with modern inference engines\.
Recent studies\(Fuet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib5); Fenget al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib7); Duet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib6); Zenget al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib62)\)have pivoted from uniform compression to*head\-adaptive strategies*, driven by the observation of attention heterogeneity\. Methods like AdaKV\(Fenget al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib7)\)and HeadKV\(Fuet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib5)\)implement head\-wise allocation based on metrics like eviction loss and importance scores\. Similarly,Duet al\.\([2025](https://arxiv.org/html/2606.28831#bib.bib6)\)use policy optimization to mix full and sparse attention, showing that heterogeneous configurations outperform uniform baselines\.
Concurrently, approaches like Twilight\(Linet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib8)\)and MagicPig\(Chenet al\.,[2024b](https://arxiv.org/html/2606.28831#bib.bib9)\)have shown that Top\-pp\(cumulative probability\) selection yields higher fidelity than rigid Top\-kk\(fixed count\) truncation\. Despite the superiority of Top\-ppselection in ideal scenarios, robustly integrating these dynamic strategies remains a challenge: existing methods often decouple sampling from memory management, while diverse importance metrics \(e\.g\., L2 norm vs\. attention scores\) defy unified probabilistic budgeting\.
Hard\-KVbridges this gap with a*logits calibration*mechanism that normalizes diverse selection metrics into a unified probability space, enabling consistent Top\-ppbudgeting across heterogeneous heads while enforcing strict, system\-compatible regularity on the resulting memory layout\.
##### System\-Algorithm Co\-Design in LLM Inference
State\-of\-the\-art inference engines maximize throughput by enforcing rigid, predictable execution patterns\. Innovations such as RadixAttention\(Zhenget al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib14)\), Tree\-Attention\(Caiet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib26)\), and static load\-balancing\(Liuet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib27)\)demonstrate the efficacy of co\-designing algorithms with static system constraints\.
Likewise, deploying head\-adaptive KV compression introduces an algorithm\-system conflict: the memory allocations*vary*at every step, violating the static assumptions of engines built for regular allocation\. Although head\-wise sparse attention is computationally feasible via kernel support like FlashInfer\(Yeet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib28)\), we identify three systemic barriers that prevent theoretical compression rates from actual speedups:
*1\) CUDA Graphs Incompatibility:*CUDA Graphs rely on static tensor shapes and persistent pointers\. The variable cache sizes produced by Top\-ppselection force the system to either frequently recapture graphs or revert to eager execution, neutralizing latency gains\.
*2\) PagedAttention Fragmentation:*Current paging mechanisms in real\-time LLM serving\(Yuet al\.,[2022](https://arxiv.org/html/2606.28831#bib.bib19); Kwonet al\.,[2023b](https://arxiv.org/html/2606.28831#bib.bib12)\)assume uniform memory distribution\. Head\-adaptive eviction shatters this uniformity, causing severe fragmentation that breaks the block contiguity required for efficient virtual\-to\-physical mapping\.
*3\) KV Selection Overhead:*Standard KV selection requires computing𝒪\(n2\)\\mathcal\{O\}\(n^\{2\}\)attention maps, which is incompatible with FlashAttention\(Dao,[2023](https://arxiv.org/html/2606.28831#bib.bib42)\)\. Such a cost introduces significant decoding overhead, stalling token generation\.
Unlike prefill\-phase compression\(Liet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib1); Chenet al\.,[2024a](https://arxiv.org/html/2606.28831#bib.bib63)\), which occurs only once, our work targets periodic compression during the latency\-critical*decoding*phase\. Proceeding to existing works\(Behnamet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib21); Wuet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib34); Joet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib51)\), we consider it crucial to maintain CUDA Graphs compatibility over tens of thousands of decoding steps in long\-context generation\. We address this by integrating head\-adaptive compression into a custom engine built onNano\-vLLM\(Yu Xingkai,[2025](https://arxiv.org/html/2606.28831#bib.bib17)\)\. While prior work, such as DiffKV\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.28831#bib.bib39)\), manages dynamic head\-wise memory allocation via bidirectional page tables and GPU\-side circular lists, we take a different approach\. We focus on algorithm\-system integration, proposing a novel hybrid of sparse and condensed formats to systematically tame head adaptivity\.
## 3TheHard\-KVDesign
### 3\.1The Framework Overview
Figure 1:Left: The*Cascade Cache*hierarchy manages the token lifecycle across three storage tiers: Dense \(recent\), Sparse \(adaptive\), and Condensed \(archival\)\.Right: TheHard\-KVexecution pipeline\. \(Step 1\) The KV cache grow in context; \(Step 2\) When compression is triggered, different selection methods will be calibrated towards real attention distribution and conduct dynamic Top\-ppsampling; \(Step 3\) Upon the sparse mask generated in the previous step, the KV indices will be condensed by rewriting; \(Step 4\) The KV cache can be further reduces to satisfy hard constraints by eviction or merging\.##### The Cascade Cache Hierarchy
Standard self\-attention suffers from linear memory complexity𝒪\(n\)\\mathcal\{O\}\(n\)\. While prior works have explored either*eviction*\(maintaining a fixed budget\) or*sparse retrieval*\(dynamic loading\), these approaches often treat KV memory as a monolithic structure\. We proposeCascade Cache, a hierarchical architecture that unifies these paradigms by managing the KV lifecycle across three distinct tiers \(Figure[1](https://arxiv.org/html/2606.28831#S3.F1), Left\):
- •Tier 1: Dense Cache\.Buffers the most recent tokens in contiguous memory\. This ensures full\-fidelity attention for the local window, preserving “attention sinks”\(Xiaoet al\.,[2023](https://arxiv.org/html/2606.28831#bib.bib10)\)and immediate syntactic dependencies\.
- •Tier 2: Sparse Cache\.As tokens age out of the dense window, they transition to the Sparse Cache\. We apply*Calibrated Top\-ppSelection*\(Section[3\.2](https://arxiv.org/html/2606.28831#S3.SS2)\) to retain only information\-dense pairs, effectively implementing head\-adaptive budgeting based on attention entropy\.
- •Tier 3: Condensed Cache\.To enforce hard memory limits, the oldest blocks are further compressed into constant\-sized storage via eviction or weighted merging\.
##### The Continuous Compression Pipeline
Building upon the Cascade Cache architecture,Hard\-KVorchestrates the KV cache lifecycle via an asynchronous four\-step pipeline \(Figure[1](https://arxiv.org/html/2606.28831#S3.F1), Right\)\. We implement this within a customNano\-vLLMengine\(Yu Xingkai,[2025](https://arxiv.org/html/2606.28831#bib.bib17)\)to ensure strict compatibility withcontinuous batchingandCUDA Graphs\(see details in Appendix[A](https://arxiv.org/html/2606.28831#A1)\)\.*Step 1*: The decoding requests linearly append KV pairs to the Tier 1 Dense Cache, utilizing efficient append\-only kernels\.*Step 2*: When the dense window reaches a predefined threshold, the*Logits Calibration*mechanism activates\. It normalizes diverse importance metrics into a unified probability space, generating a head\-adaptive sparse mask via Top\-ppsampling \(details in Section[3\.2](https://arxiv.org/html/2606.28831#S3.SS2)\)\.*Step 3*: To resolve the static\-dynamic mismatch, the*Index Regularization*engine rewrites the discontinuous selected tokens into a regularized, contiguous physical layout\.*Step 4*: Finally, to maximize memory utilization, the oldest blocks are compacted via eviction or merging before new dense blocks are allocated\.
We adapt the block table to the head\-wise format with selection masks indicating effective cache\. A metadata management system is coupled to support head\-adaptive dynamic budget\. This system is critical for enabling high\-throughput CUDA Graphs execution and preventing memory fragmentation \(details in Section[3\.3](https://arxiv.org/html/2606.28831#S3.SS3)\)\.
This pipeline can operate asynchronously; the Dense Cache handles immediate writes at the end of the sequence; meanwhile, compression \(Steps 2\-4\) operates on established history, minimizing latency impact\.
With the efficient Cascade Attention \(see Appendix[F](https://arxiv.org/html/2606.28831#A6)\) kernel implementation provided by Flashinfer\(Yeet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib28)\),Hard\-KVcan compute the exact full\-sequence attention from the partial attention outputs of three tiers in the proposed Cascade Cache\.
##### Cautious Update Strategy
A critical refinement in the Tier 1→\\rightarrowTier 2 transition is the*Cautious Update*strategy\. Unlike methods that compress based on a single snapshot \(e\.g\., the mean average attention in the last window\), we treat the dense window traversal as a “voting” period\. We maintain a cumulative selection mask from*all*queries in the window and only compress a block after it fully exits the dense tier\. For a strict constraint on the budget, we use ranking on the votes for selecting candidates to preserve\. This prevents the premature eviction of tokens that are temporarily irrelevant to the current step but critical for the local context\.
### 3\.2Logits Calibration for Effective Top\-ppBudgeting
This section proposes a unified framework to normalize heterogeneous methods, facilitating fair evaluation and control via*a single hyper\-parameterpp*\.
##### Unified Top\-P Selection


Figure 2:Comparison of the log number of selected tokens \(yy\-axis\) under P90\. Solid lines \(“—”\) represent the upper bound, while dotted lines \(“\- \- \-”\) represent the lower bound\.Left:Selections based on raw logits\.Right:Selections based on max\-pooled logits \(SnapKV\)\.Existing Key\-Value \(KV\) selection methods typically encode inductive biases, such as attention locality and token redundancy, directly into attention scores\. While such heuristics are effective for static Top\-kkselection, they are often misaligned with the inherently context\-dependent nature of Top\-ppbudgeting\.
Ideally, a fixed probability thresholdppshould produce a consistent retained\-token budget across different selection methods\. However, we observe that directly normalizing logits modified by different priors can cause their selection patterns to deviate substantially from those induced by the raw logits\. We refer to this phenomenon as*selection collapse*\(Figure[2](https://arxiv.org/html/2606.28831#S3.F2)\): a sharp reduction in the retained\-token budget under the same probability thresholdpp\. As shown in the figure, compared with the raw attention baseline, applying a selection method such asSnapKVreduces the retained budget from roughly1000\(≈e7\)1000~\(\\approx e^\{7\}\)tokens to only10\(≈e3\)10~\(\\approx e^\{3\}\)tokens under the samepp\-density at P90\.
To address this issue, we first unify diverse KV selection operations into a three\-step formulation, which allows most existing selection methods to be adapted to a Top\-ppbudget; details are provided in Appendix[E](https://arxiv.org/html/2606.28831#A5)\.
We address this by first unifying diverse operations into a three\-step framework capable of adapting most selection methods to a Top\-ppbudget \(refer to Appendix[E](https://arxiv.org/html/2606.28831#A5)\)\.
Furthermore, we propose an*order\-preserving*calibration method to facilitate Top\-ppsampling without compromising the underlying selection strategies\. This approach maps distributions from methods likeSnapKVandR\-KVto a normalized softmax distribution\. Crucially, it preserves the\(k,p\)\(k,p\)mapping \(i\.e\., linking Top\-kksets to cumulative probabilitiespp\) by adjusting the temperature of the attention scores\.
###### Problem 1\.
Given a thresholdpp, let Top\-ppsampling over\{Si\}\\\{S\_\{i\}\\\}yieldkktokens for a specific head\. The probability mass for thesekkattention scores is denoted byMk\(S\)=∑i∈Top\-kSiM\_\{k\}\(S\)=\\sum\_\{i\\in\\text\{Top\-\}k\}S\_\{i\}\. We aim to minimize the error between the probability mass of the raw attention scores and the shifted scores by adjusting the temperatureTT:
minT‖Mk\(S\(1T\)\)−Mk\(S^\(1T\)\)‖\.\\min\_\{T\}\\left\\\|M\_\{k\}\\left\(S\\left\(\{\\frac\{1\}\{T\}\}\\right\)\\right\)\-M\_\{k\}\\left\(\\hat\{S\}\\left\(\{\\frac\{1\}\{T\}\}\\right\)\\right\)\\right\\\|\.
In the standard setup, temperature scaling is defined asS\(1T\)=S1TS\(\\frac\{1\}\{T\}\)=S^\{\\frac\{1\}\{T\}\}\. However, for methods whereS^\\hat\{S\}is not guaranteed to be a valid probability distribution, directly computingS^1T\\hat\{S\}^\{\\frac\{1\}\{T\}\}renders the problem intractable\. Therefore, we extend the definition of temperature control to operate on the altered scores \(i\.e\.,S^\\hat\{S\}\) rather than the raw logits\.
Under the novel setup, we still expect to guarantee the invariance of the sampling results for the original methods\.
###### Constraint 1\.
Order\-invariance: Let the distribution alteration on the raw logitszzbe formalized by a functiong\(z\)g\(z\)\. We constructS^\(g\(z\),T\)\\hat\{S\}\(g\(z\),T\)such thatS^\|k≡S^\(g\(z\),T\)\|k\\hat\{S\}\|\_\{k\}\\equiv\\hat\{S\}\(g\(z\),T\)\|\_\{k\}, whereS\|kS\|\_\{k\}denotes the Top\-kksubset ofSS\.
Under this definition,S^\(⋅\)\\hat\{S\}\(\\cdot\)can be any function that simplifies the solution to Problem[1](https://arxiv.org/html/2606.28831#Thmproblem1)while retaining the intuitive properties of “temperature”\. We investigate the following two scenarios:
###### Solution 1\.
Ifg\(z\)g\(z\)preserves the order ofS^\\hat\{S\}, we may optionally defineS^=S\(g\(z/T\)\)\\hat\{S\}=S\(g\(z/T\)\)\.
Examples of order\-preserving \(See Condition[1](https://arxiv.org/html/2606.28831#Thmcondition1)in Appendix[D\.2](https://arxiv.org/html/2606.28831#A4.SS2)\) ofg\(z\)g\(z\)include max\-pooling inSnapKVand RocketKV, key quantization in Twilight, and vanilla attention averaging in H2O\.
###### Solution 2\.
Ifg\(z\)g\(z\)is arbitrary, we defineS^=S\(\(g\(z\)/T\)\\hat\{S\}=S\(\(g\(z\)/T\)\.
Solution[2](https://arxiv.org/html/2606.28831#ThmSolution2)applies to methods such as the pairwise scoring between keys inR\-KVand KeyDiff\(Parket al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib43)\)\.
In both scenarios, we identify the optimal temperatureTTfor each KV head using either binary search \(zero\-order optimization\) or gradient descent \(first\-order optimization\)\. These approaches yield a tempered attention distribution that satisfies Constraint[1](https://arxiv.org/html/2606.28831#Thmconstraint1)while achieving near\-optimal Top\-kkmass preservation error, denoted by‖Mk\(S\)−Mk\(S^\)‖\\\|M\_\{k\}\(S\)\-M\_\{k\}\(\\hat\{S\}\)\\\|\. The complete algorithm and associated proofs are detailed in Appendix[D](https://arxiv.org/html/2606.28831#A4)\.
This*Calibration*enables the fair evaluation on Top\-pppatterns for methods originally designed exclusively for Top\-kkselection\.
\(a\)The P60 and P90\-P60 token selection heatmap from raw attention scores in layer 6 averaged on heads\.
\(b\)The P60 and P90\-P60 token selection heatmap from maxpool\-ed scores \(SnapKV\) in layer 6 averaged on heads\.
\(c\)The P60 and P90\-P60 token selection heatmap from similarity enhanced scores \(R\-KV\) in layer 6 averaged on heads\.
Figure 3:Three different patterns observed in different resolutions of the selection heatmaps, corresponding to three tiers in Cascade Cache\. The figure is generated by stacking binary attention selection maps of all KV heads, whereREDrepresents for selected andTEALrepresents for non\-selected\.
##### Unified Patterns in Top\-ppSampling
Following calibration, we can precisely evaluate the effectiveness of Top\-ppsampling on two representative methods originally designed for alternative selection criteria:
- •SnapKV: Applies max\-pooling over the sequence and selects a Top\-kkbudget based on the pooled attention scores\.
- •R\-KV: Computes the cosine similarity between keys and prunes based on a fixed similarity threshold\. The Top\-kkselection is derived from a weighted sum of attention and similarity scores\.
We analyze the effectiveness of Top\-ppsampling by partitioning the selection results into two distinct categories:low\-resolutionandhigh\-resolution\.
Low\-resolutionrefers to the subset of tokens selected under a lower probability thresholdpp\(e\.g\., P60\)\. Selection results in this regime typically concentrate on time\-invariant tokens, exhibiting vertical patterns\.
High\-resolutionis defined as the incremental subset of tokens required to reach a higher thresholdpp\. Unlike the low\-resolution set, high\-resolution selection results are often dispersed and exhibit turbulent distribution as the sequence length increases, forming horizontal patterns\. These patterns frequently demonstrate periodic or unpredictable retention of tokens that were not selected in recent decoding steps\.
In both regimes, we observe an accumulation of attention scores within the sliding window, manifesting as sloping patterns\. This observation reinforces the common practice of maintaining a local window for the most recent tokens\.
With calibration, different methods can yield similar densities in the heatmaps in Figure[3](https://arxiv.org/html/2606.28831#S3.F3)under a unifiedppmetric\. By aligning the Top\-ppbudget with constraints on probability mass, we observe comparable patterns across methods, albeit with varying emphases\. This alignment framework further supports the development of compression methods that preserve the fundamental properties of attention computation \(Appendix[G](https://arxiv.org/html/2606.28831#A7)\)\.
### 3\.3A System View of KV Index Management
This section introduces how to unify the layer\-wise varying KV indices into a*fixed buffer*storing indices for different heads in support for CUDA Graphs\.
#### 3\.3\.1Regularization of Head\-Adaptive KV Indices
Figure 4:Left:The process of KV cache indexing in PagedAttention\. The inference engine uses batch\-level block table to collect indices in global\-level block pool that point to physical KV blocks\. Please refer to Appendix[B\.1](https://arxiv.org/html/2606.28831#A2.SS1)for further details\.Right:Three techniques to regularize head\-adaptive KV blocks\.The above head\-adaptive Top\-ppselection for KV Cache will incur challenges for inference systems\. To ensure compatibility with CUDA Graphs, the indices and offsets for KV cache pages \(or blocks\) must be fixed across all layers\.
To this end, we explore combinations of the following three index regularization operations:
1. 1\.*Rewrite Cache\.*The most direct approach is to read the selected blocks and write them into a regularized, contiguous range\.
2. 2\.*Sparse Loading\.*Hardware\-efficient attention masks are applied during kernel execution for layer\-wise dynamic loading\. By enabling masking, indices can be regularized at the cost of redundant KV cache loading\.
3. 3\.*Flatten Index\.*Indices are flattened into a 1\-dimensional array\. Using this technique, we can regularize per\-layer block indices to either the maximum length \(i\.e\.,Max, yielding higher utilization of the loaded KV cache\) or the sum over lengths \(i\.e\.,Sum, requiring fewer rewrite operations\)\.
Based on the constraints imposed on KV cache index layouts, we categorize the design space into four tasks\. Detailed configurations and corresponding experiments are provided in Appendix[B\.2](https://arxiv.org/html/2606.28831#A2.SS2)\.
We focus here on the most flexible and systematically challenging setting,Head\-wise Allocation \(HA\), and empirically compare strategies constructed from the three index regularization operations above\.
Figure 5:The latency evaluation for solutions in the Head\-wise Allocation \(HA\) task\. See Appendix[B\.2](https://arxiv.org/html/2606.28831#A2.SS2)for ablations of other tasks\.Task: Head\-wise Allocation \(HA\)\.Different heads have different budgets, and independently selected indices\.
Solutions:HA\-Sparse\- vanilla sparse loading for head\-flattened indices; \- rewrite KV Cache to the maximum number of allocated blocks, with often higher preparation operation and higher computation efficiency in longer context\.
As shown in Figure[5](https://arxiv.org/html/2606.28831#S3.F5),HA\-Sparsealways retains a lower cost at preparation stage and is comparably efficiency when the sequence length is limited;HA\-Maxoften incurs higher preparation latency due to the rewrite operation\. However, when the sequence extends to over 10 thousand,HA\-Maxcan cut the computations latency in half\. The above analysis demonstrates the necessity in rewriting KV cache into condensed blocks to maximize throughput, especially in long decoding\.
In our system design, we set two budget threshold ofLlowerL\_\{lower\}andLupperL\_\{upper\}\. When exceedingLlowerL\_\{lower\}, the system will trigger selections for sparse loading; when exceedingLupperL\_\{upper\}, the system will rewrite and organize KV cache layouts in contiguous space\.
#### 3\.3\.2Principles for Smart Index Management
We present three governing principles \(P1–P3\) for management and regularization of sparse KV Cache metadata\.
P1: Joint optimization of memory occupancy and latency\.
Cache rewriting typically incurs a higher preparation cost than sparse loading, due to additional bandwidth pressure and memory movement during the rewrite\. However, this upfront cost can be amortized by the resulting reduction in per\-step decoding latency, thereby improving compute\-side efficiency\. We characterize this trade\-off using a*Memory\-Time Integral*metric, which jointly accounts for memory occupancy and execution time and thus provides a clearer view of system\-level throughput\. A detailed discussion is provided in Appendix[B\.2](https://arxiv.org/html/2606.28831#A2.SS2)\.
P2: Head\-wise allocation for block indices\.
Figure 6:Flattened Head\-wise block table\.A practical and efficient solution for fine\-grained KV cache allocation \(head\-wise allocation\) is to revise the global or batch\-level block table metadata \(See Appendix[B\.2](https://arxiv.org/html/2606.28831#A2.SS2)\)\. Unlike existing methods that employ index mapping across all layers and heads, our design utilizes head\-independent indices for physical cache regularized for all layers\. Crucially, this data structure supports CUDA Graphs, capturing kernel executions throughout the model’s depth\.
P3: Sparsity\-agnostic management\.
Two factors are critical in the latency breakdown for taskHA: overall sparsity and maximum sparsity \(previously set for 12\.5% and 50%, respectively\)\. When holding overall sparsity constant, the maximum sparsity
Figure 7:Latency breakdown by max sparsity per head and overall sparsity\.dictates the load budget required for the attention computation\. Conversely, given a fixed maximum sparsity, the overall sparsity represents the total utilization within that budget\. Our latency analysis indicates that maximum sparsity exerts a relatively smoother impact on overall latency compared to variations in overall sparsity\.
## 4Experiment


Figure 8:Illustrations for experiment settings\.Upper: Fixed Top\-kkbudget\. Base: always keepkkblocks; Ours:kkas the maximum block usage for Top\-pppruned each head\.Lower: Dynamic Top\-ppbudget\. Base: keep tokens by Top\-ppmetrics; Ours: constrain the growth of the block usage\.Table 1:Fixed Budget Performance \(Top\-kk\)\.Comparison of baselines vsHard\-KVacross budget sizes \{1024, 2048, 4096\}\. Top\-ppbudget for our methods is 0\.90\.Full Attnrepresents performance without budget constraints\.MethodConfigBudget Size102420484096AccLenAccLenAccLenDataset: AIME24Full Attention \(Ref\.\)Acc: 79\.9%Len: 13735\.7VanillaBase10\.0%32161\.033\.1%23530\.629\.7%31843\.5\+ Ours24\.0%24158\.137\.3%24158\.133\.5%23449\.8SnapKVBase6\.2%31323\.121\.3%23860\.357\.7%19157\.4\+ Ours13\.3%32768\.040\.2%28471\.456\.7%21616\.5RKVBase13\.3%30423\.223\.3%27968\.043\.3%23281\.5\+ Ours23\.4%30480\.853\.3%27819\.657\.9%28094\.0Dataset: AIME25Full Attention \(Ref\.\)Acc: 63\.3%Len: 17528\.8VanillaBase6\.5%32309\.716\.7%26617\.524\.8%31851\.4\+ Ours20\.5%28086\.212\.9%27929\.520\.7%28794\.4SnapKVBase3\.1%31007\.519\.3%26291\.527\.9%25621\.4\+ Ours3\.3%31796\.323\.3%30152\.442\.3%25228\.6RKVBase10\.8%31879\.521\.9%27706\.740\.0%23944\.3\+ Ours17\.8%32279\.026\.7%31068\.656\.7%28231\.6Dataset: UMathFull Attention \(Ref\.\)Acc: 50\.4%Len: 9815\.7VanillaBase9\.8%32768\.038\.0%16102\.532\.1%26314\.3\+ Ours46\.1%15363\.536\.2%16791\.536\.1%16118\.8SnapKVBase18\.0%22350\.741\.8%19021\.446\.7%12420\.2\+ Ours15\.5%27705\.746\.2%22499\.449\.4%14520\.4RKVBase28\.1%19625\.335\.1%17915\.239\.9%27429\.7\+ Ours24\.1%24346\.152\.0%24748\.550\.3%21296\.5
We evaluate our framework under two settings:
1. 1\.Fixed \(Top\-kk\) Budget\.\(Figure[8](https://arxiv.org/html/2606.28831#S4.F8), Upper\) We adapt selection methods to regularized Top\-ppbudgets wherekklimits the maximum KV Cache consumption \(LupperL\_\{upper\}, see Section[3\.3\.1](https://arxiv.org/html/2606.28831#S3.SS3.SSS1)\)\. Due to our method’s head\- and layer\-adaptive nature, actual cache usage remains below Top\-kkbaselines\. We defaultLlower=Lupper/2L\_\{lower\}=L\_\{upper\}/2\.
2. 2\.Dynamic \(Top\-pp\) Budget\.\(Figure[8](https://arxiv.org/html/2606.28831#S4.F8), Lower\) Baselines evict cache only when the token selection probability is zero under the Top\-ppbudget\. While this preserves baseline performance, our method regularizes the budget to constrain cache growth, balancing performance with high sparsity for efficiency\.
Baseline Adaptations\.Hard\-KVunifies Top\-kkmethods into Top\-pphead\-adaptive counterparts via the 3\-step framework \(Appendix[E](https://arxiv.org/html/2606.28831#A5)\)\. We adapt baselines strictly as logits\-level transformations rather than direct algorithmic ports\. Compression triggers every 128 steps during decoding, differing from prefill\-only settings\.
- •Vanilla: Uses raw attention logits \(akin to H2O\)\. For fairness, we enforce the retention of attention sinks and a local window, combining H2O and StreamLLM strategies\.
- •SnapKV: Max\-pools raw logits along the sequence dimension\. We adapt the original score\-pooling to logit\-pooling without affecting selection \(Solution[1](https://arxiv.org/html/2606.28831#ThmSolution1)\), while retaining the local window\.
- •R\-KV: Converts pair\-wise similarity computations into a logits\-level transformation \(Solution[2](https://arxiv.org/html/2606.28831#ThmSolution2)\), similarly preserving attention sinks and the local window\.
Datasets & Models\.We focus on long\-decoding scenarios where models generate\>\>10k tokens \(e\.g\., reasoning tasks\)\. We evaluate on AIME24\(Zhang and Math\-AI,[2024](https://arxiv.org/html/2606.28831#bib.bib29)\), AIME25\(Zhang and Math\-AI,[2025](https://arxiv.org/html/2606.28831#bib.bib30)\), and U\-Math\(Chernyshevet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib31)\)\(see Appendix[C\.1](https://arxiv.org/html/2606.28831#A3.SS1)for details\)\. Experiments use Qwen3\-8B \(with*think*mode enabled\), with Qwen3\-4B and Qwen3\-32B results provided in Appendix[C\.3](https://arxiv.org/html/2606.28831#A3.SS3)\.
Figure 9:Sparsity Utilization Visualization\.We sort the effective budget for different heads and different layers to draw the distribution\. For a lower sparsity utilization \(Left\), the efficiency suffer from the bottleneck in loading unused cache\. To improve performance, we can increase sparsity utilization by increasing overall sparsity \(regulated by Top\-ppbudget\), with approximately same efficiency\. To improve efficiency, we can lower the max sparsity per head \(regulated by Top\-kkbudget\), with same overall sparsity ratio that preserves the performance\.### 4\.1Fixed\-budget Performance
Table 2:Fixed Budget Performance \(Top\-pp\)\.Comparison of baselines vsHard\-KVacross different Top\-ppbudgets \(\{P70, P80, P90\}\)\. The Top\-kkbudget is 4096\.MethodBaseTop\-ppBudgetAccLenp70p80p90AccLenAccLenAccLenDataset: AIME24Vanilla29\.7%31843\.540\.1%26495\.323\.3%27571\.833\.5%23449\.8SnapKV57\.7%19157\.463\.3%19386\.856\.7%21757\.356\.7%21616\.5RKV43\.3%23281\.553\.3%27802\.451\.5%29152\.157\.9%28094\.0Dataset: AIME25Vanilla24\.8%31851\.420\.1%30100\.818\.9%29155\.420\.7%28794\.4SnapKV27\.9%25621\.436\.6%24264\.553\.3%24229\.642\.3%25228\.6RKV40\.0%23944\.339\.9%29174\.346\.7%28309\.956\.7%28231\.6Dataset: UMathVanilla32\.1%26314\.338\.2%22502\.743\.9%17195\.436\.1%16118\.8SnapKV46\.7%12420\.245\.3%14212\.948\.1%14059\.449\.4%14520\.4RKV39\.9%27429\.738\.0%23649\.947\.5%23681\.050\.3%21296\.5
Fixed Budget Results \(Top\-kk\)\.Table[1](https://arxiv.org/html/2606.28831#S4.T1)presents results under varying Top\-kkbudgets\. Our method achieves up to a1\.5×\\timesperformance boostat the largest budget setting \(4096\)\. We observe consistent improvements for bothSnapKVandR\-KVacross all three datasets\. ForVanilla, while there is a considerable boost in low\-budget settings, performance degrades at higher budgets\. We attribute this drop to factors intrinsic to the method, specifically the accumulation of noisy signals in raw logits\.
A notable finding is the effectiveness ofHard\-KVin low\-budget regimes, driven by the superior utilization inherent to Top\-ppsampling\. Given the intrinsic difficulty of competition\-level math reasoning, Top\-kkbaselines frequently fail in low\-budget settings, often yielding overlong, high\-perplexity responses\(Chenet al\.,[2025a](https://arxiv.org/html/2606.28831#bib.bib24)\)\. By compressing block allocation before the hard budget is reached \(Figure[8](https://arxiv.org/html/2606.28831#S4.F8)\), our method preserves capacity for retaining critical KV cache states later in the generation process, thereby improving accuracy\.
Fixed Budget Results \(Top\-pp\)\.As discussed above, the fixed Top\-ppbudget acts as a reference for the target token density that compression methods should maintain in the current context, and therefore affects the selections accumulated over compression steps\. Table[2](https://arxiv.org/html/2606.28831#S4.T2)reports performance under different Top\-ppbudgets\. ForSnapKVandR\-KV, the shifted logits benefit substantially from largerppvalues, indicating that these adapted methods can dynamically recover useful information from long\-tailed attention distributions\. By contrast, raw logits \(Vanilla\) show relatively flat performance acrossppvalues, consistent with our earlier observation thatVanilladegrades under larger Top\-kkbudgets\. This suggests that simply increasing the retained probability mass is insufficient when the underlying logits contain accumulated noise; addressing this limitation in raw attention logits under high\-budget settings is left to future work\.
Table 3:Dynamic Budget Trade\-offs\.Analysis of Efficiency vs\. Accuracy\.SU: Sparsity Utilization,Thr\.: Throughput \(tokens/s\)\.MethodConfigAIME24AIME25UMathAccSUThr\.AccSUThr\.AccSUThr\.System Throughput Baselines \(Tokens/s\)Base System346350580\+ CUDA Graph477467748\+ Top\-kkpruning806823996Top\-p=0\.90p=0\.90VanillaBase35\.3%33\.2%27933\.3%35\.5%28944\.5%46\.6%329\+ Ours33\.5%83\.4%73220\.7%86\.3%79236\.1%79\.5%819SnapKVBase43\.3%30\.7%30143\.3%18\.9%34342\.0%37\.1%321\+ Ours49\.2%90\.3%63242\.6%97\.9%62646\.5%84\.3%726Top\-p=0\.60p=0\.60VanillaBase33\.9%27\.1%28020\.0%47\.5%29125\.3%40\.1%317\+ Ours36\.7%77\.1%72821\.0%83\.1%69840\.1%82\.0%887SnapKVBase26\.7%23\.4%30016\.7%30\.1%30530\.0%21\.1%319\+ Ours50\.2%88\.2%64643\.3%85\.2%64546\.0%89\.2%729
### 4\.2Dynamic\-budget Performance
Dynamic Budget Results\.Table[3](https://arxiv.org/html/2606.28831#S4.T3)shows performance in dynamic budget settings, along with throughput ablations\. Under the Top\-kksetting, both CUDA Graph integrations and pruning contribute crucially to the overall throughput improvements\. Under the Top\-ppsetting, baselines here retain tokens whenever any layer or head selects them within the sliding window \(see Figure[8](https://arxiv.org/html/2606.28831#S4.F8)for illustrations\)\. While this yields accurate Top\-ppselections, it has caused downgraded throughput compared with Top\-kkcounterparts, due to incompatibility with CUDA Graph and blocking operation of compression\. Another consequence in naive Top\-pppruning is suboptimal sparsity utilization \(defined as max sparsity per head divided by overall sparsity; see Figure[9](https://arxiv.org/html/2606.28831#S4.F9)\)\. By the KV index management system introduced in Section[3\.3](https://arxiv.org/html/2606.28831#S3.SS3), we can regulate the overall sparsity via a fixed Top\-ppbudget and enforce efficiency via a Top\-kkthreshold\. Our method achieves performance comparable to Top\-ppsparse baselines while maintaining the efficiency of Top\-kkeviction\.
Summary and Recommendations\.Synthesizing these results, we recommend optimizing performance by regulating sparsity utilization along two dimensions\. For performance\-sensitive tasks, we suggest using a largerkkto increase overall token retention, together with a largerppto improve utilization of the retained cache\. For efficiency\-sensitive tasks, we suggest using a smallerkkto improve throughput, together with a smallerppto reduce occupied cache space and avoid premature eviction\.
## 5Conclusion and Future Work
In summary, we propose a scheme that unifies diverse KV compression methods to facilitate Top\-ppsampling, incorporating*logits calibration*\. Furthermore, we provide a systems\-level solution for managing head\-adaptive KV indices\. By integrating these components,Hard\-KV, built upon*Cascade Cache*, achieves a superior trade\-off between performance and efficiency compared to fixed\-budget baselines\.
In future work, we aim to extend theHard\-KVframework to streaming compression regimes\(Chenet al\.,[2025b](https://arxiv.org/html/2606.28831#bib.bib64); Wanget al\.,[2025b](https://arxiv.org/html/2606.28831#bib.bib65)\)and investigate sequence\-level hybrids of constant\-sized and linear\-sized memory\(Wanget al\.,[2025a](https://arxiv.org/html/2606.28831#bib.bib66)\)\.
## Acknowledgment
This work was supported by the Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China \(No\. JYB2025XDXM103\) and CCF\-Ant Research Fund \(No\. RF20250402\)\.
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.
## References
- J\. Ainslie, J\. Lee\-Thorp, M\. De Jong, Y\. Zemlyanskiy, F\. Lebrón, and S\. Sanghai \(2023\)Gqa: training generalized multi\-query transformer models from multi\-head checkpoints\.arXiv preprint arXiv:2305\.13245\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou,et al\.\(2024\)Longbench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 3119–3137\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p1.1)\.
- Y\. Bai, S\. Tu, J\. Zhang, H\. Peng, X\. Wang, X\. Lv, S\. Cao, J\. Xu, L\. Hou, Y\. Dong,et al\.\(2025\)Longbench v2: towards deeper understanding and reasoning on realistic long\-context multitasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3639–3664\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p1.1)\.
- P\. Behnam, Y\. Fu, R\. Zhao, P\. Tsai, Z\. Yu, and A\. Tumanov \(2025\)RocketKV: accelerating long\-context llm inference via two\-stage kv cache compression\.arXiv preprint arXiv:2502\.14051\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p1.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p6.1)\.
- T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao \(2024\)Medusa: simple llm inference acceleration framework with multiple decoding heads\.arXiv preprint arXiv:2401\.10774\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Chen, R\. Geh, A\. Grover, G\. V\. d\. Broeck, and D\. Israel \(2025a\)The pitfalls of kv cache compression\.arXiv preprint arXiv:2510\.00231\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.28831#S4.SS1.p2.2)\.
- X\. Chen, K\. Tao, K\. Shao, and H\. Wang \(2025b\)Streamingtom: streaming token compression for efficient video understanding\.arXiv preprint arXiv:2510\.18269\.Cited by:[§5](https://arxiv.org/html/2606.28831#S5.p2.1)\.
- Y\. Chen, G\. Wang, J\. Shang, S\. Cui, Z\. Zhang, T\. Liu, S\. Wang, Y\. Sun, D\. Yu, and H\. Wu \(2024a\)Nacl: a general and effective kv cache eviction framework for llms at inference time\.arXiv preprint arXiv:2408\.03675\.Cited by:[§B\.2](https://arxiv.org/html/2606.28831#A2.SS2.p8.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p6.1)\.
- Z\. Chen, R\. Sadhukhan, Z\. Ye, Y\. Zhou, J\. Zhang, N\. Nolte, Y\. Tian, M\. Douze, L\. Bottou, Z\. Jia,et al\.\(2024b\)Magicpig: lsh sampling for efficient llm generation\.arXiv preprint arXiv:2410\.16179\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p3.3)\.
- K\. Chernyshev, V\. Polshkov, E\. Artemova, A\. Myasnikov, V\. Stepanov, A\. Miasnikov, and S\. Tilga \(2024\)U\-math: a university\-level benchmark for evaluating mathematical skills in llms\.arXiv preprint arXiv:2412\.03205\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p2.1),[§4](https://arxiv.org/html/2606.28831#S4.p4.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p2.1)\.
- T\. Dao \(2023\)Flashattention\-2: faster attention with better parallelism and work partitioning\.arXiv preprint arXiv:2307\.08691\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p5.1)\.
- W\. Du, L\. Jiang, K\. Tao, X\. Liu, and H\. Wang \(2025\)Which heads matter for reasoning? rl\-guided kv cache compression\.arXiv preprint arXiv:2510\.08525\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Feng, H\. Guo, J\. Lv, S\. K\. Zhou, and X\. Xie \(2025a\)Taming the fragility of kv cache eviction in llm inference\.arXiv preprint arXiv:2510\.13334\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1)\.
- Y\. Feng, J\. Lv, Y\. Cao, X\. Xie, and S\. K\. Zhou \(2024\)Ada\-kv: optimizing kv cache eviction by adaptive budget allocation for efficient llm inference\.arXiv preprint arXiv:2407\.11550\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Feng, J\. Lv, Y\. Cao, X\. Xie, and S\. K\. Zhou \(2025b\)Identify critical kv cache in llm inference from an output perturbation perspective\.arXiv preprint arXiv:2502\.03805\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1)\.
- Y\. Fu, Z\. Cai, A\. Asi, W\. Xiong, Y\. Dong, and W\. Xiao \(2024\)Not all heads matter: a head\-level kv cache compression method with integrated retrieval and reasoning\.arXiv preprint arXiv:2410\.19258\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p2.1)\.
- Z\. Fu, W\. Song, Y\. Wang, X\. Wu, Y\. Zheng, Y\. Zhang, D\. Xu, X\. Wei, T\. Xu, and X\. Zhao \(2025\)Sliding window attention training for efficient large language models\.arXiv preprint arXiv:2502\.18845\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p2.1)\.
- A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi \(2019\)The curious case of neural text degeneration\.arXiv preprint arXiv:1904\.09751\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.arXiv preprint arXiv:2404\.06654\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p1.1)\.
- Huggingface \(2025\)Math\-verify\.External Links:[Link](https://github.com/huggingface/Math-Verify)Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p2.1)\.
- D\. Jo, J\. Song, Y\. Kim, and J\. Kim \(2025\)Fastkv: kv cache compression for fast long\-context processing with token\-selective propagation\.arXiv preprint arXiv:2502\.01068\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p1.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p6.1)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§D\.1](https://arxiv.org/html/2606.28831#A4.SS1.p2.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023a\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[§B\.1](https://arxiv.org/html/2606.28831#A2.SS1.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023b\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.Cited by:[§B\.1](https://arxiv.org/html/2606.28831#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.28831#S1.p3.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p4.1)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024\)Snapkv: llm knows what you are looking for before generation\.Advances in Neural Information Processing Systems\.Cited by:[§B\.2](https://arxiv.org/html/2606.28831#A2.SS2.p8.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p6.1)\.
- M\. Liao, L\. Wang, C\. Zhang, Z\. Shen, X\. Mao, S\. Qin, Q\. Lin, S\. Rajmohan, D\. Zhang, and H\. Wan \(2025\)G\-kv: decoding\-time kv cache eviction with global attention\.arXiv preprint arXiv:2512\.00504\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p1.1)\.
- C\. Lin, J\. Tang, S\. Yang, H\. Wang, T\. Tang, B\. Tian, I\. Stoica, S\. Han, and M\. Gao \(2025\)Twilight: adaptive attention sparsity with hierarchical top\-pppruning\.arXiv preprint arXiv:2502\.02770\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p3.3)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§D\.1](https://arxiv.org/html/2606.28831#A4.SS1.p2.1)\.
- NVIDIA Corporation \(2024\)CUDA c\+\+ programming guide: cuda graphs\.Note:Accessed: 2024\-01\-25External Links:[Link](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs)Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p3.1)\.
- J\. Park, D\. Jones, M\. J\. Morse, R\. Goel, M\. Lee, and C\. Lott \(2025\)KeyDiff: key similarity\-based kv cache eviction for long\-context llm inference in resource\-constrained environments\.arXiv preprint arXiv:2504\.15364\.Cited by:[§3\.2](https://arxiv.org/html/2606.28831#S3.SS2.SSS0.Px1.p10.1)\.
- S\. Ruder \(2016\)An overview of gradient descent optimization algorithms\.arXiv preprint arXiv:1609\.04747\.Cited by:[§D\.1](https://arxiv.org/html/2606.28831#A4.SS1.p2.1)\.
- H\. Tang, Y\. Lin, J\. Lin, Q\. Han, S\. Hong, Y\. Yao, and G\. Wang \(2024\)Razorattention: efficient kv cache compression through retrieval heads\.arXiv preprint arXiv:2407\.15891\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1)\.
- D\. Wang, R\. Zhu, S\. Abreu, Y\. Shan, T\. Kergan, Y\. Pan, Y\. Chou, Z\. Li, G\. Zhang, W\. Huang,et al\.\(2025a\)A systematic analysis of hybrid linear attention\.arXiv preprint arXiv:2507\.06457\.Cited by:[§5](https://arxiv.org/html/2606.28831#S5.p2.1)\.
- Y\. Wang, X\. Liu, X\. Gui, X\. Lin, B\. Yang, C\. Liao, T\. Chen, and L\. Zhang \(2025b\)Accelerating streaming video large language models via hierarchical token compression\.arXiv preprint arXiv:2512\.00891\.Cited by:[§5](https://arxiv.org/html/2606.28831#S5.p2.1)\.
- J\. Wu, Z\. Wang, L\. Zhang, Y\. Lai, Y\. He, and D\. Zhou \(2025\)Scope: optimizing key\-value cache compression in long\-context generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 10775–10790\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p1.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p6.1)\.
- G\. Xiao, J\. Tang, J\. Zuo, J\. Guo, S\. Yang, H\. Tang, Y\. Fu, and S\. Han \(2024\)Duoattention: efficient long\-context llm inference with retrieval and streaming heads\.arXiv preprint arXiv:2410\.10819\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2023\)Efficient streaming language models with attention sinks\.arXiv preprint arXiv:2309\.17453\.Cited by:[1st item](https://arxiv.org/html/2606.28831#S3.I1.i1.p1.1)\.
- Y\. Xu, Z\. Jie, H\. Dong, L\. Wang, X\. Lu, A\. Zhou, A\. Saha, C\. Xiong, and D\. Sahoo \(2024\)Think: thinner key cache by query\-driven pruning\.arXiv preprint arXiv:2407\.21018\.Cited by:[§B\.2](https://arxiv.org/html/2606.28831#A2.SS2.p8.1)\.
- L\. X\. Xuanfan Ni, L\. W\. Chenyang Lyu, L\. L\. Mo Yu, J\. Z\. Fandong Meng, and P\. Li \(2025\)Towards threshold\-free kv cache pruning\.arXiv preprint arXiv:2502\.16886\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§C\.3](https://arxiv.org/html/2606.28831#A3.SS3.p1.1)\.
- S\. Yang, B\. Wang, Y\. Zhang, Y\. Shen, and Y\. Kim \(2024\)Parallelizing linear transformers with the delta rule over sequence length\.Advances in neural information processing systems37,pp\. 115491–115522\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Ye, L\. Chen, R\. Lai, W\. Lin, Y\. Zhang, S\. Wang, T\. Chen, B\. Kasikci, V\. Grover, A\. Krishnamurthy, and L\. Ceze \(2025\)FlashInfer: efficient and customizable attention engine for llm inference serving\.arXiv preprint arXiv:2501\.01005\.External Links:[Link](https://arxiv.org/abs/2501.01005)Cited by:[Appendix F](https://arxiv.org/html/2606.28831#A6.p1.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p2.1),[§3\.1](https://arxiv.org/html/2606.28831#S3.SS1.SSS0.Px2.p4.1)\.
- Z\. Ye, R\. Lai, B\. Lu, C\. Lin, S\. Zheng, L\. Chen, T\. Chen, and L\. Ceze \(2024\)Cascade inference: memory bandwidth efficient shared prefix batch decoding\.External Links:[Link](https://flashinfer.ai/2024/02/02/cascade-inference.html)Cited by:[Appendix F](https://arxiv.org/html/2606.28831#A6.p1.1)\.
- G\. Yu, J\. S\. Jeong, G\. Kim, S\. Kim, and B\. Chun \(2022\)Orca: a distributed serving system for\{\\\{transformer\-based\}\\\}generative models\.In16th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 22\),pp\. 521–538\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p3.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p4.1)\.
- Yu Xingkai \(2025\)Nano\-vllm\.External Links:[Link](https://github.com/GeeeekExplorer/nano-vllm)Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p6.1),[§3\.1](https://arxiv.org/html/2606.28831#S3.SS1.SSS0.Px2.p1.1)\.
- H\. Zeng, D\. Zhao, P\. Yang, W\. Hou, T\. Zheng, H\. Li, W\. Ji, and J\. Zhai \(2025\)Lethe: layer\-and time\-adaptive kv cache pruning for reasoning\-intensive llm serving\.arXiv preprint arXiv:2511\.06029\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p1.1)\.
- Z\. Zeng, B\. Lin, T\. Hou, H\. Zhang, and Z\. Deng \(2024\)In\-context kv\-cache eviction for llms via attention\-gate\.arXiv preprint arXiv:2410\.12876\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px1.p2.1)\.
- H\. Zhang, H\. Zhang, X\. Ma, J\. Zhang, and S\. Guo \(2025a\)LazyEviction: lagged kv eviction with attention pattern observation for efficient long reasoning\.arXiv preprint arXiv:2506\.15969\.Cited by:[§1](https://arxiv.org/html/2606.28831#S1.p1.1)\.
- Y\. Zhang, Y\. Hu, R\. Zhao, J\. C\. Lui, and H\. Chen \(2025b\)Diffkv: differentiated memory management for large language models with parallel kv compaction\.InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles,pp\. 431–445\.Cited by:[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p6.1)\.
- Y\. Zhang and T\. Math\-AI \(2024\)American invitational mathematics examination \(aime\) 2024\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p2.1),[§4](https://arxiv.org/html/2606.28831#S4.p4.1)\.
- Y\. Zhang and T\. Math\-AI \(2025\)American invitational mathematics examination \(aime\) 2025\.Cited by:[§C\.1](https://arxiv.org/html/2606.28831#A3.SS1.p2.1),[§4](https://arxiv.org/html/2606.28831#S4.p4.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. L\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez,et al\.\(2024\)Sglang: efficient execution of structured language model programs\.Advances in neural information processing systems37,pp\. 62557–62583\.Cited by:[§B\.1](https://arxiv.org/html/2606.28831#A2.SS1.p1.1),[§2](https://arxiv.org/html/2606.28831#S2.SS0.SSS0.Px2.p1.1)\.
## Appendix ASystem Design forHard\-Infer
LLMEngineSequenceManagerBlockManagerRequestQueueModel RunnerKV Cache\(Dense Blocks\)Attention Layer \(Dense\)All heads process same cached dataFigure 10:nanovllm: Monolithic Dense Cache Architecture\.HARD LLMEngineHARD KV CacheHARD AttentionCacheManagerHeadwiseBlockManagerRequestQueueModel RunnerSparse\(Mask\)Condensed\(Rewrite\)packed\_headwise\_mask \(uint8\)PartialUpdateQueryRewriteFigure 11:nanovllm\_HARD: Hierarchical Sparse\-Dense Architecture\.Build uponNano\-vLLM, theHard\-Inferfork have made the following key adaptations:
1. 1\.`Block Manager`⟶\\longrightarrow`Headwise Block Manager` We implement head\-flattened block indices \(as shown in Figure[6](https://arxiv.org/html/2606.28831#S3.F6)in Section[3\.3\.2](https://arxiv.org/html/2606.28831#S3.SS3.SSS2)\)\. This design facilitates head\-adaptive KV block allocation and release, enabling more fine\-grained memory management for the KV cache\.
2. 2\.`Sequence Manager`⟶\\longrightarrow`Cache Manager` The vanillaNano\-vLLMmanages metadata \(such as the`block table`, i\.e\., allocated KV blocks for a given request\) directly via the sequence/request data structure\. In contrast, for theCascade Cachein this work, we implement a dedicated`Cache Manager`\. We realize the three\-tier cache using a unified head\-wise mask structure that covers the entire sequence\. This head\-wise mask is packed in[uint8 format](https://docs.flashinfer.ai/generated/flashinfer.quantization.segment_packbits.html)for kernel efficiency\. The three\-tier cache spaces are isolated by pointers in the`req\_to\_token\_pool`[\(in SGLang forward\_batch\)](https://github.com/sgl-project/sglang/blob/673dc09d9ba30331fa60f23c88c0483865d3afce/python/sglang/srt/model_executor/forward_batch_info.py)and dynamically apply the effective mask during decoding or KV selection computations\.
3. 3\.`Flash Attention Layer`⟶\\longrightarrow`HARD Attention Layer` For the attention computation backend, we transition from Flash Attention kernels to FlashInfer\. This is due to FlashInfer’s native support for a block size of 1, which enables the arbitrary selection of KV indices\. To ensure compatibility with CUDA Graphs, we implement a*Partial Update*mechanism that bypasses high\-overhead preparation steps to directly update the head\-wise mask for different layers\. Furthermore, for decoding\-time KV cache compression, we store the query cache in a sliding window and support efficient rewrite kernels\.
For further details in the architecture ofHard\-Infer, please refer toarchitecture\.mdin our code repository\.
## Appendix BFurther Discussions on KV Index Regularization
### B\.1PagedAttention, Page Table, and KV Index Metadata
Figure 12:The metadata preparations during a complete decoding\.Global:The logical KV block pointers to the physical KV blocks;Batch:The request\-to\-logical\-block mapping, gatherer in a running batch;Kernel:The tiling data needed to execute kernel computation, often most computation\-intensive\.PagedAttention\(Kwonet al\.,[2023b](https://arxiv.org/html/2606.28831#bib.bib12)\)addresses memory fragmentation in LLM inference systems\. Engines like vLLM\(Kwonet al\.,[2023a](https://arxiv.org/html/2606.28831#bib.bib13)\)and SGLang\(Zhenget al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib14)\)implement this via various backends, each necessitating distinct metadata computations\.
Decoding workloads are typically memory\-bound but suffer disproportionate scheduling overhead\. This bottleneck is exacerbated by KV cache compression, particularly under aggressive sparsity ratios \(e\.g\.,≈20%\\approx 20\\%\)\. The primary source of this overhead is metadata preparation, which we categorize into three levels:
Global\-Level:The engine pre\-allocates a global KV memory pool\. At runtime, requests are assigned random block indices \(Continuous Batching\)\. While standard attention uses a single index per request, head\-adaptive scenarios require dynamic indices across layers\. To maintain compatibility with CUDA Graphs, we expand these allocated indices by the number of heads\.
Batch\-Level:Inference executes in batches\. For decoding\-only requests, we maintain a list of occupied block indices stacked into a block table, often flattened for efficiency\. In compression scenarios, this table must support read/write operations to facilitate data selection methods\.
Kernel\-Level:The most computationally intensive phase is tiling metadata for kernel execution\. Flashinfer employs a unified Block Compressed Sparse Row \(BSR\) format to support custom attention mechanisms, including KV compression\. To optimize preparation, we employ unified indices across layers; this avoids per\-layer overhead while remaining compatible with head\-adaptive selection via head expansion\. Additionally, we utilize Flashinfer’s custom masking to exclude attention scores at specific positions\.
### B\.2Discussion on trade\-off in the Index Management for Layerwise Selection
In this subsection, we extend the discussion from Section[3\.3\.1](https://arxiv.org/html/2606.28831#S3.SS3.SSS1)to explore potential schemas for different KV selection layouts\. Head\-wise Allocation \(HA\) requires the most dynamic selection mechanism, while the subsequent three layouts present simpler tasks that can be addressed using similar methodologies\. The metrics reported below were collected on an NVIDIA RTX 4090 GPU using a batch size of 32, with an overall sparsity ratio of 12\.5% and a maximum per\-head sparsity ratio of 25%\.
It is important to note that during attention computation, head parallelism can be effectively mapped to query parallelism\. Specifically, kernels such as FlashInfer flatten the head dimension, arranging query tile sizes for each Cooperative Thread Array \(CTA\) via the transformation:\[num\_queries,num\_kv\_heads,head\_dim\]→\[num\_queries×num\_kv\_heads,1,head\_dim\]\[num\\\_queries,num\\\_kv\\\_heads,head\\\_dim\]\\rightarrow\[num\\\_queries\\times num\\\_kv\\\_heads,1,head\\\_dim\]\.
Base Task: Token\-wise Selection\.All indices across different layers are identical\. We only need to maintain global \(batch\-level\) block indices, requiring no additional optimization efforts\.
Task 1: Layer\-wise Selection \(LS\)\.In this setting, while different layers share the same token budget, the specific indices are selected independently\.
Figure 13:Performance evaluations for solutions in the Layer\-Selection \(LS\) task\.Solutions:Vanilla\(✗\) – Requires planning before every layer, resulting in broken CUDA Graphs;LS\-Rewrite\(✓\) – Consolidates selected KV blocks by reading and writing them into new blocks with unified indices;LS\-Sparse\(✓\) – Maintains full\-attention allocated indices and passes a selection mask per layer;LS\-Sum\(✗\) – Allocates KV blocks for each layer and concatenates them during attention preparation\. This is inefficient due to excessive redundant loading\.
Task 2: Layer\-wise Allocation \(LA\)\.Different layers are assigned different budgets with independently selected indices\.
Figure 14:Performance evaluations for solutions in the Layer\-Allocation \(LA\) task\.Solutions:LA\-Sparse\(–\) – Utilizes sparse loading, identical to ‘LS\-Sparse’;LA\-Max\(✓\) – Unifies block indices based on the maximum layer\-wise allocated pages\. This necessitates an additional rewrite operation, resulting in higher preparation latency\.
*Latency Analysis:*It is important to note that the higher latency incurred by rewriting the KV cache is amortized over decoding steps\. We only need to rewrite the cache once, utilizing vanilla allocation thereafter for non\-selection decoding steps\. For prefill\-time compression methods\(Liet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib1); Xuet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib52); Chenet al\.,[2024a](https://arxiv.org/html/2606.28831#bib.bib63)\), this additional cost is effectively negligible\. Regarding decoding\-time compression methods \(which use periodic compression operations\), the cost is averaged over the compression frequency\.
Task 3: Head\-wise Selection \(HS\)\.Different heads have different budgets and independently selected indices, yet they total the same KV cache budget per layer\.
Figure 15:Performance evaluations for solutions in Head\-Selection \(HS\) scenarios\.To utilize the benefits of flexible KV cache allocation in this setting, we assign different block IDs to different heads in the global/batch\-level metadata\. However, we ensure that each block index corresponds to the KV cache of all layers \(see Figure[6](https://arxiv.org/html/2606.28831#S3.F6)in Section[3\.3\.2](https://arxiv.org/html/2606.28831#S3.SS3.SSS2)\)\.
Solutions:HS\-Sparse\(–\) – A sparse loading method that keeps full indices in sequence and loads the effective KV cache via sparse masks;HS\-Max\(✓\) – Unifies head\-wise indices to the maximum number of allocated blocks per head, requiring a rewrite operation to align indices across heads;HS\-Sum\(✗\) – Unifies head\-wise indices to the sum of allocated blocks per layer, requiring a complex rewrite operation across layers\.
Finally, we reach toHead\-wise Allocation \(HA\)task that is discussed in Section[3\.3](https://arxiv.org/html/2606.28831#S3.SS3)\.
##### Trade\-off in Rewrite Operation
Building on the trade\-off discussed in Section[3\.3\.2](https://arxiv.org/html/2606.28831#S3.SS3.SSS2), we characterize the system cost using the Memory\-Time\-Integral \(MTI\):
MTI\(Memory\-Time\-Integral\)=∫0Tmemory\(t\)dtMTI\(\\text\{Memory\-Time\-Integral\}\)=\\int\_\{0\}^\{T\}\\text\{memory\(t\)\}\\ \\mathrm\{d\}t
Figure[16](https://arxiv.org/html/2606.28831#A2.F16)compares the MTI ofHA\-SparseandHA\-Max\. Notably,HA\-Maxquickly offsets the temporary overhead incurred by rewrite operations \(in<5<5steps\)\. This implies that increasing the size of the*Condense Cache*\(HA\-Max\) over the*Sparse Cache*\(HA\-Sparse\) improves request\-level throughput\. Nevertheless, we cannot minimize the Sparse Cache entirely, as it is critical for the*Cautious Update*strategy \(Section[3\.1](https://arxiv.org/html/2606.28831#S3.SS1)\)\. Therefore, we select a balanced memory allocation \(Llower=Lupper/2L\_\{lower\}=L\_\{upper\}/2\) for the experimental evaluation presented in Section[3\.3\.1](https://arxiv.org/html/2606.28831#S3.SS3.SSS1)and[4](https://arxiv.org/html/2606.28831#S4)\.
\(a\)Memory\-latency integral on 1 sequence with length of 2048\.
\(b\)Memory\-latency integral on 1 sequence with length of 20480\.
\(c\)Memory\-latency integral on 32 sequence with length of 2048\.
\(d\)Memory\-latency integral on 32 sequence with length of 20480\.
Figure 16:Trade\-off measured by Memory\-latency integral for different number of sequences and different sequence lengths\. In most settings, the rewrite integral can catch up its sparse counterpart in less then 5 decoding steps\.
## Appendix CExperiments Details
### C\.1Choice of Datasets
Hard\-KVis designed for decoding\-time KV cache compression, specifically targeting the challenges inherent in long\-context generation\. In contrast to traditional long\-context tasks—such as document QA, few\-shot learning, summarization, and retrieval\(Baiet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib53),[2025](https://arxiv.org/html/2606.28831#bib.bib54); Hsiehet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib55)\)—which are characterized bylong inputsandshort outputs, we focus on scenarios featuringshort inputsandlong outputs\.
This focus necessitates an evaluation centered on complex mathematical reasoning\. Accordingly, we select theAIME24\(Zhang and Math\-AI,[2024](https://arxiv.org/html/2606.28831#bib.bib29)\),AIME25\(Zhang and Math\-AI,[2025](https://arxiv.org/html/2606.28831#bib.bib30)\), andU\-MATH\(Chernyshevet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib31)\)benchmarks\. U\-MATH is a university\-level dataset derived from real\-world academic sources; it remains relatively under\-explored in the literature regarding reasoning models\. We find that U\-MATH is particularly effective for evaluating KV compression methods, as its moderate overall accuracy allows for a granular distinction between different levels of attention sparsity\. For our experimental setup, we filtered the dataset to a subset of 50 non\-image\-based questions with definitive answers to facilitate objective evaluation\. While popular benchmarks such asMATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.28831#bib.bib56)\)andGSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.28831#bib.bib57)\)were considered, they are largely saturated and typically generate fewer than 10k tokens on average\. All experimental evaluations were conducted using themath\-verify\(Huggingface,[2025](https://arxiv.org/html/2606.28831#bib.bib18)\)framework\.
### C\.2About Reproduction
All experiments were conducted on NVIDIA A100 and NVIDIA RTX 4090 GPUs\. The results reported in Table[1](https://arxiv.org/html/2606.28831#S4.T1)and Table[2](https://arxiv.org/html/2606.28831#S4.T2)represent the average of three independent trials for each configuration\. For reproducibility, comprehensive guidelines are provided in our[documentation](https://github.com/SuDIS-ZJU/HARDInfer)\. Beyond the code repository, we also release full experiment logs to substantiate the reliability of our results and facilitate further analysis of token generation patterns\.
### C\.3Further Experiment Results
We present the experimental results for Qwen3\-4B in Table[4](https://arxiv.org/html/2606.28831#A3.T4)and Table[5](https://arxiv.org/html/2606.28831#A3.T5)\. When compared with the Qwen\-8B results \(Table[1](https://arxiv.org/html/2606.28831#S4.T1)and Table[2](https://arxiv.org/html/2606.28831#S4.T2)\), the performance metrics are approximately equivalent\. This similarity can be attributed to thearchitectural congruencybetween the two models; specifically, Qwen3\-4B and Qwen3\-8B shareidenticalattention layer configurations, possessing the same`head\_dim`,`num\_heads`,`num\_kv\_heads`, and`num\_layers`\. Consequently, optimizations targeting the KV cache yield consistent effects across both models\. This observation aligns with the official technical reportYanget al\.\([2025](https://arxiv.org/html/2606.28831#bib.bib50)\), which documents comparable performance profiles for these two architectures\.
However, our experiments with Qwen3\-4B reveal greater volatility in performance metrics\. Notably, we observe instances of improved results associated with longer generation lengths\. This finding is counter\-intuitive, as extended generation lengths typically degrade performance in the context of KV cache eviction\. We leave for future investigation the analysis of how non\-attention parameters, such as Feed\-Forward Networks \(FFN\), may modulate the efficacy of KV cache compression methods\.
Table 4:Fixed Budget Performance of Qwen3\-4B\.Comparison of baselines vsHard\-KVacross different Top\-kkbudgets \(\{1024, 2048, 4096\}\)\. The Top\-ppbudget for our methods is set as 0\.90\.Full Attentionshows the performance results without budget constraints\.DatasetSelectionConfigBudget SizeMethod102420484096AccGen\. LenAccGen\. LenAccGen\. LenAIME24Full Attn68\.2%14625\.3VanillaBase3\.3%32768\.036\.7%28846\.556\.7%18706\.9\+ Ours24\.0%31790\.839\.9%30273\.236\.6%30864\.9SnapKVBase3\.2%32652\.436\.0%29044\.436\.7%19007\.7\+ Ours10\.7%32768\.040\.2%29201\.655\.4%20931\.6RKVBase3\.3%26750\.126\.7%26245\.950\.0%18967\.4\+ Ours23\.4%31036\.743\.3%31788\.056\.5%28267\.2AIME25Full Attn63\.3%17998\.7VanillaBase6\.5%32309\.716\.7%26617\.524\.8%31851\.4\+ Ours10\.5%30906\.712\.9%30837\.820\.7%29966\.5SnapKVBase10\.9%32229\.712\.3%30040\.443\.3%23654\.5\+ Ours9\.3%32768\.026\.7%28254\.244\.3%26548\.9RKVBase13\.7%27924\.721\.9%26836\.230\.7%20616\.6\+ Ours13\.2%31496\.420\.7%31809\.653\.4%28202\.8UMathFull Attn53\.3%10111\.5VanillaBase14\.0%28859\.636\.7%19021\.444\.0%12828\.0\+ Ours36\.7%23695\.836\.2%24691\.838\.0%24832\.7SnapKVBase18\.0%29332\.531\.8%20651\.444\.2%14371\.1\+ Ours22\.5%29204\.842\.0%23663\.249\.4%16209\.9RKVBase30\.1%18504\.635\.1%17685\.852\.0%12825\.3\+ Ours20\.1%31351\.642\.0%23663\.250\.3%23156\.9
Table 5:Dynamic Budget Performance of Qwen3\-4B\.Comparison of baselines vsHard\-KVacross different Top\-ppbudgets \(\{P70, P80, P90\}\)\. The Top\-kkbudget is 4096\.DatasetMethodBaseTop\-ppbudgetAccGen\. Lenp70p80p90AccGen\. LenAccGen\. LenAccGen\. LenAIME24Vanilla56\.7%18709\.928\.5%31004\.022\.0%31453\.036\.6%30864\.9SnapKV36\.7%19007\.750\.0%23631\.756\.7%29990\.466\.7%21796\.8RKV50\.0%18967\.453\.3%27802\.451\.5%29152\.156\.5%28267\.2AIME25Vanilla24\.8%31851\.417\.2%29977\.026\.7%30941\.220\.7%29966\.5SnapKV43\.3%23654\.550\.0%22225\.140\.0%27711\.644\.3%26548\.9RKV30\.7%20616\.636\.9%28183\.240\.3%28214\.153\.4%28202\.8UMathVanilla44\.0%12828\.042\.0%25278\.839\.3%25729\.938\.0%24832\.7SnapKV44\.2%14371\.144\.6%15291\.452\.3%16546\.849\.4%16209\.9RKV52\.0%12825\.350\.0%22459\.152\.0%22564\.050\.3%23156\.9
We further report results for Qwen3\-32B in Table[6](https://arxiv.org/html/2606.28831#A3.T6)\. Serving larger\-scale models poses substantial system challenges: on a single GPU, such models leave less than 20GB for the KV cache \(e\.g\., Qwen3\-32B on a single NVIDIA A100 80GB GPU\), making multi\-GPU parallelism necessary\. We adopt Tensor Parallelism \(TP\), rather than Pipeline Parallelism, which introduces pipeline bubbles, or Expert Parallelism, which is incompatible with CUDA Graphs\.
Since TP naturally partitions attention heads across instances, each instance performs head\-wise KV eviction independently\. To minimize communication overhead, our customizedNano\-vLLMengine implements two key optimizations:
- •Asynchronous mask management\.We avoid synchronizing head\-wise sparse masks across instances by managing them on the CPU, thereby preventing frequent CPU–GPU synchronization stalls\.
- •Localized metadata\.Sequence\-level metadata, such as head\-wise block tables, is stored locally within each instance, eliminating cross\-instance serialization and communication overhead\.
As a result, the only required communication is a one\-time gather of head counts per sequence, which introduces negligible overhead compared to non\-parallel configurations\.
Table 6:Fixed Budget Performance of Qwen3\-32B \(Budget: 4096\)\.Transposed comparison of baselines vsHard\-KVfor a Top\-kkbudget of 4096 across datasets\. The Top\-ppbudget for our methods is set as 0\.90\.Full Attentionshows performance results without budget constraints\.SelectionConfigAIME24AIME25UMathMethodAccGen\. LenAccGen\. LenAccGen\. LenFull Attn76\.7%13104\.070\.0%16852\.056\.0%9169\.0VanillaBase40\.0%25918\.033\.3%28701\.048\.0%22183\.0\+ Ours50\.0%22164\.043\.3%24474\.052\.0%14711\.0SnapKVBase53\.3%24862\.033\.3%24256\.050\.0%21097\.0\+ Ours60\.0%21549\.046\.7%22901\.056\.0%21778\.0RKVBase60\.0%19911\.040\.0%21275\.050\.0%10537\.0\+ Ours66\.7%17030\.050\.0%21427\.052\.0%10123\.0
## Appendix DFurther Discussions on logits Calibration
### D\.1Algorithms for solving Problem[1](https://arxiv.org/html/2606.28831#Thmproblem1)under order\-invariance constraint
Algorithm 1Gradient Descent onTTParameters :
kk: for Top\-kk;pp: for Top\-pp;
TT: temperatures with shape
\[bsz×num\_kv\_heads,query\_len\]\[bsz\\times num\\\_kv\\\_heads,query\\\_len\]
Optimizers :
OptimizerOptimizer\(e\.g\. AdamW\);
Initialize :
TT,OptimizerOptimizer
k←Num\_Selected\_TopP\(S,p\)k\\leftarrow Num\\\_Selected\\\_TopP\(S,p\)
while*‖Mk\(S\)−Mk\(S^\)‖<ℰ\\\|M\_\{k\}\(S\)\-M\_\{k\}\(\\hat\{S\}\)\\\|<\\mathcal\{E\}andi≤Ni\\leq N*do
2Solution 1z^\\displaystyle\\hat\{z\}←g\(z/T\)\\displaystyle\\leftarrow g\(z/T\)S^\\displaystyle\\hat\{S\}=Softmax\(z^\)\\displaystyle=\\operatorname\{Softmax\}\(\\hat\{z\}\)Solution 2z^\\displaystyle\\hat\{z\}←g\(z\)\\displaystyle\\leftarrow g\(z\)S^\\displaystyle\\hat\{S\}=Softmax\(z^/T\)\\displaystyle=\\operatorname\{Softmax\}\(\\hat\{z\}/T\)
ℓ←‖Ms\(S\)−Mk\(S^\)‖\\ell\\leftarrow\\\|M\_\{s\}\(S\)\-M\_\{k\}\(\\hat\{S\}\)\\\|Update
TTvia
OptimizerOptimizer
Algorithm 2Binary Search onTTParameters :
kk: for Top\-kk;pp: for Top\-pp;
TT: temperatures\[bsz×num\_kv\_heads,query\_len\]\[bsz\\times num\\\_kv\\\_heads,query\\\_len\]
Initialize :
TminT\_\{min\},TmaxT\_\{max\}
k←Num\_Selected\_TopP\(S,p\)k\\leftarrow Num\\\_Selected\\\_TopP\(S,p\)
do
2
Tmid←Tmin\+Tmax2T\_\{mid\}\\leftarrow\\frac\{T\_\{min\}\+T\_\{max\}\}\{2\}Solution 1z^\\displaystyle\\hat\{z\}←g\(z/Tmid\)\\displaystyle\\leftarrow g\(z/T\_\{mid\}\)S^\\displaystyle\\hat\{S\}←Softmax\(z^\)\\displaystyle\\leftarrow\\operatorname\{Softmax\}\(\\hat\{z\}\)Tmin\\displaystyle T\_\{min\}←WHERE\(Mk\(S\)<Mk\(S^\),Tmid,Tmin\)\\displaystyle\\leftarrow\\operatorname\{WHERE\}\(M\_\{k\}\(S\)<M\_\{k\}\(\{\\hat\{S\}\}\),T\_\{mid\},T\_\{min\}\)Solution 2z^\\displaystyle\\hat\{z\}←g\(z\)\\displaystyle\\leftarrow g\(z\)S^\\displaystyle\\hat\{S\}←Softmax\(z^/Tmid\)\\displaystyle\\leftarrow\\operatorname\{Softmax\}\(\\hat\{z\}/T\_\{mid\}\)Tmax\\displaystyle T\_\{max\}←WHERE\(Mk\(S\)<Mk\(S^\),Tmax,Tmid\)\\displaystyle\\leftarrow\\operatorname\{WHERE\}\(M\_\{k\}\(S\)<M\_\{k\}\(\{\\hat\{S\}\}\),T\_\{max\},T\_\{mid\}\)
3while*‖Mk\(S\)−Mk\(S^\)‖<ℰ\\\|M\_\{k\}\(S\)\-M\_\{k\}\(\\hat\{S\}\)\\\|<\\mathcal\{E\}andALL\(Tmin<Tmax\)\\operatorname\{ALL\}\(T\_\{min\}<T\_\{max\}\)*;
Figure 17:An example of solved temperatures forR\-KV\. The solution of temperatures are relatively stable despite exceptional failures\.In this section, we will provide the two choices of algorithms to solve Problem[1](https://arxiv.org/html/2606.28831#Thmproblem1)under the order\-invariance Constraint[1](https://arxiv.org/html/2606.28831#Thmconstraint1)\.
Algorithm[1](https://arxiv.org/html/2606.28831#alg1)uses Gradient Descend to optimize temperaturesTTas parameters\. This algorithm treats the Problem[1](https://arxiv.org/html/2606.28831#Thmproblem1)as an optimization problem that can be solved by modern optimizers\(Kingma and Ba,[2014](https://arxiv.org/html/2606.28831#bib.bib47); Loshchilov and Hutter,[2017](https://arxiv.org/html/2606.28831#bib.bib48); Ruder,[2016](https://arxiv.org/html/2606.28831#bib.bib49)\)\. In practice, we prefer Adam\(Kingma and Ba,[2014](https://arxiv.org/html/2606.28831#bib.bib47)\)which converge quicker and have stable performances across different settings\.
Algorithm[2](https://arxiv.org/html/2606.28831#alg2)views Problem[1](https://arxiv.org/html/2606.28831#Thmproblem1)as a searching problem, when the transformation function satisfy certain constraints \(like being concave\), multiplying logitszzby1T\\frac\{1\}\{T\}yields sharpened distribution for lowerTTand flattened distribution for higherTT\. The sharpened/flattened distribution can be this way shaped to satisfy the mass\-preserving in Problem[1](https://arxiv.org/html/2606.28831#Thmproblem1)
In practice, we apply Algorithm[1](https://arxiv.org/html/2606.28831#alg1)for bothSnapKVandR\-KV\. An example of solved temperaturesTTis shown in Figure[17](https://arxiv.org/html/2606.28831#A4.F17)\. The results is largely constant in different decoding steps, while failure cases exist because of numerical instability and the illness of the problem\. For this reason, we only solve the temperatures for the first compression in each request, since solving temperatures in every decoding step is not necessary\.
### D\.2Proof of order\-preserving temperature transformation
In this part, will prove that when applying function \(like max\-pool\) on the raw logitszzdo not affect the Top\-kkselection when applied on the attention scoreSS\(order\-preserving\), we can shape the attention distribution by directly conduct temperature transformationz/Tz/Twithout changing the selection results\. Letzzdenote the attention logits andS\(⋅\)=Softmax\(⋅\)S\(\\cdot\)=\\operatorname\{Softmax\}\(\\cdot\)\. We consider a pooling or aggregation functiong\(⋅\)g\(\\cdot\)\(such asMaxPool\\operatorname\{MaxPool\}\) applied to chunks of the input\. We defineS^\\hat\{S\}as the altered attention score derived from the logits\.
###### Definition D\.1\(Monotonic Consistency\)\.
Letg\(⋅\)g\(\\cdot\)be a function such that the ranking of its outputs is preserved under element\-wise monotonic transformation\. Specifically,ggsatisfies the following condition:
###### Condition 1\.
For any two inputsz\(1\)z\_\{\(1\)\}andz\(2\)z\_\{\(2\)\}, ifS\(g\(z\(1\)\)\)≥S\(g\(z\(2\)\)\)S\\left\(g\(z\_\{\(1\)\}\)\\right\)\\geq S\\left\(g\(z\_\{\(2\)\}\)\\right\), then:
g\(S\(z\(1\)\)\)≥g\(S\(z\(2\)\)\)g\(S\(z\_\{\(1\)\}\)\)\\geq g\(S\(z\_\{\(2\)\}\)\)
This condition holds for order\-statistic functions such asMaxPool\\operatorname\{MaxPool\},MinPool\\operatorname\{MinPool\}, andMedian\\operatorname\{Median\}, but not necessarily for arithmetic means\.
###### Lemma D\.2\.
For any temperatureT\>0T\>0, the scaling functionf\(x\)=x/Tf\(x\)=x/Tis strictly monotonically increasing\. Consequently, for anya,b∈ℝa,b\\in\\mathbb\{R\}:
a≥b⇔a/T≥b/Ta\\geq b\\iff a/T\\geq b/T
###### Proof\.
The proof follows from the definition of a strictly increasing function on strictly positive denominators\. ∎
###### Theorem D\.3\.
LetTopK\(⋅\)\\operatorname\{TopK\}\(\\cdot\)be the operator that returns the set of indices corresponding to thekklargest values\. Ifg\(⋅\)g\(\\cdot\)satisfies Condition[1](https://arxiv.org/html/2606.28831#Thmcondition1), then the selection of key indices is invariant to temperature scalingT\>0T\>0and the order of Softmax application\. Formally:
S^\|k≡g\(z\)\|k\\hat\{S\}\|\_\{k\}\\equiv g\(z\)\|\_\{k\}whereS^=S\(g\(z/T\)\)\\hat\{S\}=S\(g\(z/T\)\),\|k\|\_\{k\}represent the Top\-kksubset\.
###### Proof\.
We aim to show that the ranking of elements inS^\\hat\{S\}is identical to the ranking of elements ing\(z\)g\(z\)\.
Consider two arbitrary indices\(1\)\(1\)and\(2\)\(2\)such that the pooled logit of\(1\)\(1\)is greater than or equal to\(2\)\(2\):
g\(z\(1\)\)≥g\(z\(2\)\)g\(z\_\{\(1\)\}\)\\geq g\(z\_\{\(2\)\}\)
Step 1: Invariance to Temperature \(Lemma[D\.2](https://arxiv.org/html/2606.28831#A4.Thmtheorem2)\) SinceT\>0T\>0, applying Lemma[D\.2](https://arxiv.org/html/2606.28831#A4.Thmtheorem2):
g\(z\(1\)\)/T≥g\(z\(2\)\)/T⟹g\(z\(1\)/T\)≥g\(z\(2\)/T\)g\(z\_\{\(1\)\}\)/T\\geq g\(z\_\{\(2\)\}\)/T\\implies g\(z\_\{\(1\)\}/T\)\\geq g\(z\_\{\(2\)\}/T\)\(Note: Sinceggis order\-preserving,g\(cx\)=cg\(x\)g\(cx\)=cg\(x\)forc\>0c\>0in the case of MaxPool, or generally preserves rank\)\.
Step 2: Monotonicity of Softmax The Softmax functionS\(⋅\)S\(\\cdot\)is strictly monotonic with respect to its input arguments\. Therefore:
g\(z\(1\)/T\)≥g\(z\(2\)/T\)⇔S\(g\(z\(1\)/T\)\)≥S\(g\(z\(2\)/T\)\)g\(z\_\{\(1\)\}/T\)\\geq g\(z\_\{\(2\)\}/T\)\\iff S\\left\(g\(z\_\{\(1\)\}/T\)\\right\)\\geq S\\left\(g\(z\_\{\(2\)\}/T\)\\right\)LetS^=S\(g\(z/T\)\)\\hat\{S\}=S\(g\(z/T\)\)\. We have established that the order ofS^\\hat\{S\}is determined strictly by the order ofg\(z\)g\(z\)\.
Step 3: Generalization via Condition[1](https://arxiv.org/html/2606.28831#Thmcondition1) Using Condition[1](https://arxiv.org/html/2606.28831#Thmcondition1), we ensure that this ranking is consistent with the ”true” attention distribution \(had Softmax been applied before pooling\)\. SinceS\(g\(⋅\)\)S\(g\(\\cdot\)\)preserves the order ofg\(⋅\)g\(\\cdot\), and Condition[1](https://arxiv.org/html/2606.28831#Thmcondition1)links this tog\(S\(⋅\)\)g\(S\(\\cdot\)\), we conclude that identifying the Top\-K elements in the computationally efficient proxyS^\\hat\{S\}yields the same indices as the original computation\.
Thus, the Top\-K indices remain invariant\. ∎
## Appendix EThe Formal Definitions of the Three\-step Framework for KV Selection
Figure 18:\(Step 1\)Gatherthe KV cache in subgroups along the sequence for following computation; \(Step 2\)Reducein the subgroup to calculate ranking score andscatterto the post\-compressed blocks; \(Step 3\)Sampleby the scattered score to satisfy Top\-kkor Top\-ppconstraints\.LetS\(i\)=S\(𝐪,𝐤\(i\)\)S\_\{\(i\)\}=S\(\\mathbf\{q\},\\mathbf\{k\}\_\{\(i\)\}\)denote the attention score for a query𝐪\\mathbf\{q\}and key𝐤\(i\)\\mathbf\{k\}\_\{\(i\)\}\. The framework proceeds as follows:
Gather: We define a set of indicesGiG\_\{i\}based on a grouping functionff\(e\.g\., quantization or similarity\) and a corresponding equivalence relation:
Gi=\{j∣f\(kj,Si\)∼f\(ki,Si\)\}\.G\_\{i\}=\\\{j\\mid f\(k\_\{j\},S\_\{i\}\)\\sim f\(k\_\{i\},S\_\{i\}\)\\\}\.\(1\)
Reduce\-Scatter: We calculate a proxy scoreS^i\\hat\{S\}\_\{i\}for the group and scatter it to the associated indices:
S^i=REDUCE\(\{S\(q,kj\)∣j∈Gi\}\)\\displaystyle\\hat\{S\}\_\{i\}=\\operatorname\{REDUCE\}\(\\\{S\(q,k\_\{j\}\)\\mid j\\in G\_\{i\}\\\}\)\(2\)SCATTER\(Sj:=S^i∣j∈G^i\)\.\\displaystyle\\operatorname\{SCATTER\}\(S\_\{j\}=\\hat\{S\}\_\{i\}\\mid j\\in\\hat\{G\}\_\{i\}\)\.
Note that the reduction function may operate on the keyskik\_\{i\}, necessitating the re\-computation ofS^i\\hat\{S\}\_\{i\}\. In KV cache compression scenarios, the original subgroupGiG\_\{i\}need not satisfyG^i≡Gi\\hat\{G\}\_\{i\}\\equiv G\_\{i\}, whereG^i\\hat\{G\}\_\{i\}represents a compressed subgroup \(occupying less memory\)\.
Sample: The final set of indicesℐfinal\\mathcal\{I\}\_\{\\text\{final\}\}is determined by:
ℐfinal=SELECT\(\{S^i\}\),\\mathcal\{I\}\_\{\\text\{final\}\}=\\operatorname\{SELECT\}\(\\\{\\hat\{S\}\_\{i\}\\\}\),\(3\)whereSELECT∈\{Top\-k,Top\-p\}\\operatorname\{SELECT\}\\in\\\{\\text\{Top\-\}k,\\text\{Top\-\}p\\\}\.
We perform logit calibrations \(Section[3\.2](https://arxiv.org/html/2606.28831#S3.SS2)\) after Step 2,Reduce\-Scatter, which solve the Problem[1](https://arxiv.org/html/2606.28831#Thmproblem1)by any one of the above Algorithm[1](https://arxiv.org/html/2606.28831#alg1)and[2](https://arxiv.org/html/2606.28831#alg2)\. The solution process, can be alternatively conducted in subgroups generated in Step 1,Gather, bringing better efficiency in optimizing on temperaturesTT\. For the reason stated in Appendix[D\.1](https://arxiv.org/html/2606.28831#A4.SS1), we reuse the temperatures solved in the first compression step\.
## Appendix FMathematical Abstraction of Cascade Attention
To seamlessly compute attention across these physically separated tiers, we can utilize a composable Attention State algebra proposed by Flashinfer\(Yeet al\.,[2025](https://arxiv.org/html/2606.28831#bib.bib28)\)originally designed for Prefix Caching\(Yeet al\.,[2024](https://arxiv.org/html/2606.28831#bib.bib44)\)\.
Recall that standard self\-attention for a query𝐪\\mathbf\{q\}and a set of KV pairs indexed byIIis computed as:
Attention\(𝐪,I\)=∑i∈Iexp\(𝐪𝐤\(i\)⊤\)𝐯\(i\)∑j∈Iexp\(𝐪𝐤\(j\)⊤\)\.\\text\{Attention\}\(\\mathbf\{q\},I\)=\\frac\{\\sum\_\{i\\in I\}\\exp\(\\mathbf\{q\}\\mathbf\{k\}\_\{\(i\)\}^\{\\top\}\)\\mathbf\{v\}\_\{\(i\)\}\}\{\\sum\_\{j\\in I\}\\exp\(\\mathbf\{q\}\\mathbf\{k\}\_\{\(j\)\}^\{\\top\}\)\}\.\(4\)
The denominator represents the total attention mass, which we term the*Sum of Exponentials*\(SE\(I\)\\operatorname\{\{SE\}\}\(I\)\)\. The numerator represents the unnormalized weighted sum of value vectors\.𝐨\(I\)\\mathbf\{o\}\(I\)is defined as the attention result restricted solely to the subsetII:
𝐨\(I\)=∑i∈ISoftmax\(SE\(i\)\)𝐯\(i\)=∑i∈ISE\(i\)𝐯\(i\)SE\(I\),\\mathbf\{o\}\(I\)=\\sum\_\{i\\in I\}\\operatorname\{Softmax\}\(\\operatorname\{\{SE\}\}\_\{\(i\)\}\)\\mathbf\{v\}\_\{\(i\)\}=\\frac\{\\sum\_\{i\\in I\}\\operatorname\{\{SE\}\}\_\{\(i\)\}\\mathbf\{v\}\_\{\(i\)\}\}\{\\operatorname\{\{SE\}\}\(I\)\},\(5\)
We define theAttention Stateof a tierIIas the tuple\[𝐨\(I\)SE\(I\)\]\\begin\{bmatrix\}\\mathbf\{o\}\(I\)\\\\ \\operatorname\{\{SE\}\}\(I\)\\end\{bmatrix\}, which fully encapsulates the partial computation ofII\. To aggregate results from multiple tiers \(e\.g\., a Sparse TierIIand a Dense TierJJ\), a binaryMerge Operator⊕\\opluscan be applied to merge the partial outputs for the complete outputs:
\[𝐨\(I∪J\)SE\(I∪J\)\]\\displaystyle\\begin\{bmatrix\}\\mathbf\{o\}\(I\\cup J\)\\\\ \\operatorname\{\{SE\}\}\(I\\cup J\)\\end\{bmatrix\}=\[𝐨\(I\)SE\(I\)\]⊕\[𝐨\(J\)SE\(J\)\]\\displaystyle=\\begin\{bmatrix\}\\mathbf\{o\}\(I\)\\\\ \\operatorname\{\{SE\}\}\(I\)\\end\{bmatrix\}\\oplus\\begin\{bmatrix\}\\mathbf\{o\}\(J\)\\\\ \\operatorname\{\{SE\}\}\(J\)\\end\{bmatrix\}\(6\)=\[𝐨\(I\)SE\(I\)\+𝐨\(J\)SE\(J\)SE\(I\)\+SE\(J\)SE\(I\)\+SE\(J\)\]\.\\displaystyle=\\begin\{bmatrix\}\\frac\{\\mathbf\{o\}\(I\)\\operatorname\{\{SE\}\}\(I\)\+\\mathbf\{o\}\(J\)\\operatorname\{\{SE\}\}\(J\)\}\{\\operatorname\{\{SE\}\}\(I\)\+\\operatorname\{\{SE\}\}\(J\)\}\\\\ \\operatorname\{\{SE\}\}\(I\)\+\\operatorname\{\{SE\}\}\(J\)\\end\{bmatrix\}\.
## Appendix GA Case Study on Attention Property\-Aware KV Compression on the Calibrated Logits
The*Logits Calibration*proves to be capable of preserving the density of selection heatmap\. In this section, we will provide another example that preserves Attention property \(like density\), built on top of*Logits Calibration*\.
##### LSE\\operatorname\{\{LSE\}\}\-preserving KV Cache Merging

\(a\)LSE\\operatorname\{\{LSE\}\}for raw attention scores\.
\(b\)P60LSE\\operatorname\{\{LSE\}\}for raw attention scores\.
\(c\)P60 log number of selected tokens \(raw\)\.
\(d\)P60 log number of selected tokens \(maxpool\-ed\)\.
Figure 19:LSE\\operatorname\{\{LSE\}\}and log number of selected tokens\.Observation 1:\(comparing \(a\) and \(b\)\), a drop in log scale when calculatingLSE\\operatorname\{\{LSE\}\}for a givenppcompared with full counterparts;Observation 2:\(comparing \(a\) and \(c\)\), an increasing trend for bothLSE\\operatorname\{\{LSE\}\}and log number of selections\.Observation 3:\(comparing \(c\) and \(d\)\), a clear drop in log scale in selections by maxpool\-ed logits\.\(a\)TheLSE\\operatorname\{LSE\}of the compressed KV under budget of 512 \(SnapKV\)\.
\(b\)TheLSE\\operatorname\{LSE\}of the compressed KV under budget of 512 \(R\-KV\)\.
Figure 20:The two figured above show the drop inLSE\\operatorname\{\{LSE\}\}and a growing magnitude as the seuqence grows\.Recall thatLSE\\operatorname\{\{LSE\}\}is defined as:
LSE\(q,K\)=log\(∑jeqK\(j\)⊤\)\\operatorname\{\{LSE\}\}\(q,K\)=\\log\\left\(\\sum\\limits\_\{j\}e^\{qK^\{\\top\}\_\{\(j\)\}\}\\right\)
This metric is computed by reduction over the whole sequence and is yielded during the computation of Flash Attention\. As shown in Figure[19](https://arxiv.org/html/2606.28831#A7.F19),LSE\\operatorname\{\{LSE\}\}connects closely with the number of selected tokens \(density\) in log scale\.
This intrigues us to investigate if the LSE for the kept tokens can still keep the same trend, when executing strict Top\-kkcompression \(eviction for both methods\), If the trend remains stable, this suggest the possibility of constant memory; otherwise the property of oracle attention is impaired, especially in the our specific scenario, where KV cache are periodically compressed to fixed budget during long\-context decoding\.
As shown in Figure[20](https://arxiv.org/html/2606.28831#A7.F20), both methods have encountered clear drop in LSE which accumulates continually as the sequence grows\. Intuitively, LSE serves as a health indicator for attention computation \(as it is soft argmax operator\) and should reflect or assess the illness of a periodic compression methods\. For both TopK selection, we have observed loss in LSE \(in Figure[20](https://arxiv.org/html/2606.28831#A7.F20)\) , which leads to discussion in the next section on how to preserve LSE when pruning KV Cache\.
DenoteSE\(q,K\)=∑jeqK\(j\)⊤\\operatorname\{\{SE\}\}\(q,K\)=\\sum\\limits\_\{j\}e^\{qK^\{\\top\}\_\{\(j\)\}\}, we have:
Ooracle\\displaystyle O\_\{oracle\}=∑ieqK\(i\)⊤V\(i\)SE\(q,K\)\\displaystyle=\\sum\\limits\_\{i\}\\frac\{e^\{qK^\{\\top\}\_\{\(i\)\}\}V\_\{\(i\)\}\}\{\\operatorname\{\{SE\}\}\\left\(q,K\\right\)\}\(7\)O1\\displaystyle O\_\{1\}=\(Ooracle−eqK\(j\)⊤V\(j\)\+eqK\(k\)⊤V\(k\)SE\(q,K\)\)×SE\(q,K\)SE\(q,K\\\{K\(j\),K\(k\)\}∪\{f\(K\(j\),K\(k\)\)\}\)\\displaystyle=\\left\(O\_\{oracle\}\-\\frac\{e^\{qK^\{\\top\}\_\{\(j\)\}\}V\_\{\(j\)\}\+e^\{qK^\{\\top\}\_\{\(k\)\}\}V\_\{\(k\)\}\}\{\\operatorname\{\{SE\}\}\\left\(\{q,K\}\\right\)\}\\right\)\\times\\frac\{\\operatorname\{\{SE\}\}\\left\(q,K\\right\)\}\{\\operatorname\{\{SE\}\}\\left\(q,K\\backslash\\left\\\{K\_\{\(j\)\},K\_\{\(k\)\}\\right\\\}\\cup\\left\\\{f\(K\_\{\(j\)\},K\_\{\(k\)\}\)\\right\\\}\\right\)\}\+eqf\(K\(j\)⊤,K\(k\)⊤\)g\(V\(j\),V\(k\)\)SE\(q,K\\\{K\(j\),K\(k\)\}∪\{f\(K\(j\),K\(k\)\)\}\)\\displaystyle\+\\frac\{e^\{qf\(K^\{\\top\}\_\{\(j\)\},K^\{\\top\}\_\{\(k\)\}\)\}g\(V\_\{\(j\)\},V\_\{\(k\)\}\)\}\{\\operatorname\{\{SE\}\}\\left\(q,K\\backslash\\left\\\{K\_\{\(j\)\},K\_\{\(k\)\}\\right\\\}\\cup\\left\\\{f\(K\_\{\(j\)\},K\_\{\(k\)\}\)\\right\\\}\\right\)\}
OoracleO\_\{oracle\}is the precise output of the self\-attention\. For the initial compression, we denote the compressed output asO1O\_\{1\}\(will be extended toOiO\_\{i\}in further compressions\), the concept of the ”first” compression is limited to the selection of two pairs of KV cache for demonstration\. This single step of compression does not have to make a complete operation\. It is only theminimal unitthat we can analyze the intervention on the attention output forselective compressionmethods\. TheK\(j\)K\_\{\(j\)\}\(V\(j\)V\_\{\(j\)\}\) represent thejj\-th key \(value\) vectors in the whole sequence \(the index in sequence will be wrapped in bracelets by default\)\. For each specific compression method, we formalize by pair\-wise mapping, i\.e\.f\(K\(j\),K\(k\)\):=\(K\(j\),K\(k\)\)→K^\(j\)f\(K\_\{\(j\)\},K\_\{\(k\)\}\):=\(K\_\{\(j\)\},K\_\{\(k\)\}\)\\rightarrow\\hat\{K\}\_\{\(j\)\}\. For evictions,ffis formularized asf\(Kanchor,Kevict\)=𝟘⋅Kanchor\+𝟙⋅Kevictf\(K\_\{anchor\},K\_\{evict\}\)=\\mathbb\{0\}\\cdot K\_\{anchor\}\+\\mathbb\{1\}\\cdot K\_\{evict\}; for merging methods,ffcan bef\(K\(j\),K\(k\)\)=w1⋅K\(j\)\+w2⋅K\(k\)f\(K\_\{\(j\)\},K\_\{\(k\)\}\)=w\_\{1\}\\cdot K\_\{\(j\)\}\+w\_\{2\}\\cdot K\_\{\(k\)\}or in a more complex form\.
In the following deduction, we will simplify some representations for ease of notations\.KKandVVwill be the current key/value vectors in theii\-th compression steps, andK^\\hat\{K\}andV^\\hat\{V\}will be in the\(i\+1\)\(i\+1\)\-th step, i\.e\.K^=f\(K\(j\),K\(k\)\)\\hat\{K\}=f\(K\_\{\(j\)\},K\_\{\(k\)\}\)\(V^=g\(V\(j\),V\(k\)\)\\hat\{V\}=g\(V\_\{\(j\)\},V\_\{\(k\)\}\)\)\.OiO\_\{i\}andSEi\\operatorname\{\{SE\}\}\_\{i\}represent theii\-th reduction result \(without bracelets\)\.
Oi\\displaystyle O\_\{i\}=\(Oi−1−eqK\(j\)⊤V\(j\)\+eqK\(k\)⊤V\(k\)SEi−1\)\\displaystyle=\\left\(O\_\{i\-1\}\-\\frac\{e^\{qK^\{\\top\}\_\{\(j\)\}\}V\_\{\(j\)\}\+e^\{qK^\{\\top\}\_\{\(k\)\}\}V\_\{\(k\)\}\}\{\\operatorname\{\{SE\}\}\_\{i\-1\}\}\\right\)\(8\)×SEi−1SEi\+eqK^⊤V^SEi\\displaystyle\\times\\frac\{\\operatorname\{\{SE\}\}\_\{i\-1\}\}\{\\operatorname\{\{SE\}\}\_\{i\}\}\+\\frac\{e^\{q\\hat\{K\}^\{\\top\}\}\\hat\{V\}\}\{\\operatorname\{\{SE\}\}\_\{i\}\}ΔOi\\displaystyle\\Delta O\_\{i\}=Oi−Oi−1\\displaystyle=O\_\{i\}\-O\_\{i\-1\}=\(eqKun⊤VunSEi−eqKun⊤VunSEi−1\)⏟Shift in Unchanged Tokens\\displaystyle=\\underbrace\{\\left\(\\frac\{e^\{qK^\{\\top\}\_\{un\}\}V\_\{un\}\}\{\\operatorname\{\{SE\}\}\_\{i\}\}\-\\frac\{e^\{qK^\{\\top\}\_\{un\}\}V\_\{un\}\}\{\\operatorname\{\{SE\}\}\_\{i\-1\}\}\\right\)\}\_\{\\text\{Shift in Unchanged Tokens\}\}\+\(eqK^⊤V^SEi−eqK⊤VSEi−1\)⏟Shift due to Compression\\displaystyle\+\\underbrace\{\\left\(\\frac\{e^\{q\\hat\{K\}^\{\\top\}\}\\hat\{V\}\}\{\\operatorname\{\{SE\}\}\_\{i\}\}\-\\frac\{e^\{qK^\{\\top\}\}V\}\{\\operatorname\{\{SE\}\}\_\{i\-1\}\}\\right\)\}\_\{\\text\{Shift due to Compression\}\}
###### Theorem G\.1\.
Denote a proper linear approximation \(weighted attention score\) of key cache merging as the following :
K^\(new\)=eqK\(j\)⊤eqK\(j\)⊤\+eqK\(k\)⊤K\(j\)\+eqK\(k\)⊤eqK\(j\)⊤\+eqK\(k\)⊤K\(k\)\\hat\{K\}\_\{\(new\)\}=\\frac\{e^\{qK^\{\\top\}\_\{\(j\)\}\}\}\{e^\{qK^\{\\top\}\_\{\(j\)\}\}\+e^\{qK^\{\\top\}\_\{\(k\)\}\}\}K\_\{\(j\)\}\+\\frac\{e^\{qK^\{\\top\}\_\{\(k\)\}\}\}\{e^\{qK^\{\\top\}\_\{\(j\)\}\}\+e^\{qK^\{\\top\}\_\{\(k\)\}\}\}K\_\{\(k\)\}\(9\)The solution to minimize the following:
ΔSEi=\|eqK^\(new\)⊤−\(eqK\(j\)⊤\+eqK\(k\)⊤\)\|\(SEpreserving\)\\Delta\\operatorname\{\{SE\}\}\_\{i\}=\\left\|e^\{q\\hat\{K\}^\{\\top\}\_\{\(new\)\}\}\-\\left\(e^\{qK^\{\\top\}\_\{\(j\)\}\}\+e^\{qK^\{\\top\}\_\{\(k\)\}\}\\right\)\\right\|\\ \(\\operatorname\{\{SE\}\}\\ preserving\)\(10\)
is to let
qK\(j\)⊤≫qK\(k\)⊤\.qK\_\{\(j\)\}^\{\\top\}\\gg qK\_\{\(k\)\}^\{\\top\}\.\(11\)
###### Proof\.
SupposeΔSEi≡0\\Delta\\operatorname\{\{SE\}\}\_\{i\}\\equiv 0, we have the following system of equations:
qK^⊤\\displaystyle q\\hat\{K\}^\{\\top\}=q\(eqK\(j\)⊤K\(j\)\+eqK\(k\)⊤K\(k\)eqK\(j\)⊤\+eqK\(k\)⊤\)⊤\\displaystyle=q\\left\(\\frac\{e^\{qK^\{\\top\}\_\{\(j\)\}\}K\_\{\(j\)\}\+e^\{qK^\{\\top\}\_\{\(k\)\}\}K\_\{\(k\)\}\}\{e^\{qK^\{\\top\}\_\{\(j\)\}\}\+e^\{qK^\{\\top\}\_\{\(k\)\}\}\}\\right\)^\{\\top\}\(9\.a\)eqK^⊤\\displaystyle e^\{q\\hat\{K\}^\{\\top\}\}=eqK\(j\)⊤\+eqK\(k\)⊤\\displaystyle=e^\{qK\_\{\(j\)\}^\{\\top\}\}\+e^\{qK\_\{\(k\)\}^\{\\top\}\}\(9\.b\)
LetqK\(j\)⊤=x1qK\_\{\(j\)\}^\{\\top\}=x\_\{1\}andqK\(k\)⊤=x2qK\_\{\(k\)\}^\{\\top\}=x\_\{2\}, combining Equation[9\.a](https://arxiv.org/html/2606.28831#A7.Ex14)and[9\.b](https://arxiv.org/html/2606.28831#A7.Ex15), we get the following:
x1lnx1\+x2lnx2x1\+x2=ln\(x1\+x2\)\\frac\{x\_\{1\}\\ln x\_\{1\}\+x\_\{2\}\\ln x\_\{2\}\}\{x\_\{1\}\+x\_\{2\}\}=\\ln\\left\(x\_\{1\}\+x\_\{2\}\\right\)\(13\)
Rearranging this leads to the*Entropy*of the attention distribution \(which is always non\-negative\)\.
x1ln\(1\+x2x1\)\+x2ln\(1\+x1x2\)⏟Always\>0=0\.\\underbrace\{x\_\{1\}\\ln\\left\(1\+\\frac\{x\_\{2\}\}\{x\_\{1\}\}\\right\)\+x\_\{2\}\\ln\\left\(1\+\\frac\{x\_\{1\}\}\{x\_\{2\}\}\\right\)\}\_\{\\text\{Always \}\>0\}=0\.\(14\)
Stated algebraically, lett=x2x1t=\\frac\{x\_\{2\}\}\{x\_\{1\}\}, we haveh\(t\)=ln\(1\+t\)\+tln\(1\+1t\)\>0h\(t\)=\\ln\(1\+t\)\+t\\ln\(1\+\\frac\{1\}\{t\}\)\>0\(Jensen Inequality\)\. We would like to solve the condition forttthat leads toh\(t\)→0h\(t\)\\to 0\. We have:
limt→0h\(t\)\\displaystyle\\lim\_\{t\\to 0\}h\(t\)=0\\displaystyle=0\(15\)witht\\displaystyle\\text\{with\\ \}t=x2x1=eqk2⊤eqk1⊤=eq\(k2−k1\)⊤\\displaystyle=\\frac\{x\_\{2\}\}\{x\_\{1\}\}=\\frac\{e^\{qk\_\{2\}^\{\\top\}\}\}\{e^\{qk\_\{1\}^\{\\top\}\}\}=e^\{q\(k\_\{2\}\-k\_\{1\}\)^\{\\top\}\}⇒q\(k2−k1\)⊤→−∞\\displaystyle\\Rightarrow q\(k\_\{2\}\-k\_\{1\}\)^\{\\top\}\\to\-\\infty⟹qk1⊤≫qk2⊤\\displaystyle\\Longrightarrow qk\_\{1\}^\{\\top\}\\gg qk\_\{2\}^\{\\top\}∎
\(a\)TheLSE\\operatorname\{LSE\}absolute error ofSnapKVpruned KV cache \(fixed 512 budget\) v\.s\. full KV cache\.
\(b\)TheLSE\\operatorname\{LSE\}absolute error ofSnapKVpruned\-and\-merged KV cache \(fixed 512 budget\) v\.s\. full KV cache\.
\(c\)TheLSE\\operatorname\{LSE\}absolute error ofR\-KVpruned KV cache \(fixed 512 budget\) v\.s\. full KV cache\.
\(d\)TheLSE\\operatorname\{LSE\}absolute error ofR\-KVpruned\-and\-merged KV cache \(fixed 512 budget\) v\.s\. full KV cache\.
Figure 21:The absolute error comparison between w\. and w\.o\.LSE\\operatorname\{\{LSE\}\}\-preserved merging forSnapKVandR\-KVAs shown in the figure[21\(a\)](https://arxiv.org/html/2606.28831#A7.F21.sf1)[21\(b\)](https://arxiv.org/html/2606.28831#A7.F21.sf2)[21\(c\)](https://arxiv.org/html/2606.28831#A7.F21.sf3)and[21\(d\)](https://arxiv.org/html/2606.28831#A7.F21.sf4), withLSE\\operatorname\{\{LSE\}\}\-preserved merging, the tendency of increasing absolute error is clearly suppressed\. We will provide a more thorough analysis in the Appendix to explain why pruning long\-tailed KV cache will result in a growing absolute error and whyLSE\\operatorname\{\{LSE\}\}\-preserve merging can suppress such tendency\.Similar Articles
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
KV Cache Is Becoming the Memory Hierarchy of Inference
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.
SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference
SeKV is a resolution-adaptive KV cache method that organizes context into entropy-guided semantic spans stored across a GPU-CPU hierarchy, enabling selective token-level reconstruction during decoding while reducing GPU memory by 53.3% versus full caching at 128K context.
MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression
MosaicKV introduces dynamic two-dimensional KV cache compression for long-context LLM serving, achieving up to 16x attention speedup and 3x memory reduction with minimal accuracy loss.
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
This paper introduces LKV, a method for end-to-end learning of head-wise budgets and token selection to optimize KV cache eviction in large language models, achieving state-of-the-art performance with high compression rates.