Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Summary
This paper introduces EpiKV, a KV cache eviction method that scores token importance via changes in internal representations (epiphany score) instead of attention weights, avoiding the need to materialize the attention matrix. It achieves competitive performance on reasoning benchmarks while enabling up to 16× longer context lengths.
View Cached Full Text
Cached at: 06/26/26, 05:20 AM
# Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Source: [https://arxiv.org/html/2606.26472](https://arxiv.org/html/2606.26472)
Steven Kolawole Language Technologies Institute Carnegie Mellon University skolawol@cs\.cmu\.edu &Virginia Smith Machine Learning Department Carnegie Mellon University smithv@cmu\.edu
###### Abstract
As reasoning models emit chains of thought tens of thousands of tokens long, KV cache increasingly becomes a deployment bottleneck\. Existing cache eviction methods rank tokens by attention weight, which is a noisy importance proxy in long reasoning traces, and prohibits the use of fused kernels in production inference by forcing the model to materialize the attention matrix\. In this work, we instead score tokens with a metric we term the*epiphany*score: the change in the model’s internal representation, read directly from the forward pass with no attention matrix and negligible extra state\. Our resulting cache eviction method,EpiKV, requires no training, classifier, or custom kernel, and can be used directly in FlashAttention inference stacks unchanged – scaling to a 16×\\timeslonger feasible context than attention\-based scoring\. At a 4096\-token cacheEpiKVreaches 72% on MATH\-500, matching the strongest attention\-based baseline \(ThinKV 71%, H2O 67%\); a lag\-normalized KV variant reaches 37% on AIME\-2024 at 8192 tokens against the best of them \(33%\) , at up to 2\.8×\\timesthe speed\.
Epiphany\-Aware KV Cache Eviction Without the Attention Matrix
Steven KolawoleLanguage Technologies InstituteCarnegie Mellon Universityskolawol@cs\.cmu\.eduVirginia SmithMachine Learning DepartmentCarnegie Mellon Universitysmithv@cmu\.edu
## 1Introduction
Reasoning models such as DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib19)\)solve hard problems by generating long chains of thought; a single competition\-mathematics problem can take tens of thousands of tokens of internal reasoning before an answer\. The key–value \(KV\) cache grows linearly with this length and quickly becomes the memory bottleneck of deployment: at10410^\{4\}–10510^\{5\}decode tokens it dominates device memory and caps the batch size a server can hold\(Kwonet al\.,[2023](https://arxiv.org/html/2606.26472#bib.bib1)\)\. KV cache eviction addresses this by retaining only a budget ofKKtokens, but it raises the question every method must answer:which tokens matter?
Existing decode\-time eviction methods for reasoning traces answer this question by considering the attention weight\(Zhanget al\.,[2023](https://arxiv.org/html/2606.26472#bib.bib9); Ramachandranet al\.,[2026](https://arxiv.org/html/2606.26472#bib.bib11); Huet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib12); Suet al\.,[2026](https://arxiv.org/html/2606.26472#bib.bib24)\)\. However, attention weight has critical drawbacks\. First, it is a noisy proxy for importance: attention sinks absorb weight regardless of content\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.26472#bib.bib10)\), and filler tokens attract weight while being generated yet are never referenced again\. Second, it is architecturally expensive: reading the attention weights requires materializing the attention matrix, which state\-of\-the\-art approaches such as FlashAttention are built to avoid\(Dao,[2024](https://arxiv.org/html/2606.26472#bib.bib15)\)\. Settingoutput\_attentions=Trueforces the eager kernel and exhausts an 80 GB A100 below the length of almost every reasoning trace, while a FlashAttention pass over the same model scales an order of magnitude further \(Figure[1](https://arxiv.org/html/2606.26472#S1.F1)\)\.
Figure 1:Peak GPU memory of a single forward pass vs\. context length on an 80 GB A100\. Reading attention weights \(eager\) grows quadratically and exhausts the GPU at 8192 tokens; the pass our method reads from scales to 65,536 – a16×16\\timeslonger feasible context\. This is the architectural payoff of not needing the attention matrix \(detailed in §[4\.4](https://arxiv.org/html/2606.26472#S4.SS4)\)\.We introduce*epiphany\-aware*KV cache eviction \(EpiKV\), which scores tokens by the change in the model’s internal representation \(hidden state at specific layers and KV vectors\) rather than by attention weight, and is read from the standard forward pass with no attention matrix\. The name refers to the transition points in a reasoning trace \(e\.g\., a concluded step, a committed insight\) where the residual stream shifts most, which we find are the tokens worth keeping\.
#### Contributions\.
1. 1\.We identify a two\-band layer anatomy in a 32\-layer reasoning model: hidden\-state change at layers 7–13 \(Band A\) correlates positively with token importance and at layers 18–25 \(Band B\) negatively, measured against counterfactual occlusion labels\. The combined signal outperforms every attention\-based signal we test\.
2. 2\.We find that the raw signal carries a monotonic positional trend within a trace, as it tracks position as much as content, and we show that a causal rollingzz\-score removes it, recovering eviction quality\.
3. 3\.At deployable budgets our attention\-matrix\-free methods match or exceed the strongest attention\-based baselines on both MATH\-500 and AIME\-2024 \(§[4\.3](https://arxiv.org/html/2606.26472#S4.SS3)\)\.
4. 4\.We quantify the engineering payoff: our method runs up to 2\.8×\\timesfaster than attention\-based eviction at equal budget, and avoids the attention\-matrix memory wall that makes attention\-based scoring infeasible at long context\.
Together these make eviction deployable in standard FlashAttention serving stacks \(§[5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px2)\)\. We release the counterfactual importance labels as a validation resource\.
transformer layersone forward passat decode stepttBand ABand Bimportances\(t\)=zA−zBs\(t\)=z\_\{\\mathrm\{A\}\}\-z\_\{\\mathrm\{B\}\}rollingzz\-score of∥Δhℓ∥\\lVert\\Delta h\_\{\\ell\}\\rVertsinkdecode tokensKV cache: keep top\-KK, evict low\-score \(×\\times\)rankAttention\-based scoring builds then×nn\\times nmatrix \(𝒪\(n2\)\\mathcal\{O\}\(n^\{2\}\)memory\); we never do\.n×nn\{\\times\}nattnFigure 2:How the importance score is computed at one decode step\. A single forward pass over the model’s layers yields hidden states and the KV cache\. We read the hidden\-state change atBand A\(layers 7–13, positively correlated with importance\) andBand B\(18–25, negatively correlated\), combine them into a causal rollingzz\-scores\(t\)=zA−zBs\(t\)=z\_\{\\mathrm\{A\}\}\-z\_\{\\mathrm\{B\}\}, and keep the top\-KKtokens in the KV cache\. Unlike attention\-based eviction, the score needs non×nn\\times nattention matrix, so it adds no memory and is compatible with fused attention kernels \(e\.g\. FlashAttention\) and the inference stacks built on them\.
## 2Related Work
#### Attention\-based KV eviction\.
Most eviction methods rank tokens by attention weight\. H2O keeps cumulative\-attention “heavy hitters”\(Zhanget al\.,[2023](https://arxiv.org/html/2606.26472#bib.bib9)\); StreamingLLM keeps attention sinks plus a recent window\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.26472#bib.bib10)\); SnapKV selects context tokens from an end\-of\-prompt observation window\(Liet al\.,[2024a](https://arxiv.org/html/2606.26472#bib.bib16)\); PyramidKV allocates larger budgets to lower layers\(Caiet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib21)\); and ChunkKV evicts contiguous chunks to preserve local semantics\(Liuet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib20)\)\. These target long*inputs*, and all need the attention distribution; and this requires materializing then×nn\\times nattention matrix and so rules out the fused kernels \(e\.g\. FlashAttention\(Daoet al\.,[2022](https://arxiv.org/html/2606.26472#bib.bib14); Dao,[2024](https://arxiv.org/html/2606.26472#bib.bib15)\)\) that production inference relies on\. We measure this cost directly \(Section[4](https://arxiv.org/html/2606.26472#S4)\)\.
#### Reasoning\-aware eviction\.
A second line targets the long*generation*traces of reasoning models, where attention is non\-monotonic and milestone tokens matter long after they are last attended\. ThinKV classifies thought segments by attention sparsity and applies per\-type quantization and eviction via a custom kernel\(Ramachandranet al\.,[2026](https://arxiv.org/html/2606.26472#bib.bib11)\)— needing the attention weights, an offline calibration of its sparsity thresholds and layer subset, and a token\-block refresh window; RaaS uses an attention\-refreshed LRU timestamp with full prefill preservation\(Huet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib12)\); LongFlow scores by∥softmax\(𝑠𝑐𝑜𝑟𝑒𝑠\)V∥1\\lVert\\mathrm\{softmax\}\(\\mathit\{scores\}\)\\,V\\rVert\_\{1\}on the same model class\(Suet al\.,[2026](https://arxiv.org/html/2606.26472#bib.bib24)\); AhaKV\(Guet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib23)\)and CAOTE\(Goelet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib22)\)refine attention\-based scores; and LagKV normalizes KV statistics against a lagged window, avoiding attention\(Lianget al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib13)\)\. Except for LagKV, all derive their signal from attention\. We instead use representational change in the residual stream and cached KV vectors\.
#### Retrieval and quantization\.
Orthogonal directions reduce KV cost without choosing which tokens to drop: retrieval keeps every token and fetches a subset per step\(Tanget al\.,[2024](https://arxiv.org/html/2606.26472#bib.bib26); Liuet al\.,[2026](https://arxiv.org/html/2606.26472#bib.bib27)\), SideQuest prompts the model to delete stale tool responses\(Kariyappa and Suh,[2026](https://arxiv.org/html/2606.26472#bib.bib30)\), and quantization lowers the precision of retained entries\(Hooperet al\.,[2024](https://arxiv.org/html/2606.26472#bib.bib25); Sharmaet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib31)\)\. All are stackable and complementary to our signal\.
#### Hidden states as importance signals\.
Mid\-network layers carry the model’s load\-bearing computation: ROME and MEMIT localise factual recall to mid\-layer feed\-forward modules\(Menget al\.,[2022](https://arxiv.org/html/2606.26472#bib.bib17),[2023](https://arxiv.org/html/2606.26472#bib.bib18)\), which act as key–value memories\(Gevaet al\.,[2021](https://arxiv.org/html/2606.26472#bib.bib29)\)— the same layers \(7–13\) where we find the strongest positive correlation with token importance\. Speculative decoding gives convergent evidence: EAGLE drafts from hidden states, not token embeddings, because they carry richer predictive structure\(Liet al\.,[2024b](https://arxiv.org/html/2606.26472#bib.bib28)\)\.
#### Positioning\.
No prior decode\-time eviction method for reasoning traces combines a non\-attention importance signal with attention\-matrix\-free scoring\. ThinKV, RaaS, and LongFlow are reasoning\-aware but attention\-derived; LagKV is attention\-free but generic and KV\-only\. Ours has both, and adds a layer\-level account — a positive mid\-layer of where importance lives\.
## 3Method
### 3\.1Problem setup
During autoregressive decoding the key–value \(KV\) cache grows by one entry per layer per generated token\. For a reasoning trace ofnntokens over a model withLLlayers andHHkey–value heads of dimensiondhd\_\{h\}, the cache holds2LHdhn2LHd\_\{h\}nscalars, which for traces of10410^\{4\}–10510^\{5\}tokens dominates device memory\. Decode\-time eviction caps the cache at a budget ofKKtokens: at each step a policy scores the cached positions and retains theKKhighest\-scoring ones, discarding the rest permanently\.
We call a policy*FlashAttention\-compatible*\(FA2\-compatible\) if it computes its score using only \(i\) the cached keys and values, which already reside in high\-bandwidth memory, and \(ii\) the per\-layer hidden states exposed by the standardoutput\_hidden\_statesinterface\. Such a policy never requestsoutput\_attentionsand never materializes then×nn\\times nattention matrix, so it runs inside a FlashAttention forward pass without forcing the eager fallback\. A policy is*attention\-requiring*if it needs the attention matrix \(equivalently,output\_attentions=True\), which disables FlashAttention’s tiling and reintroduces𝒪\(n2\)\\mathcal\{O\}\(n^\{2\}\)peak memory for scoring\. Figure[2](https://arxiv.org/html/2606.26472#S1.F2)contrasts the two regimes\.
### 3\.2The hidden\-state variance signal
Attention weight, the proxy used by every prior decode\-time eviction method for reasoning traces, is both a noisy importance signal and architecturally costly to extract \(§[1](https://arxiv.org/html/2606.26472#S1)\); we keep these two objections separate throughout\.
Our signal is the per\-token change in the residual stream\. For layerlland decode positiontt, lethl\(t\)h\_\{l\}\(t\)be the hidden state and define the L2 diff
gl\(t\)=∥hl\(t\)−hl\(t−1\)∥2\.g\_\{l\}\(t\)=\\lVert h\_\{l\}\(t\)\-h\_\{l\}\(t\-1\)\\rVert\_\{2\}\.\(1\)A largegl\(t\)g\_\{l\}\(t\)indicates that generating tokenttshifted the model’s internal state at layerll, which is the signature of a consequential token \(an intermediate result, a concluded step, a transition from exploratory to convergent reasoning\) rather than fluent filler\. We refer to these transition points as*epiphany*tokens\.
#### The two\-band anatomy\.
A per\-layer correlation study against counterfactual importance labels \(Section[3\.4](https://arxiv.org/html/2606.26472#S3.SS4)\) identifies two bands with consistent and opposite behavior on competition mathematics\.*Band A*\(layers 7–13\) has consistently positive Spearmanρ\\rho: highglg\_\{l\}marks an important token\.*Band B*\(layers 18–25\) has consistently negativeρ\\rho: highglg\_\{l\}marks a dispensable token\. We interpret the two bands in §[5](https://arxiv.org/html/2606.26472#S5); the split is consistent with mid\-layer factual retrieval\(Menget al\.,[2022](https://arxiv.org/html/2606.26472#bib.bib17),[2023](https://arxiv.org/html/2606.26472#bib.bib18); Gevaet al\.,[2021](https://arxiv.org/html/2606.26472#bib.bib29)\)\. We combine the two bands into a single score
s\(t\)=g¯10\(t\)−g¯21\(t\),s\(t\)=\\bar\{g\}\_\{10\}\(t\)\-\\bar\{g\}\_\{21\}\(t\),\(2\)whereg¯l\(t\)\\bar\{g\}\_\{l\}\(t\)is the rolling mean ofglg\_\{l\}over the trailing window ofw=64w=64tokens\. The window is causal \(it uses only positions≤t\\leq t\), so the score for tokenttnever depends on future tokens\. Tokens with highssare retained\.
#### The temporal\-trend correction\.
The raw score \([2](https://arxiv.org/html/2606.26472#S3.E2)\) carries a confound we discovered during analysis and report as a methodological finding\. Within a single trace,g¯10\\bar\{g\}\_\{10\}tends to decrease andg¯21\\bar\{g\}\_\{21\}tends to increase with position, sos\(t\)s\(t\)tracks position as much as content: in short traces it can rank early \(droppable\) tokens above late \(load\-bearing\) ones\. The aggregateρ\\rhothat motivates the bands is driven partly by cross\-problem structure and overstates within\-trace ranking quality\. We correct this with a causal rollingzz\-score,
zl\(t\)=gl\(t\)−μl\(t\)σl\(t\)\+ε,z\_\{l\}\(t\)=\\frac\{g\_\{l\}\(t\)\-\\mu\_\{l\}\(t\)\}\{\\sigma\_\{l\}\(t\)\+\\varepsilon\},\(3\)whereμl\(t\)\\mu\_\{l\}\(t\)andσl\(t\)\\sigma\_\{l\}\(t\)are the mean and standard deviation ofglg\_\{l\}over the trailing window, and score withz10\(t\)−z21\(t\)z\_\{10\}\(t\)\-z\_\{21\}\(t\)\. This converts absolute magnitude \(position\-contaminated\) into local deviation \(position\-agnostic\), in the spirit of lag\-relative normalization\(Lianget al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib13)\)and analytical detrending\(Guet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib23)\)but applied to hidden\-state diffs\. The detrended variant,EpiKV, is our primary method\.


Figure 3:Accuracy vs\. cache budget on MATH\-500 \(left,n=100n\{=\}100\) and AIME\-2024 \(right,n=30n\{=\}30\)\. Solid: FA2\-compatible methods; dashed: attention\-requiring; dotted: no\-eviction ceiling\.
### 3\.3Eviction policies
Table[1](https://arxiv.org/html/2606.26472#S3.T1)lists every policy we evaluate\. The score for each hidden\-state and KV policy is computed once when a token is generated and then frozen, so scoring is fully online and causal\. Each policy preserves the prompt \(prefill\) tokens and a trailing recency window, and applies its budget to the remaining positions\.111Structural\-token preservation differs across methods \(sinks, recency, prefill\); Appendix[D](https://arxiv.org/html/2606.26472#A4)tabulates each, and we treat the differences as a comparison caveat\.
MethodSignalFA2*Attention\-requiring baselines*H2Ocumulative attention✗ThinKVR/E/T segment entropy✗RaaSattention LRU timestamp✗*Hidden\-state \(ours\)*HS\-varianceg¯10−g¯21\\bar\{g\}\_\{10\}\-\\bar\{g\}\_\{21\}✓EpiKVz10−z21z\_\{10\}\-z\_\{21\}✓Band\-adaptiveBand A/B layers✓*KV\-vector \(ours\)*KV\-key varkey variance✓KV\-val varvalue variance✓Lag\-KVlag\-norm\. key\+\+value✓*Hybrid \(attention\+\+hidden state\)*Attn×\\timesHScumul\. attn\+z10\+\\,z\_\{10\}✗Segment\-HSThinKV seg\.\+\+HS rank✗Table 1:Eviction policies evaluated\. FA2 = runs inside a FlashAttention forward pass \(reads only cached KV and hidden states\); baselines are cited in Section[2](https://arxiv.org/html/2606.26472#S2)\. The hidden\-state and KV\-vector families are our contribution; the hybrids isolate the value of combining signals at the cost of FA2 compatibility\.The KV\-vector family scores tokens from quantities already in the cache, with no hidden states required\.*KV\-key*and*KV\-val*use the rolling\-mean variance of the key and value vectors across head dimension\.*Lag\-KV*adapts the lag\-relative normalization ofLianget al\.\([2025](https://arxiv.org/html/2606.26472#bib.bib13)\)to streaming decode: each token’s key and value vectors are normalized by the previous chunk’s per\-channel range before the variance is taken, which removes domain\-level magnitude shifts\. We use the previous chunk \(causal\) rather than the next chunk \(look\-ahead\) used by the original prefill\-time formulation\.
### 3\.4Counterfactual importance labels
The band anatomy rests on ground\-truth importance labels obtained by counterfactual occlusion \(full protocol in Appendix[C](https://arxiv.org/html/2606.26472#A3)\)\. For each correctly answered trace we slide a 32\-token window \(stride 16\) over the reasoning span, replace it with padding, regenerate the answer from the modified context, and label the window important if the answer changes \(logical\-OR over overlapping windows\)\. The occlusion feeds the same context length for every window, so the label measures content, not position — unlike an earlier truncation variant that proxied position and inflated attention signals\. Regeneration is greedy\. The important fraction is≈\\approx0\.20 on MATH\-500 and 0\.52–0\.64 on AIME, reflecting that nearly every token of a hard problem is load\-bearing\.
### 3\.5Experimental setup
#### Model\.
DeepSeek\-R1\-Distill\-LLaMA\-8B \(32 layers\)\(Guoet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib19)\), chosen for direct comparability with ThinKV and for being an open\-weight member of the reasoning\-model class\. Generation is greedy throughout, so reported differences are not sampling noise\.
#### Datasets\.
MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.26472#bib.bib32); Lightmanet al\.,[2024](https://arxiv.org/html/2606.26472#bib.bib33)\)is primary benchmark \(competition maths, verifiable boxed answers, traces of∼\\sim4k–16k tokens\)\. AIME\-2024 tests higher cache pressure with∼\\sim16k–32k\-token traces\. GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.26472#bib.bib34)\)is used only as a difficulty\-regime probe for the layer anatomy \(App\.[F](https://arxiv.org/html/2606.26472#A6)\), not as a head\-to\-head accuracy benchmark\.
#### Budgets and metrics\.
Cache budgetsK∈\{512,1024,2048,4096\}K\\in\\\{512,1024,2048,4096\\\}on MATH\-500 and\{512,…,8192\}\\\{512,\\dots,8192\\\}on AIME\-2024\. We report accuracy \(exact match on the boxed answer\), per\-problem wall\-clock time, and per\-example peak GPU memory \(reset before each problem\)\.
#### Attention back\-end\.
Attention\-requiring policies run in eager mode; FA2\-compatible ones withflash\_attention\_2, as a separate configuration \(unaffected by the back\-end since they never read the attention matrix\)\. Each run uses one GPU of our cluster’s comparable 46–49 GB cards \(L40, L40S, RTX 6000 Ada, A6000\); see Appendix[H](https://arxiv.org/html/2606.26472#A8)\.
## 4Results
We report the H2O failure that motivates a non\-attention signal \(§[4\.1](https://arxiv.org/html/2606.26472#S4.SS1)\), the signal validation behind the two\-band anatomy \(§[4\.2](https://arxiv.org/html/2606.26472#S4.SS2)\), end\-to\-end accuracy at each budget \(§[4\.3](https://arxiv.org/html/2606.26472#S4.SS3)\), the speed and memory profile \(§[4\.4](https://arxiv.org/html/2606.26472#S4.SS4)\), and the difficulty\-regime anatomy \(§[4\.5](https://arxiv.org/html/2606.26472#S4.SS5)\)\. Full tables are in Appendix[B](https://arxiv.org/html/2606.26472#A2)\.
### 4\.1Attention\-based eviction collapses
H2O does not degrade gracefully on reasoning traces; it collapses\. On MATH\-500 its accuracy falls from 67% at a 4096\-token budget to 49% at 2048 and 5% at 1024 \(Table[3](https://arxiv.org/html/2606.26472#A2.T3)\), an order of magnitude below the no\-eviction ceiling of 75%\. The collapse is empty output rather than wrong output: H2O produces no generated answer on 93 of 100 problems at a 1024\-token budget, 48 at 2048, and 27 at 4096 \(immediate end\-of\-sequence\); at 512 it instead emits unstructured text with no extractable answer on 99 of 100\. This matches the attention\-map failure RaaS documents on reasoning traces, and is the empirical case for not deriving the eviction signal from attention\.
### 4\.2The two\-band importance signal
The two\-band anatomy of §[3\.2](https://arxiv.org/html/2606.26472#S3.SS2)\(Band A positive, Band B negative\) holds against the occlusion labels consistently across both competition\-mathematics datasets and both attention back\-ends \(Appendix[A](https://arxiv.org/html/2606.26472#A1)\)\. Cumulative attention \(h2o\_attn\) is the weakest signal measured, with\|ρ\|≤0\.09\|\\rho\|\\leq 0\.09on every eager dataset, below every hidden\-state band layer\. A causal rolling\-64 window improves correlation over the raw signal by 32–57% across datasets \(Table[10](https://arxiv.org/html/2606.26472#A5.T10)\); pre\-RoPE key statistics give no measurable benefit \(Δ\|ρ\|≤0\.0005\\Delta\|\\rho\|\\leq 0\.0005\)\.
### 4\.3Accuracy at deployable budgets
At a 4096\-token cache on MATH\-500,EpiKVreaches 72%, above ThinKV \(71%\) and H2O \(67%\) and within 3 points of the 75% ceiling \(Table[3](https://arxiv.org/html/2606.26472#A2.T3); Figure[3](https://arxiv.org/html/2606.26472#S3.F3)\); the FA2\-compatible family clusters at 70–72% while the attention\-requiring baselines span 67–71%\. The margin over the best attention baseline is one problem of 100, so we claim parity\-or\-better at this budget; and we obtain it without ever materializing the attention matrix\. On AIME\-2024 at 8192 the lag\-normalized KV method reaches 37% against 33% for the best attention\-requiring method \(Table[4](https://arxiv.org/html/2606.26472#A2.T4)\); atn=30n\{=\}30this is one problem of difference\.
Two honest qualifications\. First, no single FA2\-compatible method dominates across budgets: at 2048 on MATH\-500 the band\-adaptive and KV variants reach 57% whileEpiKVdrops to 49%, and RaaS leads at 60%\. Second, at the tightest budgets \(≤\\leq1024\) an eager hybrid that combines segment classification with the hidden\-state ranker leads \(36% at 1024, 7% at 512\), and no FA2\-compatible method matches it there\. The contribution here is parity\-or\-better with attention\-based eviction at the budgets that matter for deployment, obtained without materializing the attention matrix\.
### 4\.4Speed and memory
#### Speed\.
Two effects make eviction faster\. First, a capped cache shrinks per\-step attention, so every eviction method \(even eager ones\) runs below the uncapped no\-eviction baseline \(763 s on AIME\-2024 at 8192\)\. Second, FA2\-compatible methods additionally avoid the eager\-attention kernel: on AIME\-2024 at 8192 the lag\-normalized method \(440 s per problem\) is 1\.6×\\timesfaster than ThinKV, the fastest attention baseline \(721 s\), and up to 2\.8×\\timesfaster overall \(RaaS, 1239 s; Table[6](https://arxiv.org/html/2606.26472#A2.T6), Figure[4](https://arxiv.org/html/2606.26472#S4.F4)\)\. This FA2 speed\-up is method\-specific, not automatic — the raw key\-variance and lag\-key variants recompute scores over the whole cache each step and only match the eager baselines — and H2O’s low wall\-time at tight budgets reflects its empty\-generation collapse, not efficiency\.
Figure 4:Accuracy vs\. wall\-clock time per problem on AIME\-2024 at an 8192\-token budget — top\-left is better\. FA2\-compatible methods \(green\) dominate the accuracy–speed frontier; Lag\-KV is both the most accurate and the fastest, while the attention\-based baselines \(red\) sit slower and no more accurate\.
#### Memory\.
In the decode regime measured, peak memory is set by the cache budget, not the method: at the tightest AIME budget every eviction method saves≈\\approx2\.9 GB over no eviction \(Table[8](https://arxiv.org/html/2606.26472#A2.T8)\)\. The architectural memory advantage of being FA2\-compatible appears at prefill, where reading attention weights materializes theH×n×nH\{\\times\}n\{\\times\}nmaps\. On an 80 GB A100, a forward pass withoutput\_attentions=Truealready uses 52 GB at a 4096\-token context and runs out of memory at 8192 whereas a FlashAttention pass over the same model scales to 65,536 tokens at 48 GB, a 16×\\timeslonger feasible context \(Figure[1](https://arxiv.org/html/2606.26472#S1.F1)\)\. This compounds at the batch level: holding the cache at a 2048\-token budget supports 224 concurrent 32,768\-token requests on the same GPU against 14 without eviction \(App\.[I](https://arxiv.org/html/2606.26472#A9)\), and the gap widens with context length\.
### 4\.5Difficulty\-regime anatomy
The two\-band anatomy is specific to competition mathematics\. On GSM8K \(grade\-school arithmetic,neff=352n\_\{\\mathrm\{eff\}\}\{=\}352\) the positive band moves to early layers and the negative band extends across most of the network, and both attention entropy and key variance reverse sign relative to MATH\-500 \(Appendix[F](https://arxiv.org/html/2606.26472#A6)\)\. Where the importance signal lives depends on task difficulty, and this is evidence that the signal tracks a real property of how reasoning is consolidated, not a fixed layer index\.
## 5Discussion
#### What the two\-band anatomy means, and why it moves\.
The positive band \(layers 7–13\) coincides with the mid\-network layers that mechanistic\-interpretability work identifies as the site of factual retrieval and feature routing\(Menget al\.,[2022](https://arxiv.org/html/2606.26472#bib.bib17),[2023](https://arxiv.org/html/2606.26472#bib.bib18); Gevaet al\.,[2021](https://arxiv.org/html/2606.26472#bib.bib29)\): large hidden\-state change there marks a token where the model retrieves or composes content\. The negative band \(18–25\) is the counterintuitive half: these upper\-mid layers prepare the output distribution and are active even for fluent, low\-surprise tokens, so large change there signals predictable continuation rather than content worth keeping; subtracting the bands exploits this opposition\. The band locations are not universal — they shift with task difficulty \(§[4\.5](https://arxiv.org/html/2606.26472#S4.SS5), Appendix[F](https://arxiv.org/html/2606.26472#A6)\) — which indicates the signal tracks where load\-bearing computation happens \(deeper for harder problems\) and makes the layer indices a per\-regime hyperparameter \(layers 10 and 21 for the competition\-mathematics setting we target\)\.
#### Attention\-matrix\-free scoring is the deployment contribution\.
No prior decode\-time eviction method for reasoning traces avoids the attention matrix: ThinKV needs the attention weights, an offline calibration step, and a custom kernel, and H2O, RaaS, and LongFlow all require the attention weights and therefore the eager kernel\. The cost of that requirement is not academic\. At the 16k–64k contexts typical of reasoning traces, reading the attention weights to score tokens exhausts GPU memory before the trace even fits \(§[4\.4](https://arxiv.org/html/2606.26472#S4.SS4)\), while our signal is read from the same forward pass the model already runs\. Scoring is also causal: the rollingzz\-score fixes a token’s fate at the step it is produced, where ThinKV’sτ=128\\tau\{=\}128refresh window defers classification by up toτ\\tautokens\.EpiKVdrops into vLLM, TGI, or SGLang unchanged, with no training, no classifier, and no kernel fork\. For a method already at accuracy parity and faster at equal budget, that is what makes it well\-suited to production\.
#### A trend that residual\-stream signals share\.
The finding that the raw hidden\-state signal carries a monotonic positional trend within a trace — so that aggregate correlation overstates within\-trace ranking quality — is not specific to our method\. Any importance signal read from the residual stream over a long generation is exposed to the same drift, and the causal rollingzz\-score we use is a cheap, general correction\. The deeper cause, that certain layers have systematically different activation magnitudes early versus late in a generation, is worth study in its own right\.
#### Limitations\.
Latency and memory are measured single\-GPU and single\-example; batched and multi\-GPU throughput is projected from KV\-cache arithmetic \(Appendix[I](https://arxiv.org/html/2606.26472#A9)\) rather than measured end\-to\-end, and the prefill\-memory advantage is shown by a forward\-pass microbenchmark, not a long\-prompt deployment\. The AIME\-2024 comparison isn=30n\{=\}30, where a three\-point gap is a single problem \(Appendix[G](https://arxiv.org/html/2606.26472#A7)\); pooling AIME 2024–2026 ton≈90n\{\\approx\}90would firm it up\. Results are from one model family \(DeepSeek\-R1\-Distill\-LLaMA\-8B\), as is common in this line of work; transfer across architectures and scales is untested\.
#### Futubibliographystylere work\.
As extensions, an attention\-matrix\-free analogue of the segment hybrid \(e\.g\., segment classification from KV statistics rather than attention entropy\) would target the tight\-budget regime where the eager hybrid still leads\. Chunk\-level scoring\(Liuet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib20)\)over hidden\-state change and per\-layer budgets\(Caiet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib21)\)are orthogonal gains, and quantization\(Hooperet al\.,[2024](https://arxiv.org/html/2606.26472#bib.bib25); Sharmaet al\.,[2025](https://arxiv.org/html/2606.26472#bib.bib31)\)is stackable\.
## Acknowledgements
We thank Vashisth Tiwari for their helpful comments and pointers in the ideation of this work\.
## References
- PyramidKV: dynamic KV cache compression based on pyramidal information funneling\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px5.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§3\.5](https://arxiv.org/html/2606.26472#S3.SS5.SSS0.Px2.p1.2)\.
- T\. Dao, D\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)FlashAttention: fast and memory\-efficient exact attention with IO\-awareness\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.35,pp\. 16344–16359\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Dao \(2024\)FlashAttention\-2: faster attention with better parallelism and work partitioning\.InInternational Conference on Learning Representations \(ICLR\),Vol\.2024,pp\. 35549–35562\.Cited by:[§1](https://arxiv.org/html/2606.26472#S1.p2.1),[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 5484–5495\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px4.p1.1),[§3\.2](https://arxiv.org/html/2606.26472#S3.SS2.SSS0.Px1.p1.4),[§5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px1.p1.1)\.
- R\. Goel, J\. Park, M\. Gagrani, D\. Jones, M\. Morse, H\. Langston, M\. Lee, and C\. Lott \(2025\)CAOTE: KV cache selection for LLMs via attention output error\-based token eviction\.arXiv preprint arXiv:2504\.14051\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Gu, Z\. Jiang, J\. Jin, K\. Guo, Z\. Zhang, and X\. Xu \(2025\)AhaKV: adaptive holistic attention\-driven KV cache eviction for efficient inference of large language models\.arXiv preprint arXiv:2506\.03762\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.26472#S3.SS2.SSS0.Px2.p1.9)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2606.26472#S1.p1.3),[§3\.5](https://arxiv.org/html/2606.26472#S3.SS5.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,Cited by:[§3\.5](https://arxiv.org/html/2606.26472#S3.SS5.SSS0.Px2.p1.2)\.
- C\. Hooper, S\. Kim, H\. Mohammadzadeh, M\. W\. Mahoney, Y\. S\. Shao, K\. Keutzer, and A\. Gholami \(2024\)KVQuant: towards 10 million context length LLM inference with KV cache quantization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.37,pp\. 1270–1303\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px5.p1.1)\.
- J\. Hu, W\. Huang, W\. Wang, Z\. Li, T\. Hu, Z\. Liu, X\. Chen, T\. Xie, and Y\. Shan \(2025\)RaaS: reasoning\-aware attention sparsity for efficient LLM reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 2577–2590\.Cited by:[§1](https://arxiv.org/html/2606.26472#S1.p2.1),[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Kariyappa and G\. E\. Suh \(2026\)SideQuest: model\-driven KV cache management for long\-horizon agentic reasoning\.arXiv preprint arXiv:2602\.22603\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with PagedAttention\.InProceedings of the 29th Symposium on Operating Systems Principles \(SOSP\),pp\. 611–626\.Cited by:[§1](https://arxiv.org/html/2606.26472#S1.p1.3)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024a\)SnapKV: LLM knows what you are looking for before generation\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.37,pp\. 22947–22970\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2024b\)EAGLE: speculative sampling requires rethinking feature uncertainty\.InInternational Conference on Machine Learning,pp\. 28935–28948\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px4.p1.1)\.
- M\. Liang, J\. Zhang, X\. Li, and J\. Li \(2025\)LagKV: lag\-relative information of the KV cache tells which tokens are important\.arXiv preprint arXiv:2504\.04704\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.26472#S3.SS2.SSS0.Px2.p1.9),[§3\.3](https://arxiv.org/html/2606.26472#S3.SS3.p2.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 39578–39601\.Cited by:[§3\.5](https://arxiv.org/html/2606.26472#S3.SS5.SSS0.Px2.p1.2)\.
- G\. Liu, C\. Li, Z\. Ning, J\. Lin, Y\. Yao, D\. Ke, M\. Guo, and J\. Zhao \(2026\)FreeKV: boosting KV cache retrieval for efficient LLM inference\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Liu, Z\. Tang, P\. Dong, Z\. Li, Y\. Liu, B\. Li, X\. Hu, and X\. Chu \(2025\)ChunkKV: semantic\-preserving KV cache compression for efficient long\-context LLM inference\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.38,pp\. 28728–28778\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px5.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.35,pp\. 17359–17372\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px4.p1.1),[§3\.2](https://arxiv.org/html/2606.26472#S3.SS2.SSS0.Px1.p1.4),[§5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px1.p1.1)\.
- K\. Meng, A\. S\. Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau \(2023\)Mass\-editing memory in a transformer\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px4.p1.1),[§3\.2](https://arxiv.org/html/2606.26472#S3.SS2.SSS0.Px1.p1.4),[§5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Ramachandran, M\. Neseem, C\. Sakr, R\. Venkatesan, B\. Khailany, and T\. Krishna \(2026\)ThinKV: thought\-adaptive KV cache compression for efficient reasoning models\.InInternational Conference on Learning Representations \(ICLR\),Note:OralCited by:[§1](https://arxiv.org/html/2606.26472#S1.p2.1),[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Sharma, H\. Ding, J\. Li, N\. Dani, and M\. Zhang \(2025\)MiniKV: pushing the limits of 2\-bit KV cache via compression and system co\-design for efficient long context inference\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 18506–18523\.Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.26472#S5.SS0.SSS0.Px5.p1.1)\.
- Y\. Su, Z\. Tian, D\. Qiao, Y\. Zhou, J\. Li, and M\. Zhang \(2026\)LongFlow: efficient KV cache compression for reasoning models\.arXiv preprint arXiv:2603\.11504\.Cited by:[§1](https://arxiv.org/html/2606.26472#S1.p2.1),[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Tang, Y\. Zhao, K\. Zhu, G\. Xiao, B\. Kasikci, and S\. Han \(2024\)Quest: query\-aware sparsity for efficient long\-context LLM inference\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations \(ICLR\),Vol\.2024,pp\. 21875–21895\.Cited by:[§1](https://arxiv.org/html/2606.26472#S1.p2.1),[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett, Z\. Wang, and B\. Chen \(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.26472#S1.p2.1),[§2](https://arxiv.org/html/2606.26472#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix APer\-layer importance correlations
Table[2](https://arxiv.org/html/2606.26472#A1.T2)reports the Spearmanρ\\rhobetweeng¯l\\bar\{g\}\_\{l\}\(rolling\-64 hidden\-state L2 diff at layerll\) and the counterfactual importance labels, for all 32 layers on the two competition\-mathematics datasets in both attention back\-ends\. Band A \(7–13\) is positive throughout; Band B \(18–25\) is negative throughout\. The last layer \(l31\) flips sign across datasets and is not used\.
Table 2:Spearmanρ\\rhoofg¯l\\bar\{g\}\_\{l\}with importance labels, all 32 layers\. Band A \(7–13\) and Band B \(18–25\) rows in bold\.Figure 5:Per\-layer Spearmanρ\\rhobetween rolling\-64 hidden\-state change and counterfactual importance\. Band A \(7–13\) is positive and Band B \(18–25\) negative across both datasets and back\-ends\.
## Appendix BFull Phase 1 results
Tables[3](https://arxiv.org/html/2606.26472#A2.T3)–[8](https://arxiv.org/html/2606.26472#A2.T8)give the complete accuracy, per\-problem wall\-clock time, and per\-example peak GPU memory for every method and budget\. FA2\-compatible methods are marked ✓\. MATH\-500 isn=100n\{=\}100; AIME\-2024 isn=30n\{=\}30\(each problem≈\\approx3\.3 points\)\.
Table 3:MATH\-500 accuracy \(%\) vs\. cache budget\.*none*\(no eviction\) is budget\-independent and shown once\.Table 4:AIME\-2024 accuracy \(%\) vs\. cache budget\.Table 5:MATH\-500 mean wall\-clock time per problem \(s\)\. H2O at 1024 is fast because it collapses to near\-empty generations, not because it is efficient\.Table 6:AIME\-2024 mean wall\-clock time per problem \(s\)\. lag\-kv at 8192 is 2\.8×\\timesfaster than raas \(440\.5 vs\. 1238\.7\)\.Table 7:MATH\-500 mean peak GPU memory \(MB\)\. Differences are driven by budget, not method; eager and FA2 are within a few hundred MB at equal budget\.Table 8:AIME\-2024 mean peak GPU memory \(MB\)\. At the tightest budget all eviction methods save≈\\approx2\.9 GB over no eviction; the saving shrinks at larger budgets\.
## Appendix CCounterfactual labeling details
Labels are produced by sliding\-window occlusion over the reasoning span of each correctly answered trace\. Window size 32, stride 16 \(each interior position is covered by two windows\)\. The answer boundary is located by searching for the</think\>token sequence, falling back to the last\\boxed\{and then to the final 64 tokens\. For each window the tokens are replaced with the padding id, the full modified context up to the boundary is fed, and the answer is regenerated greedily with up to 512 new tokens\. A position is labeled important \(1\) if any covering window flips the answer, else 0; prompt positions are fixed to 1 and the answer span is not tested\. Regeneration is deterministic, so labels are reproducible\.
## Appendix DEviction policy composition and budgets
Table[9](https://arxiv.org/html/2606.26472#A4.T9)records, for each method, which structural tokens are preserved and how the budgetKKis allocated\. The policies differ: H2O preserves sinks plus a recency window, RaaS and the hidden\-state/KV families preserve the entire prefill, and ThinKV preserves only a recency window and may retain fewer thanKKtokens because its per\-segment R/E/T budgets \(\{64,32,8\}\\\{64,32,8\\\}\) need not sum toKK\. Recency ismin\(128,K/4\)\\min\(128,K/4\)throughout\.
Table 9:Structural\-token preservation and budget allocation per method\.
## Appendix ETemporal smoothing and RoPE
Rolling\-64 smoothing outperforms an EMA \(α=0\.9\\alpha\{=\}0\.9\) and the raw signal across datasets \(Table[10](https://arxiv.org/html/2606.26472#A5.T10)\)\. Pre\-RoPE versus post\-RoPE key statistics is a null result: the maximumΔ\|ρ\|\\Delta\|\\rho\|observed across datasets and smoothing variants is 0\.0005, so pre\-RoPE collection is omitted\.
Table 10:\|ρ\|\|\\rho\|of kv\-key variance under three smoothings \(representative of all families\)\. Rolling\-64 improves over raw by 32–57%\.
## Appendix FGSM8K difficulty\-regime anatomy
On GSM8K \(355 correctly answered traces,neff=352n\_\{\\mathrm\{eff\}\}\{=\}352\) the layer anatomy shifts relative to competition mathematics\. Band A moves to early layers \(l0–l7 positive; l0=\+0\.181=\+0\.181\), the negative band extends across l10–l30 \(strongest l15=−0\.351=\-0\.351\), and the last layer is strongly positive \(l31=\+0\.231=\+0\.231\)\. Attention entropy reverses sign relative to MATH\-500 \(−0\.313\-0\.313vs\.\+0\.176\+0\.176\) and kv\-key variance reverses \(−0\.261\-0\.261vs\.\+0\.380\+0\.380\); both reversals are confirmed at highneffn\_\{\\mathrm\{eff\}\}\. The shift indicates that where the importance signal lives depends on task difficulty: harder problems route load\-bearing computation through mid\-layers, simpler arithmetic through early layers\. GSM8K is therefore reported as a difficulty\-regime probe, not a head\-to\-head accuracy benchmark\.
## Appendix GStatistical power
Effective sample size is the number of independent traces, not token pairs, since tokens within a trace are correlated\. Table[11](https://arxiv.org/html/2606.26472#A7.T11)givesneffn\_\{\\mathrm\{eff\}\}and the approximate 95% confidence half\-width \(Fisherzz\)\. MATH\-500 and GSM8K are the only high\-power datasets; every AIME configuration has a confidence interval spanning zero, which is why AIME results are reported as directional and tagged for pooling ton≈90n\{\\approx\}90\.
Table 11:Effective sample sizes and standard errors\.
## Appendix HImplementation notes
Eviction is applied through the HuggingFaceDynamicCache\. Two issues required fixes for correctness: keep\-masks were moved to each tensor’s device for multi\-GPUdevice\_map="auto"runs, and the post\-eviction cache is rebuilt by constructing an emptyDynamicCacheand callingupdateper layer so that\_seen\_tokensmatches the retained length \(otherwise the model builds a causal mask one position too long\)\. Multi\-GPU FlashAttention runs hit a kernel\-coordination launch failure, so all flash benchmarks use a single GPU\. The no\-eviction baseline is run once and copied across budgets\. The prefill\-memory microbenchmark \(Section[4](https://arxiv.org/html/2606.26472#S4)\) was run on an NVIDIA A100 \(80 GB\)\. The Phase\-1 accuracy/time/memory benchmarks ran on the cluster’s comparable 46–49 GB GPUs \(NVIDIA L40, L40S, RTX 6000 Ada, RTX A6000\), one GPU per job\.
## Appendix IThroughput projection
Figure[6](https://arxiv.org/html/2606.26472#A9.F6)projects the maximum number of concurrent requests that fit on an 80 GB GPU as a function of context length, computed from the per\-token KV\-cache size, with and without eviction\.
Figure 6:Maximum concurrent requests on an 80 GB GPU vs\. context length\. Without eviction, capacity falls as traces grow; a fixed cache budget holds it flat\.Similar Articles
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.
Information-Aware KV Cache Compression for Long Reasoning
This paper proposes InfoKV, an entropy-aware KV cache compression framework that combines token-level predictive uncertainty with attention scores to improve long-context reasoning efficiency. Experiments show it outperforms existing attention-based methods on Llama-3.1, Llama-3.2, and DeepSeek-R1.
Value-Aware Stochastic KV Cache Eviction for Reasoning Models
VaSE is a training-free method for KV cache eviction that protects large-magnitude value states and introduces stochasticity to improve reasoning model accuracy under compression, outperforming existing methods.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
This paper introduces a fixed-contract diagnostic tool to analyze why KV cache compression methods succeed or fail in long-context LLM inference. It identifies three failure modes—missing evidence, scoring irrelevant tokens, and breaking related evidence—and evaluates them on LongBench and NeedleBench.