NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

arXiv cs.CL Papers

Summary

NestedKV is a training-free KV cache compression method that uses nested memory routing with multi-time-scale anomaly scoring to improve long-context language model efficiency, achieving significant gains on benchmarks like RULER and LongBench.

arXiv:2605.26678v1 Announce Type: new Abstract: Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, or key distinctiveness -- which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant. We introduce NestedKV, a key-only KV cache compression method inspired by the Continuum Memory System in Nested Learning. NestedKV maintains global, block-level, and sliding-window key anchors, scores tokens by multi-time-scale cosine anomaly, and combines the resulting rankings with a training-free outer learner using head-adaptive mixing and surprise-gated token routing. The score is paired with adaptive per-head budgets and requires no training or LLM modification. Across RULER (4k--32k), LooGLE, LongBench, LongBench-E, InfiniteBench, and MMLU-Pro on Qwen3 and Llama-3.2 models, NestedKV is strongest when the retained cache is small. On Qwen3-4B, it improves over KeyDiff by up to 19.10 points on RULER and 19.29 on LongBench at $r=0.75$; at $r=0.95$, it retains 37.32 on LongBench versus 17.55 for KeyDiff.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:08 AM

# NestedKV: Nested Memory Routing for Long-Context KV Cache Compression
Source: [https://arxiv.org/html/2605.26678](https://arxiv.org/html/2605.26678)
Hong Chen1Xiang Liu1Yubo Gao1Yuxuan Fan1Bo Wang1 Yuanlin Chu1Yuanguo Lin2Xuming Hu1 1The Hong Kong University of Science and Technology \(Guangzhou\) 2Jimei University \{hchen763,xliu886,ygao704,yfan546,bwang423,ychu763\}@connect\.hkust\-gz\.edu\.cn xdlyg@jmu\.edu\.cnxuminghu@hkust\-gz\.edu\.cn

###### Abstract

Long\-context language models are limited by the memory footprint of the key\-value \(KV\) cache\. Existing training\-free KV compression methods usually rank tokens by one importance signal — attention, recency, layer\-wise allocation, or key distinctiveness — which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant\. We introduce NestedKV, a key\-only KV cache compression method inspired by the Continuum Memory System in Nested Learning\. NestedKV maintains global, block\-level, and sliding\-window key anchors, scores tokens by multi\-time\-scale cosine anomaly, and combines the resulting rankings with a training\-free outer learner using head\-adaptive mixing and surprise\-gated token routing\. The score is paired with adaptive per\-head budgets and requires no training or LLM modification\. Across RULER \(4k–32k\), LooGLE, LongBench, LongBench\-E, InfiniteBench, and MMLU\-Pro on Qwen3 and Llama\-3\.2 models, NestedKV is strongest when the retained cache is small\. On Qwen3\-4B, it improves over KeyDiff by up to 19\.10 points on RULER and 19\.29 on LongBench atr=0\.75r=0\.75; atr=0\.95r=0\.95, it retains 37\.32 on LongBench versus 17\.55 for KeyDiff\.

NestedKV: Nested Memory Routing for Long\-Context KV Cache Compression

Hong Chen1Xiang Liu1Yubo Gao1Yuxuan Fan1Bo Wang1Yuanlin Chu1Yuanguo Lin2Xuming Hu11The Hong Kong University of Science and Technology \(Guangzhou\)2Jimei University\{hchen763,xliu886,ygao704,yfan546,bwang423,ychu763\}@connect\.hkust\-gz\.edu\.cnxdlyg@jmu\.edu\.cnxuminghu@hkust\-gz\.edu\.cn

## 1Introduction

Long\-context language models have become a standard interface for document understanding, retrieval\-augmented generation, coding, and multi\-turn interaction\. Their practical deployment, however, is constrained by an increasingly simple bottleneck: the key\-value \(KV\) cache grows linearly with context length and batch size\. For long prompts and high\-throughput serving, this transient memory can dominate inference cost even when model weights are fixed\. As a result, a growing line of work studies training\-free KV cache compression, aiming to reduce cache memory without fine\-tuning the model or changing the attention implementation\(Liuet al\.,[2023](https://arxiv.org/html/2605.26678#bib.bib1); Zhanget al\.,[2023](https://arxiv.org/html/2605.26678#bib.bib2); Xiaoet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib3); Liet al\.,[2024b](https://arxiv.org/html/2605.26678#bib.bib4); Caiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib5); Fenget al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib6); Parket al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib7)\)\.

Most existing methods can be understood as choosing one anchor for token importance: past attention mass under a persistence\-of\-importance hypothesis\(Liuet al\.,[2023](https://arxiv.org/html/2605.26678#bib.bib1); Zhanget al\.,[2023](https://arxiv.org/html/2605.26678#bib.bib2)\), recency and attention sinks\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib3)\), an observation window near the end of the prompt\(Liet al\.,[2024b](https://arxiv.org/html/2605.26678#bib.bib4)\), layer\-wise cache budgets\(Caiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib5)\), or geometric distinctiveness of keys from the mean direction\(Parket al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib7)\)\. A complementary line allocates the cache budget adaptively across heads rather than uniformly\(Fenget al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib6)\)\.

![Refer to caption](https://arxiv.org/html/2605.26678v1/x1.png)Figure 1:Attention from the last 64 queries on a long\-context retrieval prompt \(Qwen3\-4B, RULERniah\_multivalue,N=3,800N\{=\}3\{,\}800, 4 needles⋆1\\star 1–⋆4\\star 4\)\.Top: attention mass \(log scale\)\.Bottom: tokens retained by an attention\-sorted compressor atr=0\.50r\{=\}0\.50andr=0\.85r\{=\}0\.85; surviving needles green, evicted red\.Figure[1](https://arxiv.org/html/2605.26678#S1.F1)illustrates why an attention\-based view is structurally insufficient under aggressive compression\. Even before considering any specific design choice, the attention signal itself concentrates on the prompt tail and the attention\-sink prefix, while answer\-bearing tokens sit in low\-attention regions and receive a negligible share of the mass\. Any compressor that scores tokens by past attention therefore inherits this geometric misalignment and evicts the wrong tokens first — a problem that grows worse, not better, as the budget shrinks\. This motivates moving the compression score out of attention space altogether and into the key stream\.

These approaches are effective, but they also expose a common limitation: each method compresses the cache through a single view of memory\. A token may be important because it is globally unusual in the document, because it marks a local topic shift inside one segment, or because it is part of the recent stream that will shape immediate generation\. Under mild compression, a single statistic may be sufficient\. Under aggressive compression or longer contexts, these notions diverge\. A global mean can miss local episodes; a local rule can overfit repetitive blocks; a recent\-window rule can discard earlier evidence needed for retrieval or multi\-hop reasoning\.

We propose NestedKV, a training\-free KV cache compression method based on a continuum\-memory view of token importance\. Following the Nested Learning perspective that models maintain compressed context flows through nested memory systems with a self\-modifying update rule\(Behrouzet al\.,[2026a](https://arxiv.org/html/2605.26678#bib.bib14)\), NestedKV maintains a three\-time\-scale memory directly over the cached key stream — a stable, an episodic, and a current anchor — and scores each token by its cosine anomaly against each scale\. The three scales act as inner learners: a token receives three rankings rather than one, and is retained if it is anomalous against any of the scales\.

A training\-free outer learner then combines these inner rankings on two axes\. Per attention head, the most discriminative scale is up\-weighted relative to a fixed prior, so heads can specialize in different temporal roles\. Per token, the cross\-scale disagreement between the three rankings is read as a compression\-induced surprise signal, and high surprise smoothly routes the score from the blended view toward the strongest individual memory\. Together, these two axes instantiate the self\-modifying compressor motif of Nested Learning at test time, with no trainable parameters and no modification to the underlying LLM\.

The score is key\-only and remains compatible with optimized attention kernels\. The full policy is combined with adaptive per\-head memory allocation, separating two questions that are often entangled: which tokens are informative within a head, and how much memory each head should receive\.

We evaluate NestedKV on a suite of long\-context benchmarks \(RULER\(Hsiehet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib18)\), LongBench\(Baiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib19)\), LooGLE\(Liet al\.,[2024a](https://arxiv.org/html/2605.26678#bib.bib21)\), LongBench\-E, and InfiniteBench\(Zhanget al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib22)\)\) and a short\-context knowledge benchmark \(MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib23)\)\), using Qwen3\-4B as the primary frozen model\. The main empirical pattern is that the continuum\-memory score is most useful exactly where a single anchor should be weakest: at higher compression ratios and longer contexts\. NestedKV is best or near\-best across most RULER context–ratio cells, with the clearest gains under aggressive compression and longer contexts\. It also improves the LongBench average from 30\.77 to 50\.06 atr=0\.75r=0\.75, and on MMLU\-Pro remains within0\.20\.2points of the Full KV ceiling atr=0\.25r\{=\}0\.25while most baselines degrade\.

Our contributions are:

- •We reframe training\-free KV cache compression as continuum\-memory anomaly detection over the key stream, giving token eviction a Nested Learning interpretation as bounded test\-time memory maintenance\.
- •We introduce NestedKV, which uses three time\-scale key statistics — stable, episodic, and current — as inner learners and combines their per\-token anomaly rankings through a training\-free outer learner that adapts per head and per token, the latter driven by compression\-induced surprise\. The score is paired with adaptive per\-head memory allocation, with no training or LLM modification\.
- •We provide empirical evidence across six benchmarks — RULER, LongBench, LooGLE, LongBench\-E, InfiniteBench, and MMLU\-Pro — that multi\-time\-scale scoring is especially valuable under aggressive compression and long contexts, while not compromising short\-prompt capability\.

## 2Method

NestedKV compresses the KV cache after prefill and before autoregressive decoding\. It is applied independently at each Transformer layer, while coordinating memory allocation across the layer’s KV heads\. The model parameters, attention function, and retained value vectors are unchanged; the method only determines which cached positions remain in the bounded test\-time memory\. Figure[2](https://arxiv.org/html/2605.26678#S2.F2)summarizes the three components: a nested continuum memory state over the cached keys, a per\-scale anomaly score blended into a primary continuum reading, and a surprise\-guided routing rule that selects between the blended reading and the strongest individual memory for each token\.

![Refer to caption](https://arxiv.org/html/2605.26678v1/x2.png)Figure 2:Overview of NestedKV\.Left \(Section[2\.2](https://arxiv.org/html/2605.26678#S2.SS2)\)\.Three time\-scale summaries of the cached key stream: stable meanμs\\mu\_\{s\}, episodic block meanμe​\(i\)\\mu\_\{e\}\(i\), and current sliding\-window meanμc​\(i\)\\mu\_\{c\}\(i\)\.Middle \(Sections[2\.3](https://arxiv.org/html/2605.26678#S2.SS3)–[2\.4](https://arxiv.org/html/2605.26678#S2.SS4)\)\.Each key produces per\-scale cosine anomaliesss​\(i\),se​\(i\),sc​\(i\)s\_\{s\}\(i\),s\_\{e\}\(i\),s\_\{c\}\(i\), normalized per head and combined by a head\-adaptive softmax into the blended scoresb​\(i\)s\_\{b\}\(i\)\.Right \(Section[2\.4](https://arxiv.org/html/2605.26678#S2.SS4)\)\.Surprise\-guided routing measures cross\-scale disagreement: agreeing tokens keepsb​\(i\)s\_\{b\}\(i\), disagreeing tokens are routed to the maximum\-anchor reading so any single anomaly flag suffices\. The final score drives the retain/evict decision \(bottom row\)\.### 2\.1KV Compression as Nested Memory Maintenance

For a frozen LLM, the prefilled KV cache is the inner memory state through which the model carries the context flow into future decoding steps\. KV compression therefore asks for a bounded memory policy rather than a standalone token deletion rule\. For layerℓ\\elland KV headhh, let

Mℓ,h=\(Kℓ,h,Vℓ,h\)M\_\{\\ell,h\}=\(K\_\{\\ell,h\},V\_\{\\ell,h\}\)\(1\)be the full prefill memory\. NestedKV constructs a compressed memory

Mℓ,hBh=𝒞ϕ​\(Kℓ,h,Vℓ,h;Bh\),M\_\{\\ell,h\}^\{B\_\{h\}\}=\\mathcal\{C\}\_\{\\phi\}\(K\_\{\\ell,h\},V\_\{\\ell,h\};B\_\{h\}\),\(2\)whereBhB\_\{h\}is the head\-specific memory budget andϕ\\phidenotes the fixed NestedKV memory policy\. No parameters inϕ\\phiare learned; the policy is defined by the continuum\-memory state and the head\-wise allocation rule below\.

To simplify notation, we describe one layer and one KV head and omitℓ,h\\ell,h\. Let

K\\displaystyle K=\[k1,…,kN\]∈ℝN×d,\\displaystyle=\[k\_\{1\},\\ldots,k\_\{N\}\]\\in\\mathbb\{R\}^\{N\\times d\},\(3\)V\\displaystyle V=\[v1,…,vN\]∈ℝN×dv\.\\displaystyle=\[v\_\{1\},\\ldots,v\_\{N\}\]\\in\\mathbb\{R\}^\{N\\times d\_\{v\}\}\.Given budgetBB, the compressor returns an index set𝒮\\mathcal\{S\}with\|𝒮\|=B\|\\mathcal\{S\}\|=Band memoryMB=\(K𝒮,V𝒮\)M^\{B\}=\(K\_\{\\mathcal\{S\}\},V\_\{\\mathcal\{S\}\}\)\. All scores are computed from normalized keysk^i=ki/∥ki∥2\\hat\{k\}\_\{i\}=k\_\{i\}/\\lVert k\_\{i\}\\rVert\_\{2\}, so the memory policy focuses on directional structure in key space\.

![Refer to caption](https://arxiv.org/html/2605.26678v1/x3.png)Figure 3:LongBench\-Qasper attention\-series probe\(Dasigiet al\.,[2021](https://arxiv.org/html/2605.26678#bib.bib20); Baiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib19)\)\. Q1–Q3 attend to different answer regions \(vertical lines\), while NestedKV assigns saliency across these dispersed positions\.Figure[3](https://arxiv.org/html/2605.26678#S2.F3)shows the motivation for using a memory state rather than a single token\-importance view\. Even within the same Qasper document, different downstream questions activate different answer regions across layers\. A compressor that commits to one temporal anchor risks preserving only one such pattern\. NestedKV therefore constructs a continuum memory over the key stream, so a token can be retained when it is distinctive globally, within its local episode, or relative to the current stream\.

### 2\.2Continuum Memory State

The Continuum Memory System view suggests that memory should not collapse into a single temporal scale\. A token can be redundant with the document as a whole, redundant within its local episode, or redundant with the recent stream\. NestedKV represents these three notions as a continuum memory state

ℳ​\(i\)=\{μs,μe​\(i\),μc​\(i\)\}\\mathcal\{M\}\(i\)=\\\{\\mu\_\{s\},\\mu\_\{e\}\(i\),\\mu\_\{c\}\(i\)\\\}\(4\)for every cached tokenii\.

#### Stable memory\.

The stable component summarizes the whole prefilled context:

μs=1N​∑j=1Nk^j\.\\mu\_\{s\}=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\hat\{k\}\_\{j\}\.\(5\)It captures document\-level regularities that persist across the entire context\.

#### Episodic memory\.

The episodic component summarizes the local segment containing tokenii\. LetB​\(i\)B\(i\)be the block containingii, with block size

b=clip⁡\(⌊N/32⌋,128,256\)\.b=\\operatorname\{clip\}\\left\(\\left\\lfloor N/32\\right\\rfloor,128,256\\right\)\.\(6\)Then

μe​\(i\)=1\|B​\(i\)\|​∑j∈B​\(i\)k^j\.\\mu\_\{e\}\(i\)=\\frac\{1\}\{\|B\(i\)\|\}\\sum\_\{j\\in B\(i\)\}\\hat\{k\}\_\{j\}\.\(7\)It captures passage\-level or turn\-level structure that may be invisible to a global summary\.

#### Current memory\.

The current component summarizes the immediate causal stream ending atii\. With window sizeW=64W=64,

μc​\(i\)\\displaystyle\\mu\_\{c\}\(i\)=1i−ℓi\+1​∑j=ℓiik^j,\\displaystyle=\\frac\{1\}\{i\-\\ell\_\{i\}\+1\}\\sum\_\{j=\\ell\_\{i\}\}^\{i\}\\hat\{k\}\_\{j\},\(8\)ℓi\\displaystyle\\ell\_\{i\}=max⁡\(1,i−W\+1\)\.\\displaystyle=\\max\(1,i\-W\+1\)\.It captures short\-range continuity and immediate redundancy\.

### 2\.3Per\-Scale Anomaly Scores

NestedKV evicts a token when its key is already predictable from the continuum memory state and retains it when it is anomalous under that state\. Because the three memory scales summarize different temporal structure, we keep their anomaly readings separate rather than collapsing them into a single anchor up front\. For each cached tokenii, the per\-scale anomaly scores are

as​\(i\)\\displaystyle a\_\{s\}\(i\)=−cos⁡\(k^i,μs\),\\displaystyle=\-\\cos\(\\hat\{k\}\_\{i\},\\mu\_\{s\}\),\(9\)ae​\(i\)\\displaystyle a\_\{e\}\(i\)=−cos⁡\(k^i,μe​\(i\)\),\\displaystyle=\-\\cos\(\\hat\{k\}\_\{i\},\\mu\_\{e\}\(i\)\),ac​\(i\)\\displaystyle a\_\{c\}\(i\)=−cos⁡\(k^i,μc​\(i\)\)\.\\displaystyle=\-\\cos\(\\hat\{k\}\_\{i\},\\mu\_\{c\}\(i\)\)\.A lowak​\(i\)a\_\{k\}\(i\)means tokeniiis typical with respect to memory scalekk; a highak​\(i\)a\_\{k\}\(i\)means the token carries information not well explained by that scale and should remain available for future attention\. Each score is min\-max normalized within its head, yieldinga~s,a~e,a~c\\tilde\{a\}\_\{s\},\\tilde\{a\}\_\{e\},\\tilde\{a\}\_\{c\}on a common per\-head scale\. To preserve attention stability, the firstnsink=4n\_\{\\mathrm\{sink\}\}=4positions are pinned by assigning them a large value before selection\.

The three readings now propose three potentially different token rankings, and any rule for combining them is itself a modeling decision\. The next subsection introduces the outer learner that performs this combination\.

### 2\.4Outer Learner: Head\-Adaptive Blend with Surprise Routing

NestedKV combines the per\-scale anomaly scores through a training\-free outer learner that adapts on two complementary axes: which memory scale is reliable*on each attention head*, and which combination rule should apply*on each token*\.

#### Head\-adaptive blend\.

Different heads specialize in different temporal roles, so a single mixing rule across all heads wastes head capacity\. For each head we measure how strongly each memory scale separates its top from its bottom tokens,

Δk=topp⁡\(a~k\)¯−botp⁡\(a~k\)¯,\\Delta\_\{k\}=\\overline\{\\operatorname\{top\}\_\{p\}\(\\tilde\{a\}\_\{k\}\)\}\-\\overline\{\\operatorname\{bot\}\_\{p\}\(\\tilde\{a\}\_\{k\}\)\},\(10\)withp=10%p=10\\%\. A largerΔk\\Delta\_\{k\}means scalekkproduces a more discriminative ranking on this head\. The head\-adaptive blend weight is then a softmax over reliability gaps anchored by a fixed log\-prior,

wk=exp⁡\(log⁡wk0\+β​Δk\)∑jexp⁡\(log⁡wj0\+β​Δj\),w\_\{k\}=\\frac\{\\exp\\bigl\(\\log w\_\{k\}^\{0\}\+\\beta\\,\\Delta\_\{k\}\\bigr\)\}\{\\sum\_\{j\}\\exp\\bigl\(\\log w\_\{j\}^\{0\}\+\\beta\\,\\Delta\_\{j\}\\bigr\)\},\(11\)with prior\(ws0,we0,wc0\)=\(0\.4,0\.4,0\.2\)\(w\_\{s\}^\{0\},w\_\{e\}^\{0\},w\_\{c\}^\{0\}\)=\(0\.4,0\.4,0\.2\)shared across the model and fixed temperatureβ\\beta\. The blended score on that head is

ablend​\(i\)=ws​a~s​\(i\)\+we​a~e​\(i\)\+wc​a~c​\(i\)\.a\_\{\\text\{blend\}\}\(i\)=w\_\{s\}\\,\\tilde\{a\}\_\{s\}\(i\)\+w\_\{e\}\\,\\tilde\{a\}\_\{e\}\(i\)\+w\_\{c\}\\,\\tilde\{a\}\_\{c\}\(i\)\.\(12\)The prior\(0\.4,0\.4,0\.2\)\(0\.4,0\.4,0\.2\)enters only as a Bayesian anchor, not as a final coefficient: heads on which one scale is markedly more discriminative shift weight onto that scale, while heads on which the three scales are similarly informative fall back on the prior\.

#### Compression\-induced surprise\.

The blended score is reliable when the three memory scales agree on the relative anomaly of a token, but it is brittle when they disagree\. A token that is highly anomalous against the stable memory may still be typical inside its local episode, or vice versa, and any average hides exactly the cross\-scale information that distinguishes these cases\. We define the*compression\-induced surprise*of tokeniias the standard deviation of the inner memories’ rankings,

s​\(i\)=std⁡\(a~s​\(i\),a~e​\(i\),a~c​\(i\)\)\.s\(i\)=\\operatorname\{std\}\\bigl\(\\tilde\{a\}\_\{s\}\(i\),\\,\\tilde\{a\}\_\{e\}\(i\),\\,\\tilde\{a\}\_\{c\}\(i\)\\bigr\)\.\(13\)Low surprise means the inner memories agree; high surprise means they disagree, so any single average is at risk of being driven by one scale’s blind spot\.

#### Routed score\.

When the inner memories disagree, the safer reading is the strongest individual memory rather than their average:

awin​\(i\)=max⁡\(a~s​\(i\),a~e​\(i\),a~c​\(i\)\)\.a\_\{\\text\{win\}\}\(i\)=\\max\\bigl\(\\tilde\{a\}\_\{s\}\(i\),\\,\\tilde\{a\}\_\{e\}\(i\),\\,\\tilde\{a\}\_\{c\}\(i\)\\bigr\)\.\(14\)The NestedKV score combines the two branches with a sigmoid gate over surprise,

α​\(i\)\\displaystyle\\alpha\(i\)=σ​\(κ​\(s​\(i\)−τ\)\),\\displaystyle=\\sigma\\bigl\(\\kappa\\,\(s\(i\)\-\\tau\)\\bigr\),\(15\)a⋆​\(i\)\\displaystyle a^\{\\star\}\(i\)=\(1−α​\(i\)\)​ablend​\(i\)\+α​\(i\)​awin​\(i\),\\displaystyle=\(1\-\\alpha\(i\)\)\\,a\_\{\\text\{blend\}\}\(i\)\+\\alpha\(i\)\\,a\_\{\\text\{win\}\}\(i\),with fixed gate thresholdτ\\tauand sharpnessκ\\kappashared across all benchmarks\. Tokens on which the three scales agree pass through the head\-adaptive blend; tokens that produce high cross\-scale disagreement are routed toward the strongest memory\.

### 2\.5NestedKV Compression Operator

The single\-head compression operator appliesTopB\\operatorname\{TopB\}to the NestedKV scorea⋆a^\{\\star\}:

𝒞ϕ​\(K,V;B\)=\{\(ki,vi\):i∈TopB⁡\(a1:N⋆\)\},\\mathcal\{C\}\_\{\\phi\}\(K,V;B\)=\\\{\(k\_\{i\},v\_\{i\}\):i\\in\\operatorname\{TopB\}\(a^\{\\star\}\_\{1:N\}\)\\\},\(16\)whereTopB\\operatorname\{TopB\}returns theBBpositions with the largest scores after sink pinning\. This operator is the concrete bounded\-memory update of NestedKV: it keeps positions that are anomalous against at least one memory scale — either by the head\-adaptive blend or by the surprise\-routed winner — and removes positions already absorbed by the continuum state\.

### 2\.6Head\-Wise Memory Competition

Different heads act as parallel memory channels and can specialize in different temporal roles\. Some heads may concentrate high residuals around local episodes, while others may spread memory over stable document structure or recent transitions\. NestedKV therefore allocates a layer budget across heads by letting head\-token pairs compete under their normalized continuum residuals\.

Letah,ia\_\{h,i\}be the normalized residual for tokeniiin headhh, and letBℓB\_\{\\ell\}be the total number of KV positions retained in layerℓ\\ell\. NestedKV selects the globally highest\-residual pairs

𝒫ℓ=TopBBℓ⁡\{\(h,i\):ah,i\},\\mathcal\{P\}\_\{\\ell\}=\\operatorname\{TopB\}\_\{B\_\{\\ell\}\}\\\{\(h,i\):a\_\{h,i\}\\\},\(17\)subject to a small per\-head safeguard so that every memory channel keeps a minimum state\. This induces a head\-specific budget

Bh=\|\{i:\(h,i\)∈𝒫ℓ\}\|,∑hBh=Bℓ\.B\_\{h\}=\|\\\{i:\(h,i\)\\in\\mathcal\{P\}\_\{\\ell\}\\\}\|,\\quad\\sum\_\{h\}B\_\{h\}=B\_\{\\ell\}\.\(18\)Each head then applies𝒞ϕ​\(Kh,Vh;Bh\)\\mathcal\{C\}\_\{\\phi\}\(K\_\{h\},V\_\{h\};B\_\{h\}\)\. The result is a bounded layer memory whose capacity is distributed according to continuum\-memory surprise rather than a uniform allocation\.

## 3Experiments

Table 1:RULER average over 13 tasks\. Each group reports Full KV and compressed\-cache results across model scales, context lengths, and eviction ratios\. Shaded rows are NestedKV\.We evaluate NestedKV as a training\-free KV cache compressor for frozen LLM inference\. The experimental matrix tests whether continuum\-memory compression is robust across model families, model scales, compression ratios, and task types\.

### 3\.1Experimental Setup

#### Models\.

We evaluate Qwen3\-0\.6B, Qwen3\-4B, Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2605.26678#bib.bib16)\), Llama\-3\.2\-1B\-Instruct, and Llama\-3\.2\-3B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib17)\)\. For every benchmark and compression ratio, we also run the same model with the full KV cache as an upper\-bound reference\.

#### Benchmarks\.

We evaluate on a suite of long\-context and short\-context benchmarks\. RULER\(Hsiehet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib18)\)provides controlled synthetic tasks across multiple context lengths; we report the average over 13 RULER tasks at 4k, 8k, 16k, and 32k contexts\. LongBench\(Baiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib19)\)evaluates real long\-context understanding tasks\. LooGLE\(Liet al\.,[2024a](https://arxiv.org/html/2605.26678#bib.bib21)\)evaluates long\-dependency and short\-dependency QA over long documents\. We additionally evaluate LongBench\-E\(Baiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib19)\)and InfiniteBench longbook\_qa\_eng and code\_debug\(Zhanget al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib22)\); to test whether the compressor preserves short\-prompt capability, we evaluate MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib23)\), a 10\-choice multi\-domain knowledge benchmark\. The LongBench\-E and InfiniteBench numbers are reported in Appendix[C](https://arxiv.org/html/2605.26678#A3)\.

#### Compression ratios\.

We evaluate eviction ratiosr∈\{0\.25,0\.50,0\.75\}r\\in\\\{0\.25,0\.50,0\.75\\\}, whererrdenotes the fraction of KV entries removed after prefill\.

#### Baselines\.

We compare against representative training\-free KV compression methods implemented through thekvpressevaluation framework released with Expected Attention\(Devotoet al\.,[2025](https://arxiv.org/html/2605.26678#bib.bib13)\): StreamingLLM\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib3)\), SnapKV\(Liet al\.,[2024b](https://arxiv.org/html/2605.26678#bib.bib4)\), Expected Attention, PyramidKV\(Caiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib5)\), and KeyDiff\(Parket al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib7)\)\. Full\-cache inference is included as the no\-compression reference\. We omit H2O\(Zhanget al\.,[2023](https://arxiv.org/html/2605.26678#bib.bib2)\), as it runs out of memory on most settings in our evaluation matrix\.

![Refer to caption](https://arxiv.org/html/2605.26678v1/x4.png)Figure 4:LooGLE Rouge\-L score as a function of the eviction ratiorrfor the long\-dependency QA \(top\) and short\-dependency QA \(bottom\) splits, across four models\. Dotted lines mark each model’s Full KV reference\. NestedKV \(solid red\) is competitive across dependency regimes and remains robust under stronger compression\.
#### Environment\.

All experiments are conducted on a server with four NVIDIA L20 GPUs\. We use the same hardware configuration for full\-cache references and compressed\-cache runs so that accuracy comparisons are not affected by changes in model parallelism or execution backend\.

![Refer to caption](https://arxiv.org/html/2605.26678v1/x5.png)Figure 5:LongBench 8\-task average vs\. eviction ratiorron Qwen3\-4B \(top\) and Llama\-3\.2\-3B\-Instruct \(bottom\)\. Dashed lines mark each model’s Full KV ceiling\.

### 3\.2Long\-Context Tasks

We organize the main results by benchmark\. Each table or figure reports Full KV, multiple compression ratios, and multiple model families in a single view\. NestedKV rows are shaded\.

Table[1](https://arxiv.org/html/2605.26678#S3.T1)first reports RULER, a controlled synthetic benchmark that isolates retrieval and aggregation behaviour at 4k–32k contexts\. Across Qwen3 and Llama\-3\.2\-Instruct models, NestedKV is consistently strongest or near\-strongest, and its advantage is clearest under aggressive compression\. On Qwen3\-4B, it improves over KeyDiff in all reported context–ratio cells; atr=0\.75r=0\.75, the gains are \+19\.10 at 4k, \+15\.99 at 8k, and \+17\.69 at 16k\. The 32k columns show the same trend at longer context: attention\-based baselines can be competitive when the retained budget is still sufficient, but NestedKV avoids the sharp drops seen in single\-signal methods as the budget tightens\.

Figure[4](https://arxiv.org/html/2605.26678#S3.F4)evaluates real long\-document QA on LooGLE\. On long\-dependency QA, NestedKV is best or within0\.60\.6Rouge\-L of the best method on three of four models, though PyramidKV is stronger on Qwen3\-8B atr≤0\.50r\\leq 0\.50\. On short\-dependency QA, NestedKV leads on Qwen3\-8B and Llama\-3\.2\-3B\-Instruct, while PyramidKV or ExpAttn can win in easier low\-compression settings\. The key trend is therefore not a uniform per\-cell win, but stable performance across dependency regimes as compression increases\.

Figure[5](https://arxiv.org/html/2605.26678#S3.F5)reports LongBench averages, where the main signal is the slope of degradation\. Atr=0\.25r=0\.25andr=0\.50r=0\.50, the methods are close: on Qwen3\-4B, SnapKV is marginally ahead of NestedKV and ExpAttn is competitive, while on Llama\-3\.2\-3B\-Instruct the strongest baseline differs by less than one point\. The curves separate once the cache becomes scarce\. On Qwen3\-4B, NestedKV reaches 50\.06 atr=0\.75r=0\.75versus 30\.77 for KeyDiff, remains highest atr=0\.85r=0\.85\(45\.38 vs\. 40\.61 for SnapKV\), and keeps 37\.32 atr=0\.95r=0\.95while KeyDiff falls to 17\.55\. On Llama\-3\.2\-3B\-Instruct, NestedKV is similarly the strongest method fromr=0\.75r=0\.75onward, reaching 48\.14, 44\.51, and 35\.47 atr∈\{0\.75,0\.85,0\.95\}r\\in\\\{0\.75,0\.85,0\.95\\\}\. These results support the central claim: NestedKV is not primarily a low\-compression peak\-score method; its benefit is that multi\-time\-scale scoring degrades more gracefully when the retained cache must be small\.

### 3\.3Short\-Context Preservation

To check whether the long\-context gains come at the cost of short\-prompt capability, we evaluate MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib23)\)on Qwen3\-4B in a0\-shot setting\. Figure[6](https://arxiv.org/html/2605.26678#S3.F6)reports accuracy versus the compression ratiorr\. NestedKV is the top or tied method at every ratio: essentially lossless atr=0\.25r\{=\}0\.25\(36\.536\.5vs Full KV36\.336\.3\), and atr=0\.75r\{=\}0\.75it retains33\.133\.1while SnapKV and PyramidKV collapse to22\.022\.0and21\.421\.4\. This indicates that continuum\-memory scoring does not compromise short\-prompt knowledge access\.

![Refer to caption](https://arxiv.org/html/2605.26678v1/x6.png)Figure 6:MMLU\-Pro accuracy on Qwen3\-4B versus compression ratiorr\. The dotted line is the Full KV baseline\.
### 3\.4Ablation Study

We ablate NestedKV’s two core components — the continuum memory state and the adaptive head\-wise allocation — by removing them in isolation and jointly\. We evaluate at Qwen3\-4B RULER 4k under aggressive compression \(r=0\.75r=0\.75\), where neither component is masked by ceiling effects\.

Table[2](https://arxiv.org/html/2605.26678#S3.T2)reports the results\. The two components contribute comparably: removing the adaptive head\-wise budget drops the score by 8\.41 points, while replacing the three\-scale continuum score with a single\-anchor key\-distinctiveness score drops it by 7\.99 points\. Removing both jointly drops the score by 19\.10 points, more than the sum of the individual deltas \(16\.40\), because the two components are coupled by the discrete top\-kkcache decision: continuum scoring decides*which*tokens remain salient, while adaptive allocation decides*where*the retained budget is spent across heads\. Each component can partially compensate when the other remains, but removing both eliminates both compensation paths\. We repeat the same four\-variant ablation on LongBench and LooGLE in Appendix[C\.5](https://arxiv.org/html/2605.26678#A3.SS5)\(Figure[7](https://arxiv.org/html/2605.26678#A3.F7)\); the same pattern holds but the continuum component becomes the dominant contributor on real\-world long\-document tasks\.

Table 2:Component ablation of NestedKV on Qwen3\-4B RULER 4k atr=0\.75r=0\.75\. “w/o continuum” replaces the three\-scale continuum score with a single\-anchor key\-distinctiveness score; “w/o adaptive” replaces the head\-adaptive budget with a uniform per\-head budget\. Both components contribute roughly equally; their combined removal exceeds the sum of the individual deltas\.Table 3:Efficiency on Qwen3\-4B with a 32k\-token context on a single NVIDIA L20 in bf16,r=0\.75r\{=\}0\.75\. KV denotes retained prompt\-cache entries used during decoding; prefill is in seconds, decode in ms/token, and peak memory in GB\.
### 3\.5Efficiency

Table[3](https://arxiv.org/html/2605.26678#S3.T3)reports prefill latency, per\-token decode latency, and peak GPU memory of NestedKV against KeyDiff, SnapKV, and the Full KV baseline on Qwen3\-4B with a 32k\-token context\. At the samerr, all compressed methods use the same retained prompt\-cache budget during decoding, which gives nearly identical decode latency and peak memory while remaining substantially below Full KV\. The three\-time\-scale scoring incurs a one\-time prefill overhead that is amortized over decoding, and stays within0\.5%0\.5\\%of the single\-anchor KeyDiff prefill\.

## 4Conclusion

We presented NestedKV, a training\-free KV cache compression method that treats eviction as bounded test\-time memory maintenance\. Instead of ranking cached tokens from a single signal, NestedKV uses three time\-scale key statistics — stable, episodic, and current — as inner learners, each producing its own cosine anomaly ranking\. A training\-free outer learner combines these rankings on two axes: a head\-adaptive softmax mix anchored by a fixed log\-prior, and a per\-token sigmoid gate over compression\-induced surprise that routes between the blended view and the strongest individual memory\.

According to experiments, the results show that continuum\-memory scoring is especially useful when single\-anchor rules become brittle under stronger compression and longer contexts\. The ablations further indicate that stable, episodic, and current memories contribute complementary signals, and that adaptive head\-wise allocation amplifies these gains by matching memory budgets to head\-specific residual structure\.

## Limitations

NestedKV assumes that redundancy with respect to stable, episodic, and current key statistics is a useful signal for cache eviction\. This assumption is strongest for retrieval, question answering, and long\-document understanding, where repeated local context often indicates information already represented by the continuum state\. It can be weaker for code\-completion style tasks: local repetition in code may be the very pattern the model must preserve for the next line\. This suggests that future variants should adapt the continuum weights to input structure, for example, by reducing episodic or current redundancy penalties when the prompt exhibits code\-like local regularity\.

Our experiments are also limited to frozen open\-weight models and training\-free compression during prefill\. NestedKV does not modify decoding kernels or learn task\-specific parameters, which keeps the method simple but leaves open whether learned or query\-aware variants could further improve difficult settings\.

## References

- Longbench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 3119–3137\.Cited by:[§1](https://arxiv.org/html/2605.26678#S1.p8.3),[Figure 3](https://arxiv.org/html/2605.26678#S2.F3),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px2.p1.1)\.
- A\. Behrouz, M\. Razaviyayn, P\. Zhong, and V\. Mirrokni \(2026a\)Nested learning: the illusion of deep learning architectures\.Advances in Neural Information Processing Systems38,pp\. 46968–47002\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p5.1)\.
- A\. Behrouz, P\. Zhong, and V\. Mirrokni \(2026b\)Titans: learning to memorize at test time\.Advances in Neural Information Processing Systems38,pp\. 113506–113543\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px3.p1.1)\.
- Z\. Cai, Y\. Zhang, B\. Gao, Y\. Liu, Y\. Li, T\. Liu, K\. Lu, W\. Xiong, Y\. Dong, J\. Hu,et al\.\(2024\)Pyramidkv: dynamic kv cache compression based on pyramidal information funneling\.arXiv preprint arXiv:2406\.02069\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px4.p1.1)\.
- H\. Chen, X\. Liu, B\. Wang, Y\. Fan, Y\. Chu, Z\. Li, X\. Chu, and X\. Hu \(2026\)SONIC: segmented optimized nexus for information compression in key\-value caching\.arXiv preprint arXiv:2601\.21927\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px2.p1.1)\.
- P\. Dasigi, K\. Lo, I\. Beltagy, A\. Cohan, N\. A\. Smith, and M\. Gardner \(2021\)A dataset of information\-seeking questions and answers anchored in research papers\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4599–4610\.Cited by:[Figure 3](https://arxiv.org/html/2605.26678#S2.F3)\.
- A\. Devoto, M\. Jeblick, and S\. Jégou \(2025\)Expected attention: kv cache compression by estimating attention from future queries distribution\.arXiv preprint arXiv:2510\.00636\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px4.p1.1)\.
- A\. Devoto, Y\. Zhao, S\. Scardapane, and P\. Minervini \(2024\)A simple and effective l\_2 norm\-based strategy for kv cache compression\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 18476–18499\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px2.p1.1)\.
- Y\. Feng, J\. Lv, Y\. Cao, X\. Xie, and S\. K\. Zhou \(2026\)Ada\-kv: optimizing kv cache eviction by adaptive budget allocation for efficient llm inference\.Advances in Neural Information Processing Systems38,pp\. 113152–113188\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px1.p1.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.arXiv preprint arXiv:2404\.06654\.Cited by:[§1](https://arxiv.org/html/2605.26678#S1.p8.3),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px2.p1.1)\.
- J\. Li, M\. Wang, Z\. Zheng, and M\. Zhang \(2024a\)Loogle: can long\-context language models understand long contexts?\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16304–16333\.Cited by:[§1](https://arxiv.org/html/2605.26678#S1.p8.3),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px2.p1.1)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024b\)Snapkv: llm knows what you are looking for before generation\.Advances in Neural Information Processing Systems37,pp\. 22947–22970\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px4.p1.1)\.
- X\. Liu, H\. Chen, X\. Hu, and X\. Chu \(2025\)FlowKV: enhancing multi\-turn conversational coherence in llms via isolated key\-value cache management\.arXiv preprint arXiv:2505\.15347\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px2.p1.1)\.
- X\. Liu, Z\. Tang, H\. Chen, P\. Dong, Z\. Li, X\. Zhou, B\. Li, X\. Hu, and X\. Chu \(2026\)Semantic integrity matters: benchmarking and preserving high\-density reasoning in kv cache compression\.External Links:2502\.01941,[Link](https://arxiv.org/abs/2502.01941)Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px2.p1.1)\.
- Z\. Liu, A\. Desai, F\. Liao, W\. Wang, V\. Xie, Z\. Xu, A\. Kyrillidis, and A\. Shrivastava \(2023\)Scissorhands: exploiting the persistence of importance hypothesis for llm kv cache compression at test time\.Advances in Neural Information Processing Systems36,pp\. 52342–52364\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p2.1)\.
- M\. Oren, M\. Hassid, N\. Yarden, Y\. Adi, and R\. Schwartz \(2024\)Transformers are multi\-state rnns\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 18724–18741\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px3.p1.1)\.
- J\. Park, D\. Jones, M\. Morse, R\. Goel, M\. Lee, and C\. Lott \(2026\)Keydiff: key similarity\-based kv cache eviction for long\-context llm inference in resource\-constrained environments\.Advances in Neural Information Processing Systems38,pp\. 5983–6019\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px4.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[§1](https://arxiv.org/html/2605.26678#S1.p8.3),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2605.26678#S3.SS3.p1.9)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 21875–21895\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px1.p1.1)\.
- X\. Zhang, Y\. Chen, S\. Hu, Z\. Xu, J\. Chen, M\. K\. Hao, X\. Han, Z\. L\. Thai, S\. Wang, Z\. Liu,et al\.\(2024\)Infty bench: extending long context evaluation beyond 100k tokens\.arXiv preprint arXiv:2402\.13718\.Cited by:[§C\.3](https://arxiv.org/html/2605.26678#A3.SS3.p1.12),[§C\.4](https://arxiv.org/html/2605.26678#A3.SS4.p1.6),[§1](https://arxiv.org/html/2605.26678#S1.p8.3),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px2.p1.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett,et al\.\(2023\)H2o: heavy\-hitter oracle for efficient generative inference of large language models\.Advances in Neural Information Processing Systems36,pp\. 34661–34710\.Cited by:[Appendix B](https://arxiv.org/html/2605.26678#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p1.1),[§1](https://arxiv.org/html/2605.26678#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26678#S3.SS1.SSS0.Px4.p1.1)\.

## Appendix AImplementation Details

#### Scoring cost\.

NestedKV is training\-free and parameter\-free\. The stable memory requires one mean over the key sequence; episodic memory is computed with block means; and current memory is computed with cumulative sums\. The scoring cost is thereforeO​\(N​d\)O\(Nd\)per layer and head\. NestedKV only removes cached positions and leaves retained keys and values unchanged, so the compressed memory remains compatible with standard attention implementations\.

#### Inference pipeline\.

All methods are evaluated using the samekvpress\-based runner and the same HuggingFace model checkpoints\. NestedKV is implemented as a training\-free prefill compressor: after the prompt is encoded, each layer computes per\-scale anomalies from cached keys, combines them with the head\-adaptive blend and surprise\-gated route described in Section[2\.4](https://arxiv.org/html/2605.26678#S2.SS4), allocates head\-wise memory budgets, and removes low\-scoring entries before decoding\. The head\-adaptive blend weights and surprise gates are computed once during this prefill\-time compression step and then kept fixed throughout autoregressive decoding\. NestedKV does not recompute scores, scale reliabilities, or routes for retained prompt tokens as new tokens are generated; newly decoded tokens are appended normally and remain available to subsequent decoding steps\. Adapting the continuum\-memory weights during very long\-form generation is left to future work\.

#### Hyperparameters\.

Unless otherwise stated, NestedKV uses log\-prior\(ws0,we0,wc0\)=\(0\.4,0\.4,0\.2\)\(w\_\{s\}^\{0\},w\_\{e\}^\{0\},w\_\{c\}^\{0\}\)=\(0\.4,0\.4,0\.2\), block sizeclip⁡\(⌊N/32⌋,128,256\)\\operatorname\{clip\}\(\\lfloor N/32\\rfloor,128,256\), current\-window size6464, and44pinned sink tokens\. The head\-adaptive blend usesβ=3\.0\\beta=3\.0; the surprise gate uses thresholdτ=0\.60\\tau=0\.60and sharpnessκ=10\.0\\kappa=10\.0\. Before applying the gate, surprise scores are min–max normalized within each head and mean\-centered with a rectifier\. For head\-wise budgeting, we use a per\-head safeguardαs=0\.20\\alpha\_\{s\}=0\.20: for a sequence of lengthNNand eviction ratiorr, each head first preserves⌊0\.20​⌊\(1−r\)​N⌋⌋\\lfloor 0\.20\\lfloor\(1\-r\)N\\rfloor\\rfloorof its own highest\-scoring tokens before the remaining budget is allocated by cross\-head competition\. These constants are fixed across all models, benchmarks, and compression ratios\.

Table 4:Evaluation data scale\. “Examples” counts the fixed evaluation set per model\. Average input length is measured in tokens under the evaluation prompt construction; RULER evaluates 6,500 samples under per context length setting, and InfiniteBench reflects the4040k input cap\.Table 5:LongBench per\-task scores on Qwen3\-4B across different KV eviction ratios\. NestedKV rows are shaded\.

## Appendix BRelated Work

#### Training\-free KV cache eviction and budgeting\.

KV cache compression reduces inference memory by retaining only a subset of past keys and values\. A central line of work ranks cached tokens by attention\-derived importance\. Scissorhands proposes the persistence\-of\-importance hypothesis, observing that tokens important in previous attention tend to remain important later\(Liuet al\.,[2023](https://arxiv.org/html/2605.26678#bib.bib1)\)\. H2O similarly keeps heavy hitters in attention scores while preserving recent tokens\(Zhanget al\.,[2023](https://arxiv.org/html/2605.26678#bib.bib2)\)\. StreamingLLM identifies the importance of initial attention sinks and combines them with a recent window to support streaming generation\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib3)\)\. SnapKV uses an observation window at the end of the prompt to estimate which prefix tokens are likely to matter during generation\(Liet al\.,[2024b](https://arxiv.org/html/2605.26678#bib.bib4)\)\. Expected Attention estimates future attention scores from the distribution of future queries rather than directly observing them\(Devotoet al\.,[2025](https://arxiv.org/html/2605.26678#bib.bib13)\)\. Ada\-KV addresses a different but related issue: after tokens are scored, the available cache budget should be allocated adaptively across heads rather than uniformly\(Fenget al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib6)\)\. NestedKV follows this line of training\-free compression, but defines token importance as a multi\-time\-scale anomaly of the stored key stream rather than as a single attention, recency, or head\-allocation statistic\.

#### Geometric, structured, and conversational cache compression\.

Other methods avoid direct attention scoring and compress the cache using structure in key/value representations or dialogue history\. KeyDiff ranks tokens by their cosine difference from the mean key direction, showing that geometric distinctiveness is a strong training\-free signal for long\-context inference\(Parket al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib7)\)\. L2\-norm\-based compression uses the magnitude of key embeddings as a simple importance proxy\(Devotoet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib12)\)\. PyramidKV studies layer\-wise differences in cache importance and allocates memory according to a pyramidal information\-funneling pattern\(Caiet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib5)\)\. FlowKV studies multi\-turn conversations and isolates the KV cache of completed turns to avoid repeatedly recompressing older conversational context\(Liuet al\.,[2025](https://arxiv.org/html/2605.26678#bib.bib8)\)\. SONIC further emphasizes segment\-level structure by compressing historical dialogue segments into compact Nexus tokens\(Chenet al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib9)\)\. Beyond efficiency, KVFundaBench shows that compression can affect fundamental abilities such as world knowledge, reasoning, code generation, safety, and long\-context generation in task\-dependent ways\(Liuet al\.,[2026](https://arxiv.org/html/2605.26678#bib.bib10)\)\. NestedKV is closest in spirit to key\-space geometric methods, but replaces the single global anchor with a continuum anchor spanning global, block\-local, and recent statistics\.

#### Nested learning and test\-time memory\.

Several recent works reinterpret Transformer inference as maintaining a bounded recurrent or memory state rather than as unbounded full\-context attention\. StreamingLLM emphasizes attention sinks as persistent state variables for streaming generation\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib3)\), while TOVA explicitly connects KV cache compression to bounded multi\-state RNNs\(Orenet al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib11)\)\. Titans introduces a neural long\-term memory module that learns to memorize at test time, contrasting attention as short\-term memory with a persistent learned memory state\(Behrouzet al\.,[2026b](https://arxiv.org/html/2605.26678#bib.bib15)\)\. Nested Learning further frames models as nested optimization problems over context flows and presents HOPE with a Continuum Memory System spanning multiple memory time scales and a self\-modifying update rule that adapts to compression\-induced surprise\(Behrouzet al\.,[2026a](https://arxiv.org/html/2605.26678#bib.bib14)\)\. NestedKV instantiates both ideas for frozen LLM KV compression\. The continuum\-memory view drives the inner learners: each cached key is read against stable, episodic, and current memory anchors, producing three anomaly rankings rather than one\. The self\-modifying view drives the outer learner: cross\-head reliability gaps and per\-token cross\-scale disagreement determine, respectively, how the three rankings are mixed on each head and whether the score falls back to the strongest individual memory on each token\. Neither axis introduces trainable parameters or modifies the underlying LLM\.

## Appendix CAdditional Results

### C\.1LongBench

Tables[5](https://arxiv.org/html/2605.26678#A1.T5)and Tables[6](https://arxiv.org/html/2605.26678#A3.T6)report per\-task scores for the eight LongBench tasks used in Figure[5](https://arxiv.org/html/2605.26678#S3.F5), covering both models across all evaluated eviction ratios\. Task abbreviations: NarrQA = NarrativeQA, MF\-QA = MultiFieldQA\-en, 2Wiki = 2WikiMQA, GovRpt = GovReport, TrivQA = TriviaQA, PsgRet = Passage\-Retrieval\-en\.

Table 6:LongBench per\-task scores on Llama\-3\.2\-3B\-Instruct across different KV eviction ratios\. NestedKV rows are shaded\.
### C\.2LongBench\-E

Table[7](https://arxiv.org/html/2605.26678#A3.T7)reports LongBench\-E results, the length\-balanced variant of LongBench with 0–4k, 4–8k, and 8k\+ input buckets, averaged over its 13 tasks\. On the smaller Llama\-3\.2\-3B\-Instruct model, NestedKV is within a fraction of a point of the best baseline atr≤0\.50r\\leq 0\.50, trails KeyDiff by 2\.2 points atr=0\.75r=0\.75, and is the top method atr∈\{0\.85,0\.95\}r\\in\\\{0\.85,0\.95\\\}\. On Qwen3\-4B, NestedKV stays within roughly four points of the best attention\-based baseline atr≤0\.85r\\leq 0\.85and becomes the top method atr=0\.95r=0\.95\. The pattern is consistent with the main text: continuum\-memory scoring is most useful under aggressive compression, while at moderate compression all attention\- and key\-based methods cluster tightly\.

Table 7:LongBench\-E 13\-task average across compression ratios\. Shaded rows are NestedKV\.
### C\.3InfiniteBench longbook\_qa\_eng

Table 8:InfiniteBench longbook\_qa\_eng F1 score \(%\) on Qwen3\-4B over alln=351n\{=\}351examples with4040k input cap\. Shaded row is NestedKV; per\-column best in bold\.Table[8](https://arxiv.org/html/2605.26678#A3.T8)reports the long\-document QA setting of InfiniteBench\(Zhanget al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib22)\), the longbook\_qa\_eng task, in which the model must answer free\-form questions over book\-length English contexts truncated to4040k tokens\. Absolute F1 is low for all methods \(∼\\sim10%\) because Qwen3\-4B is itself a weak book\-length QA model under token\-level F1 with11–33word gold answers\. The informative signal is robustness as the retained cache shrinks\. At light and moderate compression \(r=\.25,\.50r\{=\}\.25,\.50\), all methods except StreamingLLM remain within roughly three points of the Full KV baseline \(12\.0912\.09\), so no method is meaningfully separated\. The separation appears only at aggressive compression: atr=\.75r\{=\}\.75NestedKV \(11\.2011\.20\) is the best method and is the only one that does not degrade — its score is unchanged fromr=\.25r\{=\}\.25— while KeyDiff, SnapKV, and ExpAttn fall to the6\.86\.8–7\.87\.8range\. PyramidKV is the only baseline that remains comparably robust \(10\.1910\.19\)\. The combination of stable, episodic, and current anchors keeps a more diverse set of supporting tokens than a single\-anchor scorer, which is what preserves usable accuracy once the budget becomes tight\.

### C\.4InfiniteBench code\_debug

Table 9:InfiniteBench code\_debug accuracy \(%\) on Qwen3\-4B over alln=394n\{=\}394examples with4040k input cap\. Shaded row is NestedKV; per\-column best in bold\.Table[9](https://arxiv.org/html/2605.26678#A3.T9)complements the MMLU\-Pro analysis in Section[3\.3](https://arxiv.org/html/2605.26678#S3.SS3)with a code\-intensive multiple\-choice task from InfiniteBench\(Zhanget al\.,[2024](https://arxiv.org/html/2605.26678#bib.bib22)\)\. The model must identify which function in a repository\-scale Python context contains a deliberate bug\. Because the answer is a single multiple\-choice label rather than a free\-form span, all methods are far more robust to compression here than on longbook\_qa\_eng: every method stays within a few points of the Full KV baseline \(34\.2634\.26\) across all three ratios\. NestedKV is nonetheless the strongest method at every ratio \(33\.7633\.76,34\.0134\.01,31\.2231\.22\), slightly exceeding even Full KV, while PyramidKV is the weakest \(23\.923\.9–26\.726\.7\)\. This indicates that the continuum\-memory score preserves compact code\-level answer selection at least as well as any single\-signal baseline, and does so consistently as the budget tightens\.

### C\.5Cross\-Benchmark Ablation

![Refer to caption](https://arxiv.org/html/2605.26678v1/x7.png)Figure 7:Cross\-benchmark ablation on Qwen3\-4B atr=0\.75r=0\.75\. Red bars show full NestedKV; other bars remove adaptive budgeting, continuum scoring, or both\.Figure[7](https://arxiv.org/html/2605.26678#A3.F7)extends the RULER 4k ablation of Section[3\.4](https://arxiv.org/html/2605.26678#S3.SS4)to LongBench and LooGLE\. The qualitative picture is consistent across the four benchmarks: NestedKV \(full\) is the top configuration on every column, and removing both components is the worst on three of the four\. However, the relative contribution of the two components depends strongly on the task\. On RULER 4k the two components contribute almost equally \(−8\.41\\\!\-8\.41vs−7\.99\\\!\-7\.99\), but on LongBench and LooGLE\-long the continuum score is the dominant factor:−14\.89\-14\.89vs−3\.18\-3\.18on LongBench and−6\.04\-6\.04vs−2\.47\-2\.47on LooGLE\-long, i\.e\. continuum scoring carries between2\.4×2\.4\\timesand4\.7×4\.7\\timesthe per\-task value of head\-adaptive allocation\. On LooGLE\-short, the two components are again comparable and small individually, but jointly account for−9\.67\-9\.67points – the strongest super\-additivity among the ablations\. This pattern suggests that the controlled synthetic structure of RULER masks how much of NestedKV’s gain on real\-world long\-document tasks is driven by the continuum\-memory score, rather than by adaptive head\-wise allocation alone\.

### C\.6Hyperparameter Sensitivity

![Refer to caption](https://arxiv.org/html/2605.26678v1/x8.png)Figure 8:Hyperparameter sensitivity on Qwen3\-4B RULER 4k atr=0\.75r=0\.75\. Red markers indicate the default schedule used in the main results\.Figure[8](https://arxiv.org/html/2605.26678#A3.F8)reports the RULER 4k score atr=0\.75r=0\.75as each block/window schedule knob is moved one step away from the default\. Across the three axes the scores stay within2\.042\.04points of the default; the smallest block clip \(\[64,128\]\[64,128\]\) is the only configuration losing more than1\.21\.2points, and the largest block fraction \(N/64N/64\) is marginally better than the default \(\+0\.46\+0\.46, within run\-to\-run noise\)\. NestedKV is therefore robust to the precise block and window schedule in the explored neighbourhood, and the default schedule sits within0\.470\.47points of the best observed configuration\.

![Refer to caption](https://arxiv.org/html/2605.26678v1/x9.png)Figure 9:Router/prior hyperparameter sensitivity on Qwen3\-4B RULER 4k atr=0\.75r=0\.75\. Red markers indicate the default configuration used in the main results\.Figure[9](https://arxiv.org/html/2605.26678#A3.F9)reports the same robustness probe for the four router and prior knobs introduced in Section[2\.4](https://arxiv.org/html/2605.26678#S2.SS4): the mixing temperatureβ\\beta, the gate thresholdτ\\tau, the gate sharpnessκ\\kappa, and the stable\-memory prior weightws0w\_\{s\}^\{0\}\. Within the explored neighbourhoods, all eleven cells stay within0\.940\.94points of each other and within0\.580\.58points of the default\. NestedKV does not require per\-task tuning of the surprise\-router schedule\.

## Appendix DPotential Risks

NestedKV is an inference\-time efficiency method and does not train a new language model, introduce new datasets, or add new generation capabilities\. Its main risk is therefore indirect: by reducing the memory cost of long\-context inference, it may lower the cost of deploying existing LLMs in applications that process sensitive, copyrighted, or otherwise high\-risk long documents\. Any downstream deployment should inherit the safety, privacy, and data\-governance controls required for the underlying model and application domain\.

A second risk is reliability\. KV cache compression changes the effective context available to the model, and incorrect eviction may silently remove evidence needed for faithful generation\. This is particularly important in high\-stakes settings such as legal, medical, financial, or security\-sensitive document analysis\. We therefore view NestedKV as an efficiency technique that should be validated on the target task and compression ratio before deployment, rather than as a guarantee\-preserving replacement for full\-cache inference\.

## Appendix ELicense Information

All experiments use publicly released research artifacts, and we do not redistribute model weights or benchmark data\. The Qwen3 checkpoints used in our experiments are released under Apache License 2\.0\. The Llama\-3\.2\-1B\-Instruct and Llama\-3\.2\-3B\-Instruct checkpoints are governed by the Llama 3\.2 Community License\. Thekvpresslibrary used for evaluation is released under Apache License 2\.0\.

For benchmarks, RULER’s public generation code is released under Apache License 2\.0\. LongBench and LongBench\-E are associated with the LongBench public repository under the MIT License\. LooGLE’s public HuggingFace dataset card lists CC\-BY\-SA\-4\.0\. InfiniteBench is released under the MIT License, and MMLU\-Pro is released under the MIT License\. Some HuggingFace mirrors used by our runner do not specify independent license metadata; in those cases we follow the upstream benchmark licenses where available\.

Similar Articles

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.