Probing the Prompt KV Cache: Where It Becomes Dispensable
Summary
This paper systematically investigates when and which parts of the prompt KV cache become dispensable during LLM decoding, showing that redundancy primarily involves chat template scaffolding rather than task content, and replacement with neutral filler preserves accuracy.
View Cached Full Text
Cached at: 06/01/26, 09:25 AM
# Probing the Prompt KV Cache: Where It Becomes Dispensable
Source: [https://arxiv.org/html/2605.30574](https://arxiv.org/html/2605.30574)
Vinayshekhar Bannihatti KumarManoj Ghuhan ArivazhaganDisha Makhija††footnotemark: Rashmi Gangadharaiah AWS AI Labs \{vinayshk,mghuhan,dismakhi,rgangad\}@amazon\.com
###### Abstract
Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss\. We ask*when*and*what kind*of redundancy\. At which layers, after how many decoding steps, and in what form can the prompt span KV cache be replaced without breaking the task\. A controlled splice intervention swept over layer cutoff and decoding steps shows this redundancy is about*form*\(chat template scaffolding\) rather than content\. Replacing the upper layer prompt span KV cache with KV cache from a chat template scaffold whose user content is a neutral filler recovers near clean accuracy, while zeroing the same slots collapses accuracy\. The dissociation replicates across the Qwen3, Gemma 3, and Llama 3 families on multiple datasets\.
Probing the Prompt KV Cache: Where It Becomes Dispensable
Vinayshekhar Bannihatti Kumar Manoj Ghuhan Arivazhagan††thanks:Equal contribution\.Disha Makhija††footnotemark:Rashmi GangadharaiahAWS AI Labs\{vinayshk,mghuhan,dismakhi,rgangad\}@amazon\.com
## 1Introduction
A growing body of work hints that LLMs use the prompt differently at different stages of the forward pass\. Attention concentrates on the first few positions as a positional sink\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib1)\), the same massive activations explain mid\-layer compression valleys\(Queipo\-de\-Llanoet al\.,[2025](https://arxiv.org/html/2605.30574#bib.bib2)\), in\-context demonstrations are summarised into compact internal task and function vectors\(Hendelet al\.,[2023](https://arxiv.org/html/2605.30574#bib.bib11); Toddet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib12)\), and cross\-prompt KV substitution attacks show that generation can be hijacked by overwriting the trailing token positions across all layers of the cache with another prompt’s KV\(Ganeshet al\.,[2025](https://arxiv.org/html/2605.30574#bib.bib3)\), though that work studies attack feasibility rather than what the cache encodes\. KV\-cache compression schemes\(Caiet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib4); Liet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib5); Zhanget al\.,[2023](https://arxiv.org/html/2605.30574#bib.bib6); Geet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib7); Liuet al\.,[2023](https://arxiv.org/html/2605.30574#bib.bib8)\)go further and empirically show that large fractions of the prompt cache can be dropped or summarised with little accuracy loss, indicating partial redundancy\. Each line of work, however, addresses the question at a narrow scope \(four sink tokens, BOS\-token activations, a single task vector, an all\-layer\-or\-nothing swap with one donor pair, or a specific pruning policy\) and leaves open the question: When, and with what replacement, can the prompt\-span KV cache be replaced: at which layers, after how many decoding steps, and with what donor content? The answer determines whether compression schemes must preserve entries, drop them, or substitute placeholders\.
We address this with a controlled splice intervention swept over layer cutoff and decoding steps with donor caches varying in how much of the prompt’s form \(chat\-template scaffolding\) and content \(the user’s task\) they preserve, tracing a phase diagram of when the prompt\-span KV cache can be replaced without breaking the task\.
Our contributions are: \(i\) The first systematic characterisation of*when*the decode\-time prompt KV cache becomes dispensable\. \(ii\)*What*of the prompt the cache must retain at upper layers, via a form\-vs\-content dissociation showing that chat\-template scaffolding is sufficient and task content is dispensable, replicated across multiple LLMs and datasets\. \(iii\) A donor\-noise ladder over a 180\-variant algorithmic\-donor benchmark we built to systematically vary noise, in which the donor cache injects increasing amounts of noise \(same\-algo→\\todiff\-algo→\\todiff\-family\)\. Together these give an end\-to\-end picture suggesting that at upper layers form alone is sufficient, while at lower layers the cache content is causally consulted, with recovery cost growing as the donor drifts from the target task\.
\(a\)GSM8KBLANK
\(b\)GSM8KZERO
\(c\)MBPPBLANK
\(d\)MBPPZERO
Figure 1:Qwen3\-4B heatmaps forzeroandblankon GSM8K \(top\) and MBPP \(bottom\)\. Each cell shows pass% at one\(L,W\)\(L,W\)\.Blankrecovers far faster thanzeroalong bothLLandWW\.
## 2Models and Datasets
We evaluate Qwen3\-4B, Gemma\-3\-4B\-IT, Qwen3\-8B, and Llama\-3\-8B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2605.30574#bib.bib18); Gemma Team,[2025](https://arxiv.org/html/2605.30574#bib.bib19); Grattafioriet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib20)\), with greedy decoding on four datasets\.GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.30574#bib.bib15)\)is a standard chain\-of\-thought arithmetic benchmark, and we score answers by exact match on the final numeric value\.MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2605.30574#bib.bib16)\)is a standalone Python code\-completion benchmark, and we score completions by passing the test suite shipped with each problem\.HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.30574#bib.bib17)\)is a Python function\-completion benchmark of 164 hand\-written programming problems, each with a function signature, docstring, and hidden unit tests, and we score completions by test\-suite pass\. From each of the datasets, we sample 100 problems to keep the experiments manageable\.
Algorithmic\-donor benchmark\.A Python benchmark spanning 9 algorithmic families \(search, sort, graph traversal, shortest path, fibonacci, max\-subarray, primes, two\-sum, string\-match\) with 10 LLM\-generated problems per family\. Each problem appears in two prompt variants that pose the same underlying problem but pin the solution to a different algorithm class \(e\.g\., the same search\-family problem is asked once as linear\-search and once as binary\-search\), giving 180 variants indexed by a \(problempp, algorithm\-classcc, familyff\) triple, each shipped with a hidden test suite\. Correctness is scored as target\-algorithm match \(via LLM judge\) plus test\-suite pass\.
## 3Experimental Setup
### Splice intervention\.
We study chat\-tuned decoder transformers withNNlayers\. Prefilling on a prompt of lengthTpT\_\{p\}yields a KV cacheK\(ℓ\),V\(ℓ\)∈ℝTp×dK^\{\(\\ell\)\},V^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{T\_\{p\}\\times d\}at each layerℓ∈\{0,…,N−1\}\\ell\\in\\\{0,\\ldots,N\{\-\}1\\\}, and during decoding stepttthe model attends over the concatenation of these prompt\-span entries with the generated\-span entries produced by the previousttsteps\.
Every experiment in this paper is an instance of a single intervention parameterised by three quantities, a cumulative layer cutoffLL, a splice onsetWW\(the number of decoding steps performed before the intervention\), and a donor cacheK~,V~\\tilde\{K\},\\tilde\{V\}\. The cutoffLLdefines the patched layer set𝒫=\{ℓ:ℓ≥L\}\\mathcal\{P\}=\\\{\\ell:\\ell\\geq L\\\}, and layers outside𝒫\\mathcal\{P\}are unmodified\. The onsetWWdefines the intervention timing\. For decoding stepst<Wt<Wthe unmodified cache is used, and at stept=Wt=Wthe prompt\-span entries at every layerℓ∈𝒫\\ell\\in\\mathcal\{P\}are replaced in place,K0:Tp\(ℓ\)←K~0:Tp\(ℓ\)K^\{\(\\ell\)\}\_\{0:T\_\{p\}\}\\leftarrow\\tilde\{K\}^\{\(\\ell\)\}\_\{0:T\_\{p\}\}and likewise forVV, after which decoding continues from the modified cache through to completion\. Generated\-span entries are never modified\. The donor cacheK~,V~\\tilde\{K\},\\tilde\{V\}takes one of five forms \(Table[1](https://arxiv.org/html/2605.30574#S3.T1)\)\. For the three forms produced from an alternate prompt, we trim the longer prompt’s scaffold at the character level until target and donor prompts tokenise to the same length, so the patched slots are positionally aligned\.
Table 1:Donor caches\. All interventions modify the same prompt\-span positions at the same layers under the same splice onset; only the content varies\. Algorithmic\-donor variants are indexed by a triple\(p,c,f\)\(p,c,f\)of problempp, algorithm\-classcc, familyff\.
### Sweep and reporting\.
For every \(model, task, intervention\) we run the full gridL∈\{0,3,6,…,N−1\}L\\in\\\{0,3,6,\\ldots,N\{\-\}1\\\},W∈\{0,5,10,…,200\}W\\in\\\{0,5,10,\\ldots,200\\\}plus the unintervened*clean*baseline\. We summarise each condition as \(i\) anL×WL\\times Wphase\-diagram heatmap and \(ii\) a*recovery frontier*W⋆\(L;α\)=min\{W:pass\(L,W\)≥αpclean\}W^\{\\star\}\(L;\\alpha\)=\\min\\\{W:\\mathrm\{pass\}\(L,W\)\\geq\\alpha\\,p\_\{\\mathrm\{clean\}\}\\\}, wherepcleanp\_\{\\mathrm\{clean\}\}is the unintervened pass rate of the same model on the same task andα∈\(0,1\]\\alpha\\in\(0,1\]\.
## 4Results and Analysis


Figure 2:Recovery frontierW⋆\(L;α=0\.75\)W^\{\\star\}\(L;\\alpha\\\!=\\\!0\.75\)forzero\(blue\) vsblank\(orange\) on GSM8K \(top\) and MBPP \(bottom\); columns are the four models\.Blankrequires less pre\-splice decoding thanzeroat everyLL\.Figure 3:Recovery frontierW⋆\(L;α\)W^\{\\star\}\(L;\\alpha\)on the algorithmic\-donor benchmark for Qwen3\-8B and Llama\-3\-8B\-Instruct atα∈\{0\.4,0\.67\}\\alpha\\\!\\in\\\!\\\{0\.4,0\.67\\\}\. Recovery cost grows with donor noise:diff\-family\(red\) is highest,blank\(blue\) lowest\.### \(R1\) Form is sufficient, content is dispensable\.
We probe whether the prompt cache content matters at upper layers in two steps\. We first ask whether*task content*is required at all by running theblankintervention \(Table[1](https://arxiv.org/html/2605.30574#S3.T1)\)\. Recovery is rapid on both axes \(Fig\.[1](https://arxiv.org/html/2605.30574#S1.F1)\)\. On Qwen3\-4B GSM8K \(clean=90%=90\\%, median length=147=147\),blankreaches88%88\\%atL=30,W=0L\{=\}30,W\{=\}0; sweeping alongWWat fixedL=27L\{=\}27, it hits87%87\\%byW=25W\{=\}25\. MBPP \(clean=71%=71\\%, median length=79=79\) and HumanEval \(clean=79%=79\\%, median length=121=121\) reproduces the shape observed in GSM8K\. This shows that retaining the prompt’s structural form, while discarding its task content, is sufficient at upper layers\.
If task content is not required, what is? We push further and remove the structural scaffolding too viazero\. Recovery still happens, but is markedly slower on both axes thanblank\. At Qwen3\-4B GSM8KW=0W=0,zerosits at1%1\\%atL=27L\{=\}27, climbs to48%48\\%atL=30L\{=\}30, and only matches clean atL=33L\{=\}33; alongWWatL=27L\{=\}27, it crosses87%87\\%only nearW≈170W\{\\approx\}170, seven times the pre\-splice decoding steps compared toblank\. MBPP and HumanEval both show the same gap\. The two interventions converge at the boundaries\. As seen in Fig\.[1](https://arxiv.org/html/2605.30574#S1.F1)both match clean atL=33L\{=\}33and both collapse at shallowLLwithW=0W\{=\}0, where no substituted cache can compensate for missing prompt context the model has not yet seen\. Hence we can conclude that at upper layers the prompt’s*form*including chat\-template scaffolding and positional structure is what must be preserved, while the task*content*carried within can be replaced with a neutral filler\.
### \(R2\) Phase diagram across models and tasks\.
The \(R1\) shape generalises beyond a single model and task\. The recovery frontierW⋆\(L;α=0\.75\)W^\{\\star\}\(L;\\alpha\\\!=\\\!0\.75\)in Fig\.[2](https://arxiv.org/html/2605.30574#S4.F2)plotszero\(blue\) andblank\(orange\) for all four models on GSM8K \(top\) and MBPP \(bottom\)\. On both tasks,blankrecovers performance faster asblanksits consistently to the left ofzeroacross every panel of the figure\. At intermediateLLthe gap is large, on Qwen3\-4B GSM8K,∼\\sim60 steps atL=15L=15widening to∼\\sim115 steps atL=24L=24, and both lines collapse toW=0W=0only at the topmost cutoffs \(blankfromL≈N−6L\\approx N\{\-\}6onwards,zeroa few layers deeper\)\. The gap shrinks for the larger Qwen3\-8B but does not invert, and the qualitative pattern ofblank\-to\-the\-left holds in every panel we plot\.\. We find that the same dissociation holds on the algorithmic\-donor benchmark where theblankintervention recovers faster than any other donor types \(Fig\.[3](https://arxiv.org/html/2605.30574#S4.F3)\)\.
### \(R3\) Donor noise determines lower\-layer recovery cost\.
At lower layers the prompt cache is no longer dispensable: what gets injected matters, and how much it matters tracks how far the donor cache is from the target\. Fig\.[3](https://arxiv.org/html/2605.30574#S4.F3)plots the recovery frontierW⋆\(L;α\)W^\{\\star\}\(L;\\alpha\)on the algorithmic\-donor benchmark for Qwen3\-8B and Llama\-3\-8B\-Instruct at two thresholds \(α=0\.4,0\.67\\alpha=0\.4,0\.67\)\.Blank, which preserves only the prompt’s structural form is closest to theW=0W\{=\}0axis at everyLL, consistent with \(R1\): when no task content is injected, decoding recovers fastest\. The donor curves degrade as the donor’s content drifts from the target task\. The within\-family donors \(same\-algoanddiff\-algo\) track each other and require modest pre\-splice decoding to recover even at shallowLL; the across\-family donor \(diff\-family\) sits consistently above them, requiring substantially moreWWto clear the same threshold\. The pattern holds for both Qwen3\-8B and Llama\-3\-8B\-Instruct and is sharper at the stricterα=0\.67\\alpha=0\.67, where the gap betweendiff\-familyand the within\-family lines widens to tens of decoding steps\. Together with \(R1\), these results give an end\-to\-end picture: at upper layers the model needs only the prompt’s structural form, while at lower layers it actively processes the cache content, with quality degrading as the injected content moves further from the target task\. See the appendix for per\-model[A](https://arxiv.org/html/2605.30574#A1)and per\-donor heatmaps[F](https://arxiv.org/html/2605.30574#A6), results on HumanEval dataset[E](https://arxiv.org/html/2605.30574#A5), splice pseudocode[C](https://arxiv.org/html/2605.30574#A3), and example generations[D](https://arxiv.org/html/2605.30574#A4)\.
## 5Related Work
Prior work documents prompt\-cache redundancy at narrow scopes, including positional sinks at the first∼\\sim4 tokens\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib1)\), mid\-layer compression valleys driven by massive activations\(Queipo\-de\-Llanoet al\.,[2025](https://arxiv.org/html/2605.30574#bib.bib2)\), the lost\-in\-the\-middle effect in which models underperform on relevant content placed in the middle of long contexts\(Liuet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib21)\), and ICL summarisation into compact internal task and function vectors\(Hendelet al\.,[2023](https://arxiv.org/html/2605.30574#bib.bib11); Toddet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib12)\)\. A complementary mechanistic line argues that in\-context copying is implemented by induction heads, a distributed attention pattern rather than a vector summary\(Olssonet al\.,[2022](https://arxiv.org/html/2605.30574#bib.bib10)\)\. KV\-cache compression schemes\(Caiet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib4); Liet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib5); Zhanget al\.,[2023](https://arxiv.org/html/2605.30574#bib.bib6); Geet al\.,[2024](https://arxiv.org/html/2605.30574#bib.bib7); Liuet al\.,[2023](https://arxiv.org/html/2605.30574#bib.bib8)\)prune by attention magnitude, and our results suggest a complementary axis of substitution by structural placeholders at upper layers\.Ganeshet al\.\([2025](https://arxiv.org/html/2605.30574#bib.bib3)\)is the closest methodologically, an attack feasibility study that hijacks generation by overwriting trailing token positions across all KV layers with fixed donor\. We ask a different question and run a graded donor\-noise ladder, expose an algorithmic\-overlap inversion, and add the form\-vs\-contentzero\-vs\-blankdissociation\.
## 6Conclusion
A controlled splice intervention across layers and decoding steps reveals a depth\-dependent split in what the prompt KV cache must carry\. At upper layers, chat\-template alone is sufficient: content\-free filler recovers near\-clean accuracy while zeroing the same slots collapses generation\. At lower layers, content is causally active: donor caches degrade accuracy in proportion to their distance from the target task, with cross\-family donors requiring substantially more decoding steps to recover than within\-family donors\. For KV\-cache compression, the practical implication is that upper\-layer prompt entries can be replaced with structural placeholders rather than dropped, while lower\-layer entries must preserve task\-specific content\.
## Limitations
Our splice intervention requires write access to the prompt\-span key/value cache, which restricts the analysis to settings where model internals can be modified at inference time\. It does not directly characterise behaviour observable through standard inference APIs\. The study covers four chat\-tuned decoder\-only transformers in the 4B to 8B parameter range and three task domains alongside an algorithmic\-donor benchmark\. We have not investigated encoder\-decoder architectures, base models without instruction tuning, or substantially larger models\. The absolute layer depth at which form alone becomes sufficient is reported per model rather than predicted from architectural quantities\. Finally, ourblankconstruction relies on a chat\-template scaffold filled with a single neutral filler token\. This isolates form from content cleanly but does not distinguish which structural elements such as role markers, length, position, or specific template tokens carry the load\. We leave a finer decomposition of form to future work\.
## References
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton \(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§2](https://arxiv.org/html/2605.30574#S2.p1.1)\.
- Z\. Cai, Y\. Zhang, B\. Gao, Y\. Liu, Y\. Li, T\. Liu, K\. Lu, W\. Xiong, Y\. Dong, J\. Hu, and W\. Xiao \(2024\)PyramidKV: dynamic KV cache compression based on pyramidal information funneling\.arXiv preprint arXiv:2406\.02069\.Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Appendix E](https://arxiv.org/html/2605.30574#A5.p1.2),[§2](https://arxiv.org/html/2605.30574#S2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2](https://arxiv.org/html/2605.30574#S2.p1.1)\.
- M\. Ganesh, K\. Iyer, and A\. B\. S\. Ananthan \(2025\)Whose narrative is it anyway? a KV cache manipulation attack\.arXiv preprint arXiv:2511\.12752\.Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- S\. Ge, Y\. Zhang, L\. Liu, M\. Zhang, J\. Han, and J\. Gao \(2024\)Model tells you what to discard: adaptive KV cache compression for LLMs\.InInternational Conference on Learning Representations \(ICLR\),External Links:2310\.01801Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- Gemma Team \(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[§2](https://arxiv.org/html/2605.30574#S2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§2](https://arxiv.org/html/2605.30574#S2.p1.1)\.
- R\. Hendel, M\. Geva, and A\. Globerson \(2023\)In\-context learning creates task vectors\.InFindings of the Association for Computational Linguistics: EMNLP 2023,External Links:2310\.15916Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024\)SnapKV: LLM knows what you are looking for before generation\.arXiv preprint arXiv:2404\.14469\.Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.External Links:2307\.03172Cited by:[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- Z\. Liu, A\. Desai, F\. Liao, W\. Wang, V\. Xie, Z\. Xu, A\. Kyrillidis, and A\. Shrivastava \(2023\)Scissorhands: exploiting the persistence of importance hypothesis for LLM KV cache compression at test time\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2305\.17118Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2022\)In\-context learning and induction heads\.Note:Transformer Circuits ThreadExternal Links:[Link](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)Cited by:[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- E\. Queipo\-de\-Llano, Á\. Arroyo, F\. Barbero, X\. Dong, M\. Bronstein, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Attention sinks and compression valleys in LLMs are two sides of the same coin\.arXiv preprint arXiv:2510\.06477\.Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- E\. Todd, M\. L\. Li, A\. S\. Sharma, A\. Mueller, B\. C\. Wallace, and D\. Bau \(2024\)Function vectors in large language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:2310\.15213Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations \(ICLR\),External Links:2309\.17453Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§2](https://arxiv.org/html/2605.30574#S2.p1.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett, Z\. Wang, and B\. Chen \(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2306\.14048Cited by:[§1](https://arxiv.org/html/2605.30574#S1.p1.1),[§5](https://arxiv.org/html/2605.30574#S5.p1.1)\.
Appendix
## Appendix APer\-Model Phase Diagrams
Fig\.[1](https://arxiv.org/html/2605.30574#S1.F1)in the main text shows phase diagrams for Qwen3\-4B only\. Fig\.[4](https://arxiv.org/html/2605.30574#A1.F4)provides the analogous per\-cell heatmaps for Gemma\-3\-4B\-IT, Qwen3\-8B, and Llama\-3\-8B\-Instruct on both GSM8K and MBPP, for bothzeroandblank\. Across every \(model, task\) pair, the same pattern holds\.Blankcells stay near the clean baseline onceLLis sufficiently deep, whilezerocells form a red region at shallowLLthat contracts only at the topmost cutoffs\.
\(a\)Gemma\-3\-4B\-IT, GSM8K,zero
\(b\)Gemma\-3\-4B\-IT, GSM8K,blank
\(c\)Gemma\-3\-4B\-IT, MBPP,zero
\(d\)Gemma\-3\-4B\-IT, MBPP,blank
\(e\)Qwen3\-8B, GSM8K,zero
\(f\)Qwen3\-8B, GSM8K,blank
\(g\)Qwen3\-8B, MBPP,zero
\(h\)Qwen3\-8B, MBPP,blank
\(i\)Llama\-3\-8B\-Instruct, GSM8K,zero
\(j\)Llama\-3\-8B\-Instruct, GSM8K,blank
\(k\)Llama\-3\-8B\-Instruct, MBPP,zero
\(l\)Llama\-3\-8B\-Instruct, MBPP,blank
Figure 4:Per\-cell heatmaps for Gemma\-3\-4B\-IT, Qwen3\-8B, and Llama\-3\-8B\-Instruct on GSM8K and MBPP\. Each cell is a pass\-rate percentage at one\(L,W\)\(L,W\)configuration\. Rows within a panel are cumulative layer cutoffs, columns are pre\-splice decoding stepsWW\. The left column showszeroand the right column showsblank\. Across every \(model, task\) pair,blankstays near the clean baseline at deepLLwhilezeroproduces a large red region at shallowLLthat only contracts at the topmost cutoffs\.
## Appendix BClean Baselines
Table[2](https://arxiv.org/html/2605.30574#A2.T2)reports the unintervened \(*clean*\) pass ratepcleanp\_\{\\mathrm\{clean\}\}used as the reference in the recovery frontierW⋆\(L;α\)=min\{W:pass\(L,W\)≥αpclean\}W^\{\\star\}\(L;\\alpha\)=\\min\\\{W:\\mathrm\{pass\}\(L,W\)\\geq\\alpha\\,p\_\{\\mathrm\{clean\}\}\\\}\. Each entry is the pass rate of the unmodified model on the full benchmark sample \(100 items for GSM8K, 100 items for MBPP\), computed once per \(model, task\) pair and shared across all interventions\.
Table 2:Clean baselinespcleanp\_\{\\mathrm\{clean\}\}\(%, exact match for GSM8K, test\-suite pass for MBPP and HumanEval\) for the four models evaluated\.
## Appendix CSplice Intervention Pseudocode
Algorithm[1](https://arxiv.org/html/2605.30574#alg1)restates the splice intervention as a modified greedy decoding loop\. The loop differs from a standard greedy decode only in line[6](https://arxiv.org/html/2605.30574#alg1.l6), which performs a single in\-place replacement of the prompt\-span entries at the patched layers once decoding has reached stepWW\. The patched cache is then used for every subsequent decoding step until generation terminates\.
Algorithm 1Splice\-intervention greedy decode1:model
MMwith
NNlayers, prefilled cache
\(K,V\)\(K,V\)with prompt\-span length
TpT\_\{p\}, donor cache
\(K~,V~\)\(\\tilde\{K\},\\tilde\{V\}\), layer cutoff
LL, splice onset
WW, max new tokens
TmaxT\_\{\\max\}
2:
𝒫←\{ℓ:ℓ≥L\}\\mathcal\{P\}\\leftarrow\\\{\\ell:\\ell\\geq L\\\}⊳\\trianglerightpatched layers
3:
y0←argmaxvMlogits\(K,V\)vy\_\{0\}\\leftarrow\\arg\\max\_\{v\}\\,M\_\{\\mathrm\{logits\}\}\(K,V\)\_\{v\}
4:
𝑡𝑜𝑘𝑒𝑛𝑠←\[y0\]\\mathit\{tokens\}\\leftarrow\[y\_\{0\}\]
5:for
t=1t=1to
Tmax−1T\_\{\\max\}\-1do
6:if
t=Wt=Wthen
7:for
ℓ∈𝒫\\ell\\in\\mathcal\{P\}do
8:
K0:Tp\(ℓ\)←K~0:Tp\(ℓ\)K^\{\(\\ell\)\}\_\{0:T\_\{p\}\}\\leftarrow\\tilde\{K\}^\{\(\\ell\)\}\_\{0:T\_\{p\}\}
9:
V0:Tp\(ℓ\)←V~0:Tp\(ℓ\)V^\{\(\\ell\)\}\_\{0:T\_\{p\}\}\\leftarrow\\tilde\{V\}^\{\(\\ell\)\}\_\{0:T\_\{p\}\}
10:endfor
11:endif
12:
\(K,V\)←M\(𝑡𝑜𝑘𝑒𝑛𝑠\[t−1\],\(K,V\)\)\(K,V\)\\leftarrow M\(\\mathit\{tokens\}\[t\-1\],\(K,V\)\)⊳\\trianglerightforward pass, appends generated KV to cache
13:
yt←argmaxvMlogits\(K,V\)vy\_\{t\}\\leftarrow\\arg\\max\_\{v\}\\,M\_\{\\mathrm\{logits\}\}\(K,V\)\_\{v\}
14:if
yt=eosy\_\{t\}=\\mathrm\{eos\}thenbreak
15:endif
16:
𝑡𝑜𝑘𝑒𝑛𝑠\.append\(yt\)\\mathit\{tokens\}\.\\mathrm\{append\}\(y\_\{t\}\)
17:endfor
18:return
𝑡𝑜𝑘𝑒𝑛𝑠\\mathit\{tokens\}
## Appendix DExample Generations
To complement the aggregate pass\-rate results, Tables[3](https://arxiv.org/html/2605.30574#A4.T3)and[4](https://arxiv.org/html/2605.30574#A4.T4)report side\-by\-side greedy completions from Qwen3\-4B on representative MBPP and GSM8K problems at\(L=27,W=0\)\(L=27,W=0\), the regime where thezerovsblankgap is largest whileblankis still recoverable\. Each row shows the output under the unintervened decode, theblankintervention, and thezerointervention\. Visible characters are verbatim from the decoded output, problem statements are paraphrased to a single line, and “…\\ldots” marks elided text\. Inzerocells this is further token\-loop repetition, and incleanorblankcells this is surrounding prose\.
Table 3:Greedy completions from Qwen3\-4B on two MBPP problems under the unintervened \(clean\) decode, theblankintervention, and thezerointervention, all atL=27,W=0L=27,W=0\. Underblankthe model emits well\-formed, mostly correct Python\. Forstring\_to\_listthe completion is identical to the unintervened output, and forfind\_star\_numit produces an equivalent closed form \(6n2−6n\+16n^\{2\}\{\-\}6n\{\+\}1vs\.6n\(n−1\)\+16n\(n\{\-\}1\)\{\+\}1\) but under a renamed entry point \(a\), which would fail a strict test harness despite the formula being correct\. Underzerothe decoder collapses into a single\-token loop and produces no usable solution\.Table 4:Greedy completions from Qwen3\-4B on two GSM8K problems under the unintervened \(clean\) decode, theblankintervention, and thezerointervention, all atL=27,W=0L=27,W=0\. Reference \(gold\) answers are 70 and 120\. Undercleanthe model follows a coherent arithmetic chain and reaches the gold answer in both cases\. Underblankthe decoder remains arithmetic and roughly on\-topic but introduces small content errors, such as misreading the discount rate in the first problem and renaming the protagonists in the second, resulting in incorrect final answers\. Underzerothe model emits a degenerate single\-token loop and never produces a numeric answer\.
## Appendix EHumanEval Replication
The form\-vs\-content dissociation also replicates on HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.30574#bib.bib17)\)\. Fig\.[5](https://arxiv.org/html/2605.30574#A5.F5)shows the recovery frontierW⋆\(L;α=0\.75\)W^\{\\star\}\(L;\\alpha\\\!=\\\!0\.75\)forzeroandblankon all four models, and Fig\.[6](https://arxiv.org/html/2605.30574#A5.F6)gives the per\-cell phase diagrams\. Across every model,blanksits to the left ofzeroand reaches near\-baseline at deepLLwhilezerorequires substantially more pre\-splice decoding to recover\.
Figure 5:HumanEval recovery frontierW⋆\(L;α=0\.75\)W^\{\\star\}\(L;\\alpha\\\!=\\\!0\.75\)forzero\(blue\) andblank\(orange\), one panel per model\.Blanksits consistently to the left ofzeroacross every model\.\(a\)Qwen3\-4B, HumanEval,zero
\(b\)Qwen3\-4B, HumanEval,blank
\(c\)Gemma\-3\-4B\-IT, HumanEval,zero
\(d\)Gemma\-3\-4B\-IT, HumanEval,blank
\(e\)Qwen3\-8B, HumanEval,zero
\(f\)Qwen3\-8B, HumanEval,blank
\(g\)Llama\-3\-8B\-Instruct, HumanEval,zero
\(h\)Llama\-3\-8B\-Instruct, HumanEval,blank
Figure 6:Per\-cell heatmaps on HumanEval for the four models, one row per model\. The left column showszeroand the right column showsblank\. Each cell is a pass\-rate percentage at one\(L,W\)\(L,W\)configuration\. Rows within a panel are cumulative layer cutoffs, columns are pre\-splice decoding stepsWW\.Blankstays near the clean baseline at deepLL, whilezeroproduces a large red region at shallowLLthat only contracts at the topmost cutoffs\.
## Appendix FPer\-Model Heatmaps on the Algorithmic\-Donor Benchmark
Fig\.[7](https://arxiv.org/html/2605.30574#A6.F7)gives a unified cell\-level view of the algorithmic\-donor benchmark across the four models and four donor caches, under both test\-pass scoring and an algo\-aware score that additionally checks whether the emitted solution uses the prescribed algorithm class\. Reading across columns reproduces the noise ladder of \(R3\)\. Damage grows fromblankandsame\-algothroughdiff\-algotodiff\-family\. The two scoring rules diverge most underdiff\-algo, where the model passes the tests but commits to the donor’s algorithm class, so the algo\-aware column develops a large red region that persists to deeperLL\.
Figure 7:Unified per\-cell heatmaps on the algorithmic\-donor benchmark\. Rows are models \(Qwen3\-4B, Gemma\-3\-4B\-IT, Qwen3\-8B, Llama\-3\-8B\-Instruct\), and columns group the four donor caches \(blank,same\-algo,diff\-algo,diff\-family\) under test\-pass scoring and an algo\-aware score that also requires the emitted solution to use the prescribed algorithm class\.
## Appendix GAlgorithmic\-Donor Benchmark Construction
The algorithmic\-donor benchmark is built around a hand\-written taxonomy of algorithm pairs\. Each pair shares an input/output contract but admits two distinct algorithmic strategies, an A side \(typically a naive baseline\) and a B side \(typically an optimised alternative\)\. Nine pairs were specified in this taxonomy\. The pairs cover search \(linear vs binary\), sort \(bubble vs merge\), graph traversal \(BFS vs DFS\), shortest path \(Dijkstra vs Bellman\-Ford\), Fibonacci \(naive recursion vs bottom\-up DP\), max\-subarray \(brute force vs Kadane\), primes \(trial division vs Sieve of Eratosthenes\), two\-sum \(nested\-loop vs hash\-map single pass\), and string match \(naive vs KMP\)\. Each taxonomy entry is paired with a short natural\-language description of the contract and the strategy\.
Variant generation:Sonnet 4\.5 \(max\_tokens=4096,temperature=0\.8\) was used in the dataset construction\. The system prompt used in this pass is show below\.
Variant Generation System PromptYou are a programming problem generator\. Given an algorithm family with algorithms A and B, generate exactly 10 diverse problem variants\. Each variant should be a different concrete problem \(different domain, data types, or scenario\) that both algorithms can solve\.For each variant, provide: 1\. A Python prompt asking for algorithm B 2\. A Python solution using algorithm A Return a JSON array of 10 objects, each with keys: \- ‘‘variant\_name’’: short description \- ‘‘python\_b\_prompt’’: one\-sentence Python prompt explicitly naming algorithm B \- ‘‘python\_a\_code’’: complete Python function implementing algorithm A Return ONLY valid JSON, no prose\.
The user prompt provided the family name, the names of algorithms A and B, and an example A\-side and B\-side prompt drawn from the taxonomy\. The output was constrained to a JSON array of ten objects per family\. Each call was retried up to three times on parse or transport errors with a short sleep between retries\.
### Flipped halves\.
Using the same model attemperature=0\.2 and the same other settings, the system prompt provided the existing A\-side Python implementation and the name of algorithm B and asked for two outputs, a Python implementation of algorithm B with the identical function signature, and a Python prompt requesting algorithm A for the same problem\. The user prompt repeated the algorithm names, embedded the existing A\-side Python code, and quoted the original B\-side Python prompt for context\. After this pass each variant carries all four artifacts, an A\-side prompt and reference implementation in Python and a B\-side prompt and reference implementation in Python\.
### Ground truth and tests\.
The committed ground\-truth file holds, per variant, two prompt sides \(A\-side and B\-side\) plus a deterministic test suite of five cases \(one edge case and four normal\)\. Reference Python implementations were produced as a generation\-pipeline artifact and are checked for internal consistency, but the splice experiments only use the prompts and test cases\. Across the nine families with ten variants each, this yields ninety problems and one hundred and eighty prompt variants in total, indexed by a \(problempp, algorithm\-classcc, familyff\) triple\.
## Appendix HExample Generations on the Algorithmic\-Donor Benchmark
Tables[5](https://arxiv.org/html/2605.30574#A8.T5)and[6](https://arxiv.org/html/2605.30574#A8.T6)show greedy completions from Qwen3\-4B under all four donor interventions at\(L=12,W=15\)\(L\{=\}12,W\{=\}15\)\. Each entry illustrates the characteristic failure mode of that donor condition\.
Table 5:Algorithmic\-donor examples \(1/2\):blankandsame\-algoatL=12L\{=\}12, decoding step=15\{=\}15, Qwen3\-4B\.Table 6:Algorithmic\-donor examples \(2/2\):diff\-algoanddiff\-familyatL=12L\{=\}12, decoding step=15\{=\}15, Qwen3\-4B\. Together with Table[5](https://arxiv.org/html/2605.30574#A8.T5), these illustrate the graded noise ladder:blankloses the algorithm class but produces correct output;same\-algopreserves the algorithm but leaks donor identifiers;diff\-algoflips the algorithm entirely;diff\-familyproduces an unrelated computation\.Similar Articles
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
KV Cache Is Becoming the Memory Hierarchy of Inference
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.
llama.cpp has a clever trick for speeding up KV cache decode
A setting in llama.cpp's webUI re-sends generated tokens to the KV cache to significantly reduce prompt processing latency, improving responsiveness for long generations or tool calls without apparent trade-offs.
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
This paper introduces a fixed-contract diagnostic tool to analyze why KV cache compression methods succeed or fail in long-context LLM inference. It identifies three failure modes—missing evidence, scoring irrelevant tokens, and breaking related evidence—and evaluates them on LongBench and NeedleBench.