The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content
Summary
This paper identifies and formalizes the 'structural attention tax' phenomenon, where the format of retrieved content (e.g., knowledge graph triples) independently distorts LLM attention distribution regardless of semantic relevance, leading to compressed demonstration attention. It provides a formal framework, empirical evidence across models and benchmarks, and proposes structure-aware mitigation strategies.
View Cached Full Text
Cached at: 06/11/26, 01:34 PM
# How Retrieval Format Hijacks In-Context Learning Independent of Content
Source: [https://arxiv.org/html/2606.11198](https://arxiv.org/html/2606.11198)
## The Structural Attention Tax: How Retrieval Format Hijacks In\-Context Learning Independent of Content
###### Abstract
Retrieval\-augmented generation \(RAG\) systems inject external knowledge to improve LLM outputs, yet the*format*of injected content—distinct from its semantic relevance—can independently distort the model’s attention distribution\. We identify and formalise a phenomenon we term thestructural attention tax: knowledge graph \(KG\) triples, due to their relational delimiters and repeated slot patterns, capture22–3×3\\timesmore attention per token than semantically equivalent natural\-language text \(σ^\(KG\)≈0\.70\\hat\{\\sigma\}\(\\text\{KG\}\)\\approx 0\.70vs\.σ^\(neutral\)≈0\.25\\hat\{\\sigma\}\(\\text\{neutral\}\)\\approx 0\.25\), compressing demonstration attention by up to42%42\\%—*regardless of whether the triples are relevant or noise*\. We develop a formal framework decomposing attention scores into semantic and structural components \(Eq\.[2](https://arxiv.org/html/2606.11198#S3.E2)\), derive a compression bound \(Proposition[1](https://arxiv.org/html/2606.11198#Thmproposition1)\) connecting token\-level format bias to demonstration attention loss, and show that the structural term governs*how much*attention is diverted while the semantic term governs*whether*this helps or hurts\. This decoupling reveals two orthogonal axes for improving retrieval\-augmented ICL: optimising retrieval quality \(semantic axis\) and reducing format\-driven attention capture \(structural axis\)\. Empirically, across two model families \(Mistral\-7B, LLaMA\-3\-8B\) and three QA benchmarks, we observe that source–task alignment dominates: task\-matched BM25 retrieval achieves5858–62%62\\%on HotpotQA vs\. ConceptNet’s2525–27%27\\%, a\>\>30 pp gap that dwarfs all gating strategies \(≤\\leq2 pp\)\. We derive five structure\-aware mitigation strategies from the framework, ranging from zero\-cost prompt modifications to training\-time regularisation; format flattening \(S3\) is validated by both accuracy and attention\-level evidence from a verbalized\-triple control, while structural dispersal \(S1\) yields mixed results that illuminate the challenges of format\-level intervention\.
## 1Introduction
Retrieval\-augmented generation \(RAG\)\(Jiang and others,[2023b](https://arxiv.org/html/2606.11198#bib.bib16); Suiet al\.,[2025](https://arxiv.org/html/2606.11198#bib.bib19)\)has become a standard strategy for grounding LLM outputs in external knowledge\. The dominant concern in RAG research is*what*to retrieve: selecting relevant passages\(Parryet al\.,[2024](https://arxiv.org/html/2606.11198#bib.bib29)\), gating on confidence\(Jiang and others,[2023b](https://arxiv.org/html/2606.11198#bib.bib16)\), or chaining knowledge graph facts into reasoning traces\(Suiet al\.,[2025](https://arxiv.org/html/2606.11198#bib.bib19)\)\. Far less attention has been paid to a complementary question:*how does the format of retrieved content interact with the transformer’s attention mechanism, independent of whether that content is semantically useful?*
We show that this question matters more than it might appear\. In transformer\-based in\-context learning \(ICL\)\(Brown and others,[2020](https://arxiv.org/html/2606.11198#bib.bib4); Chenet al\.,[2025](https://arxiv.org/html/2606.11198#bib.bib24)\), all prompt regions compete for the same fixed attention budget\. When knowledge graph triples are injected into the prompt, their distinctive structure—relational delimiters, repeated slot patterns, high token\-level regularity—creates a*format\-driven bias*in attention allocation that operates independently of the triples’ semantic content\. We term this phenomenon thestructural attention tax: a systematic over\-allocation of attention to structurally salient prompt regions, with a corresponding under\-allocation \(compression\) of attention to demonstrations and other task\-critical context\.
Our central theoretical contribution is a decomposition of attention scores into semantic and structural components \(Section[3](https://arxiv.org/html/2606.11198#S3); Figure[1](https://arxiv.org/html/2606.11198#S3.F1)\), yielding a formal characterisation of how format bias interacts with content relevance:
- •Thestructural componentλ⋅σ\(K\)\\lambda\\cdot\\sigma\(K\)determines the*magnitude*of attention diversion—how many attention units are taxed away from demonstrations\.
- •Thesemantic components¯Ksem\\bar\{s\}\_\{K\}^\{\\text\{sem\}\}determines the*sign*of the performance effect—whether diverted attention carries useful signal or noise\.
This decoupling implies that optimising*what*to retrieve \(semantic axis\) and reducing*format\-driven capture*\(structural axis\) are orthogonal improvement strategies, a perspective that unifies several previously disconnected observations\(Shi and others,[2023](https://arxiv.org/html/2606.11198#bib.bib23); Wuet al\.,[2024](https://arxiv.org/html/2606.11198#bib.bib31); Liu and others,[2024](https://arxiv.org/html/2606.11198#bib.bib14)\)\.
We develop this framework through four contributions:
1. 1\.The structural attention tax framework\(Section[3](https://arxiv.org/html/2606.11198#S3)\): a formal decomposition of attention competition in augmented ICL, yielding four testable predictions and a provable compression bound \(Proposition[1](https://arxiv.org/html/2606.11198#Thmproposition1)\)\.
2. 2\.Empirical validation of the format–content decoupling\(Section[5](https://arxiv.org/html/2606.11198#S5)\): using a seven\-condition study across two model families \(Mistral\-7B, LLaMA\-3\-8B\) and three QA tasks, we show that KG triples absorb22–3×3\\timesmore attention per token than neutral text, with noise and relevant triples exhibiting nearly identical attention patterns—confirming that the structural tax is format\-driven, not content\-driven\.
3. 3\.A source\-alignment dominance result\(Section[5\.3](https://arxiv.org/html/2606.11198#S5.SS3)\): task\-matched BM25 retrieval outperforms mismatched ConceptNet retrieval by\>\>30 pp on HotpotQA, demonstrating that source selection along the semantic axis dwarfs gating sophistication \(≤\\leq2 pp\)\. This result is currently limited to one task and confounded by retrieval\-unit differences \(Section[8](https://arxiv.org/html/2606.11198#S8)\)\.
4. 4\.Five structure\-aware mitigation strategies\(Section[6](https://arxiv.org/html/2606.11198#S6)\): derived from the framework, targeting the structural termλ⋅σ\(K\)\\lambda\\cdot\\sigma\(K\)through prompt modification, logit suppression, and training\-time regularisation\. Two strategies are empirically evaluated: S3 \(format flattening\) is supported by both accuracy and attention\-level evidence \(AppendixLABEL:app:c5b\); S1 \(structural dispersal\) yields mixed results with model\-dependent effects \(AppendixLABEL:app:s1\_dispersal\)\. The remaining three are mathematically grounded but untested\.
Scope:We do not claim the structural attention tax makes KG augmentation universally harmful; we argue its existence as an independent, format\-driven cost has been overlooked \(Section[8](https://arxiv.org/html/2606.11198#S8)\)\.
## 2Related Work
In\-Context Learning\.Brown and others \([2020](https://arxiv.org/html/2606.11198#bib.bib4)\)showed LLMs generalise from demonstrations without gradient updates\. Demonstration format\(Min and others,[2022](https://arxiv.org/html/2606.11198#bib.bib12)\), skill matching\(Anet al\.,[2023](https://arxiv.org/html/2606.11198#bib.bib13)\), and schema\-structured prompts\(Chenet al\.,[2025](https://arxiv.org/html/2606.11198#bib.bib24)\)affect performance\. Parametric fact recall degrades before ICL ability under compression\(Jin and others,[2023](https://arxiv.org/html/2606.11198#bib.bib21)\)\.Parryet al\.\([2024](https://arxiv.org/html/2606.11198#bib.bib29)\)frame ICL as applied information retrieval\.
KG Augmentation and RAG\.Pipelines range from triple injection\(Liet al\.,[2023](https://arxiv.org/html/2606.11198#bib.bib9)\)to graph\-based reasoning\(Huang and others,[2023](https://arxiv.org/html/2606.11198#bib.bib11)\)\. FLARE\(Jiang and others,[2023b](https://arxiv.org/html/2606.11198#bib.bib16)\)gates retrieval on confidence; FiDeLiS\(Suiet al\.,[2025](https://arxiv.org/html/2606.11198#bib.bib19)\)chains KG facts into verifiable traces\.Zheng and others \([2023](https://arxiv.org/html/2606.11198#bib.bib22)\)show KG triples serve as factual overrides;Wuet al\.\([2024](https://arxiv.org/html/2606.11198#bib.bib31)\)quantify the “tug\-of\-war” between parametric priors and retrieved evidence\.Liu and others \([2024](https://arxiv.org/html/2606.11198#bib.bib14)\)find RAG gains are modest when chain\-of\-thought already approaches correct conclusions;Shi and others \([2023](https://arxiv.org/html/2606.11198#bib.bib23)\)show LLMs are susceptible to distraction by irrelevant context\. While these works identify cases where retrieval can hurt, none isolate the*format\-driven*component of attention distortion from the*content\-driven*component—the central contribution of our framework\.
Multi\-Hop Reasoning\.Chain\-of\-thought\(Wei and others,[2022](https://arxiv.org/html/2606.11198#bib.bib15)\)and backward chaining\(Kazemi and others,[2023](https://arxiv.org/html/2606.11198#bib.bib17)\)improve multi\-hop accuracy\. Confidence calibration\(Denget al\.,[2024](https://arxiv.org/html/2606.11198#bib.bib27)\)and the memory–reasoning distinction\(Jin and others,[2025](https://arxiv.org/html/2606.11198#bib.bib20)\)motivate our task contrast\.
Positioning\.Our work introduces the*structural attention tax*as a formal concept, provides a decomposition framework that separates format effects from content effects in retrieval\-augmented ICL, and derives mitigation strategies grounded in this decomposition\. This complements existing work on retrieval quality and gating\(Jiang and others,[2023b](https://arxiv.org/html/2606.11198#bib.bib16); Wuet al\.,[2024](https://arxiv.org/html/2606.11198#bib.bib31)\)by identifying an orthogonal axis of improvement\.
## 3The Structural Attention Tax Framework
Natural language:“A dog is a common pet\. Cats are also animals\.”IDemonstrations35%Knowledge25%Query35%↕\\updownarrowsame knowledge content, different formatKG triples:dog\|IsA\|pet; cat\|IsA\|animalID20%Knowledge 48%Q28%Structural Attention TaxFigure 1:The structural attention tax\. Each bar shows how the first answer token’s last\-layer attention \(summing to 100%\) is distributed across four prompt regions: instruction \(I\), demonstrations \(D\), knowledge \(K\), and query \(Q\)\. The knowledge content is identical in both rows; only the*format*differs\. Presenting knowledge as KG triples \(bottom\) nearly doubles the knowledge\-region attention share \(25%→48%25\\%\\to 48\\%\) and compresses demonstration attention from35%35\\%to20%20\\%, regardless of whether the triples are semantically relevant\. The dashed line marks the natural\-language knowledge boundary; attention to its left has been “taxed away” from demonstrations\. \(Illustrative values; see Section[5](https://arxiv.org/html/2606.11198#S5)for measured data\.\)We develop a formal account of how the*format*of injected knowledge interacts with the transformer’s attention mechanism, generating four testable predictions \(see Figure[1](https://arxiv.org/html/2606.11198#S3.F1)for an overview\)\. The decompositions serve as heuristic frameworks that organise otherwise disconnected observations into a coherent theory\.
### 3\.1Setup and Notation
Letqqdenote a query with gold answery∗y^\{\*\}\. The prompt isx=\[I;D;K;q\]x=\[I;\\,D;\\,K;\\,q\]with instructionII, demonstrationsDD, optional knowledgeKK, and questionqq\. Definec0\(q\)≜p\(y∗∣x∅\)c\_\{0\}\(q\)\\triangleq p\(y^\{\*\}\\mid x\_\{\\varnothing\}\)\(no knowledge\) andcK\(q\)≜p\(y∗∣xK\)c\_\{K\}\(q\)\\triangleq p\(y^\{\*\}\\mid x\_\{K\}\)\(with knowledge\)\.
### 3\.2Attention Score Decomposition
For query tokeniiat layerll, the attention mass onKKisAK\(l\)\(i\)=∑j∈Kexp\(sij\(l\)\)/∑kexp\(sik\(l\)\)A\_\{K\}^\{\(l\)\}\(i\)=\\sum\_\{j\\in K\}\\exp\(s\_\{ij\}^\{\(l\)\}\)/\\sum\_\{k\}\\exp\(s\_\{ik\}^\{\(l\)\}\)\. Since attention is normalised,AD\+AK\+AI\+AQ=1A\_\{D\}\+A\_\{K\}\+A\_\{I\}\+A\_\{Q\}=1\. We decompose the attention score into semantic and structural components:
sij\(l\)=sij\(l\),sem⏟content\-driven\+bj\(l\)⏟format bias\.s\_\{ij\}^\{\(l\)\}=\\underbrace\{s\_\{ij\}^\{\(l\),\\text\{sem\}\}\}\_\{\\text\{content\-driven\}\}\+\\underbrace\{b\_\{j\}^\{\(l\)\}\}\_\{\\text\{format bias\}\}\.\(1\)The effective attention allocated toKKdecomposes as:
AK\(l,h\)\(i\)=AK\(l,h\),sem\(i\)⏟semantic relevance\+λ\(l,h\)⋅σ\(K\)⏟structural attention tax,A\_\{K\}^\{\(l,h\)\}\(i\)=\\underbrace\{A\_\{K\}^\{\(l,h\),\\text\{sem\}\}\(i\)\}\_\{\\text\{semantic relevance\}\}\+\\underbrace\{\\lambda^\{\(l,h\)\}\\cdot\\sigma\(K\)\}\_\{\\text\{structural attention tax\}\},\(2\)whereσ\(K\)∈\[0,1\]\\sigma\(K\)\\in\[0,1\]quantifiesstructural intensity\(triple density, delimiter frequency, slot repetitiveness\) andλ\(l,h\)\\lambda^\{\(l,h\)\}is a model\-intrinsic bias coefficient\. The termλ⋅σ\(K\)\\lambda\\cdot\\sigma\(K\)is the*structural attention tax*: attention captured by format alone, independent of whether the content is relevant, irrelevant, or noise\.
###### Definition 1\(Structural capture potential\)\.
For regionℛ\\mathcal\{R\}withmmtokens:σ\(ℛ\)=γ⋅1m∑j∈ℛ𝕀\[tokenj∈𝒫struct\]\+βrep⋅rep\(ℛ\)\\sigma\(\\mathcal\{R\}\)=\\gamma\\cdot\\frac\{1\}\{m\}\\sum\_\{j\\in\\mathcal\{R\}\}\\mathbb\{I\}\[\\text\{token\}\_\{j\}\\in\\mathcal\{P\}\_\{\\text\{struct\}\}\]\+\\beta\_\{\\text\{rep\}\}\\cdot\\text\{rep\}\(\\mathcal\{R\}\), where𝒫struct\\mathcal\{P\}\_\{\\text\{struct\}\}is the structured\-pattern token set \(relation keywords, delimiters, slot markers\) andrep\(ℛ\)\\text\{rep\}\(\\mathcal\{R\}\)quantifies repetitiveness\.
The key insightis that the structural and semantic components play fundamentally different roles:*λ⋅σ\(K\)\\lambda\\cdot\\sigma\(K\)determines how much attention is taxed;s¯K*sem*\\bar\{s\}\_\{K\}^\{\\text\{sem\}\}determines whether this tax helps or hurts\.*This decoupling generates two orthogonal improvement axes: reducing format\-driven capture \(targetingσ\(K\)\\sigma\(K\)orλ\\lambda\) and improving retrieval quality \(targetings¯Ksem\\bar\{s\}\_\{K\}^\{\\text\{sem\}\}\)\.
### 3\.3Demonstration Compression Bound
The zero\-sum constraint implies that the structural tax compresses demonstration attention:
AD\(l\),eff=AD\(l\),sem−η⋅λ⋅σ\(K\)⋅AD\(l\),sem∑R≠KAR\(l\),sem,A\_\{D\}^\{\(l\),\\text\{eff\}\}=A\_\{D\}^\{\(l\),\\text\{sem\}\}\-\\eta\\cdot\\lambda\\cdot\\sigma\(K\)\\cdot\\frac\{A\_\{D\}^\{\(l\),\\text\{sem\}\}\}\{\\sum\_\{R\\neq K\}A\_\{R\}^\{\(l\),\\text\{sem\}\}\},\(3\)whereη∈\[0\.5,1\.0\]\\eta\\in\[0\.5,1\.0\]is a competition coefficient\.
###### Proposition 1\(Demonstration compression bound\)\.
IfKKhasmmtokens with mean logits¯K\\bar\{s\}\_\{K\}andDDhas mean logits¯D\\bar\{s\}\_\{D\}, then:
AD\(K\)AD\(0\)≥11\+mT0⋅exp\(s¯K−s¯D\)\.\\frac\{A\_\{D\}^\{\(K\)\}\}\{A\_\{D\}^\{\(0\)\}\}\\geq\\frac\{1\}\{1\+\\frac\{m\}\{T\_\{0\}\}\\cdot\\exp\(\\bar\{s\}\_\{K\}\-\\bar\{s\}\_\{D\}\)\}\.\(4\)
Incorporating the structural decomposition,s¯K=s¯Ksem\+λ⋅σ\(K\)\\bar\{s\}\_\{K\}=\\bar\{s\}\_\{K\}^\{\\text\{sem\}\}\+\\lambda\\cdot\\sigma\(K\), so:
AD\(K\)AD\(0\)≥11\+mT0⋅exp\(s¯Ksem\+λ⋅σ\(K\)−s¯D\)\.\\frac\{A\_\{D\}^\{\(K\)\}\}\{A\_\{D\}^\{\(0\)\}\}\\geq\\frac\{1\}\{1\+\\frac\{m\}\{T\_\{0\}\}\\cdot\\exp\\\!\\bigl\(\\bar\{s\}\_\{K\}^\{\\text\{sem\}\}\+\\lambda\\cdot\\sigma\(K\)\-\\bar\{s\}\_\{D\}\\bigr\)\}\.\(5\)The structural termλ⋅σ\(K\)\\lambda\\cdot\\sigma\(K\)appears inside the exponential, meaning even modest format bias is*amplified exponentially*in its effect on compression\. Whenλ⋅σ\(K\)≫\|s¯Ksem−s¯D\|\\lambda\\cdot\\sigma\(K\)\\gg\|\\bar\{s\}\_\{K\}^\{\\text\{sem\}\}\-\\bar\{s\}\_\{D\}\|, noise and relevant triples compress demonstrations nearly identically —a signature prediction of the structural attention tax\.
### 3\.4Source–Task Alignment
The influence ofKKon the model’s output decomposes into a useful signal component and a distraction component \(AppendixLABEL:app:mi\_decomposition\)\.Prediction 1 \(Source dominance\):WhenKKis misaligned \(e\.g\., ConceptNet for Wikipedia\-based questions\), distraction dominates\. The structural attention tax amplifies this: misaligned triples are not merely uninformative but*actively costly*because they impose a format\-driven attention tax on top of semantic distraction\.
### 3\.5Confidence\-Dependent Interference
The KL divergenceDKL\(p\(⋅∣xK\)∥p\(⋅∣x∅\)\)D\_\{\\mathrm\{KL\}\}\(p\(\\cdot\\mid x\_\{K\}\)\\\|p\(\\cdot\\mid x\_\{\\varnothing\}\)\)quantifies the representational shift from knowledge injection\. In the*high\-confidence regime*\(c0→1c\_\{0\}\\to 1\), any shift leaks mass away fromy∗y^\{\*\}:
c0\(q\)≈1⟹cK\(q\)≤c0\(q\)−∑y≠y∗\[p\(y∣xK\)−p\(y∣x∅\)\]\+⏟≜ℓ\(q,K\)≥0\.c\_\{0\}\(q\)\\approx 1\\implies c\_\{K\}\(q\)\\leq c\_\{0\}\(q\)\-\\underbrace\{\\textstyle\\sum\_\{y\\neq y^\{\*\}\}\[p\(y\\mid x\_\{K\}\)\-p\(y\\mid x\_\{\\varnothing\}\)\]^\{\+\}\}\_\{\\triangleq\\,\\ell\(q,K\)\\,\\geq\\,0\}\.\(6\)Prediction 2 \(Confidence modulation\):The expected accuracy change is:
𝔼\[Δ\]≈P\(c0≪1\)⋅𝔼\[Δ∣c0≪1\]⏟\>0\+P\(c0≈1\)⋅𝔼\[Δ∣c0≈1\]⏟<0\.\\mathbb\{E\}\[\\Delta\]\\approx\\underbrace\{P\(c\_\{0\}\\ll 1\)\\cdot\\mathbb\{E\}\[\\Delta\\mid c\_\{0\}\\ll 1\]\}\_\{\>0\}\+\\underbrace\{P\(c\_\{0\}\\approx 1\)\\cdot\\mathbb\{E\}\[\\Delta\\mid c\_\{0\}\\approx 1\]\}\_\{<0\}\.\(7\)The structural tax exacerbates this: even whenKKcontains useful signal, format\-driven attention capture reduces the model’s ability to attend to demonstrations that provide task\-critical calibration\.
### 3\.6Testable Predictions
The framework generates four predictions:P1\(Source dominance\): task\-matched retrieval outperforms mismatched;P2\(Confidence modulation\): KG hurts high\-confidence tasks, helps low\-confidence tasks;P3\(Format\-invariant capture\): noise and relevant triples absorb similar attention becauseλ⋅σ\(K\)\\lambda\\cdot\\sigma\(K\)dominates;P4\(Compression–performance decoupling\): KG\-broken and KG\-fixed samples show similar attention but divergent accuracy\.
## 4Methodology
### 4\.1Experimental Design
The prompt takes the formx=\[Instr\.;Demos;T\(q\);q\]x=\[\\text\{Instr\.\};\\;\\text\{Demos\};\\;T\(q\);\\;q\]withk=3k\{=\}3demonstrations held constant\. We design seven conditions to isolate the structural attention tax from semantic effects:
C1\(Standard ICL, no external context\);C2\(Relevant\-KG: top\-3 ConceptNet triples by cosine similarity,all\-MiniLM\-L6\-v2\);C3\(Noise\-KG: unrelated triples\);C4\(Scalar\-Gated: inject when first\-token log\-prob<−0\.3<\-0\.3\);C5\(Neutral\-Text: length\-matched neutral sentences\);C5b\(Verbalized triples; AppendixLABEL:app:c5b\);C6\(Multi\-Feature Gate: dual inference, diagnostic only\);C7\(BM25 Wikipedia passage, HotpotQA only\)\.
The C2/C3/C5 contrast is the key design for isolating the structural tax: C2 and C3 share highσ\(K\)\\sigma\(K\)but differ in semantic relevance; C5 has lowσ\(K\)\\sigma\(K\)but matched token count\. If the structural tax exists, C2 and C3 should show similar attention capture despite opposite semantic relevance, and both should exceed C5\.
*Caveat:*C2 optimises for surface similarity rather than answer\-discriminative relevance; stronger KG retrieval methods might yield different results\.
### 4\.2Datasets, Models, and Metrics
We evaluate onCommonsenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2606.11198#bib.bib5)\)\(five\-way MC\),HotpotQA\(Yang and others,[2018](https://arxiv.org/html/2606.11198#bib.bib6)\)\(multi\-hop\), andTriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.11198#bib.bib1)\)\(open\-domain factual\)\. Base set:n=200n\{=\}200; expanded ton=1,000n\{=\}1\{,\}000for McNemar tests\. Models:Mistral\-7B\-Instruct\-v0\.1\(Jiang and others,[2023a](https://arxiv.org/html/2606.11198#bib.bib8)\)andLLaMA\-3\-8B\-Instruct\(Dubey and others,[2024](https://arxiv.org/html/2606.11198#bib.bib2)\), both with 4\-bit NF4 quantisation and greedy decoding\. Quantisation introduces perturbations: FP16 shows\+10\+10pp on HotpotQA C1 \(Appendix[A\.2](https://arxiv.org/html/2606.11198#A1.SS2)\), so fine\-grained effects \(1–3 pp\) should be interpreted cautiously\. Metrics: exact\-match accuracy, first\-token log\-probability as confidence proxy, McNemar’s test with Bonferroni correction\.
*Confidence proxy caveat:*First\-token log\-probability conflates knowledge confidence with format preferences and tokeniser biases; cross\-task comparability is limited\.
## 5Results
We organise results around the four predictions of the structural attention tax framework, leading with the core structural findings \(Predictions 3–4\), then examining performance effects \(Prediction 2\), and concluding with the source\-alignment result \(Prediction 1\)\.
### 5\.1The Structural Tax in Action \(Predictions 3–4\)
Format\-invariant attention capture \(Prediction 3\)\.Last\-layer attention shows KG triples absorbing 7–10% \(Mistral\) or 3–6% \(LLaMA\) of attention mass*regardless of relevance*, with demonstration compression up to 42% \(Figure[2](https://arxiv.org/html/2606.11198#S5.F2)\)\. Critically, C3 \(noise triples\) shows attention absorption comparable to C2 \(relevant triples\)—7\.7% vs\. 10\.3% for Mistral on CSQA—confirming that the structural tax operates independently of content \(Figure[2](https://arxiv.org/html/2606.11198#S5.F2)\)\. In contrast, C5 \(neutral text\) absorbs only∼\\sim3%, yielding the signature∼\\sim3×\\timesratio:σ^\(KG\)≈0\.70\\hat\{\\sigma\}\(\\text\{KG\}\)\\approx 0\.70vs\.σ^\(neutral\)≈0\.25\\hat\{\\sigma\}\(\\text\{neutral\}\)\\approx 0\.25\.
\{subcaptiongroup\}
\(a\)Attention by region under C1/C2/C3\.
\(b\)C2 vs\. C3 KG\-region attention\.
Figure 2:Structural attention tax evidence\. \(a\) KG triples capture 7–10% of attention regardless of relevance, compressing demonstration attention by up to 42%\. \(b\) Similar capture rates for relevant \(C2\) and irrelevant \(C3\) triples confirm format\-driven allocation \(Prediction 3\)\.Quantifying the tax\.KG tokens receive≈0\.34%\\approx 0\.34\\%/token \(Mistral\) vs\.0\.13%0\.13\\%/token for demonstrations \(2\.6×2\.6\\times\), with estimated structural biasλ≈0\.07\\lambda\\approx 0\.07–0\.100\.10\(Mistral\) and0\.030\.03–0\.060\.06\(LLaMA\)\. The exponential amplification in Eq\.[5](https://arxiv.org/html/2606.11198#S3.E5)explains why even modestλ⋅σ\(K\)≈0\.05\\lambda\\cdot\\sigma\(K\)\\approx 0\.05–0\.070\.07produces substantial compression\.
Compression–performance decoupling \(Prediction 4\)\.KG\-broken and KG\-fixed samples show structurally similar attention redistribution patterns \(TableLABEL:tab:attention\), confirming that compression magnitude alone does not predict performance direction\. This validates the core claim of the framework: the structural tax determines*how much*; the semantic content determines*whether*\.
Neutral\-text control \(C5\)\.C5 decomposes the C2 effect into crowding \(C1−\-C5\) and semantic \(C5−\-C2\) components \(most estimates 1–3 pp, within CI atn=200n\{=\}200\)\. On CSQA, neutral text has negligible effect while KG triples degrade performance—consistent with content\-driven interference*amplified by*the structural tax\. On HotpotQA, KG triples overcome crowding for a net gain\.
These are correlational last\-layer observations; multi\-layer analysis \(AppendixLABEL:app:multilayer\) confirms KG attention elevation persists across layers; causal validation \(attention masking, activation patching\) is important future work\.
### 5\.2Confidence\-Dependent Effects \(Prediction 2\)
Table[1](https://arxiv.org/html/2606.11198#S5.T1)reports accuracy across conditions \(95% CI≈±6\.5\\approx\\pm 6\.5pp atn=200n\{=\}200\)\. OnCSQA\(high parametric confidence\), C2 shows a directional decrease \(−2\.0\-2\.0and−1\.0\-1\.0pp\), consistent with Eq\.[6](https://arxiv.org/html/2606.11198#S3.E6): triples containing plausible competing entities redistribute mass away fromy∗y^\{\*\}\. OnHotpotQA\(low confidence\), C2 improves accuracy \(\+2\.5/\+2\.0\+2\.5/\+2\.0pp\), consistent with Eq\.[7](https://arxiv.org/html/2606.11198#S3.E7)\. OnTriviaQA, results are mixed; LLaMA shows−4\.5\-4\.5pp\. All differences except one fall within the CI\.
McNemar’s test atn=1,000n\{=\}1\{,\}000with Bonferroni correction \(αadj=0\.0083\\alpha\_\{\\text\{adj\}\}=0\.0083; Table[2](https://arxiv.org/html/2606.11198#S5.T2)\): onlyLLaMA\-3\-8B HotpotQAsurvives \(pBonf=0\.001p\_\{\\text\{Bonf\}\}=0\.001; OR=3\.36=3\.36\)\. A sign test across all six outcomes yieldsp=0\.69p=0\.69\(non\-significant\), so the confidence\-dependent pattern remains a directional trend requiring replication\.
Table 1:Accuracy \(%,n=200n=200; 95% CI≈±6\.5\\approx\\pm 6\.5pp\)\. Best per pair inbold\.Table 2:McNemar’s test \(C2 vs C1,n=1,000n\{=\}1\{,\}000\)\.†\\dagger=pBonf<0\.05p\_\{\\text\{Bonf\}\}<0\.05\.‡\\ddagger=praw<0\.05p\_\{\\text\{raw\}\}<0\.05only\.
### 5\.3Source–Task Alignment Dominates \(Prediction 1\)
Having established that the structural tax is format\-driven and measurable, we now examine the complementary semantic axis\. Table[3](https://arxiv.org/html/2606.11198#S5.T3)reports C7 on HotpotQA\. Mistral achieves 61\.5% under C7 vs\. 24\.5% under C2 \(\+37\.0\+37\.0pp\); LLaMA achieves 58\.0% vs\. 26\.5% \(\+31\.5\+31\.5pp\)\. The\>\>30 pp gap is an order of magnitude larger than any gating strategy \(≤\\leq2 pp\), strongly supporting Prediction 1\.
In the language of our framework, Wikipedia retrieval for HotpotQA maximisess¯Ksem\\bar\{s\}\_\{K\}^\{\\text\{sem\}\}\(task\-aligned content\) while presenting information in coherent prose with lowerσ\(K\)\\sigma\(K\)than triple format, reducing both semantic distraction and the structural attention tax simultaneously\.
*Caveats:*C2 and C7 differ in retrieval unit, token budget, and text coherence; C7 is limited to HotpotQA; a stronger KG pipeline might narrow the gap\. Gating strategies \(C4, C6\) yield≤\\leq2 pp improvements, negligible compared to source selection \(AppendixLABEL:app:oracle\)\.
Table 3:C7 on HotpotQA \(n=200n\{=\}200\)\. The\>\>30 pp gap supports Prediction 1 \(source dominance\)\. C1 values differ from Table[1](https://arxiv.org/html/2606.11198#S5.T1)\(separate split\)\.
## 6Structure\-Aware Mitigation Strategies
The structural attention tax framework identifiesλ⋅σ\(K\)\\lambda\\cdot\\sigma\(K\)as a format\-driven attention cost independent of content quality\. This suggests a principled design space for mitigation, targeting different terms in Eq\.[2](https://arxiv.org/html/2606.11198#S3.E2)\. S3 is supported by both accuracy and attention\-level evidence from the C5b condition \(AppendixLABEL:app:c5b\); S1 is empirically tested with mixed results \(AppendixLABEL:app:s1\_dispersal\); the remaining three strategies areuntestedframework\-derived hypotheses\.
S1: Structural Dispersal\(targetsσ\(K\)↓\\sigma\(K\)\\\!\\downarrow\)\. Interleave triples with natural\-language prose, reducing contiguous structural density:σdisp=σorig/ρ\\sigma\_\{\\text\{disp\}\}=\\sigma\_\{\\text\{orig\}\}/\\rhowhereρ\>1\\rho\>1is the dispersal factor\. Predicted compression reduction:ΔD′≈ΔD/ρ\\Delta\_\{D\}^\{\\prime\}\\approx\\Delta\_\{D\}/\\rho\. A pilot experiment \(AppendixLABEL:app:s1\_dispersal\) reveals that dispersal effects are model\-dependent: LLaMA\-3\-8B shows the predicted 20–39% reduction in KG attention, but Mistral\-7B shows a paradoxical*increase*, and accuracy degrades in both cases \(−0\.5\-0\.5to−6\.5\-6\.5pp\), suggesting that bridging phrases can introduce new structural anchors that offset the intended dispersal\.
S2: Attention Logit Suppression\(targetss¯K↓\\bar\{s\}\_\{K\}\\\!\\downarrow\)\. Before softmax, subtractc\>0c\>0from KG\-region logits:sij\(l\)←sij\(l\)−c⋅𝕀\[j∈K\]s\_\{ij\}^\{\(l\)\}\\leftarrow s\_\{ij\}^\{\(l\)\}\-c\\cdot\\mathbb\{I\}\[j\\in K\]\. This directly counteracts the structural tax:
AD\(K,c\)AD\(0\)≥11\+mT0⋅exp\(s¯K−c−s¯D\)\.\\frac\{A\_\{D\}^\{\(K,c\)\}\}\{A\_\{D\}^\{\(0\)\}\}\\geq\\frac\{1\}\{1\+\\frac\{m\}\{T\_\{0\}\}\\cdot\\exp\(\\bar\{s\}\_\{K\}\-c\-\\bar\{s\}\_\{D\}\)\}\.\(8\)Withs¯K−s¯D≈0\.96\\bar\{s\}\_\{K\}\-\\bar\{s\}\_\{D\}\\approx 0\.96,c∈\[0\.5,1\.5\]c\\in\[0\.5,1\.5\]should substantially reduce compression while preserving semantic signal\.
S3: Format Flattening\(targetsσ\(K\)↓\\sigma\(K\)\\\!\\downarrow\)\. Convert triples to natural sentences, reducingσ\(K\)\\sigma\(K\)by removing delimiter patterns and slot structure\. The C5b condition \(AppendixLABEL:app:c5b\) provides direct support: verbalized triples not only maintain accuracy on 4 of 6 model–task pairs, but also reduce KG\-region attention by 17–29% on LLaMA\-3\-8B \(ratio C5b/C2≈0\.71\\approx 0\.71–0\.830\.83\), confirming that format flattening lowers the structural tax while preserving semantic content\. Extension: interrogative form \(e\.g\., “Have you considered that a rug is often on a floor?”\) further aligns with demonstration style\.
S4: Confidence\-Modulated Injection\(targets effective attention\)\. Define trust coefficientμ\(q\)∈\[0,1\]\\mu\(q\)\\in\[0,1\]:AKeff=μ\(q\)⋅AKsem\+\(1−μ\(q\)\)⋅λ⋅σ\(K\)A\_\{K\}^\{\\text\{eff\}\}=\\mu\(q\)\\cdot A\_\{K\}^\{\\text\{sem\}\}\+\(1\{\-\}\\mu\(q\)\)\\cdot\\lambda\\cdot\\sigma\(K\)\. A meta\-instruction \(“These facts are for reference only”\) encourages highμ\(q\)\\mu\(q\)for confident queries\.
S5: Structural Adversarial Regularisation\(targetsλ↓\\lambda\\\!\\downarrow\)\. During fine\-tuning, penalise attention to noise triples:
ℒstruct=αreg∑l∑iAK\(l\)\(i;Knoise\)AD\(l\)\(i\)\+ϵ\.\\mathcal\{L\}\_\{\\text\{struct\}\}=\\alpha\_\{\\text\{reg\}\}\\sum\_\{l\}\\sum\_\{i\}\\frac\{A\_\{K\}^\{\(l\)\}\(i;K\_\{\\text\{noise\}\}\)\}\{A\_\{D\}^\{\(l\)\}\(i\)\+\\epsilon\}\.\(9\)This directly reducesλ\\lambda, the model\-intrinsic format bias—the most expensive but most durable approach\.
Table 4:Mitigation strategies derived from the structural attention tax framework\. S3 is supported by C5b accuracy and attention evidence \(AppendixLABEL:app:c5b\); S1 shows mixed results \(AppendixLABEL:app:s1\_dispersal\); S2/S4/S5 are untested\.Table[4](https://arxiv.org/html/2606.11198#S6.T4)summarises all five strategies\. The strategies form a cost–effectiveness ladder: S1/S3 require only prompt modification, S2 requires logit access, S4 requires dual inference, S5 requires fine\-tuning\. This structured design space is a direct consequence of the decomposition in Eq\.[2](https://arxiv.org/html/2606.11198#S3.E2): each strategy targets a specific term, enabling principled selection based on deployment constraints\. Notably, the contrasting outcomes of S1 \(mixed\) and S3 \(supported\) highlight that*how*σ\(K\)\\sigma\(K\)is reduced matters: eliminating structural patterns entirely \(S3\) is more reliable than diluting them with additional tokens that may themselves become attention anchors \(S1\)\.
## 7Discussion
The structural attention tax as a unifying concept\.Prior work has noted that retrieval can hurt\(Shi and others,[2023](https://arxiv.org/html/2606.11198#bib.bib23)\), that parametric and retrieved knowledge compete\(Wuet al\.,[2024](https://arxiv.org/html/2606.11198#bib.bib31)\), and that RAG gains diminish when models are already confident\(Liu and others,[2024](https://arxiv.org/html/2606.11198#bib.bib14)\)\. Our framework unifies these observations by identifying a single mechanism—format\-driven attention capture—that operates orthogonally to content quality\. The∼\\sim3×\\timesratio between KG\-format and neutral\-text attention capture \(σ^\(KG\)/σ^\(neutral\)\\hat\{\\sigma\}\(\\text\{KG\}\)/\\hat\{\\sigma\}\(\\text\{neutral\}\)\) confirms that triple format elevates attention independently of content, and the exponential amplification in Eq\.[5](https://arxiv.org/html/2606.11198#S3.E5)explains why this modest bias produces substantial downstream effects\.
Two orthogonal axes\.The\>\>30 pp BM25 gap demonstrates the semantic axis; the∼\\sim3×\\timesstructural ratio identifies an untapped structural axis\. Current RAG research focuses almost exclusively on the former; our framework motivates systematic attention to the latter\.
Practical guidelines\.\(1\) Match knowledge source to task—our strongest finding, though currently limited to one task\. \(2\) If parametric confidence is high, avoid KG injection from mismatched sources\. \(3\) Apply format flattening \(zero\-cost, empirically supported\) rather than structural dispersal \(which may introduce new attention anchors\)\. \(4\) Use logit suppression when attention access is available\. \(5\) For deployment, consider structural adversarial regularisation\.
Broader implications\.The structural attention tax may extend to any prompt component with distinctive formatting \(SQL, JSON, code blocks\), suggesting that format normalisation should be a standard preprocessing step in RAG pipelines\.
## 8Limitations
Statistical power:Only one of six Bonferroni\-corrected comparisons is significant \(p=0\.69p=0\.69sign test\); the confidence\-dependent pattern is a directional trend\.Retrieval:Cosine\-similarity ConceptNet retrieval is relatively weak; BM25 evaluated only on HotpotQA\.Scale:Two 7B/8B models, 4\-bit NF4 quantisation \(FP16 shows\+10\+10pp on HotpotQA C1\), greedy decoding\.Attention:Last\-layer, correlational, no causal intervention\.Theory:Eq\.LABEL:eq:mi\_decompis a heuristic; Eq\.[2](https://arxiv.org/html/2606.11198#S3.E2)assumes additive separation\.Mitigation:S3 is supported by both accuracy and attention evidence; S1 yields mixed results with model\-dependent effects; the remaining three strategies \(S2, S4, S5\) are untested\.Generality:Demonstrated only for KG triple format\. See AppendixLABEL:app:extended\_limitationsfor extended discussion\.
## 9Conclusion
We have introduced thestructural attention tax: a format\-driven mechanism by which structured prompt regions \(such as knowledge graph triples\) capture disproportionate attention independent of their semantic content, compressing demonstration attention by up to 42%\. Our formal framework decomposes attention competition into semantic and structural components, revealing that these two axes govern different aspects of retrieval\-augmented ICL: the semantic term determines*whether*augmentation helps; the structural term determines*how much*attention is taxed\.
This decoupling yields three actionable insights\. First, source–task alignment along the semantic axis dominates: task\-matched BM25 retrieval achieves5858–62%62\\%on HotpotQA vs\. ConceptNet’s2525–27%27\\%\(\>\>30 pp gap\)\. Second, the structural tax is real and measurable:σ^\(KG\)≈3×σ^\(neutral\)\\hat\{\\sigma\}\(\\text\{KG\}\)\\approx 3\\times\\hat\{\\sigma\}\(\\text\{neutral\}\), with noise and relevant triples showing comparable attention capture\. Third, the framework generates a principled design space of five mitigation strategies; empirical evaluation of two prompt\-level strategies reveals that format flattening \(S3\) effectively reduces the structural tax \(17–29% KG attention reduction on LLaMA\-3\-8B\) while preserving accuracy, whereas structural dispersal \(S1\) produces model\-dependent effects with accuracy degradation, highlighting that the design of format\-level interventions requires care to avoid introducing new structural anchors\.
Future work:\(i\) causal interventions \(attention masking, activation patching\) to validate the structural tax mechanism; \(ii\) extend C7 to other tasks; \(iii\) empirically evaluate the remaining three mitigation strategies \(S2, S4, S5\); \(iv\) investigate the structural tax for other formatted prompt components \(SQL, JSON, code blocks\)\.
## References
- S\. An, B\. Zhou, Z\. Lin, Q\. Fu, B\. Chen, N\. Zheng, W\. Chen, and J\. Lou \(2023\)Skill\-based few\-shot selection for in\-context learning\.InProceedings of EMNLP,pp\. 13472–13492\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p1.1)\.
- T\. B\. Brownet al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p2.1),[§2](https://arxiv.org/html/2606.11198#S2.p1.1)\.
- P\. Chen, S\. Chen, M\. Wang, S\. X\. Leong, P\. Fung, V\. Bernales, and A\. Aspuru\-Guzik \(2025\)Schema for in\-context learning\.arXiv preprint arXiv:2510\.13905\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p2.1),[§2](https://arxiv.org/html/2606.11198#S2.p1.1)\.
- S\. Deng, N\. Zhang, N\. Oo, and B\. Hooi \(2024\)Towards a unified view of answer calibration for multi\-step reasoning\.InProc\. 2nd Workshop on Natural Language Reasoning and Structured Explanations \(NLRSE @ ACL 2024\),pp\. 25–38\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p3.1)\.
- A\. Dubeyet al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.2](https://arxiv.org/html/2606.11198#S4.SS2.p1.3)\.
- Q\. Huanget al\.\(2023\)PRODIGY: enabling in\-context learning over graphs\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p2.1)\.
- A\. Q\. Jianget al\.\(2023a\)Mistral 7B\.arXiv preprint arXiv:2310\.06825\.Cited by:[§4\.2](https://arxiv.org/html/2606.11198#S4.SS2.p1.3)\.
- Z\. Jianget al\.\(2023b\)Active retrieval augmented generation\.InProceedings of EMNLP,pp\. 7969–7992\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p1.1),[§2](https://arxiv.org/html/2606.11198#S2.p2.1),[§2](https://arxiv.org/html/2606.11198#S2.p4.1)\.
- M\. Jinet al\.\(2025\)Disentangling memory and reasoning ability in large language models\.InProceedings of ACL,pp\. 1681–1701\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p3.1)\.
- T\. Jinet al\.\(2023\)The cost of down\-scaling language models: fact recall deteriorates before in\-context learning\.arXiv preprint arXiv:2310\.04680\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p1.1)\.
- M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of ACL,pp\. 1601–1611\.Cited by:[§4\.2](https://arxiv.org/html/2606.11198#S4.SS2.p1.3)\.
- S\. M\. Kazemiet al\.\(2023\)LAMBADA: backward chaining for automated reasoning in natural language\.InProceedings of ACL,pp\. 6547–6568\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p3.1)\.
- T\. Li, X\. Ma, A\. Zhuang, Y\. Gu, Y\. Su, and W\. Chen \(2023\)Few\-shot in\-context learning on knowledge base question answering\.InProceedings of ACL,pp\. 6966–6980\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p2.1)\.
- J\. Liuet al\.\(2024\)How much can RAG help the reasoning of LLM?\.arXiv preprint arXiv:2410\.02338\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p3.2),[§2](https://arxiv.org/html/2606.11198#S2.p2.1),[§7](https://arxiv.org/html/2606.11198#S7.p1.3)\.
- S\. Minet al\.\(2022\)Rethinking the role of demonstrations: what makes in\-context learning work?\.InProceedings of EMNLP,pp\. 11048–11064\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p1.1)\.
- A\. Parry, D\. Ganguly, and M\. Chandra \(2024\)“In\-context learning” or: how I learned to stop worrying and love applied information retrieval\.InProceedings of SIGIR,pp\. 14–25\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p1.1),[§2](https://arxiv.org/html/2606.11198#S2.p1.1)\.
- F\. Shiet al\.\(2023\)Large language models can be easily distracted by irrelevant context\.InProceedings of ICML,pp\. 31210–31227\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p3.2),[§2](https://arxiv.org/html/2606.11198#S2.p2.1),[§7](https://arxiv.org/html/2606.11198#S7.p1.3)\.
- Y\. Sui, Y\. He, N\. Liu, X\. He, K\. Wang, and B\. Hooi \(2025\)FiDeLiS: faithful reasoning in LLMs for knowledge graph question answering\.InFindings of ACL,pp\. 8315–8330\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p1.1),[§2](https://arxiv.org/html/2606.11198#S2.p2.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of NAACL\-HLT,pp\. 4149–4158\.Cited by:[§4\.2](https://arxiv.org/html/2606.11198#S4.SS2.p1.3)\.
- J\. Weiet al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p3.1)\.
- K\. Wu, E\. Wu, and J\. Zou \(2024\)How faithful are RAG models? Quantifying the tug\-of\-war between RAG and LLMs’ internal prior\.arXiv preprint arXiv:2404\.10198\.Cited by:[§1](https://arxiv.org/html/2606.11198#S1.p3.2),[§2](https://arxiv.org/html/2606.11198#S2.p2.1),[§2](https://arxiv.org/html/2606.11198#S2.p4.1),[§7](https://arxiv.org/html/2606.11198#S7.p1.3)\.
- Z\. Yanget al\.\(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of EMNLP,pp\. 2369–2380\.Cited by:[§4\.2](https://arxiv.org/html/2606.11198#S4.SS2.p1.3)\.
- C\. Zhenget al\.\(2023\)Can we edit factual knowledge by in\-context learning?\.InProceedings of EMNLP,pp\. 4862–4876\.Cited by:[§2](https://arxiv.org/html/2606.11198#S2.p2.1)\.
## Appendix
## Appendix AExperimental Setup and Validation
### A\.1Extended Condition Descriptions
C2:Triples ranked by cosine similarity \(all\-MiniLM\-L6\-v2\), rendered as natural language\.C3:Shares relational format with C2 \(σ\(C3\)≈σ\(C2\)\\sigma\(\\text\{C3\}\)\\approx\\sigma\(\\text\{C2\}\)\), enabling Prediction 3 testing\.C4:τ=−0\.3\\tau=\-0\.3via grid search on 50 held\-out samples\.C5:Lowσ\(C5\)≪σ\(C2\)\\sigma\(\\text\{C5\}\)\\ll\\sigma\(\\text\{C2\}\); does not control for syntactic form\.C6:τa\\tau\_\{a\},δ\\deltavia 5\-fold CV\.C7:HotpotQA only\. Entity extraction: spaCy NER \+ noun\-chunk detector\.
### A\.2FP16 vs\. INT4 Quantisation Comparison
Table[5](https://arxiv.org/html/2606.11198#A1.T5)compares FP16 and INT4 accuracy on a subset of HotpotQA\. The\+10\+10pp gap under C1 indicates that quantisation substantially affects baseline performance, warranting caution when interpreting small effect sizes in the main experiments\.
Table 5:FP16 vs\. INT4 \(Mistral\-7B, HotpotQA,n=50n\{=\}50\)\.
### A\.3Alias\-Aware Evaluation
Alias matching yields uniform improvements \(≤\\leq3 pp\) not altering conclusions\.
### A\.4SARP Threshold Stability
Bootstrap stability \(200 resamples\) confirmsτ=−0\.3\\tau=\-0\.3is almost never optimal \(<4%<4\\%\); modal:τ=−0\.05\\tau=\-0\.05\(62–100%\)\.
## Appendix BSupplementary Results
### B\.1Additional Main Results
This section reports supplementary metrics that complement the main accuracy results\. Table[6](https://arxiv.org/html/2606.11198#A2.T6)reports mean answer log\-probabilities, serving as a confidence proxy across conditions\. Table[7](https://arxiv.org/html/2606.11198#A2.T7)traces per\-sample error transitions from C1 to C2\. FigureLABEL:fig:heatmapvisualises accuracy deltas across all conditions as a heatmap\. TableLABEL:tab:mcnemar500reports McNemar tests atn≈500n\\approx 500as an intermediate power check\.
Table 6:Answer log\-probability \(mean\),n=200n=200\.Table 7:Per\-sample error flow \(C1→\\toC2\),n=200n\{=\}200\.Similar Articles
Rethinking the Role of Efficient Attention in Hybrid Architectures
This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
This paper identifies that chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, and proposes QK-Restore, a training-free method that restores long-context recall while preserving reasoning performance.
Dynamic Linear Attention
This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.
The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
This paper identifies and formalizes 'recorruption' in multimodal RAG, where adding accurate context causes models to abandon correct predictions due to attentional collapse (visual blindness and positional bias). The authors propose BAIR, a parameter-free inference-time framework that restores visual saliency and penalizes textual distractors, improving reliability across medical, fairness, and geospatial benchmarks.
Generic Triple-Latent Compression with Gated Associative Retrieval
This paper introduces generic triple-latent recurrent models that compress token pair interactions into a latent state, and a gated associative retrieval variant that improves exact recall. The hybrid model outperforms Transformers on byte-level WikiText-2 and a tokenized language benchmark, achieving up to 41.9% associative recall versus 25%.