In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

arXiv cs.CL Papers

Summary

This paper studies retrieval-augmented generation as an in-context optimization process, showing that linear self-attention can implement gradient descent on a unified RAG objective. It proposes a lightweight method for frozen RAG LLMs that predicts context-conditioned updates, improving performance across multiple QA benchmarks.

arXiv:2605.26356v1 Announce Type: new Abstract: In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:03 AM

# In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective
Source: [https://arxiv.org/html/2605.26356](https://arxiv.org/html/2605.26356)
Mingchen Li1, Jiatan Huang211footnotemark:1, Chuxu Zhang2, Liang Zhao3,Hong Yu1 1University of Massachusetts, Amherst2University of Connecticut 3Emory University

###### Abstract

In\-context learning has recently been linked to implicit gradient descent in linear self\-attention models, suggesting that context can induce a forward\-pass update\. Retrieval\-augmented generation \(RAG\) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation\. We study RAG as an in\-context optimization process\. First, we show that one linear self\-attention layer can implement one gradient\-descent step on a unified linearized RAG objective covering both projection\-based and dot\-product retrieval interfaces\. This gives an exact regime where retrieval\-augmented prediction and in\-context optimization coincide\. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence\. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature\-distribution dependent under nonlinear architectures\. Finally, we turn this view into a lightweight method for frozen RAG LLMs\. The method keeps the retriever and backbone fixed, and predicts a context\-conditioned update to a generator\-side evidence\-use interface\. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward\-only update improves a shared\-interface baseline, transfers to held\-out tasks, and approaches test\-time gradient adaptation at much lower per\-query cost\.

## 1Introduction

Large language models \(LLMs\) have achieved strong performance across many natural\-language tasks, but adapting them to knowledge outside their static pretraining corpus remains difficult\. Retrieval\-augmented generation \(RAG\)\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib7)\)addresses this limitation by conditioning a frozen LLM on documents retrieved from an external corpus\. However, retrieval alone does not solve the full adaptation problem\. After relevant documents are retrieved, the model must still decide how to use them for a new task, domain, or query distribution\.

Existing RAG systems usually address this problem in one of three ways\. The first is to keep both the retriever and generator fixed, and simply prepend retrieved documents to the input\. This strategy is efficient, but it treats retrieved documents as static evidence and gives the model no mechanism to adjust how evidence should be used\. The second is to fine\-tune the retriever, the generator, or both\. This can improve task performance, but it requires additional training and can be expensive when the task or domain changes\. The third is to use in\-context learning \(ICL\), where a few input\-output examples are provided at inference time\. ICL is attractive because it avoids full model retraining, but it is still unclear whether these examples merely provide extra demonstrations, or whether they can induce a more systematic update to how a RAG model uses retrieved evidence\.

This paper asks a simple question:can retrieved evidence and a few RAG examples act not only as context to read from, but also as a signal for adapting how the model uses evidence?Answering this question requires connecting two views that have mostly been studied separately\. On one side, recent theory shows that, under linear self\-attention, in\-context learning can implement gradient descent on the examples in the context\(von Oswaldet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib6); Akyüreket al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib28); Mahankaliet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib4)\)\. This suggests that context can behave like a forward\-pass update, rather than only as additional input text\. On the other side, RAG introduces structure that is absent from standard ICL theory: a query, retrieved documents, query\-evidence interactions, and a generator that must combine them to produce an answer\. It remains unclear whether the gradient\-descent view of ICL extends to retrieval\-augmented prediction, and whether such a view can guide practical adaptation in real RAG systems\.

We study RAG from this in\-context optimization perspective\. Our goal is not to claim that modern retrieval\-augmented LLMs literally perform gradient descent during inference\. Instead, we use a controlled linear setting to identify where such an update would act\. In linear RAG, the relevant update acts on the interaction between the query and retrieved evidence\. This gives a simple design principle for LLM\-scale RAG: rather than changing which documents are retrieved, we adapt how the frozen generator uses the retrieved documents\. Guided by this principle, we propose a forward\-only adaptation method for frozen RAG LLMs\. The method keeps both the external retriever and the LLM backbone fixed, and adapts only a lightweight generator\-side evidence\-use interface implemented with LoRA\. At inference time, given new few\-shot RAG demonstrations, the predictor produces the update in a single forward pass, enabling the generator to adjust how it uses retrieved documents without re\-training on the new dataset\.

We develop this idea in three steps\.First, we prove that one linear self\-attention layer can implement one gradient\-descent step on a unified linearized RAG objective covering both projection\-based and dot\-product retrieval interfaces\.Second, we test how far this correspondence extends beyond the exact construction\. A trained self\-attention layer closely matches the constructed gradient\-descent predictor under controlled linear shifts, varying document counts, and stacked depths, while nonlinear architectures and real\-world regression data reveal a clear dependence on feature distribution\.Third, we use the optimization view to guide LLM\-scale RAG adaptation\. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, the predicted update improves a shared\-interface baseline, transfers to held\-out tasks, and approaches test\-time gradient adaptation at much lower per\-query cost\. Our contributions are summarized as follows:

- •An in\-context optimization view of linear RAG\.We extend the ICL\-as\-gradient\-descent perspective from generator\-only prediction to retrieval\-augmented prediction\. We prove that one linear self\-attention layer can implement one gradient\-descent step on a unified linearized RAG loss covering both linear\-projection and dot\-product retrieval interfaces\. We also show that stackingKKlinear self\-attention layers gives a multi\-step view of in\-context optimization for linear RAG\.
- •A boundary analysis beyond the exact linear setting\.We test when the linear construction remains predictive and when it breaks\. On synthetic linear regression tasks, a trained self\-attention layer closely matches the constructed gradient\-descent predictor under distribution shift, varying document counts, and stacked depths\. On nonlinear architectures and four real\-world regression datasets, the alignment degrades in a structured way and becomes sensitive to feature distribution\.
- •Adapting evidence use without test\-time backpropagation\.We use the optimization view to guide adaptation in frozen RAG LLMs\. Rather than changing the external retriever, we adapt a generator\-side evidence\-use interface implemented with Q/K/V LoRA modules\. A small context\-conditioned predictor amortizes the autograd\-definedKK\-step update to this interface\. Across seven QA benchmarks, two backbones, and two retrievers, the predicted update improves a shared\-interface baseline, transfers to held\-out domains, and approaches test\-time gradient adaptation at much lower per\-query cost\.

## 2Related Work

Retrieval\-augmented generation \(RAG\) conditions a language model on documents retrieved from an external corpus\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib7); Guuet al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib19); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib9); Izacard and Grave,[2021](https://arxiv.org/html/2605.26356#bib.bib39); Borgeaudet al\.,[2022](https://arxiv.org/html/2605.26356#bib.bib37); Zhanget al\.,[2025a](https://arxiv.org/html/2605.26356#bib.bib23)\)\. Prior work has improved RAG through better retrieval, prompting, evidence fusion, and joint retriever\-generator training\(Ramet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib36); Asaiet al\.,[2024](https://arxiv.org/html/2605.26356#bib.bib38); Huanget al\.,[2026](https://arxiv.org/html/2605.26356#bib.bib16); Zhanget al\.,[2025b](https://arxiv.org/html/2605.26356#bib.bib18)\)\. Our focus is complementary: rather than changing which documents are retrieved, we study how a frozen generator can adapt its use of already\-retrieved evidence\. We position this contribution relative to three lines of work\.

#### In\-context learning as gradient descent\.

A growing line of theory interprets in\-context learning as implicit optimization\. Under linear self\-attention, a single ICL forward pass can implement one gradient\-descent step, and stacked layers can implement multiple steps\(von Oswaldet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib6); Akyüreket al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib28); Mahankaliet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib4); Daiet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib42)\)\. Later work extends this view to preconditioned gradient descent\(Ahnet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib40)\), in\-context algorithm selection\(Baiet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib41)\), the role of depth\(Vladymyrovet al\.,[2024](https://arxiv.org/html/2605.26356#bib.bib17); Gatmiryet al\.,[2024](https://arxiv.org/html/2605.26356#bib.bib5)\), and kernel\-regression interpretations of attention\(Shenet al\.,[2026](https://arxiv.org/html/2605.26356#bib.bib2); Ren and Liu,[2024](https://arxiv.org/html/2605.26356#bib.bib1)\)\. These analyses mainly study generator\-only settings, often through linear regression or simplified attention\. RAG introduces additional structure, including retrieved documents, query\-evidence interactions, and evidence\-conditioned generation\. We extend the gradient\-descent view to a linearized RAG setting and use it to identify where evidence\-use adaptation should act\.

#### Context\-conditioned weight prediction\.

Another line of work learns auxiliary networks that produce model updates from a small context\. HyperTuning\(Phanget al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib29)\)predicts soft prompts or low\-rank weights from few\-shot examples\. HyperFlow\(Kimet al\.,[2025](https://arxiv.org/html/2605.26356#bib.bib31)\)learns support\-conditioned fine\-tuning dynamics\. MAC\(Tacket al\.,[2024](https://arxiv.org/html/2605.26356#bib.bib32)\)maps documents into memory modulations\. MEND\(Mitchellet al\.,[2022](https://arxiv.org/html/2605.26356#bib.bib33)\)maps fine\-tuning gradients into knowledge edits\.RAG\-GDfollows the broad template of predicting an update from context, but differs in both the target and the adaptation site\. The target is not a downstream task loss, a meta\-learning objective, a memory objective, or a single editing gradient\. Instead, the predictor matches an autograd\-definedKK\-step SGD update induced by RAG\-formatted demonstrations\. The adapted parameters are also restricted to a generator\-side evidence\-use interface, while the retriever and backbone remain fixed\.

#### Test\-time adaptation\.

Standard adaptation either updates model parameters before deployment, as in fine\-tuning and LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.26356#bib.bib35)\), or leaves the model unchanged at inference, as in pure ICL\. Test\-time training\(Sunet al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib34)\)lies between these extremes by updating parameters for each test instance, but this requires per\-instance backpropagation and becomes expensive for large LLMs\. Recent studies compare ICL, fine\-tuning, and trainable RAG as system\-level adaptation strategies\(Wanget al\.,[2024a](https://arxiv.org/html/2605.26356#bib.bib21); Mosbachet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib20); Liet al\.,[2025](https://arxiv.org/html/2605.26356#bib.bib24)\)\.RAG\-GDtargets the same goal of adapting at inference time, but amortizes the update: a small predictor emits a LoRA update to the generator’s evidence\-use interface in one forward pass\. Thus, it avoids backpropagation through the LLM at deployment while keeping both the external retriever and the frozen backbone unchanged\.

## 3A Linear RAG Setting Where Self\-Attention Implements Gradient Descent

This section establishes the linear\-regime basis for our in\-context optimization view of RAG\. We study a controlled setting in which retrieval\-augmented prediction and gradient descent can be connected exactly\. The goal is not to model modern RAG systems literally: real retrievers involve discrete document selection, and modern LLMs are deep and nonlinear\. Instead, we isolate a differentiable retrieval\-augmented prediction problem and show that one linear self\-attention layer can realize the prediction shift produced by one gradient\-descent step\. The proof and explicit construction are in Appendix[A](https://arxiv.org/html/2605.26356#A1), and the derivations for the retrieval variants are in Appendix[B](https://arxiv.org/html/2605.26356#A2)\.

### 3\.1Self\-Attention

We begin with a multi\-head self\-attention block parameterized byθ=\{Ph,Wh,V,Wh,K,Wh,Q\}h=1H\\theta=\\\{P\_\{h\},W\_\{h,V\},W\_\{h,K\},W\_\{h,Q\}\\\}\_\{h=1\}^\{H\}\. Given tokens\{e1,…,eN\}⊂ℝd\\\{e\_\{1\},\\ldots,e\_\{N\}\\\}\\subset\\mathbb\{R\}^\{d\}, the update for tokeneje\_\{j\}is

ej←ej\+SAθ​\(j,\{ei\}i=1N\)=ej\+∑hPh​Vh​softmax​\(Kh⊤​qh,j\),e\_\{j\}\\leftarrow e\_\{j\}\+\\mathrm\{SA\}\_\{\\theta\}\(j,\\\{e\_\{i\}\\\}\_\{i=1\}^\{N\}\)=e\_\{j\}\+\\sum\_\{h\}P\_\{h\}V\_\{h\}\\,\\mathrm\{softmax\}\(K\_\{h\}^\{\\top\}q\_\{h,j\}\),\(1\)whereVhV\_\{h\},KhK\_\{h\}, andqh,jq\_\{h,j\}are the value matrix, key matrix, and query vector for headhh\. Following\(von Oswaldet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib6); Vladymyrovet al\.,[2024](https://arxiv.org/html/2605.26356#bib.bib17)\), we remove the softmax and bias terms to obtain the linear self\-attention \(LSA\) update:

ej←ej\+LSAθ​\(j,\{ei\}i=1N\)=ej\+∑hPh​Vh​Kh⊤​qh,j\.e\_\{j\}\\leftarrow e\_\{j\}\+\\mathrm\{LSA\}\_\{\\theta\}\(j,\\\{e\_\{i\}\\\}\_\{i=1\}^\{N\}\)=e\_\{j\}\+\\sum\_\{h\}P\_\{h\}V\_\{h\}K\_\{h\}^\{\\top\}q\_\{h,j\}\.\(2\)

### 3\.2A Unified Linearized RAG Predictor

We use a linearized abstraction of retrieval\-augmented prediction\. Rather than modeling discrete top\-kkselection, this abstraction captures a differentiable interface in which query features and retrieval\-derived features jointly determine the prediction\. Both a projection\-based retrieval interface\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib7)\)and a dot\-product retrieval interface\(Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib9)\)can be written asy=W1​x1\+W2​x2y=W\_\{1\}x\_\{1\}\+W\_\{2\}x\_\{2\}wherex1x\_\{1\}denotes the query\-side feature andx2x\_\{2\}denotes the retrieval\-derived feature\. For the projection\-based interface, we setx1=xqx\_\{1\}=x\_\{q\},x2=Dx\_\{2\}=D, andW2≜W1​WdW\_\{2\}\\triangleq W\_\{1\}W\_\{d\}, whereWdW\_\{d\}projects document embeddings into the prediction space\. For the dot\-product interface, we setx1=x2=xqx\_\{1\}=x\_\{2\}=x\_\{q\}andW2=Wz​\(∑idi​di⊤\)​M⊤,W\_\{2\}=W\_\{z\}\\left\(\\sum\_\{i\}d\_\{i\}d\_\{i\}^\{\\top\}\\right\)M^\{\\top\},whereMMparameterizes query\-document similarity\. For tractability, we use the shared\-encoder simplificationM=We⊤​WeM=W\_\{e\}^\{\\top\}W\_\{e\}, soMMis symmetric\. The general DPR formulation\(Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26356#bib.bib9)\)allows separate query and document encoders, withM=Wq⊤​WdM=W\_\{q\}^\{\\top\}W\_\{d\}\. Full derivations are provided in Appendix[B](https://arxiv.org/html/2605.26356#A2)\.

### 3\.3Optimization Objective

Given training examples\{\(x1i,x2i,yi\)\}i=1N\\\{\(x\_\{1\}^\{i\},x\_\{2\}^\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, we consider the squared loss

L​\(W1,W2\)=12​N​∑i=1N‖W1​x1i\+W2​x2i−yi‖2\.L\(W\_\{1\},W\_\{2\}\)=\\frac\{1\}\{2N\}\\sum\_\{i=1\}^\{N\}\\left\\\|W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y\_\{i\}\\right\\\|^\{2\}\.\(3\)One gradient\-descent step with learning rateη\\etagives

Δ​Wk=−η​∇WkL=−ηN​∑i=1N\(W1​x1i\+W2​x2i−yi\)​\(xki\)⊤,k∈\{1,2\}\.\\Delta W\_\{k\}=\-\\eta\\nabla\_\{W\_\{k\}\}L=\-\\frac\{\\eta\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y\_\{i\}\\right\)\(x\_\{k\}^\{i\}\)^\{\\top\},\\qquad k\\in\\\{1,2\\\}\.\(4\)For a query token with features\(x1,x2\)\(x\_\{1\},x\_\{2\}\), the corresponding prediction shift isΔ​y≜Δ​W1​x1\+Δ​W2​x2\.\\Delta y\\triangleq\\Delta W\_\{1\}x\_\{1\}\+\\Delta W\_\{2\}x\_\{2\}\.Thus,Δ​y\\Delta yis the change in prediction after updatingWkW\_\{k\}toWk′=Wk\+Δ​WkW\_\{k\}^\{\\prime\}=W\_\{k\}\+\\Delta W\_\{k\}\.

### 3\.4Linear Self\-attention Reproduces one Gradient Step

###### Lemma 1\(Linear self\-attention implements one RAG gradient step\)\.

Consider a 1\-head linear self\-attention layer, context tokensei=\(x1i,x2i,yi\)e\_\{i\}=\(x\_\{1\}^\{i\},x\_\{2\}^\{i\},y^\{i\}\)fori=1,…,Ni=1,\\ldots,N, and a query tokenej=\(x1j,x2j,yj\)e\_\{j\}=\(x\_\{1\}^\{j\},x\_\{2\}^\{j\},y^\{j\}\)\. LetΔ​W1\\Delta W\_\{1\}andΔ​W2\\Delta W\_\{2\}be the one\-step gradient\-descent updates in Eq\.[4](https://arxiv.org/html/2605.26356#S3.E4)\. There exist matricesWK,WQ,WVW\_\{K\},W\_\{Q\},W\_\{V\}and an output projectionPPsuch that one LSA update changes only theyy\-coordinate ofeje\_\{j\}:

ej←ej\+\(0,0,Δ​W1​x1j\+Δ​W2​x2j\)\.e\_\{j\}\\leftarrow e\_\{j\}\+\\left\(0,\\,0,\\,\\Delta W\_\{1\}x\_\{1\}^\{j\}\+\\Delta W\_\{2\}x\_\{2\}^\{j\}\\right\)\.\(5\)Equivalently,

P​V​K⊤​qj=\(0,0,Δ​W1​x1j\+Δ​W2​x2j\)\.PVK^\{\\top\}q\_\{j\}=\\left\(0,\\,0,\\,\\Delta W\_\{1\}x\_\{1\}^\{j\}\+\\Delta W\_\{2\}x\_\{2\}^\{j\}\\right\)\.\(6\)Therefore, the LSA update exactly matches the prediction shift induced by one gradient\-descent step on the unified linearized RAG predictor\.

The construction is given in Appendix[A](https://arxiv.org/html/2605.26356#A1)\. Intuitively, the value projection encodes the residualW1​x1i\+W2​x2i−yiW\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y^\{i\}\. The key\-query interaction computes the inner products\(x1i\)⊤​x1j\(x\_\{1\}^\{i\}\)^\{\\top\}x\_\{1\}^\{j\}and\(x2i\)⊤​x2j\(x\_\{2\}^\{i\}\)^\{\\top\}x\_\{2\}^\{j\}\. The output projection then writes the resulting weighted residual sum into the query token’s prediction coordinate\.

This construction also gives a controlled multi\-step analogue\. If each LSA layer represents one gradient\-like update, then afterKKlayers,

y^N\+1\(K\)=y^N\+1\(0\)\+∑t=0K−1\(Δ​W1\(t\)​xN\+11\+Δ​W2\(t\)​xN\+12\),\\hat\{y\}^\{\(K\)\}\_\{N\+1\}=\\hat\{y\}^\{\(0\)\}\_\{N\+1\}\+\\sum\_\{t=0\}^\{K\-1\}\\left\(\\Delta W^\{\(t\)\}\_\{1\}x^\{1\}\_\{N\+1\}\+\\Delta W^\{\(t\)\}\_\{2\}x^\{2\}\_\{N\+1\}\\right\),\(7\)whereΔ​W1\(t\)\\Delta W^\{\(t\)\}\_\{1\}andΔ​W2\(t\)\\Delta W^\{\(t\)\}\_\{2\}are the implicit updates represented by layertt\. We use this multi\-step view as a linear\-regime guide rather than as a literal claim about frozen LLM computation\. In later sections, this view motivates a forward\-only mechanism that adapts how a frozen generator uses retrieved evidence\.

## 4Testing the Boundary of the Linear Correspondence

Lemma[1](https://arxiv.org/html/2605.26356#Thmlemma1)gives an exact correspondence in a controlled linear setting\. We now ask how far this correspondence remains predictive when the setting is varied\. The experiments have two goals: first, to verify that a trained linear self\-attention layer can reproduce the constructed gradient\-descent predictor; second, to identify where the correspondence begins to break beyond the exact regime\.

### 4\.1Linear\-Regime Verification

Each token concatenates an input feature, a retrieval\-derived feature, and a target,ei=\(xi,zi,yi\)e\_\{i\}=\(x\_\{i\},z\_\{i\},y\_\{i\}\)fori=1,…,Ni=1,\\ldots,N\. The auxiliary slotziz\_\{i\}instantiates the unified RAG view\. For the projection\-based interface,ziz\_\{i\}is a document\-derived feature\. For the dot\-product interface,zi=xiz\_\{i\}=x\_\{i\}, and document information is injected into the keys and values\. We train an LSA layerθ\\thetato minimize expected squared error across tasks, using minibatch SGD over freshly sampled tasks\. Following prior work\(Garget al\.,[2022](https://arxiv.org/html/2605.26356#bib.bib10); von Oswaldet al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib6)\), each task is generated from a teacher with weightsWτ∼𝒩​\(0,I\)W\_\{\\tau\}\\sim\\mathcal\{N\}\(0,I\)\. Inputs are sampled asxτ,i∼𝒰​\(−1,1\)nIx\_\{\\tau,i\}\\sim\\mathcal\{U\}\(\-1,1\)^\{n\_\{I\}\}, and targets are generated byyτ,i=Wτ1​xτ,i1\+Wτ2​xτ,i2y\_\{\\tau,i\}=W\_\{\\tau\}^\{1\}x\_\{\\tau,i\}^\{1\}\+W\_\{\\tau\}^\{2\}x\_\{\\tau,i\}^\{2\}\. We setN=nI=10N=n\_\{I\}=10and sweep the document countk∈\{2,5,10,25\}k\\in\\\{2,5,10,25\\\}\. We compare the trained layerθ∗\\theta^\{\*\}with the constructed predictor that exactly realizes one gradient\-descent step on the unified RAG loss\. OnTval=104T\_\{\\mathrm\{val\}\}=10^\{4\}held\-out tasks, we report the prediction difference‖y^θ∗−y^θ,rag‖2\\\|\\hat\{y\}\_\{\\theta^\{\*\}\}\-\\hat\{y\}\_\{\\theta,\\mathrm\{rag\}\}\\\|\_\{2\}, the cosine similarity between input sensitivities∂y^/∂xtest\\partial\\hat\{y\}/\\partial x\_\{\\mathrm\{test\}\}, and the corresponding sensitivityℓ2\\ell\_\{2\}difference\. Details are in Appendix[C](https://arxiv.org/html/2605.26356#A3)\.

![Refer to caption](https://arxiv.org/html/2605.26356v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x2.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x4.png)

Figure 1:Single\-layer LSA reproduces one gradient\-descent step on the unified linearized RAG loss\. The two left panels report the projection\-based interface, and the two right panels report the dot\-product interface\. Across both variants, the trained LSA layer and the constructed gradient\-descent predictor are nearly indistinguishable on held\-out tasks\.Figure[1](https://arxiv.org/html/2605.26356#S4.F1)verifies the construction for both retrieval interfaces\. The trained LSA layer closely matches the constructed predictor: the loss difference is small, the sensitivity cosine is close to11, and the sensitivityℓ2\\ell\_\{2\}difference is negligible\. This numerically confirms the algebraic correspondence in Lemma[1](https://arxiv.org/html/2605.26356#Thmlemma1)\.

### 4\.2Controlled Stress Tests

We next test whether the agreement persists under controlled changes within the linear regime\. We vary the document count, shift the test\-input distribution, and stack LSA layers with shared parameters\. When sweepingn∈\{2,5,10,25\}n\\in\\\{2,5,10,25\\\}and shifting the test\-input range toα∈\{0\.5,1,1\.5,2\}\\alpha\\in\\\{0\.5,1,1\.5,2\\\}while keeping training fixed atα=1\\alpha=1, the loss difference between the trained Transformer and the gradient predictor remains small \(Figure[5](https://arxiv.org/html/2605.26356#A3.F5), Appendix[C](https://arxiv.org/html/2605.26356#A3)\)\. Stacking LSA layers further supports the multi\-step picture in Eq\.[7](https://arxiv.org/html/2605.26356#S3.E7)\. At depths22and55, the loss and prediction differences remain small across document counts\. The residual gap atDocs=25\\mathrm\{Docs\}=25also shrinks as depth increases \(Figure[6](https://arxiv.org/html/2605.26356#A3.F6), Appendix[C](https://arxiv.org/html/2605.26356#A3)\)\. The projection\-based interface shows similar behavior \(Appendix[E](https://arxiv.org/html/2605.26356#A5)\)\. These results suggest that the linear correspondence is not a fragile single\-step artifact, but remains stable under controlled linear extensions\.

### 4\.3Nonlinear Stress Test

We then examine where the correspondence begins to break\. We add MLP layers after the input embedding and evaluate on four real\-world regression datasets: California Housing, Bike Sharing, Wine Quality, and Predict Calorie Expenditure\. We focus on the dot\-product interface throughout\. Dataset details are in Appendix[F](https://arxiv.org/html/2605.26356#A6.SS0.SSS0.Px2)\. This experiment is diagnostic\. We do not claim that normalization solves RAG adaptation\. Instead, we use normalization to control the feature geometry that interacts with dot\-product retrieval\. We compare Z\-score\(Bishop,[2006](https://arxiv.org/html/2605.26356#bib.bib25)\), Min–Max\(Bishop,[2006](https://arxiv.org/html/2605.26356#bib.bib25)\), rank\-based normalization\(Conover,[1999](https://arxiv.org/html/2605.26356#bib.bib26)\), and Tanh normalization\. The training set is used as the retrieval corpus and is normalized with Z\-score throughout\. Only the input\-side normalization is varied, and alignment is measured using the same metrics as in Section[4\.1](https://arxiv.org/html/2605.26356#S4.SS1)\.

![Refer to caption](https://arxiv.org/html/2605.26356v1/x5.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x6.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x7.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x8.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x9.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x10.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x11.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x12.png)

Figure 2:Effect of input normalization on the alignment between the trained nonlinear Transformer and the gradient\-descent predictor under the dot\-product interface\.Top row:Bike Sharing\.Bottom row:California Housing\. Columns report loss difference, sensitivity cosine, model difference, and prediction difference\. Min–Max normalization closely matches the gradient\-descent predictor on Bike Sharing, where features are bounded and roughly uniform\. On California Housing, where features are skewed and heavy\-tailed, the alignment degrades\.Figure[2](https://arxiv.org/html/2605.26356#S4.F2)shows two representative cases\. On Bike Sharing, Min–Max normalization gives the closest agreement between the trained nonlinear Transformer and the gradient\-descent predictor, likely because the features are bounded and not dominated by outliers\. On California Housing, the alignment is weaker: skewed and heavy\-tailed features make dot\-product geometry more sensitive to outliers\. The sensitivity cosine drops, the model difference grows, and prediction differences become less stable\. The same pattern appears on the remaining datasets\. Predict Calorie Expenditure behaves similarly to Bike Sharing, while Wine Quality behaves more like California Housing \(Figure[7](https://arxiv.org/html/2605.26356#A4.F7), Appendix[D](https://arxiv.org/html/2605.26356#A4)\)\. Overall, the linear optimization view remains informative when feature geometry is stable, but becomes less predictive when retrieval\-derived dot products are dominated by skewed or heavy\-tailed features\. This empirical boundary supports our use of the linear construction as a guide for adaptation, rather than as a literal model of LLM computation\. Next, we use this view to design a forward\-only update to the generator\-side evidence\-use interface in LLM\-scale RAG\.

## 5LLM\-Scale RAG: Amortizing the Gradient\-Descent Update

We now instantiate this view asRAG\-GD, a forward\-only adaptation method for frozen billion\-parameter RAG LLMs\. In this setting, we do not assume an exact equivalence between an LLM forward pass and gradient descent\. Instead, we use gradient descent as an operational target for adapting how the generator uses retrieval\-conditioned information\. Given a few RAG\-formatted demonstrations, we use autograd during training to compute the update that gradient descent would make to the generator\-side retrieval interface\. We then train a lightweight predictor to approximate this update\. At inference time, the predictor produces the update in a single forward pass, without further training the RAG system or backpropagating through the frozen LLM\.

Concretely, we first train a base retrieval adapterW0retW\_\{0\}^\{\\mathrm\{ret\}\}on RAG\-formatted examples from the NQ training split\. It is a low\-rank LoRA perturbation to the Q/K/V projections of every attention layer in a frozen LLM\. All inputs are RAG\-formatted instances\(x,𝒟x,y\)\(x,\\mathcal\{D\}\_\{x\},y\), whereyyis the gold answer\. In this work, we use a fixed external retriever, either BM25 or E5, to select the retrieved documents𝒟x\\mathcal\{D\}\_\{x\}for each questionxxfrom a fixed corpus\. This adapter serves as a generator\-side retrieval interface: it does not select documents, but modulates how the generator uses evidence selected by BM25 or E5\. The predictorgϕg\_\{\\phi\}is then meta\-trained on few\-shot support contextsC=\{\(xi,𝒟i,yi\)\}i=1NC=\\\{\(x\_\{i\},\\mathcal\{D\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}from NQ, TriviaQA, HotpotQA, 2WikiMultiHopQA, and MuSiQue\. PopQA and Bamboogle are held out from both stages and used only for evaluation\.

### 5\.1Supervision Target: Autograd\-Defined Interface Update

The supervision target is the update thatKKSGD steps would produce on the generator\-side retrieval interface using a support contextCC\. Starting fromWret\(0\)=W0retW\_\{\\mathrm\{ret\}\}^\{\(0\)\}=W\_\{0\}^\{\\mathrm\{ret\}\}, we run

Δ​WGD\(K\)​\(C\)=Wret\(K\)−W0ret,Wret\(t\+1\)=Wret\(t\)−η​∇ℒ​\(Wret\(t\);C\),\\Delta W\_\{\\mathrm\{GD\}\}^\{\(K\)\}\(C\)=W\_\{\\mathrm\{ret\}\}^\{\(K\)\}\-W\_\{0\}^\{\\mathrm\{ret\}\},\\qquad W\_\{\\mathrm\{ret\}\}^\{\(t\+1\)\}=W\_\{\\mathrm\{ret\}\}^\{\(t\)\}\-\\eta\\nabla\\mathcal\{L\}\(W\_\{\\mathrm\{ret\}\}^\{\(t\)\};C\),\(8\)fort=0,…,K−1t=0,\\ldots,K\-1, whereℒ\\mathcal\{L\}is the answer\-token cross\-entropy conditioned on the question and retrieved documents\. We computeΔ​WGD\(K\)​\(C\)\\Delta W\_\{\\mathrm\{GD\}\}^\{\(K\)\}\(C\)with autograd and detach it from the predictor optimizer\. Equation[8](https://arxiv.org/html/2605.26356#S5.E8)is not a theorem\-preserving lift of Lemma[1](https://arxiv.org/html/2605.26356#Thmlemma1)\. The setting changes from squared regression with a linear predictor to cross\-entropy training of a deep LLM, and the adapted parameter becomes a low\-rank Q/K/V LoRA interface\. Its role is practical: it provides an optimization\-derived target for how RAG demonstrations should adjust generator\-side evidence use\.

### 5\.2Predictor Architecture and Matching Objective

The predictorgϕg\_\{\\phi\}is a context encoder with per\-layer, per\-projection update heads\. Each demonstration\(xi,𝒟i,yi\)∈C\(x\_\{i\},\\mathcal\{D\}\_\{i\},y\_\{i\}\)\\in Cis formatted by concatenating the question, retrieved documents, and gold answer\. The frozen LLM with the base adapter encodes each sequence, and we use the EOS hidden statehi∈ℝdhh\_\{i\}\\in\\mathbb\{R\}^\{d\_\{h\}\}\. We aggregate demonstrations by mean pooling,h¯​\(C\)=1N​∑i=1Nhi\\bar\{h\}\(C\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}h\_\{i\}\. For each layerℓ\\elland projection typeπ∈\{Q,K,V\}\\pi\\in\\\{Q,K,V\\\}, the update head outputsΔ​W~ℓ,π=Uℓ,π​Vℓ,π⊤∈ℝd×d,\\widetilde\{\\Delta W\}\_\{\\ell,\\pi\}=U\_\{\\ell,\\pi\}V\_\{\\ell,\\pi\}^\{\\top\}\\in\\mathbb\{R\}^\{d\\times d\},whereUℓ,π,Vℓ,π∈ℝd×rU\_\{\\ell,\\pi\},V\_\{\\ell,\\pi\}\\in\\mathbb\{R\}^\{d\\times r\}are generated by a two\-layer MLP fromh¯​\(C\)\\bar\{h\}\(C\)\. The rankrrmatches the base adapter, sogϕ​\(C\)g\_\{\\phi\}\(C\)has the same shape asW0retW\_\{0\}^\{\\mathrm\{ret\}\}\. We traingϕg\_\{\\phi\}to match the autograd\-defined target per layer and projection type\. LetΔ​Wℓ,π⋆≜\[Δ​WGD\(K\)​\(C\)\]ℓ,π\.\\Delta W\_\{\\ell,\\pi\}^\{\\star\}\\triangleq\[\\Delta W\_\{\\mathrm\{GD\}\}^\{\(K\)\}\(C\)\]\_\{\\ell,\\pi\}\.The matching loss is

ℒmatch​\(ϕ;C\)=∑ℓ,π\[1−⟨Δ​W~ℓ,π,Δ​Wℓ,π⋆⟩‖Δ​W~ℓ,π‖F​‖Δ​Wℓ,π⋆‖F\+λ​\|log⁡‖Δ​W~ℓ,π‖F‖Δ​Wℓ,π⋆‖F\|\],\\mathcal\{L\}\_\{\\mathrm\{match\}\}\(\\phi;C\)=\\sum\_\{\\ell,\\pi\}\\left\[1\-\\frac\{\\langle\\widetilde\{\\Delta W\}\_\{\\ell,\\pi\},\\Delta W\_\{\\ell,\\pi\}^\{\\star\}\\rangle\}\{\\\|\\widetilde\{\\Delta W\}\_\{\\ell,\\pi\}\\\|\_\{F\}\\\|\\Delta W\_\{\\ell,\\pi\}^\{\\star\}\\\|\_\{F\}\}\+\\lambda\\left\|\\log\\frac\{\\\|\\widetilde\{\\Delta W\}\_\{\\ell,\\pi\}\\\|\_\{F\}\}\{\\\|\\Delta W\_\{\\ell,\\pi\}^\{\\star\}\\\|\_\{F\}\}\\right\|\\right\],\(9\)where the cosine term matches direction and the log\-magnitude term matches scale\. We useλ=0\.1\\lambda=0\.1throughout\. At deployment, the predictor emitsΔ​W~​\(C\)\\widetilde\{\\Delta W\}\(C\)in one forward pass, and the frozen LLM answers with the adapted interfaceW0ret\+Δ​W~​\(C\)W\_\{0\}^\{\\mathrm\{ret\}\}\+\\widetilde\{\\Delta W\}\(C\)\.

### 5\.3Benchmarks, Baselines, and Metrics

#### Benchmarks

We evaluate on seven open\-domain QA benchmarks: NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle\. NQ, TriviaQA, and PopQA are single\-hop, while the remaining four are multi\-hop\. We use Qwen 2\.5\-7B\-Instruct\(Qwenet al\.,[2024](https://arxiv.org/html/2605.26356#bib.bib44)\)and Llama 3\.1\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26356#bib.bib45)\)as frozen backbones\. Each query is augmented with the top five documents from BM25 or E5\-large\(Wanget al\.,[2024b](https://arxiv.org/html/2605.26356#bib.bib46)\), using the same retrieval cache for all methods\. The support size isN=3N=3, and we train predictors withK∈\{1,5,10\}K\\in\\\{1,5,10\\\}\.

#### Baselines and Metrics

We compareRAG\-GDwith six baselines\.Query Onlyuses no retrieved documents\.Vanilla RAGprepends retrieved documents to the prompt\.Base adapterappliesW0retW\_\{0\}^\{\\mathrm\{ret\}\}without context\-conditioned perturbation\.\+ few shotvariants concatenate support demonstrations into Vanilla RAG or Base adapter prompts\.Prompt tuning\(Lesteret al\.,[2021](https://arxiv.org/html/2605.26356#bib.bib30)\)learns a soft prefix using the same supervision pool asW0retW\_\{0\}^\{\\mathrm\{ret\}\}\.HyperTuning\(Phanget al\.,[2023](https://arxiv.org/html/2605.26356#bib.bib29)\)uses the same predictor architecture asRAG\-GD, but trains through downstream task loss rather than the autograd\-defined target\.TT\-SGDperformsKKSGD steps onCCat test time and serves as the non\-amortized reference\.RAG\-GDsharesW0retW\_\{0\}^\{\\mathrm\{ret\}\}with Base adapter and adds onlyΔ​W~​\(C\)\\widetilde\{\\Delta W\}\(C\)\. Table[1](https://arxiv.org/html/2605.26356#S5.T1)reports the headline comparison against no\-perturbation baselines\. Full results for Prompt tuning, HyperTuning, TT\-SGD, and \+ few shot variants are in Appendix[I](https://arxiv.org/html/2605.26356#A9)\. We report SQuAD\-style\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.26356#bib.bib47)\)exact match \(EM\) and token\-overlap F1\.

Table 1:Main QA results across seven benchmarks, two frozen LLM backbones, and two retrievers\.RAG\-GD\(K=5K\{=\}5\) applies a context\-conditioned update on top of the same static retrieval adapterW0retW\_\{0\}^\{\\mathrm\{ret\}\}used by Base adapter\. Bold values mark the best result per column within each backbone block\. Full context\-conditioned baseline results are in Appendix[I](https://arxiv.org/html/2605.26356#A9)\.MethodRetrieverSingle\-Hop QAMulti\-Hop QAAvg\.NQTriviaQAPopQAHotpotQA2WikiMuSiQueBamboogleEMF1EMF1EMF1EMF1EMF1EMF1EMF1EMF1Qwen\-2\.5\-7BQuery Only–15\.9524\.2843\.3349\.5116\.0219\.7618\.4025\.3923\.9128\.123\.8010\.5711\.2018\.0218\.9425\.09Vanilla RAGBM2527\.6136\.6658\.2465\.7728\.8433\.1831\.2841\.2527\.8733\.245\.8713\.0510\.4021\.1627\.1634\.90E539\.1650\.0362\.9970\.8044\.0350\.2132\.4542\.2125\.4831\.435\.7912\.7718\.4026\.7432\.6140\.60Base adapterBM2532\.5741\.4560\.1167\.9331\.5535\.6332\.3143\.4528\.2234\.136\.4115\.4616\.8026\.0129\.7137\.72E541\.7751\.2263\.3171\.6247\.0551\.8233\.7844\.6427\.8933\.976\.9515\.7618\.4028\.7834\.1642\.54RAG\-GD\(K=5K\{=\}5\)BM2534\.4643\.5463\.2770\.6933\.2237\.6935\.5447\.1428\.8634\.489\.2619\.7322\.4032\.8532\.4340\.87E542\.9152\.7165\.9873\.6048\.1252\.6135\.5447\.0029\.6735\.479\.1419\.2625\.6035\.1236\.7145\.11Llama\-3\.1\-8BQuery Only–22\.4632\.5152\.6759\.7920\.6325\.0918\.3125\.7126\.3931\.043\.819\.516\.4012\.8821\.5228\.08Vanilla RAGBM2531\.4140\.9560\.4368\.3531\.0835\.4331\.9242\.4626\.0731\.825\.7512\.4414\.4022\.9328\.7236\.34E540\.7252\.2364\.4272\.5845\.8551\.4832\.7843\.1923\.4429\.466\.0412\.2524\.8032\.1234\.0141\.90Base adapterBM2538\.4749\.4262\.6672\.4637\.8042\.2937\.3550\.4133\.6639\.5511\.4621\.9829\.6041\.2935\.8645\.34E543\.4654\.6063\.9074\.0551\.6955\.7437\.3149\.9033\.4539\.4511\.8322\.0730\.4040\.4838\.8648\.04RAG\-GD\(K=5K\{=\}5\)BM2540\.2250\.0166\.1374\.2837\.8142\.1938\.9951\.1434\.1540\.0412\.5422\.6128\.0039\.6136\.8345\.70E545\.6855\.8467\.6676\.0752\.3156\.4839\.0150\.9433\.9439\.8313\.2023\.4732\.0042\.0840\.5449\.24

### 5\.4Results

Table[1](https://arxiv.org/html/2605.26356#S5.T1)reports the headline comparison betweenRAG\-GD\(K=5K\{=\}5\) and the no\-perturbation baselines\. Figures[3](https://arxiv.org/html/2605.26356#S5.F3)and[4](https://arxiv.org/html/2605.26356#S5.F4)compare against additional context\-conditioned methods, including Prompt tuning, HyperTuning, TT\-SGD, and \+ few shot variants\. Full per\-method and per\-benchmark results are in Appendix[I](https://arxiv.org/html/2605.26356#A9), and Algorithm[1](https://arxiv.org/html/2605.26356#algorithm1)gives the deployment procedure\.

#### The predicted update improves the base retrieval adapter\.

Across all backbone and retriever configurations,RAG\-GDimproves average EM and F1 over Base adapter\. This comparison is controlled: both methods share the external retriever, retrieval cache, frozen backbone, and base adapterW0retW\_\{0\}^\{\\mathrm\{ret\}\}\. The only difference is the predicted perturbationΔ​W~​\(C\)\\widetilde\{\\Delta W\}\(C\), which isolates the effect of adapting the generator\-side retrieval interface from the support context\.

#### The learned update transfers to held\-out tasks\.

PopQA and Bamboogle are held out from training for bothW0retW\_\{0\}^\{\\mathrm\{ret\}\}andgϕg\_\{\\phi\}\. On PopQA,RAG\-GDimproves EM over Base adapter in every backbone and retriever setting, with F1 close or improved in most cases\. On Bamboogle, gains are strongest on Qwen, while Llama with BM25 stays close to Base adapter\. The transfer is consistent but not uniform, suggesting thatgϕg\_\{\\phi\}learns a reusable update rule for generator\-side evidence use rather than only fitting the meta\-training tasks\.

![Refer to caption](https://arxiv.org/html/2605.26356v1/x13.png)\(a\)Method\-family comparison\.
![Refer to caption](https://arxiv.org/html/2605.26356v1/x14.png)\(b\)Robustness to inner GD depthKK\.

Figure 3:The gradient\-derived update improves context\-conditioned adaptation and is relatively insensitive toKK\.\(a\)Average EM across seven QA benchmarks on Qwen 2\.5\-7B with E5 retrieval\.RAG\-GDmatches the TT\-SGD reference using one predictor forward pass\.\(b\)Average EM acrossK∈\{1,5,10\}K\\in\\\{1,5,10\\\}on both backbones and retrievers\. Solid lines use E5, and dashed lines use BM25\.
#### Gradient\-update supervision matters\.

Figure[3\(a\)](https://arxiv.org/html/2605.26356#S5.F3.sf1)comparesRAG\-GDwith context\-conditioned baselines on Qwen\-2\.5\-7B with E5 retrieval\. HyperTuning uses the same predictor architecture but trains through downstream task loss rather than matching the autograd\-defined update\.RAG\-GDimproves average EM and F1 over HyperTuning and Prompt tuning, indicating that the gradient\-derived target contributes beyond the predictor architecture\.

#### Performance is largely insensitive toKK\.

Figure[3\(b\)](https://arxiv.org/html/2605.26356#S5.F3.sf2)sweeps the inner\-loop depthK∈\{1,5,10\}K\\in\\\{1,5,10\\\}across both backbones and retrievers\. A single amortized step already recovers most of the gain, while additional steps bring only small and configuration\-dependent changes\. Thus, theK=5K\{=\}5setting in Table[1](https://arxiv.org/html/2605.26356#S5.T1)is representative\. Full per\-KKresults are in Appendix[I](https://arxiv.org/html/2605.26356#A9)\.

#### Amortization approaches test\-time adaptation at lower cost\.

Figure[4](https://arxiv.org/html/2605.26356#S5.F4)shows the EM\-cost tradeoff for Qwen\-2\.5\-7B with E5 retrieval\. TT\-SGD performs inner\-loop backpropagation through the 7B LLM at test time, whileRAG\-GDmoves this computation into training and uses only one forward pass throughgϕg\_\{\\phi\}at inference\. As a result,RAG\-GDreaches a similar average EM and F1 operating point at substantially lower per\-query cost\.

![Refer to caption](https://arxiv.org/html/2605.26356v1/x15.png)Figure 4:EM\-cost tradeoffon Qwen 2\.5\-7B with E5 retrieval\. Per\-query cost is shown on a log scale\. The shaded region marks methods that run inner GD at test time\.

### 5\.5Discussion

These results complete the theory\-to\-practice arc\. The linear regime gives an exact gradient\-descent correspondence, while the nonlinear experiments show where this correspondence becomes feature\-dependent\. At LLM scale, we do not claim that frozen RAG LLMs implement the linear equivalence internally\. Instead, the autograd\-defined update provides a practical target for context\-conditioned adaptation\. BecauseRAG\-GDand Base adapter share the same retriever, retrieval cache, backbone, andW0retW\_\{0\}^\{\\mathrm\{ret\}\}, the gains isolate the contribution of the predicted update to the generator\-side retrieval interface\. The held\-out transfer and cost\-performance tradeoff suggest that gradient\-supervised retrieval\-interface adaptation is a promising forward\-only alternative to test\-time backpropagation for RAG\.

## 6Conclusion

We studied retrieval\-augmented generation through an in\-context optimization lens\. In a controlled linear setting, we showed that one linear self\-attention layer can implement one gradient\-descent step on a unified linearized RAG loss covering projection\-based and dot\-product retrieval interfaces\. Empirical tests verified this construction in the exact regime and revealed a structured boundary under nonlinear architectures and real regression data, where alignment becomes sensitive to feature distribution\. At LLM scale, we turned this view into a forward\-only adaptation method by using the autograd\-definedKK\-step update to a generator\-side Q/K/V LoRA interface as supervision for a lightweight predictor\. Across seven QA benchmarks, two retrievers, and two frozen backbones, the predicted update improved a shared\-adapter baseline, transferred to held\-out domains, and was largely insensitive toKK\. Overall, these results suggest that retrieved evidence can be treated not only as external context, but also as a signal for context\-induced adaptation in RAG\.

## 7Limitations

Our linear construction is an analytical starting point rather than a literal account of modern RAG LLMs, and the nonlinear experiments show that the correspondence depends on architecture and feature distribution\. At LLM scale, we instantiate the view with a generator\-side Q/K/V LoRA interface while keeping the retriever and backbone fixed\. Open questions remain about the inner\-loop optimizer, update parameterization, predictor capacity, scaling to larger backbones, and robustness across broader retrieval settings\. Future work should characterize when context\-induced updates improve evidence use, when they should be suppressed, and how uncertainty\-aware gating can make such updates robust to noisy retrieval\.

## References

- K\. Ahn, X\. Cheng, H\. Daneshmand, and S\. Sra \(2023\)Transformers learn to implement preconditioned gradient descent for in\-context learning\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Akyürek, D\. Schuurmans, J\. Andreas, T\. Ma, and D\. Zhou \(2023\)What learning algorithm is in\-context learning? investigations with linear models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.26356#S1.p3.1),[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.
- Y\. Bai, F\. Chen, H\. Wang, C\. Xiong, and S\. Mei \(2023\)Transformers as statisticians: provable in\-context learning with in\-context algorithm selection\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- C\.M\. Bishop \(2006\)Pattern recognition and machine learning\.Springer\.Cited by:[Appendix D](https://arxiv.org/html/2605.26356#A4.p1.1),[§4\.3](https://arxiv.org/html/2605.26356#S4.SS3.p1.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. van den Driessche, J\. Lespiau,et al\.\(2022\)Improving language models by retrieving from trillions of tokens\.InICML,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.
- W\. J\. Conover \(1999\)Practical nonparametric statistics\.Wiley\.Cited by:[Appendix D](https://arxiv.org/html/2605.26356#A4.p1.1),[§4\.3](https://arxiv.org/html/2605.26356#S4.SS3.p1.1)\.
- D\. Dai, Y\. Sun, L\. Dong, Y\. Hao, S\. Ma, Z\. Sui, and F\. Wei \(2023\)Why can gpt learn in\-context? language models implicitly perform gradient descent as meta\-optimizers\.InACL,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Garg, D\. Tsipras, P\. Liang, and G\. Valiant \(2022\)What can transformers learn in\-context? a case study of simple function classes\.InNeurIPS,Cited by:[§C\.1](https://arxiv.org/html/2605.26356#A3.SS1.SSS0.Px3.p1.7),[§4\.1](https://arxiv.org/html/2605.26356#S4.SS1.p1.16)\.
- K\. Gatmiry, N\. Saunshi, S\. J\. Reddi, S\. Jegelka, and S\. Kumar \(2024\)Can looped transformers learn to implement multi\-step gradient descent for in\-context learning?\.InICML,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao,et al\.\(2024\)The llama 3 herd of models\.CoRR\.Cited by:[§5\.3](https://arxiv.org/html/2605.26356#S5.SS3.SSS0.Px1.p1.2),[Reproducibility Statement](https://arxiv.org/html/2605.26356#Sx2.p2.5)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)REALM: retrieval\-augmented language model pre\-training\.InICML,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InCOLING,Cited by:[5th item](https://arxiv.org/html/2605.26356#A6.I1.i5.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Huang, M\. Li, Z\. Yao, D\. Li, Y\. Zhang, Z\. Yang, Y\. Xiao, F\. Ouyang, X\. Li, S\. Han, and H\. Yu \(2026\)RiTeK: a dataset for large language models complex reasoning over textual knowledge graphs in medicine\.InACL,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InEACL,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.
- J\. Jin, Y\. Zhu, Z\. Dou, G\. Dong, X\. Yang, C\. Zhang, T\. Zhao, Z\. Yang, and J\. Wen \(2025\)FlashRAG: a modular toolkit for efficient retrieval\-augmented generation research\.InWWW,Cited by:[Appendix F](https://arxiv.org/html/2605.26356#A6.SS0.SSS0.Px1.p1.2)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InACL,Cited by:[2nd item](https://arxiv.org/html/2605.26356#A6.I1.i2.p1.1)\.
- V\. Karpukhin, B\. Oğuz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InEMNLP,Cited by:[Appendix B](https://arxiv.org/html/2605.26356#A2.SS0.SSS0.Px2.p1.5),[§2](https://arxiv.org/html/2605.26356#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.26356#S3.SS2.p1.14)\.
- D\. Kim, C\. Kim, and S\. Hong \(2025\)HyperFlow: gradient\-free emulation of few\-shot fine\-tuning\.CoRR\.Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.InTACL,Cited by:[1st item](https://arxiv.org/html/2605.26356#A6.I1.i1.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InEMNLP,Cited by:[§5\.3](https://arxiv.org/html/2605.26356#S5.SS3.SSS0.Px2.p1.6)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.26356#S1.p1.1),[§2](https://arxiv.org/html/2605.26356#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.26356#S3.SS2.p1.14)\.
- M\. Li, Z\. Zhan, H\. Yang, Y\. Xiao, J\. Huang, and R\. Zhang \(2025\)Benchmarking retrieval\-augmented large language models in biomedical nlp: application, robustness, and self\-awareness\.Science Advances\.Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Mahankali, T\. B\. Hashimoto, and T\. Ma \(2023\)One step of gradient descent is provably the optimal in\-context learner with one layer of linear self\-attention\.CoRR\.Cited by:[§1](https://arxiv.org/html/2605.26356#S1.p3.1),[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InACL,Cited by:[3rd item](https://arxiv.org/html/2605.26356#A6.I1.i3.p1.1)\.
- E\. Mitchell, C\. Lin, A\. Bosselut, C\. Finn, and C\. D\. Manning \(2022\)Fast model editing at scale\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Mosbach, T\. Pimentel, S\. Ravfogel, D\. Klakow, and Y\. Elazar \(2023\)Few\-shot fine\-tuning vs\. in\-context learning: a fair comparison and evaluation\.InACL,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Phang, Y\. Mao, P\. He, and W\. Chen \(2023\)HyperTuning: toward adapting large language models without back\-propagation\.InICML,Cited by:[Appendix I](https://arxiv.org/html/2605.26356#A9.p1.4),[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.26356#S5.SS3.SSS0.Px2.p1.6)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InEMNLP,Cited by:[7th item](https://arxiv.org/html/2605.26356#A6.I1.i7.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu,et al\.\(2024\)Qwen2\.5 technical report\.CoRR\.Cited by:[§5\.3](https://arxiv.org/html/2605.26356#S5.SS3.SSS0.Px1.p1.2),[Reproducibility Statement](https://arxiv.org/html/2605.26356#Sx2.p2.5)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InEMNLP,Cited by:[§5\.3](https://arxiv.org/html/2605.26356#S5.SS3.SSS0.Px2.p1.6)\.
- O\. Ram, Y\. Levine, I\. Dalmedigos, D\. Muhlgay, A\. Shashua, K\. Leyton\-Brown, and Y\. Shoham \(2023\)In\-context retrieval\-augmented language models\.InTACL,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.
- R\. Ren and Y\. Liu \(2024\)Towards understanding how transformers learn in\-context through a representation learning lens\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Shen, A\. Hsu, R\. Lai, and W\. Liao \(2026\)Understanding in\-context learning on structured manifolds: bridging attention to kernel methods\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Sun, X\. Wang, Z\. Liu, J\. Miller, A\. A\. Efros, and M\. Hardt \(2020\)Test\-time training with self\-supervision for generalization under distribution shifts\.InICML,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Tack, J\. Kim, E\. Mitchell, J\. Shin, Y\. W\. Teh, and J\. R\. Schwarz \(2024\)Online adaptation of language models with a memory of amortized contexts\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.InTACL,Cited by:[6th item](https://arxiv.org/html/2605.26356#A6.I1.i6.p1.1)\.
- L\. van der Maaten and G\. Hinton \(2008\)Visualizing data using t\-sne\.JMLR\.Cited by:[Appendix D](https://arxiv.org/html/2605.26356#A4.p1.1)\.
- M\. Vladymyrov, J\. von Oswald, M\. Sandler, and R\. Ge \(2024\)Linear transformers are versatile in\-context learners\.InICML,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.26356#S3.SS1.p1.7)\.
- J\. von Oswald, E\. Niklasson, E\. Randazzo, J\. Sacramento, A\. Mordvintsev, A\. Zhmoginov, and M\. Vladymyrov \(2023\)Transformers learn in\-context by gradient descent\.InICML,Cited by:[§C\.1](https://arxiv.org/html/2605.26356#A3.SS1.SSS0.Px3.p1.7),[§C\.1](https://arxiv.org/html/2605.26356#A3.SS1.SSS0.Px5.p1.12),[§C\.1](https://arxiv.org/html/2605.26356#A3.SS1.SSS0.Px6.p1.5),[§1](https://arxiv.org/html/2605.26356#S1.p3.1),[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.26356#S3.SS1.p1.7),[§4\.1](https://arxiv.org/html/2605.26356#S4.SS1.p1.16)\.
- F\. Wang, C\. Lin, Y\. Cao, and Y\. Kang \(2024a\)Benchmarking general\-purpose in\-context learning\.CoRR\.Cited by:[§2](https://arxiv.org/html/2605.26356#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei \(2024b\)Text embeddings by weakly\-supervised contrastive pre\-training\.CoRR\.Cited by:[§5\.3](https://arxiv.org/html/2605.26356#S5.SS3.SSS0.Px1.p1.2)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InEMNLP,Cited by:[4th item](https://arxiv.org/html/2605.26356#A6.I1.i4.p1.1)\.
- Q\. Zhang, S\. Chen, Y\. Bei, Z\. Yuan, H\. Zhou, Z\. Hong, H\. Chen, Y\. Xiao, C\. Zhou, J\. Dong, Y\. Chang, and X\. Huang \(2025a\)A survey of graph retrieval\-augmented generation for customized large language models\.CoRR\.Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.
- Y\. Zhang, A\. K\. Singh, P\. E\. Latham, and A\. Saxe \(2025b\)Training dynamics of in\-context learning in linear attention\.InICML,Cited by:[§2](https://arxiv.org/html/2605.26356#S2.p1.1)\.

## Broader Impacts

The method we propose lowers the cost of adapting a deployed large language model to a new task or domain\. At inference, the LLM still runs a forward pass to generate the answer, but no backward pass through the LLM is required: adapting to the context costs only a single small forward pass through a context\-conditional weight predictor\. If adopted at scale, this could reduce the energy footprint of test\-time adaptation pipelines for retrieval\-augmented systems\. The flip side is that lower adaptation cost could also accelerate the deployment of models in domains for which the underlying LLM has not been carefully evaluated, including settings where retrieval can amplify biases in the corpus\. We encourage practitioners adopting this style of adaptation to retain the same evaluation discipline that would apply to a fine\-tuned model\. All datasets used in our evaluation are publicly available QA and regression benchmarks that do not contain personally identifiable or sensitive information, and our work does not raise additional ethical concerns beyond those discussed above\.

## Reproducibility Statement

All datasets used in our main evaluation are publicly available QA benchmarks \(NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle\)\. The supplementary regression datasets discussed in the appendix are also public\. Preprocessing steps, dataset splits, and the training pools used for the reference retrieval adapterW0retW\_\{0\}^\{\\text\{ret\}\}and the predictorgϕg\_\{\\phi\}are documented in the appendix\.

Our implementation uses PyTorch with the Hugging Face Transformers library on top of frozen Qwen 2\.5\-7B\-Instruct\[[31](https://arxiv.org/html/2605.26356#bib.bib44)\]and Llama 3\.1\-8B\-Instruct\[[11](https://arxiv.org/html/2605.26356#bib.bib45)\]backbones\. Hyperparameters for the reference retrieval adapterW0retW\_\{0\}^\{\\text\{ret\}\}, the predictorgϕg\_\{\\phi\}, and the inner\-loop SGD target \(η\\eta,KK, number of demonstrationsNN\) are listed in Appendix[G](https://arxiv.org/html/2605.26356#A7)\. All experiments were conducted on NVIDIA A100 GPUs\.

## Appendix AProof of Lemma[1](https://arxiv.org/html/2605.26356#Thmlemma1)

#### Statement\.

Given a 1\-head linear\-attention layer and tokensej=\(x1j,x2j,yj\)e\_\{j\}=\(x\_\{1\}^\{j\},\\,x\_\{2\}^\{j\},\\,y^\{j\}\)forj=1,…,Nj=1,\\ldots,N, we construct key, query, and value matricesWKW\_\{K\},WQW\_\{Q\},WVW\_\{V\}and a projectionPPsuch that one linear self\-attention update on eacheje\_\{j\}matches one gradient\-descent step on the unified RAG loss of Section[3](https://arxiv.org/html/2605.26356#S3)\. The update modifies only theyy\-coordinate of each token:

ej←ej\+\(0,0,Δ​W1​x1j\+Δ​W2​x2j\)=ej\+P​V​K⊤​qj,e\_\{j\}\\leftarrow e\_\{j\}\+\\big\(0,\\;0,\\;\\Delta W\_\{1\}x\_\{1\}^\{j\}\+\\Delta W\_\{2\}x\_\{2\}^\{j\}\\big\)\\;=\\;e\_\{j\}\+P\\,V\\,K^\{\\top\}q\_\{j\},\(1\)whereΔ​W1\\Delta W\_\{1\}andΔ​W2\\Delta W\_\{2\}are the gradient\-step updates of Eq\.[4](https://arxiv.org/html/2605.26356#S3.E4)in the main text\.

#### Setup\.

We are givenNNcontext tokens, each of the formei=\(x1i,x2i,yi\)e^\{i\}=\(x\_\{1\}^\{i\},\\,x\_\{2\}^\{i\},\\,y^\{i\}\)corresponding to one training pair, plus a query tokeneN\+1=\(x1N\+1,x2N\+1,0\)e\_\{N\+1\}=\(x\_\{1\}^\{N\+1\},\\,x\_\{2\}^\{N\+1\},\\,0\)at positionN\+1N\{\+\}1\. The model is asked to predict the updatedyy\-value at positionN\+1N\{\+\}1\.

#### Step 1: expand the post\-update prediction\.

Writing the post\-update predictiony′y^\{\\prime\}as the original prediction plus the contributions ofΔ​W1\\Delta W\_\{1\}andΔ​W2\\Delta W\_\{2\},

y′\\displaystyle y^\{\\prime\}=W1′​x1\+W2′​x2\\displaystyle=W\_\{1\}^\{\\prime\}x\_\{1\}\+W\_\{2\}^\{\\prime\}x\_\{2\}=\(W1\+Δ​W1\)​x1\+\(W2\+Δ​W2\)​x2\\displaystyle=\(W\_\{1\}\+\\Delta W\_\{1\}\)\\,x\_\{1\}\+\(W\_\{2\}\+\\Delta W\_\{2\}\)\\,x\_\{2\}=W1​x1\+W2​x2\+Δ​W1​x1\+Δ​W2​x2\.\\displaystyle=W\_\{1\}x\_\{1\}\+W\_\{2\}x\_\{2\}\\;\+\\;\\Delta W\_\{1\}x\_\{1\}\+\\Delta W\_\{2\}x\_\{2\}\.\(2\)

#### Step 2: gradient step on the unified RAG loss\.

Under the squared lossL​\(W1,W2\)=12​N​∑i=1N‖W1​x1i\+W2​x2i−yi‖2L\(W\_\{1\},W\_\{2\}\)=\\tfrac\{1\}\{2N\}\\sum\_\{i=1\}^\{N\}\\\|W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y\_\{i\}\\\|^\{2\}, one gradient step with learning rateη\\etayields

Δ​W1\\displaystyle\\Delta W\_\{1\}=−η​∇W1L=−ηN​∑i=1N\(W1​x1i\+W2​x2i−yi\)​\(x1i\)⊤,\\displaystyle=\-\\eta\\,\\nabla\_\{W\_\{1\}\}L=\-\\frac\{\\eta\}\{N\}\\sum\_\{i=1\}^\{N\}\\big\(W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y\_\{i\}\\big\)\\,\(x\_\{1\}^\{i\}\)^\{\\top\},\(3\)Δ​W2\\displaystyle\\Delta W\_\{2\}=−η​∇W2L=−ηN​∑i=1N\(W1​x1i\+W2​x2i−yi\)​\(x2i\)⊤,\\displaystyle=\-\\eta\\,\\nabla\_\{W\_\{2\}\}L=\-\\frac\{\\eta\}\{N\}\\sum\_\{i=1\}^\{N\}\\big\(W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y\_\{i\}\\big\)\\,\(x\_\{2\}^\{i\}\)^\{\\top\},\(4\)Δ​y\\displaystyle\\Delta y=Δ​W1​x1\+Δ​W2​x2\.\\displaystyle=\\Delta W\_\{1\}x\_\{1\}\+\\Delta W\_\{2\}x\_\{2\}\.\(5\)

#### Step 3: rewriteΔ​y\\Delta yas a sum of outer\-product contractions\.

Substituting Eqs\.[3](https://arxiv.org/html/2605.26356#A1.E3)–[4](https://arxiv.org/html/2605.26356#A1.E4)into Eq\.[5](https://arxiv.org/html/2605.26356#A1.E5)and evaluating at the query tokenjj,

Δ​y=−ηN​∑i=1N\(W1​x1i\+W2​x2i−yi\)​\(x1i\)⊤​x1j−ηN​∑i=1N\(W1​x1i\+W2​x2i−yi\)​\(x2i\)⊤​x2j\.\\displaystyle\\Delta y\\;=\\;\-\\frac\{\\eta\}\{N\}\\sum\_\{i=1\}^\{N\}\\big\(W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y\_\{i\}\\big\)\\,\(x\_\{1\}^\{i\}\)^\{\\top\}x\_\{1\}^\{j\}\\;\-\\;\\frac\{\\eta\}\{N\}\\sum\_\{i=1\}^\{N\}\\big\(W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y\_\{i\}\\big\)\\,\(x\_\{2\}^\{i\}\)^\{\\top\}x\_\{2\}^\{j\}\.\(6\)Equivalently, the update applied to the token at positionjjis

\(x1jx2jyj\)←\(x1jx2jyj\)\+\(00Δ​y\),with\(00Δ​y\)=\(00Δ​W1​x1\+Δ​W2​x2\)\.\\displaystyle\\begin\{pmatrix\}x\_\{1\}^\{j\}\\\\ x\_\{2\}^\{j\}\\\\ y^\{j\}\\end\{pmatrix\}\\;\\leftarrow\\;\\begin\{pmatrix\}x\_\{1\}^\{j\}\\\\ x\_\{2\}^\{j\}\\\\ y^\{j\}\\end\{pmatrix\}\\;\+\\;\\begin\{pmatrix\}0\\\\ 0\\\\ \\Delta y\\end\{pmatrix\},\\qquad\\text\{with\}\\qquad\\begin\{pmatrix\}0\\\\ 0\\\\ \\Delta y\\end\{pmatrix\}=\\begin\{pmatrix\}0\\\\ 0\\\\ \\Delta W\_\{1\}x\_\{1\}\+\\Delta W\_\{2\}x\_\{2\}\\end\{pmatrix\}\.\(7\)

#### Step 4: cast the update as a linear self\-attention output\.

Using the identitya​b⊤​c=\(a⊗b⊤\)​ca\\,b^\{\\top\}c=\(a\\otimes b^\{\\top\}\)\\,cand grouping terms, Eq\.[6](https://arxiv.org/html/2605.26356#A1.E6)can be written as

\(00Δ​y\)=−ηN​∑i=1N\(00W1​x1i\+W2​x2i−yi\)⏟value vector​vi⊗\(x1ix2i0\)⏟key vector​ki⊤​\(x1jx2j0\)⏟query vector​qj\.\\displaystyle\\begin\{pmatrix\}0\\\\ 0\\\\ \\Delta y\\end\{pmatrix\}\\;=\\;\-\\frac\{\\eta\}\{N\}\\sum\_\{i=1\}^\{N\}\\underbrace\{\\begin\{pmatrix\}0\\\\ 0\\\\ W\_\{1\}x\_\{1\}^\{i\}\+W\_\{2\}x\_\{2\}^\{i\}\-y^\{i\}\\end\{pmatrix\}\}\_\{\\text\{value vector \}v\_\{i\}\}\\,\\otimes\\,\\underbrace\{\\begin\{pmatrix\}x\_\{1\}^\{i\}&x\_\{2\}^\{i\}&0\\end\{pmatrix\}\}\_\{\\text\{key vector \}k\_\{i\}^\{\\top\}\}\\;\\underbrace\{\\begin\{pmatrix\}x\_\{1\}^\{j\}\\\\ x\_\{2\}^\{j\}\\\\ 0\\end\{pmatrix\}\}\_\{\\text\{query vector \}q\_\{j\}\}\.\(8\)Each factor in Eq\.[8](https://arxiv.org/html/2605.26356#A1.E8)can be obtained by applying a fixed linear projection to the tokenei=\(x1i,x2i,yi\)e^\{i\}=\(x\_\{1\}^\{i\},x\_\{2\}^\{i\},y^\{i\}\)oreje^\{j\}:

vi\\displaystyle v\_\{i\}=\(000000W1W2−Iy\)⏟WV​ei,ki=\(Ix000Ix0000\)⏟WK​ei,qj=\(Ix000Ix0000\)⏟WQ​ej\.\\displaystyle=\\underbrace\{\\begin\{pmatrix\}0&0&0\\\\ 0&0&0\\\\ W\_\{1\}&W\_\{2\}&\-I\_\{y\}\\end\{pmatrix\}\}\_\{W\_\{V\}\}\\,e^\{i\},\\qquad k\_\{i\}=\\underbrace\{\\begin\{pmatrix\}I\_\{x\}&0&0\\\\ 0&I\_\{x\}&0\\\\ 0&0&0\\end\{pmatrix\}\}\_\{W\_\{K\}\}\\,e^\{i\},\\qquad q\_\{j\}=\\underbrace\{\\begin\{pmatrix\}I\_\{x\}&0&0\\\\ 0&I\_\{x\}&0\\\\ 0&0&0\\end\{pmatrix\}\}\_\{W\_\{Q\}\}\\,e^\{j\}\.\(9\)

#### Step 5: explicit construction\.

Combining Eqs\.[8](https://arxiv.org/html/2605.26356#A1.E8)and[9](https://arxiv.org/html/2605.26356#A1.E9), the gradient step of Eqs\.[3](https://arxiv.org/html/2605.26356#A1.E3)–[4](https://arxiv.org/html/2605.26356#A1.E4)is realized as a linear self\-attention update with the closed\-form projections

\(x1jx2jyj\)←\(x1jx2jyj\)−ηN​∑i=1N\(\(000000W1W2−Iy\)⏟WV​\(x1ix2iyi\)\)⊗\(\(Ix000Ix0000\)⏟WK​\(x1ix2iyi\)\)⊤​\(\(Ix000Ix0000\)⏟WQ​\(x1jx2jyj\)\)\\begin\{pmatrix\}x\_\{1\}^\{j\}\\\\ x\_\{2\}^\{j\}\\\\ y^\{j\}\\end\{pmatrix\}\\leftarrow\\begin\{pmatrix\}x\_\{1\}^\{j\}\\\\ x\_\{2\}^\{j\}\\\\ y^\{j\}\\end\{pmatrix\}\-\\frac\{\\eta\}\{N\}\\sum\_\{i=1\}^\{N\}\\Bigg\(\\underbrace\{\\begin\{pmatrix\}0&0&0\\\\ 0&0&0\\\\ W\_\{1\}&W\_\{2\}&\-I\_\{y\}\\end\{pmatrix\}\}\_\{W\_\{V\}\}\\begin\{pmatrix\}x\_\{1\}^\{i\}\\\\ x\_\{2\}^\{i\}\\\\ y^\{i\}\\end\{pmatrix\}\\Bigg\)\\otimes\\Bigg\(\\underbrace\{\\begin\{pmatrix\}I\_\{x\}&0&0\\\\ 0&I\_\{x\}&0\\\\ 0&0&0\\end\{pmatrix\}\}\_\{W\_\{K\}\}\\begin\{pmatrix\}x\_\{1\}^\{i\}\\\\ x\_\{2\}^\{i\}\\\\ y^\{i\}\\end\{pmatrix\}\\Bigg\)^\{\\top\}\\Bigg\(\\underbrace\{\\begin\{pmatrix\}I\_\{x\}&0&0\\\\ 0&I\_\{x\}&0\\\\ 0&0&0\\end\{pmatrix\}\}\_\{W\_\{Q\}\}\\begin\{pmatrix\}x\_\{1\}^\{j\}\\\\ x\_\{2\}^\{j\}\\\\ y^\{j\}\\end\{pmatrix\}\\Bigg\)

\(10\)The projectionPPis taken to be the identity on theyy\-coordinate, so that the value contribution lands in theyy\-slot ofeje^\{j\}\. Comparing the right\-hand side of Eq\.[10](https://arxiv.org/html/2605.26356#A1.E10)to the gradient step in Eqs\.[3](https://arxiv.org/html/2605.26356#A1.E3)–[4](https://arxiv.org/html/2605.26356#A1.E4), the two sides agree term by term\. One linear self\-attention update therefore reproduces one gradient\-descent step on the unified RAG loss, as claimed\. ∎

## Appendix BLinear RAG: derivation of the unified retriever formulation

#### Main function\.

y=\(Wq,Wz\)​\[xq∑i=1n\(We​xq\)⊤​\(We​di\)​di\]=Wq​xq\+Wz​∑i=1n\(We​xq\)⊤​\(We​di\)​di\.y=\(W\_\{q\},W\_\{z\}\)\\begin\{bmatrix\}x\_\{q\}\\\\\[3\.0pt\] \\sum\_\{i=1\}^\{n\}\(W\_\{e\}x\_\{q\}\)^\{\\top\}\(W\_\{e\}d\_\{i\}\)\\,d\_\{i\}\\end\{bmatrix\}=W\_\{q\}x\_\{q\}\+W\_\{z\}\\sum\_\{i=1\}^\{n\}\(W\_\{e\}x\_\{q\}\)^\{\\top\}\(W\_\{e\}d\_\{i\}\)\\,d\_\{i\}\.\(11\)

#### DefineMMand rewrite the similarity\.

We adopt the shared\-encoder simplification, where the same linear encoderWeW\_\{e\}maps both queries and documents into the retrieval space\. The general DPR formulation\[[19](https://arxiv.org/html/2605.26356#bib.bib9)\]permits separateWqW\_\{q\}andWdW\_\{d\}, in which case the analysis below carries through withM=Wq⊤​WdM=W\_\{q\}^\{\\top\}W\_\{d\}butMMneed not be symmetric\. Defining

M≜We⊤​We⇒\(We​xq\)⊤​\(We​di\)=xq⊤​We⊤​We​di=xq⊤​M​di\.M\\triangleq W\_\{e\}^\{\\top\}W\_\{e\}\\;\\;\\;\\;\\Rightarrow\\;\\;\\;\\;\(W\_\{e\}x\_\{q\}\)^\{\\top\}\(W\_\{e\}d\_\{i\}\)=x\_\{q\}^\{\\top\}W\_\{e\}^\{\\top\}W\_\{e\}d\_\{i\}=x\_\{q\}^\{\\top\}Md\_\{i\}\.\(12\)Hence,

y=Wq​xq\+Wz​∑i=1n\(xq⊤​M​di\)​di\.y=W\_\{q\}x\_\{q\}\+W\_\{z\}\\sum\_\{i=1\}^\{n\}\(x\_\{q\}^\{\\top\}Md\_\{i\}\)\\,d\_\{i\}\.\(13\)

#### Converting “scalar×\\timesvector” into “matrix×\\timesvector\.”

Note thatxq⊤​M​dix\_\{q\}^\{\\top\}Md\_\{i\}is a scalar, and the following identity holds:

\(xq⊤​M​di\)​di=di​\(di⊤​M⊤​xq\)=\(di​di⊤\)​M⊤​xq\.\(x\_\{q\}^\{\\top\}Md\_\{i\}\)\\,d\_\{i\}=d\_\{i\}\(d\_\{i\}^\{\\top\}M^\{\\top\}x\_\{q\}\)=\(d\_\{i\}d\_\{i\}^\{\\top\}\)M^\{\\top\}x\_\{q\}\.\(14\)Therefore,

∑i=1n\(xq⊤​M​di\)​di=∑i=1n\(di​di⊤\)​M⊤​xq=\(∑i=1ndi​di⊤\)​M⊤​xq\.\\sum\_\{i=1\}^\{n\}\(x\_\{q\}^\{\\top\}Md\_\{i\}\)\\,d\_\{i\}=\\sum\_\{i=1\}^\{n\}\(d\_\{i\}d\_\{i\}^\{\\top\}\)M^\{\\top\}x\_\{q\}=\\Big\(\\sum\_\{i=1\}^\{n\}d\_\{i\}d\_\{i\}^\{\\top\}\\Big\)M^\{\\top\}x\_\{q\}\.\(15\)

#### Define the document second\-moment matrixDD\.

D≜∑i=1ndi​di⊤⇒∑i=1n\(xq⊤​M​di\)​di=D​M⊤​xq\.D\\triangleq\\sum\_\{i=1\}^\{n\}d\_\{i\}d\_\{i\}^\{\\top\}\\;\\;\\;\\;\\Rightarrow\\;\\;\\;\\;\\sum\_\{i=1\}^\{n\}\(x\_\{q\}^\{\\top\}Md\_\{i\}\)\\,d\_\{i\}=DM^\{\\top\}x\_\{q\}\.\(16\)

#### Substituting back intoyy\.

y=Wq​xq\+Wz​D​M⊤​xq\.y=W\_\{q\}x\_\{q\}\+W\_\{z\}DM^\{\\top\}x\_\{q\}\.\(17\)
ThenM=We⊤​WeM=W\_\{e\}^\{\\top\}W\_\{e\}is symmetric, i\.e\.,M⊤=MM^\{\\top\}=M\. Thus the expression simplifies to

y=Wq​xq\+Wz​D​M​xq\.y=W\_\{q\}x\_\{q\}\+W\_\{z\}DMx\_\{q\}\.\(18\)
The right\-hand side is grouped into an equivalent linear mapping:

y=\(Wq\+Wz​D​M\)​xq\.y=\(W\_\{q\}\+W\_\{z\}DM\)\\,x\_\{q\}\.\(19\)

## Appendix CLinear\-regime equivalence: full setup and additional figures

This appendix supplements Section[4\.1](https://arxiv.org/html/2605.26356#S4.SS1)of the main paper with a full description of the synthetic\-regression setup and the additional robustness and stacked\-layer figures referenced there\.

### C\.1Setup details

#### Tokens\.

Each token concatenates an input vector, a retrieval\-derived feature, and a target,

ei=\(xi,zi,yi\),i=1,…,N,e\_\{i\}=\(x\_\{i\},\\,z\_\{i\},\\,y\_\{i\}\),\\qquad i=1,\\ldots,N,\(20\)whereNNis the number of in\-context examples for a single taskτ\\tau\. The auxiliary slotziz\_\{i\}instantiates the unified RAG view of Section[3](https://arxiv.org/html/2605.26356#S3)\. Under the linear\-projection retriever,ziz\_\{i\}is a document\-derived feature; under the dot\-product retriever,zi=xiz\_\{i\}=x\_\{i\}and the document information is injected through the keys and values rather than through the token \(see “dot\-product injection” below\)\.

#### Pre\-training objective\.

We train an LSA layer parameterized byθ\\thetato minimize the expected squared prediction error across tasks:

ℒ​\(θ\)=1B​∑τ=1B‖y^θ​\(\{eτ,i\}i=1N,eτ,N\+1\)−yτ,N\+1‖2,\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{B\}\\sum\_\{\\tau=1\}^\{B\}\\big\\\|\\hat\{y\}\_\{\\theta\}\\big\(\\\{e\_\{\\tau,i\}\\\}\_\{i=1\}^\{N\},\\;e\_\{\\tau,N\+1\}\\big\)\-y\_\{\\tau,N\+1\}\\big\\\|^\{2\},\(21\)where the query token at positionN\+1N\+1iseτ,N\+1=\(xtest,ztest,0\)e\_\{\\tau,N\+1\}=\(x\_\{\\text\{test\}\},z\_\{\\text\{test\}\},0\)andyτ,N\+1y\_\{\\tau,N\+1\}is its target\. The objective is optimized with minibatch SGD over a fresh batch of tasks at each iteration\. We denote the parameters at convergence byθ∗\\theta^\{\*\}\.

#### Synthetic data\.

Following\[[9](https://arxiv.org/html/2605.26356#bib.bib10),[41](https://arxiv.org/html/2605.26356#bib.bib6)\], we generate each taskτ\\taufrom a teacher with weightsWτ∼𝒩​\(0,I\)W\_\{\\tau\}\\sim\\mathcal\{N\}\(0,I\)\. Inputs are drawn fromxτ,i∼𝒰​\(−1,1\)nIx\_\{\\tau,i\}\\sim\\mathcal\{U\}\(\-1,1\)^\{n\_\{I\}\}and targets are constructed asyτ,i=Wτ1​xτ,i1\+Wτ2​xτ,i2y\_\{\\tau,i\}=W\_\{\\tau\}^\{1\}\\,x\_\{\\tau,i\}^\{1\}\+W\_\{\\tau\}^\{2\}\\,x\_\{\\tau,i\}^\{2\}\. We setN=nI=10N=n\_\{I\}=10with output dimension11, and we sweep the document countk∈\{2,5,10,25\}k\\in\\\{2,5,10,25\\\}\.

#### Dot\-product injection\.

Under the linear\-projection retriever the document is included in the token,ei=\(xi,𝒟,yi\)e\_\{i\}=\(x\_\{i\},\\mathcal\{D\},y\_\{i\}\), and the LSA layer can learn to select relevant documents during pre\-training\. Under the dot\-product retriever the document is not concatenated into the token; instead, document information is injected directly into the key and value matrices,

K=\[Kctxhd\],V=\[Vctxhd\],K=\\begin\{bmatrix\}K\_\{\\text\{ctx\}\}\\\\ h\_\{d\}\\end\{bmatrix\},\\qquad V=\\begin\{bmatrix\}V\_\{\\text\{ctx\}\}\\\\ h\_\{d\}\\end\{bmatrix\},\(22\)whereKctx,VctxK\_\{\\text\{ctx\}\},V\_\{\\text\{ctx\}\}are the contextual key/value rows andhd=f​\(𝒟\)∈ℝB×dim\(x\)h\_\{d\}=f\(\\mathcal\{D\}\)\\in\\mathbb\{R\}^\{B\\times\\dim\(x\)\}is a fixed projection of the document set𝒟=\{d1,…,dn\}\\mathcal\{D\}=\\\{d\_\{1\},\\ldots,d\_\{n\}\\\}into the input dimension\.

#### Constructed reference predictor\.

The trained LSA layer \(parametersθ∗\\theta^\{\*\}\) is compared against a*constructed*predictor that realizes one gradient\-descent step on the unified RAG loss exactly\. Following the construction of Eq\.[10](https://arxiv.org/html/2605.26356#A1.E10), we set the value, key, and query projections so that one LSA update reproduces the gradient step\. For the linear\-projection retriever,W1W\_\{1\}andW2W\_\{2\}insideWVW\_\{V\}are initialized to zero, following\[[41](https://arxiv.org/html/2605.26356#bib.bib6)\]\. For the dot\-product retriever,W2=Wz​\(∑idi​di⊤\)​M⊤W\_\{2\}=W\_\{z\}\\big\(\\sum\_\{i\}d\_\{i\}d\_\{i\}^\{\\top\}\\big\)M^\{\\top\}, withWz∈ℝdy×ddW\_\{z\}\\in\\mathbb\{R\}^\{d\_\{y\}\\times d\_\{d\}\}andM∈ℝdd×ddM\\in\\mathbb\{R\}^\{d\_\{d\}\\times d\_\{d\}\}sampled independently from𝒩​\(0,σ2\)\\mathcal\{N\}\(0,\\sigma^\{2\}\), and document featuresC∼𝒰​\(−12,12\)k×ddC\\sim\\mathcal\{U\}\(\-\\tfrac\{1\}\{2\},\\tfrac\{1\}\{2\}\)^\{k\\times d\_\{d\}\}\. The inner\-loop learning rateη\\etafor the constructed predictor is chosen by line search to minimize the constructed model’s loss over10410^\{4\}training tasks\. We write the resulting predictiony^θ,rag​\(xtest\)\\hat\{y\}\_\{\\theta,\\mathrm\{rag\}\}\(x\_\{\\text\{test\}\}\)\.

#### Evaluation metrics\.

OnTval=104T\_\{\\text\{val\}\}=10^\{4\}held\-out validation tasks, following\[[41](https://arxiv.org/html/2605.26356#bib.bib6)\], we report the mean of three quantities between the trained and constructed predictors: \(i\) the prediction difference‖y^θ∗​\(xτ,test\)−y^θ,rag​\(xτ,test\)‖2\\big\\\|\\hat\{y\}\_\{\\theta^\{\*\}\}\(x\_\{\\tau,\\text\{test\}\}\)\-\\hat\{y\}\_\{\\theta,\\mathrm\{rag\}\}\(x\_\{\\tau,\\text\{test\}\}\)\\big\\\|\_\{2\}; \(ii\) the cosine similarity between the input\-sensitivities∂y^θ,rag/∂xtest\\partial\\hat\{y\}\_\{\\theta,\\mathrm\{rag\}\}/\\partial x\_\{\\text\{test\}\}and∂y^θ∗/∂xtest\\partial\\hat\{y\}\_\{\\theta^\{\*\}\}/\\partial x\_\{\\text\{test\}\}; and \(iii\) the correspondingℓ2\\ell\_\{2\}sensitivity difference\.

### C\.2Robustness to distribution shift and document count

![Refer to caption](https://arxiv.org/html/2605.26356v1/x16.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x17.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x18.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x19.png)

Figure 5:Robustness of the single\-layer agreement of Section[4\.1](https://arxiv.org/html/2605.26356#S4.SS1)\.Left two:loss as a function of document count for the linear\-projection \(left\) and dot\-product \(centre\-left\) retrievers\.Right two:loss under input distribution shift, with test inputs drawn from𝒰​\(−α,α\)nI\\mathcal\{U\}\(\-\\alpha,\\alpha\)^\{n\_\{I\}\}for varyingα\\alpha, for the linear\-projection \(centre\-right\) and dot\-product \(right\) retrievers\. The trained Transformer, the constructed gradient\-descent predictor, and their interpolation track each other closely in all settings\.To probe whether the trained LSA layer captures a generalizable update rule rather than memorizing the training distribution, we vary two factors at test time\. First, we sweep the document countn∈\{2,5,10,25\}n\\in\\\{2,5,10,25\\\}and recompute the comparison; second, we sample test inputs from𝒰​\(−α,α\)nI\\mathcal\{U\}\(\-\\alpha,\\alpha\)^\{n\_\{I\}\}withα∈\{0\.5,1,1\.5,2\}\\alpha\\in\\\{0\.5,1,1\.5,2\\\}while training is fixed atα=1\\alpha=1\. Figure[5](https://arxiv.org/html/2605.26356#A3.F5)reports the resulting loss curves\. With the linear\-projection retriever, the absolute loss rises with document count \(the projected document features carry more variance\) but the LSA layer follows the gradient predictor in lockstep\. With the dot\-product retriever, where the document information enters through the second\-moment matrix∑idi​di⊤\\sum\_\{i\}d\_\{i\}d\_\{i\}^\{\\top\}, the loss is largely insensitive to document count\. The dot\-product variant is also computationally cheaper, since no per\-document projection is required\.

### C\.3Stacked\-layer agreement under the dot\-product retriever

![Refer to caption](https://arxiv.org/html/2605.26356v1/x20.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x21.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x22.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x23.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x24.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x25.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x26.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x27.png)

Figure 6:Stacked\-layer agreement under the dot\-product retriever \(Section[4\.1](https://arxiv.org/html/2605.26356#S4.SS1)\)\.Top row:2\-layer model\.Bottom row:5\-layer model\. Columns: \(a\) loss difference between trained Transformer and constructed gradient\-descent predictor, \(b\) sensitivity cosine, \(c\) model difference, \(d\) prediction difference\. Agreement remains close at both depths; the small residual gap atDocs=25\\text\{Docs\}=25in the 2\-layer setting shrinks as depth increases to 5\.Figure[6](https://arxiv.org/html/2605.26356#A3.F6)reports the dot\-product variant at depths 2 and 5\. The loss differences between the trained Transformer and the constructed predictor remain small across document counts, and the prediction differences converge to similar values\. The number of retrieved documents has a depth\-dependent effect on the residual: at depth 2, the prediction difference is smaller forDocs=2\\text\{Docs\}=2than forDocs=25\\text\{Docs\}=25, but this gap narrows at depth 5\. The corresponding analysis for the linear\-projection retriever is reported in Appendix[E](https://arxiv.org/html/2605.26356#A5)\.

## Appendix DNormalization analysis: per\-dataset extended results

This appendix supplements Section[4\.3](https://arxiv.org/html/2605.26356#S4.SS3)of the main paper\. Section[4\.3](https://arxiv.org/html/2605.26356#S4.SS3)reports the headline result on Bike Sharing and California Housing; here we cover the remaining two datasets, Predict Calorie Expenditure and Wine Quality\. Datasets and normalization methods are as defined in Section[4\.3](https://arxiv.org/html/2605.26356#S4.SS3)and Appendix[F](https://arxiv.org/html/2605.26356#A6.SS0.SSS0.Px2)\. The training set is used as the retrieval corpus and is normalized with Z\-score throughout; only the input\-side normalization is varied, between Z\-score\[[5](https://arxiv.org/html/2605.26356#bib.bib25)\], Min–Max\[[5](https://arxiv.org/html/2605.26356#bib.bib25)\], rank\-based\[[7](https://arxiv.org/html/2605.26356#bib.bib26)\], and Tanh\[[39](https://arxiv.org/html/2605.26356#bib.bib27)\]\.

On Predict Calorie Expenditure, the trained Transformer continues to track the gradient\-descent predictor closely, mirroring the alignment seen on Bike Sharing in Section[4\.3](https://arxiv.org/html/2605.26356#S4.SS3)\. Wine Quality is the harder case\. Under Min–Max normalization, a few outliers dominate the scaling and compress most of the samples near zero\. Two effects follow\. The sensitivity cosine drops because the sensitivity vectors diverge from those of the gradient\-descent predictor\. The prediction difference also fluctuates more strongly, indicating instability in the alignment between RAG and ICL dynamics under heavy\-tailed feature distributions\. This is consistent with the California Housing pattern in Section[4\.3](https://arxiv.org/html/2605.26356#S4.SS3): when retrieval\-derived dot products are dominated by skewed features, the linear correspondence becomes less predictive\.

![Refer to caption](https://arxiv.org/html/2605.26356v1/x28.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x29.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x30.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x31.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x32.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x33.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x34.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x35.png)

Figure 7:Per\-dataset normalization results\.Top row:Predict Calorie Expenditure\.Bottom row:Wine Quality\. Each column reports a different evaluation metric: loss difference with the trained Transformer, training loss of RAG, sensitivity cosine, model difference, and prediction difference\. The four normalization schemes \(Z\-score, Min–Max, rank\-based, Tanh\) are overlaid within each panel\.
## Appendix EStacked\-layer agreement under the projection\-based retriever

![Refer to caption](https://arxiv.org/html/2605.26356v1/x36.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x37.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x38.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x39.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x40.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x41.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x42.png)

![Refer to caption](https://arxiv.org/html/2605.26356v1/x43.png)

Figure 8:Stacked\-layer agreement under the projection\-based retriever\.Top row:2\-layer model\.Bottom row:5\-layer model\. Columns: \(a\) loss difference between trained Transformer and constructed gradient\-descent predictor, \(b\) sensitivity cosine, \(c\) model difference, \(d\) prediction difference\.This appendix complements the dot\-product analysis of Appendix[C](https://arxiv.org/html/2605.26356#A3)with the projection\-based retriever\. Under this interface, the retrieved documents are concatenated with the input tokens,ei=\(xi,𝒟,yi\)e\_\{i\}=\(x\_\{i\},\\mathcal\{D\},y\_\{i\}\), so each per\-document feature is processed jointly with the query through every stacked layer\. As the document count grows, the variance of the per\-token features grows with it, which amplifies the discrepancy between the trained Transformer and the constructed gradient\-descent predictor at any fixed depth\. Stacking layers reduces this discrepancy: at depth 5, the loss and prediction differences are uniformly smaller across document counts than at depth 2, and the sensitivity cosine is closer to 1\.

Whereas the dot\-product sweep in Appendix[C](https://arxiv.org/html/2605.26356#A3)reports results at 2, 5, 10, and 25 documents, the projection\-based sweep is restricted to 2, 5, 10, and 15\. The compute cost of stacking concatenated\-document tokens grows substantially with retrieval size, and the growing\-residual trend is already clear at 15 documents, so the 25\-document run is omitted\.

## Appendix FDataset details

#### QA benchmarks \(main evaluation\)\.

The seven question\-answering benchmarks used in the main evaluation are all publicly available:

- •Natural Questions \(NQ\)\[[21](https://arxiv.org/html/2605.26356#bib.bib48)\]: open\-domain factoid QA over Wikipedia\.
- •TriviaQA\[[18](https://arxiv.org/html/2605.26356#bib.bib49)\]: large\-scale trivia question answering with evidence documents\.
- •PopQA\[[26](https://arxiv.org/html/2605.26356#bib.bib50)\]: popularity\-stratified entity\-centric QA \(held out from training\)\.
- •HotpotQA\[[44](https://arxiv.org/html/2605.26356#bib.bib51)\]: multi\-hop QA with comparison and bridge questions\.
- •2WikiMultiHopQA\[[13](https://arxiv.org/html/2605.26356#bib.bib52)\]: multi\-hop questions grounded in Wikipedia article pairs\.
- •MuSiQue\[[38](https://arxiv.org/html/2605.26356#bib.bib53)\]: compositional multi\-hop QA constructed by composing single\-hop pairs\.
- •Bamboogle\[[30](https://arxiv.org/html/2605.26356#bib.bib54)\]: small multi\-hop benchmark on long\-tail entities \(held out from training\)\.

For each benchmark we use the standard FlashRAG\[[17](https://arxiv.org/html/2605.26356#bib.bib43)\]corpus and retrieval splits\. NQ is used to train the source retrieval adapterW0retW\_\{0\}^\{\\mathrm\{ret\}\}\. NQ, TriviaQA, HotpotQA, 2WikiMultiHopQA, and MuSiQue are used to meta\-train the predictorgϕg\_\{\\phi\}\. PopQA and Bamboogle are held out from both adapter training and predictor meta\-training, and are used only for evaluation\.

#### Synthetic and tabular regression datasets \(linear\-attention sanity check and normalization analysis\)\.

- •California Housing: Given eight features,\[’MedInc’, ’HouseAge’, ’AveRooms’, ’AveBedrms’, ’Population’, ’AveOccup’, ’Latitude’, ’Longitude’\], the task is to predictMedHouseVal\. The dataset is split into 16,640 training samples and 2,000 test samples\.
- •Bike Sharing: Using the features\[’season’, ’yr’, ’mnth’, ’hr’, ’holiday’, ’weekday’, ’workingday’, ’weathersit’, ’temp’, ’atemp’, ’hum’, ’windspeed’, ’casual’, ’registered’\], the task is to predictcount\. The dataset contains 15,641 training samples and 1,738 test samples\.
- •Wine Quality: Given eleven physicochemical features,\[fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol\]the task is to predict the winequality\(a sensory score ranging from 0 to 10\)\. The dataset is split into 4,408 training samples and 490 test samples\.
- •Predict Calorie Expenditure: Using the features\[Gender, Age, Height, Weight, Duration, Heart\_Rate, Body\_Temp\], the task is to predict the number ofCaloriesexpended\. The dataset is split into 13,500 training samples and 1,540 test samples\.

## Appendix GImplementation Details

#### Source retrieval adapterW0retW\_\{0\}^\{\\mathrm\{ret\}\}\.

We implement the generator\-side retrieval interface with LoRA modules on the\{q,k,v\}\\\{q,k,v\\\}projections of every transformer block\. The LoRA rank is1616, withα=32\\alpha\{=\}32and dropout0\. The LLM backbone is kept frozen and loaded in 4\-bit NF4 precision\. We trainW0retW\_\{0\}^\{\\mathrm\{ret\}\}with AdamW using learning rate10−410^\{\-4\}, weight decay0\.010\.01, gradient clipping1\.01\.0, and gradient accumulation44for3,0003\{,\}000steps on RAG\-formatted examples from the NQ training split\. Each example is paired with the top\-Kret=5K\_\{\\mathrm\{ret\}\}\{=\}5retrieved documents from the fixed external retriever\. For each retriever setting, we use the corresponding fixed retrieval cache and keep this cache unchanged across methods\. This adapter serves only as a generator\-side evidence\-use interface: it does not select documents, but modulates how the frozen generator uses the retrieved evidence\.

#### Predictorgϕg\_\{\\phi\}\.

The predictor maps a support contextC=\{\(xi,Di,yi\)\}i=1NC=\\\{\(x\_\{i\},D\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}withN=3N\{=\}3demonstrations to a context\-conditioned update for the generator\-side retrieval interface\. Each demonstration is formatted by concatenating the question, retrieved documents, and gold answer, and is encoded by the frozen LLM equipped withW0retW\_\{0\}^\{\\mathrm\{ret\}\}\. We take the EOS hidden statehih\_\{i\}of each demonstration and aggregate the support context by mean pooling,

h¯​\(C\)=1N​∑i=1Nhi\.\\bar\{h\}\(C\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}h\_\{i\}\.The pooled representation is passed to a two\-layer MLP encoder with hidden dimension256256and output dimension6464, followed by per\-layer and per\-projection update heads for the\{q,k,v\}\\\{q,k,v\\\}LoRA modules\. The update heads output low\-rank perturbations with the same LoRA rank asW0retW\_\{0\}^\{\\mathrm\{ret\}\}, so thatgϕ​\(C\)g\_\{\\phi\}\(C\)has the same parameter shape as the base retrieval adapter\. We traingϕg\_\{\\phi\}with AdamW using learning rate5×10−45\\times 10^\{\-4\}, weight decay0\.010\.01, gradient clipping1\.01\.0, and gradient accumulation44for3,0003\{,\}000steps\. The predictor is meta\-trained on support contexts sampled from NQ, TriviaQA, HotpotQA, 2WikiMultiHopQA, and MuSiQue\. PopQA and Bamboogle are excluded from both adapter training and predictor meta\-training, and are used only for held\-out evaluation\.

#### Matching objective\.

The predictor is trained to match the autograd\-defined inner\-GD target for each layer and projection type\. The matching loss uses a cosine term for update direction and a log\-magnitude term for update scale, as defined in Section[5\.2](https://arxiv.org/html/2605.26356#S5.SS2)\. We set the magnitude weight toλ=0\.1\\lambda\{=\}0\.1throughout\. No downstream answer loss is applied when traininggϕg\_\{\\phi\}for the mainRAG\-GDresults\.

#### Inner GD target\.

For each support context, the supervision target is computed by runningKKsteps of SGD on theN=3N\{=\}3demonstrations, starting fromW0retW\_\{0\}^\{\\mathrm\{ret\}\}\. The inner\-loop learning rate isη=10−2\\eta\{=\}10^\{\-2\}, and we evaluateK∈\{1,5,10\}K\\in\\\{1,5,10\\\}\. Gradients are taken only with respect to the LoRA parameters of the generator\-side retrieval interface; the external retriever and the LLM backbone remain fixed\.

#### Compute\.

All experiments are run on NVIDIA A100 80GB GPUs\. A single training run forW0retW\_\{0\}^\{\\mathrm\{ret\}\}takes approximately5050–6060minutes\. Traininggϕg\_\{\\phi\}takes approximately1\.51\.5hours forK=1K\{=\}1,44hours forK=5K\{=\}5, and77hours forK=10K\{=\}10\. The full result table requires approximately5050A100\-hours\.

## Appendix HAlgorithm and inference procedure forRAG\-GD

Algorithm[1](https://arxiv.org/html/2605.26356#algorithm1)summarises the deployment\-time procedure ofRAG\-GD: build a RAG\-formatted support context, predict the retrieval\-interface update with a single forward pass throughgϕg\_\{\\phi\}, and generate with the perturbed interface\. No backward pass through the LLM is required at deployment\.

Input:External retriever

RR, frozen LLM

ff, source adapter

W0retW\_\{0\}^\{\\mathrm\{ret\}\}, predictor

gϕg\_\{\\phi\}, support size

NN, query

qq
Output:Generated answer

y^q\\hat\{y\}\_\{q\}
1

2// Phase 1: Build RAG\-formatted support context

3

C←∅C\\leftarrow\\emptyset
4for*i=1,…,Ni=1,\\ldots,N*do

5Sample support pair

\(xi,yi\)\(x\_\{i\},y\_\{i\}\)from the task support pool

𝒟i←R​\(xi\)\\mathcal\{D\}\_\{i\}\\leftarrow R\(x\_\{i\}\)
//retrieve top\-kkdocuments

6

C←C∪\{\(xi,𝒟i,yi\)\}C\\leftarrow C\\cup\\\{\(x\_\{i\},\\mathcal\{D\}\_\{i\},y\_\{i\}\)\\\}
7

8

9// Phase 2: Predict retrieval\-interface update

10for*\(xi,𝒟i,yi\)∈C\(x\_\{i\},\\mathcal\{D\}\_\{i\},y\_\{i\}\)\\in C*do

11

hi←fW0ret​\(xi,𝒟i,yi\)EOSh\_\{i\}\\leftarrow f\_\{W\_\{0\}^\{\\mathrm\{ret\}\}\}\(x\_\{i\},\\mathcal\{D\}\_\{i\},y\_\{i\}\)\_\{\\mathrm\{EOS\}\}
12

13

h¯​\(C\)←1N​∑i=1Nhi\\bar\{h\}\(C\)\\leftarrow\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}h\_\{i\}
14

Δ​W~​\(C\)←gϕ​\(h¯​\(C\)\)\\widetilde\{\\Delta W\}\(C\)\\leftarrow g\_\{\\phi\}\(\\bar\{h\}\(C\)\)
15

16// Phase 3: Generate with adapted interface

17

𝒟q←R​\(q\)\\mathcal\{D\}\_\{q\}\\leftarrow R\(q\)
18

y^q←fW0ret\+Δ​W~​\(C\)​\(q,𝒟q\)\\hat\{y\}\_\{q\}\\leftarrow f\_\{W\_\{0\}^\{\\mathrm\{ret\}\}\+\\widetilde\{\\Delta W\}\(C\)\}\(q,\\mathcal\{D\}\_\{q\}\)
19return

y^q\\hat\{y\}\_\{q\}

Algorithm 1RAG\-GD: Forward\-Only Retrieval\-Interface Adaptation
## Appendix IFull per\-method comparison on QA benchmarks

We split the per\-benchmark numbers into two tables\. Table[2](https://arxiv.org/html/2605.26356#A9.T2)reports the methods that we ran on*both*Qwen\-2\.5\-7B\-Instruct and Llama\-3\.1\-8B\-Instruct: Query Only, Vanilla RAG, Base adapter, andRAG\-GDatK∈\{1,5,10\}K\\in\\\{1,5,10\\\}\. Table[3](https://arxiv.org/html/2605.26356#A9.T3)reports the additional context\-conditioned baselines that we ran only on Qwen due to compute constraints: Vanilla RAG \+ few shot, Base adapter \+ few shot, Prompt tuning, HyperTuning, and TT\-SGD atK=5K\{=\}5\. Together they complement the slim main\-text Table[1](https://arxiv.org/html/2605.26356#S5.T1)and the per\-family aggregates in Figures[3](https://arxiv.org/html/2605.26356#S5.F3)and[4](https://arxiv.org/html/2605.26356#S5.F4)\.HyperTuning\[[29](https://arxiv.org/html/2605.26356#bib.bib29)\]uses the same predictor architecture asRAG\-GDbut supervises against a downstream task loss instead of the SGD\-update target, so the contrast between the HyperTuning rows in Table[3](https://arxiv.org/html/2605.26356#A9.T3)and theRAG\-GD\(K=5K\{=\}5\) rows for Qwen in Table[2](https://arxiv.org/html/2605.26356#A9.T2)isolates the choice of supervision signal\.TT\-SGDperformsK=5K\{=\}5inner gradient\-descent steps at test time and serves as a reference for what test\-time gradient adaptation would achieve at the same backbone and retriever\.\+ few shotvariants concatenate the support demonstrations into the prompt under the corresponding base configuration\.

Table 2:Methods run on both backbones\. Per\-benchmark exact match \(EM\) and F1 for Query Only, Vanilla RAG, Base adapter, andRAG\-GDatK∈\{1,5,10\}K\\in\\\{1,5,10\\\}\.MethodRetrieverSingle\-Hop QAMulti\-Hop QAAvg\.NQTriviaQAPopQAHotpotQA2WikiMuSiQueBamboogleEMF1EMF1EMF1EMF1EMF1EMF1EMF1EMF1Qwen\-2\.5\-7BQuery Only–15\.9524\.2843\.3349\.5116\.0219\.7618\.4025\.3923\.9128\.123\.8010\.5711\.2018\.0218\.9425\.09Vanilla RAGBM2527\.6136\.6658\.2465\.7728\.8433\.1831\.2841\.2527\.8733\.245\.8713\.0510\.4021\.1627\.1634\.90E539\.1650\.0362\.9970\.8044\.0350\.2132\.4542\.2125\.4831\.435\.7912\.7718\.4026\.7432\.6140\.60Base adapterBM2532\.5741\.4560\.1167\.9331\.5535\.6332\.3143\.4528\.2234\.136\.4115\.4616\.8026\.0129\.7137\.72E541\.7751\.2263\.3171\.6247\.0551\.8233\.7844\.6427\.8933\.976\.9515\.7618\.4028\.7834\.1642\.54RAG\-GD\(K=1K\{=\}1\)BM2533\.8843\.2063\.0270\.6633\.2537\.7335\.2546\.7928\.8434\.599\.1419\.6222\.4033\.0632\.2540\.81E542\.4952\.3265\.6173\.5148\.4152\.9335\.3446\.9429\.6635\.508\.9319\.1724\.0034\.1836\.3544\.94RAG\-GD\(K=5K\{=\}5\)BM2534\.4643\.5463\.2770\.6933\.2237\.6935\.5447\.1428\.8634\.489\.2619\.7322\.4032\.8532\.4340\.87E542\.9152\.7165\.9873\.6048\.1252\.6135\.5447\.0029\.6735\.479\.1419\.2625\.6035\.1236\.7145\.11RAG\-GD\(K=10K\{=\}10\)BM2534\.5743\.7263\.2670\.5733\.1137\.5735\.4547\.0728\.6434\.329\.3519\.5822\.4031\.9232\.4040\.68E542\.4652\.1465\.9373\.4947\.9652\.4335\.6947\.0529\.4335\.148\.9019\.0926\.4034\.6736\.6844\.86Llama\-3\.1\-8BQuery Only–22\.4632\.5152\.6759\.7920\.6325\.0918\.3125\.7126\.3931\.043\.819\.516\.4012\.8821\.5228\.08Vanilla RAGBM2531\.4140\.9560\.4368\.3531\.0835\.4331\.9242\.4626\.0731\.825\.7512\.4414\.4022\.9328\.7236\.34E540\.7252\.2364\.4272\.5845\.8551\.4832\.7843\.1923\.4429\.466\.0412\.2524\.8032\.1234\.0141\.90Base adapterBM2538\.4749\.4262\.6672\.4637\.8042\.2937\.3550\.4133\.6639\.5511\.4621\.9829\.6041\.2935\.8645\.34E543\.4654\.6063\.9074\.0551\.6955\.7437\.3149\.9033\.4539\.4511\.8322\.0730\.4040\.4838\.8648\.04RAG\-GD\(K=1K\{=\}1\)BM2539\.5849\.6066\.1374\.1337\.8942\.3838\.5350\.6733\.7239\.6112\.2522\.4830\.4041\.6936\.9345\.79E545\.1055\.1967\.6375\.9852\.3856\.6138\.2750\.0933\.4439\.4712\.3222\.5230\.4040\.7239\.9348\.65RAG\-GD\(K=5K\{=\}5\)BM2540\.2250\.0166\.1374\.2837\.8142\.1938\.9951\.1434\.1540\.0412\.5422\.6128\.0039\.6136\.8345\.70E545\.6855\.8467\.6676\.0752\.3156\.4839\.0150\.9433\.9439\.8313\.2023\.4732\.0042\.0840\.5449\.24RAG\-GD\(K=10K\{=\}10\)BM2540\.2850\.2965\.9374\.0737\.7342\.1039\.1051\.2034\.2640\.0812\.6222\.8628\.8039\.9236\.9645\.79E545\.4855\.5167\.7176\.0352\.3856\.4939\.1050\.9634\.0339\.7913\.4123\.4132\.0042\.8340\.5949\.29

Table 3:Additional context\-conditioned baselines, run on Qwen\-2\.5\-7B only\. For comparison anchors \(Qwen Base adapter andRAG\-GDat the same retriever and benchmark\), see Table[2](https://arxiv.org/html/2605.26356#A9.T2)\.MethodRetrieverSingle\-Hop QAMulti\-Hop QAAvg\.NQTriviaQAPopQAHotpotQA2WikiMuSiQueBamboogleEMF1EMF1EMF1EMF1EMF1EMF1EMF1EMF1Vanilla RAG \+ few shotBM2527\.7236\.8058\.4165\.9828\.8033\.1831\.7341\.4326\.2732\.055\.8813\.5811\.2020\.2727\.1434\.76E538\.7849\.7563\.1470\.9443\.8250\.0032\.2842\.0524\.2030\.805\.9913\.2116\.8025\.4232\.1440\.31Base adapter \+ few shotBM2534\.0743\.1862\.2669\.9732\.9437\.6534\.6546\.1828\.9134\.569\.0619\.6619\.2030\.1931\.5840\.20E541\.5251\.2565\.0373\.1147\.8552\.3834\.5146\.0829\.8135\.708\.4418\.6726\.4036\.0836\.2244\.75Prompt tuningBM2528\.9440\.8158\.4769\.1132\.4237\.1432\.4344\.9832\.1838\.547\.3617\.7922\.4033\.1530\.6040\.22E535\.9248\.1661\.0971\.7347\.1051\.9632\.7245\.1431\.6038\.047\.2817\.2226\.4032\.9834\.5943\.60HyperTuningBM2533\.6243\.2161\.3869\.9033\.7137\.9734\.5546\.3431\.8737\.817\.5217\.6019\.2031\.6931\.6940\.65E540\.7250\.5164\.1972\.5748\.8853\.2534\.7646\.1631\.4737\.446\.9516\.7424\.0032\.8335\.8544\.21TT\-SGD \(K=5K\{=\}5\)BM2534\.3743\.3263\.3270\.7532\.9437\.6335\.1647\.1328\.7634\.538\.7719\.5123\.2033\.2632\.3640\.88E542\.5252\.1465\.5073\.4547\.9552\.5135\.6247\.4329\.6735\.598\.9319\.1325\.6034\.8836\.5445\.02

## Appendix JUse of large language models

Large language models \(LLMs\) were only used to assist with language polishing and minor grammatical editing of this manuscript\.

Similar Articles

LightRAG: Simple and Fast Retrieval-Augmented Generation

Papers with Code Trending

The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

arXiv cs.CL

This paper identifies and formalizes 'recorruption' in multimodal RAG, where adding accurate context causes models to abandon correct predictions due to attentional collapse (visual blindness and positional bias). The authors propose BAIR, a parameter-free inference-time framework that restores visual saliency and penalizes textual distractors, improving reliability across medical, fairness, and geospatial benchmarks.

Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

arXiv cs.CL

This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.