Diagnosing and Repairing Factual Errors in RAG under Budget Constraints

arXiv cs.AI 06/30/26, 04:00 AM Papers
Summary
This paper proposes D2R-RAG, a model-agnostic and resource-aware framework that diagnoses and repairs factual errors in RAG systems under latency and VRAM constraints, achieving better accuracy-efficiency trade-offs on FEVER and HotpotQA.
arXiv:2606.29377v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) improves the factuality of large language models by grounding responses in external evidence, yet real-world deployments remain fragile. Failures often stem from missing or weakly relevant evidence, as well as from generation that does not faithfully reflect the retrieved context. Many existing approaches rely on fine-tuning, privileged access to internal model signals, or resource-insensitive escalation strategies, which limits their practicality in black-box and budget-constrained settings. We propose D2R-RAG (Diagnose-to-Repair RAG), a model-agnostic and resource-aware framework that combines lightweight failure diagnosis with adaptive repair. D2R-RAG derives interpretable failure signatures from observable signals in the query, retrieved evidence, and generated response, and then selects from a small set of corrective actions under explicit latency and VRAM constraints. Experiments on FEVER and HotpotQA show that D2R-RAG improves reliability over recent baselines and achieves better accuracy--efficiency trade-offs across multiple compute budgets. The code is available at https://github.com/CyberScienceLab/D2R-RAG/.
Original Article
View Cached Full Text
Cached at: 06/30/26, 05:33 AM
# Diagnosing and Repairing Factual Errors in RAG under Budget Constraints
Source: [https://arxiv.org/html/2606.29377](https://arxiv.org/html/2606.29377)
\\volumeheader

390

###### Abstract\.

Retrieval\-Augmented Generation \(RAG\) improves the factuality of large language models by grounding responses in external evidence, yet real\-world deployments remain fragile\. Failures often stem from missing or weakly relevant evidence, as well as from generation that does not faithfully reflect the retrieved context\. Many existing approaches rely on fine\-tuning, privileged access to internal model signals, or resource\-insensitive escalation strategies, which limits their practicality in black\-box and budget\-constrained settings\. We proposeD2R\-RAG\(Diagnose\-to\-Repair RAG\), a model\-agnostic and resource\-aware framework that combines lightweight failure diagnosis with adaptive repair\.D2R\-RAGderives interpretable failure signatures from observable signals in the query, retrieved evidence, and generated response, and then selects from a small set of corrective actions under explicit latency and VRAM constraints\. Experiments on FEVER and HotpotQA show thatD2R\-RAGimproves reliability over recent baselines and achieves better accuracy–efficiency trade\-offs across multiple compute budgets\. The code is available at[https://github\.com/CyberScienceLab/D2R\-RAG/](https://github.com/CyberScienceLab/D2R-RAG/)\.

###### keywords:

Keywords: RAG, Factuality Verification, Contextual Bandits, Adaptive Repair\.

Soroush Hashemifar\\upstairs\\affilone, Havva Alizadeh Noughabi\\upstairs\\affiltwo, Fattane Zarrinkalam\\upstairs\\affilone,\\affiltwo,\*, Ali Dehghantanha\\upstairs\\affiltwo\\upstairs\\affiloneCollege of Engineering, University of Guelph, Guelph, ON, Canada\\upstairs\\affiltwoCyber Science Lab, School of Computer Science, University of Guelph, Guelph, ON, Canada\\emails\\upstairs

\* fzarrink@uoguelph\.ca

\\copyrightnotice

## 1\.Introduction

Retrieval\-Augmented Generation \(RAG\)\[lewis2020retrieval\]has become a standard approach for improving the factuality of large language model \(LLM\) outputs by grounding generation in external evidence rather than relying solely on parametric memory\. In many real deployments, however, RAG remains unreliable: correct answers may fail when the retriever misses key evidence, when retrieved passages are relevant but incomplete, or when the generator produces content that is not supported by the provided context\[xie2024adaptive\]\. These failure modes are stochastic and non\-uniform, and they are most problematic in settings with strict latency and hardware budgets \(e\.g\., limited GPU memory, rate\-limited APIs, or cost\-constrained edge/cloud deployments\), where iterative retrieval or resource\-intensive reranking are computationally prohibitive\[ray2025metis\]\.

Recent work has proposed dynamic and self\-correcting RAG variants from three directions: \(1\) training\-time modifications, \(2\) inference\-time control, and \(3\) learning\-based decision policies\. Self\-RAG\[asai2024self\]trains generators to emit reflection control tokens that decide when to retrieve and how to critique intermediate outputs, improving factuality through self\-assessment, but requires fine\-tuning and architectural coupling that limits use in black\-box or API\-based settings\. Moreover, DRAGIN\[su2403dragin\]uses token\-level uncertainty and other internal dynamics to trigger retrieval during inference, offering lightweight, training\-free control, yet relies on logit access \(unavailable in closed\-source LLM APIs\)\. Finally, learning\-based policies like MBA\-RAG\[tang2025mba\]\(multi\-armed bandit balancing quality and efficiency, dependent on an external query\-complexity predictor\) and QueryBandits\[cho2025querybandits\]\(bandit\-based query rewriting using semantic and lexical signals\) demonstrate that online decision\-making can improve robustness, but typically target specific pipeline stages or require additional supervision or modeling components\.

This paper addresses*model\-agnostic, resource\-aware RAG recovery in black\-box settings*, arguing that reliable recovery requires two capabilities: a*lightweight diagnostic*distinguishing retrieval\-side evidence insufficiency from generation\-side unfaithfulness using only observable artifacts, and the*least\-cost corrective action*under explicit latency and VRAM budgets\. We operationalize these principles inD2R\-RAG\(Diagnose\-to\-Repair RAG\), which performs triangulated verification by combining \(1\) textual entailment checks between retrieved evidence and both the query and response, with \(2\) structured consistency checks that align relational triples extracted from the response against a knowledge graph to identify entity/predicate mismatches\. These signals yield an interpretable failure signature separating missing evidence from unsupported generation\. Repair selection is then formulated as a contextual multi\-armed bandit\[langford2007epoch\]where each arm corresponds to a specific intervention \(query rewriting, retrieval depth or mode adjustment, cross\-encoder reranking, or index refresh\), and a LinUCB policy\[li2010contextual\]learns to balance expected factual recovery against computational cost\. Notably, the framework requires neither generator fine\-tuning nor access to internal logits, enabling deployment with closed\-source LLM APIs\.

We evaluateD2R\-RAGon fact verification and multi\-hop question answering using FEVER\[Thorne18Fever\]and HotpotQA\[yang2018hotpotqa\]\. Our evaluation emphasizes both factual performance and operational efficiency under multiple latency/VRAM budget regimes\. Beyond aggregate results, we analyze how the learned policy allocates high\-cost interventions across different failure signatures and how performance changes as resource budgets tighten\.

## 2\.Methodology

![Refer to caption](https://arxiv.org/html/2606.29377v1/x1.png)Figure 1\.Overview ofD2R\-RAG\.Figure[1](https://arxiv.org/html/2606.29377#S2.F1)overviews theD2R\-RAG, which augments a RAG pipeline with two modules\.

\(1\) Failure Diagnosis\.This stage determines whether the initial RAG output is trustworthy and identifies the likely failure type, using queryqq, evidenceEE, responserr, and predicted labelyy, which is suitable for black\-box deployment\. The goal is to produce an interpretable failure signature that separates retrieval\-side insufficiency from generation\-side unfaithfulness, rather than perfectly explaining every error\. We derive the signature using three complementary signals:query entailmenteqe^\{q\},response entailmentere^\{r\}, andKG alignment statusκ\\kappa\[naveen2023nli\]\. See Appendix[A](https://arxiv.org/html/2606.29377#A1)for additional details\.

Given\(κ,eq,er,y\)\(\\kappa,e^\{q\},e^\{r\},y\), our failure types are: wrong\-predicate \(WP\) for KG conflict, insufficient\-evidence \(IE\) when retrieved evidence does not entail the query \(eq=neutrale^\{q\}=\\textsf\{neutral\}\), wrong\-response \(WR\) when the response is not entailed by retrieved evidence \(er∈\{neutral,contradict\}e^\{r\}\\in\\\{\\textsf\{neutral\},\\textsf\{contradict\}\\\}\), and label–evidence mismatch \(LEM\) for FEVER when the predictedyyconflicts the query\-context entailment\. For FEVER, IE and LEM cases emit the abstention label*Unverified*rather than a potentially correct label for the wrong reason, improving trustworthiness and prioritizing retrieval\-oriented repairs \(evidence\-gated label prediction\)\.

\(2\) Adaptive Repair via Contextual Bandits\.D2R\-RAGperforms single\-shot repair by selecting a patch and re\-running the RAG pipeline, as a contextual multi\-armed bandit \(CMAB\)\[langford2007epoch\]: the learner observes context vector𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}, chooses actiona∈𝒜a\\in\\mathcal\{A\}, and receives a reward reflecting output quality and resource cost\. The action space𝒜\\mathcal\{A\}is intentionally small and black\-box deployable: \(1\) prompt\-level rewrites \(paraphrasing and simplification\) to reduce query ambiguity\[cho2025querybandits\]; \(2\) reranker activation using a cross\-encoder to improve evidence precision; and \(3\) retrieval\-level interventions adjusting depthkk, switching between BM25 and dense retrieval, or refreshing the index\. We learn the repair policy with LinUCB\[li2010contextual\], which assumes a linear reward modelr^\(a∣𝐱\)=𝐱⊤𝜽a\\hat\{r\}\(a\\mid\\mathbf\{x\}\)=\\mathbf\{x\}^\{\\top\}\\bm\{\\theta\}\_\{a\}and selects actions by maximizing an upper\-confidence bound \(more details in Appendix[B](https://arxiv.org/html/2606.29377#A2)\):

a∗=arg⁡maxa∈𝒜⁡\(𝐱⊤𝜽a\+α𝐱⊤𝐀a−1𝐱\)\.a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\left\(\\mathbf\{x\}^\{\\top\}\\bm\{\\theta\}\_\{a\}\+\\alpha\\sqrt\{\\mathbf\{x\}^\{\\top\}\\mathbf\{A\}\_\{a\}^\{\-1\}\\mathbf\{x\}\}\\right\)\.\(1\)whereα\\alphacontrols exploration and𝐀a\\mathbf\{A\}\_\{a\}is the action\-specific covariance matrix\. After applyinga∗a^\{\*\}, the revised output and measured action cost define the observed reward\. We use a resource\-aware reward scoring repairs by output reliability and budget compliance\. LetLaL\_\{a\}andVaV\_\{a\}denote the latency and additional VRAM usage incurred by actionaa, with per\-action budgetsBLB\_\{L\}andBVB\_\{V\}\. The reward is:

r\(a\)=14\(𝟙NF\+𝟙KG\+2⋅𝟙NLI\)∏x∈\{L,V\}𝟙\[xa≤Bx\]\(1−xaBx\)\.r\(a\)=\\frac\{1\}\{4\}\\left\(\\mathbbm\{1\}\_\{NF\}\+\\mathbbm\{1\}\_\{KG\}\+2\\cdot\\mathbbm\{1\}\_\{NLI\}\\right\)\\prod\_\{x\\in\\\{L,V\\\}\}\\mathbbm\{1\}\\\!\\left\[x\_\{a\}\\leq B\_\{x\}\\right\]\\\!\\left\(1\-\\frac\{x\_\{a\}\}\{B\_\{x\}\}\\right\)\.\(2\)where𝟙NF\\mathbbm\{1\}\_\{NF\}indicates aNoFailurediagnosis after repair,𝟙KG\\mathbbm\{1\}\_\{KG\}indicates triple\-level consistency, and𝟙NLI\\mathbbm\{1\}\_\{NLI\}indicates entailment support\. Hard gates𝟙\[xa≤Bx\]\\mathbbm\{1\}\[x\_\{a\}\\leq B\_\{x\}\]zero out rewards for budget\-violating actions, while multiplicative terms encourage cheaper actions among feasible ones\. Following each interaction, LinUCB updates only the selected action:

𝐀a←𝐀a\+𝐱𝐱⊤,𝐛a←𝐛a\+r\(a\)𝐱,𝜽a←𝐀a−1𝐛a\.\\mathbf\{A\}\_\{a\}\\leftarrow\\mathbf\{A\}\_\{a\}\+\\mathbf\{x\}\\mathbf\{x\}^\{\\top\},\\qquad\\mathbf\{b\}\_\{a\}\\leftarrow\\mathbf\{b\}\_\{a\}\+r\(a\)\\mathbf\{x\},\\qquad\\bm\{\\theta\}\_\{a\}\\leftarrow\\mathbf\{A\}\_\{a\}^\{\-1\}\\mathbf\{b\}\_\{a\}\.\(3\)

## 3\.Experiments

We investigate three research questions:RQ1:DoesD2R\-RAGimprove factual quality while respecting deployment costs compared to baselines?RQ2:Does the policy adapt interventions to specific failure modes rather than using a uniform strategy?RQ3:How does the reward formulation inD2R\-RAGdetermine the learned repair strategy, and how stable is this under varying latency/VRAM budgets?

### 3\.1\.Experimental Setup

Datasets\.We evaluateD2R\-RAGon FEVER\[Thorne18Fever\]\(2,000 claim verification instances\) and HotpotQA\[yang2018hotpotqa\]\(1,000 multi\-hop question answering instances\)\.

Baselines\.We evaluateD2R\-RAGagainstNaive\-RAG\(a single\-pass RAG\),Query Paraphrase\[deng2023rephrase\]\(rewriting the query conditioned on the retrieved context\),Context Expansion\(increasing dense retrieval depth fromkktok′=20k^\{\\prime\}=20\), andThompson Sampling\[Thompson1933thompson\]\(replacing LinUCB with a context\-free policy\)\. All methods utilize aGPT\-4o\-minigenerator and 128\-token chunks with 16\-token overlap, within LlamaIndex\.

Evaluation Metrics\.We reportExact Match \(EM\)for HotpotQA,Accuracy \(ACC\)for FEVER, and responseRelevanceandFaithfulness\. Efficiency is measured by Latency \(retrieval to final output\) and peak VRAM \(maximum GPU memory during execution\)\.

Implementation Settings\.We useDeBERTa\-v3\-large\-nlifor entailment check andBabelscape/rebel\-largefor triple extraction\. We impose 3s latency and 6 MB VRAM per query\. The repair policy is trained online for two epochs, with exploration parameterα=2\\alpha\{=\}2\.

### 3\.2\.Results

RQ1\.Table[1](https://arxiv.org/html/2606.29377#S3.T1)shows thatD2R\-RAGimproves factual quality while maintaining efficiency constraints: bold and underlined values represent best and second\-best results per dataset/metric, respectively\. On FEVER, it improves ACC from 56\.3% \(Naive\-RAG\) to 61\.5% \(Thompson Sampling\) and 60\.8% \(LinUCB\), withThompson Samplingachieving the best cost profile \(1\.14s, 0\.17 MB\) compared toLinUCB\(1\.47s, 0\.36 MB\)\. On HotpotQA,Thompson Samplingreaches 40\.4% EM with 71\.52% faithfulness, whileLinUCBmatches 39\.8% EM but at higher latency \(3\.41s vs 2\.37s\);LinUCBoccasionally exceeds the VRAM budget \(3\.41 MB\), since budget constraints are enforced per individual repair action rather than end\-to\-end\. Overall,LinUCBprioritizes quality at higher resource cost, whileThompson Samplingoffers a stronger quality–efficiency trade\-off \(see Appendix[C](https://arxiv.org/html/2606.29377#A3)\)\.

DatasetMethodLatencyVRAMRelevanceFaithfulnessACCEMFEVERNaive\-RAG1\.160\.2050\.1691\.6756\.3\-Query Paraphrase1\.320\.2050\.0992\.6956\.7\-Context Expansion1\.350\.2050\.2492\.8261\.8\-D2R\-RAG\(LinUCB\)1\.470\.3650\.8092\.3460\.8\-D2R\-RAG\(T\.Sampling\)1\.140\.1750\.0092\.3761\.5\-HotpotQANaive\-RAG1\.520\.3628\.3968\.58\-36\.2Query Paraphrase1\.600\.3627\.9074\.65\-39\.8Context Expansion1\.520\.3632\.1470\.70\-40\.6D2R\-RAG\(LinUCB\)3\.411\.0431\.6669\.89\-39\.8D2R\-RAG\(T\.Sampling\)2\.371\.2330\.6971\.52\-40\.4Table 1\.Comparison of D2R\-RAG and baselines on FEVER and HotpotQA\.RQ2\.Figure[2](https://arxiv.org/html/2606.29377#S3.F2)shows thatD2R\-RAGconditions actions on failure type\. On FEVER, IE cases favor retrieval escalation \(∼\\sim60–80% on deeper dense/BM25\), while on HotpotQA, WR and WP failures route to reranker activation, indicating that selecting the right passages \(not merely retrieving more\) is critical for multi\-hop questions\.LinUCBexplores broader action distribution, whileThompson Samplingexploits more \(concentrates on high\-utility repairs\), yet both preserve the same failure\-to\-action mapping\.

![Refer to caption](https://arxiv.org/html/2606.29377v1/x2.png)\(a\)LinUCB on FEVER
![Refer to caption](https://arxiv.org/html/2606.29377v1/x3.png)\(b\)Thompson Sampling on FEVER
![Refer to caption](https://arxiv.org/html/2606.29377v1/x4.png)\(c\)LinUCB on HotpotQA
![Refer to caption](https://arxiv.org/html/2606.29377v1/x5.png)\(d\)Thompson Sampling on HotpotQA

Figure 2\.Analysis of failure type frequency exposed byD2R\-RAG\.RQ3\.Table[2](https://arxiv.org/html/2606.29377#S3.T2)ablates the two mechanisms that encode deployment constraints:Unweighted, which removes the soft resource\-weighting terms, andUnconstrained, which removes the hard budget gates\. On HotpotQA,Unconstrainedreduces latency \(3\.41→\\rightarrow2\.03s\) and VRAM \(1\.04→\\rightarrow0\.84 MB\) with minimal EM drop, whileUnweightedimproves quality metrics \(Relevance 31\.66→\\rightarrow32\.20%; Faithfulness 69\.89→\\rightarrow70\.59%\) with near\-identical EM\. On FEVER, both ablations improve ACC \(60\.8→\\rightarrow61\.3\-61\.4%\) and reduce memory\. Budget sensitivity tests \(Table[3](https://arxiv.org/html/2606.29377#S3.T3)\) showStringent\(0\.7×\\times\) cuts VRAM on both datasets while preserving performance, andRelaxed\(1\.5×\\times\) lowers latency and boosts faithfulness\. Notably,D2R\-RAGachieves the highest HotpotQA EM at highest cost, demonstrating that budget shaping controls quality–cost trade\-offs\. More results are provided in Appendix[C](https://arxiv.org/html/2606.29377#A3)\.

DatasetVariantLatencyVRAMRelevanceFaithfulnessACCEMFEVERUnconstrained1\.400\.2750\.9192\.6161\.3–Unweighted1\.460\.1950\.6492\.6261\.4–D2R\-RAG1\.470\.3650\.8092\.3460\.8–HotpotQAUnconstrained2\.030\.8430\.3669\.46–39\.0Unweighted2\.790\.6832\.2070\.59–39\.4D2R\-RAG3\.411\.0431\.6669\.89–39\.8Table 2\.Overall performance comparison of reward variants\.DatasetVariantLatencyVRAMRelevanceFaithfulnessACCEMFEVERStringent2\.220\.2451\.0192\.8060\.6\-Relaxed1\.420\.3250\.6392\.7660\.6\-D2R\-RAG1\.470\.3650\.8092\.3460\.8\-HotpotQAStringent2\.230\.6629\.9971\.46\-39\.2Relaxed2\.310\.7230\.8171\.57\-39\.4D2R\-RAG3\.411\.0431\.6669\.89\-39\.8Table 3\.Overall performance comparison across budget variants\.

## 4\.Conclusion

We introducedD2R\-RAG, a repair framework for black\-box RAG systems with lightweight diagnostics and adaptive patch decisions\. Experiments suggest that reliability improves most when repair is targeted to specific failure signatures\. Limitations include: \(1\) diagnostic signal imperfections, since KG alignment and NLI models can struggle with noisy evidence and multi\-hop reasoning, cascading errors into suboptimal repairs; and \(2\) the policy is constrained to exposed repair actions, limiting its capability or overusing costly patches when cheaper alternatives are unavailable\.

## GenAI Usage Disclosure

OpenAI’s ChatGPT was used to refine sentence clarity and grammatical correctness during manuscript preparation\.

## Acknowledgments

This work was supported in part by the NSERC\-CSE Research Community Grants \(ALLRP 598786\-24\), NSERC Canada Research Chair Grant \(CRC\-2024\-00017\), and the National Cybersecurity Consortium \(2025\-1601\) projects\. Researchers funded through the NSERC\-CSE Research Communities Grants do not represent the Communications Security Establishment Canada or the Government of Canada\. Any research, opinions or positions they produce as part of this initiative do not represent the official views of the Government of Canada\.

## Appendix ADiagnostic Signals

We derive failure signatures from three complementary diagnostic signals\. The first and second are semantic support via textual entailment: an NLI model evaluates whether retrieved passages address the query and whether the response is supported by evidence\. Concretely, each passage inEEis treated as a premise and the query \(or response\) as a hypothesis; passage\-level predictions are aggregated into two coarse entailment stateseqe^\{q\}andere^\{r\}\. Intuitively,eqe^\{q\}captures whether retrieved evidence is relevant and sufficient for the information need, whileere^\{r\}captures whether the generated output remains faithful to the evidence\. The third source is structured consistency derived from relational triples\. Entailment signals can miss fine\-grained relational errors, e\.g\., response mentions the correct entities but asserts an incorrect predicate or swaps roles\. To capture such errors, we extract relations from the response and check whether the stated relations align with the underlying knowledge source\. In practice, this signal is most useful for distinguishing “topically plausible but relationally wrong” answers from answers that are merely missing evidence\. Algorithm[1](https://arxiv.org/html/2606.29377#alg1)summarizes the rule\-based diagnosis\.

## Appendix BContext Representation and Training Process

To form the context vector𝐱\\mathbf\{x\}used by the repair policy, in our implementation, we concatenate: \(1\) a semantic representation of the query, \(2\) the diagnosed failure type, \(3\) the entailment and triple\-alignment statuses, and \(4\) normalized remaining latency and VRAM budgets\. This representation supports cost\-sensitive choices: evidence\-insufficiency signatures favor retrieval\-side escalation, while evidence\-present but unsupported\-response signatures prioritize higher\-precision evidence selection \(e\.g\., reranking\) or query rewrites\. Algorithm[2](https://arxiv.org/html/2606.29377#alg2)summarizes training procedure\. The process is one\-shot: for each diagnosed failure, the policy selects a single action and performs exactly one additional RAG pass\.

Algorithm 1Failure Diagnosis1:query

qq, evidence set

EE, response

rr, predicted task label

yy
2:failure type

f∈\{NoFail,WP,IE,WR,LEM\}f\\in\\\{\\textsf\{NoFail\},\\textsf\{WP\},\\textsf\{IE\},\\textsf\{WR\},\\textsf\{LEM\}\\\}
3:

κ←TripleAlignStatus\(r\)\\kappa\\leftarrow\\textsc\{TripleAlignStatus\}\(r\)
4:

eq←AggregateNLI\(E,q\)e^\{q\}\\leftarrow\\textsc\{AggregateNLI\}\(E,q\)
5:

er←AggregateNLI\(E,r\)e^\{r\}\\leftarrow\\textsc\{AggregateNLI\}\(E,r\)
6:if

κ=conflict\\kappa=\\textsf\{conflict\}then

7:returnWP

8:endif

9:if

eq=neutrale^\{q\}=\\textsf\{neutral\}then

10:returnIE

11:endif

12:if

er≠entaile^\{r\}\\neq\\textsf\{entail\}then

13:returnWR

14:endif

15:if

yyis providedand

LabelCompatible\(y,eq\)=false\\textsc\{LabelCompatible\}\(y,e^\{q\}\)=\\textsf\{false\}then

16:returnLEM

17:endif

18:returnNoFailure

Algorithm 2Bandit Policy Learning1:dataset

𝒟\\mathcal\{D\}, default RAG configuration

PP
2:learned bandit parameters

\{𝐀a,𝐛a\}a∈𝒜\\\{\\mathbf\{A\}\_\{a\},\\mathbf\{b\}\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\}\(equivalently

\{𝜽a\}a∈𝒜\\\{\\bm\{\\theta\}\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\}\)

3:Initialize

𝐀a←𝐈\\mathbf\{A\}\_\{a\}\\leftarrow\\mathbf\{I\},

𝐛a←𝟎\\mathbf\{b\}\_\{a\}\\leftarrow\\mathbf\{0\}for all

a∈𝒜a\\in\\mathcal\{A\}
4:foreach epochdo

5:foreach query

q∈𝒟q\\in\\mathcal\{D\}do

6:

\(r,y,E\)←RAG\(q,P\)\(r,y,E\)\\leftarrow\\textsc\{RAG\}\(q,P\)
7:

f←DetectFailure\(q,E,r,y\)f\\leftarrow\\textsc\{DetectFailure\}\(q,E,r,y\)
8:if

f≠NoFailuref\\neq\\textsf\{NoFailure\}then

9:

𝐱←BuildContext\(q,f,E,r\)\\mathbf\{x\}\\leftarrow\\textsc\{BuildContext\}\(q,f,E,r\)
10:

a←SelectAction\(𝐱,\{𝐀a,𝐛a\}\)a\\leftarrow\\textsc\{SelectAction\}\(\\mathbf\{x\},\\\{\\mathbf\{A\}\_\{a\},\\mathbf\{b\}\_\{a\}\\\}\)
11:

P′←ApplyPatch\(P,a\)P^\{\\prime\}\\leftarrow\\textsc\{ApplyPatch\}\(P,a\)
12:

\(r′,y′,E′\)←RAG\(q,P′\)\(r^\{\\prime\},y^\{\\prime\},E^\{\\prime\}\)\\leftarrow\\textsc\{RAG\}\(q,P^\{\\prime\}\)
13:

f′←DetectFailure\(q,E′,r′,y′\)f^\{\\prime\}\\leftarrow\\textsc\{DetectFailure\}\(q,E^\{\\prime\},r^\{\\prime\},y^\{\\prime\}\)
14:

r\(a\)←ComputeReward\(f′,La,Va\)r\(a\)\\leftarrow\\textsc\{ComputeReward\}\(f^\{\\prime\},L\_\{a\},V\_\{a\}\)
15:UpdateLinUCB\(

a,𝐱,r\(a\)a,\\mathbf\{x\},r\(a\)\)

16:endif

17:endfor

18:endfor

19:return

\{𝐀a,𝐛a\}a∈𝒜\\\{\\mathbf\{A\}\_\{a\},\\mathbf\{b\}\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\}

## Appendix CDiagnostic Analysis and Budget Sensitivity

Regarding factual quality, Table[4](https://arxiv.org/html/2606.29377#A3.T4)shows that correct predictions concentrate in entailment\-supportive regimes: on HotpotQA, both policies yield 170\+ correct answers under response\-entailment versus far fewer under contradiction, while on FEVER, most correct predictions fall under entailed response with contradiction states remaining rare and mostly incorrect\. Frequent missing KG alignment indicates triple extraction falls short for many queries, yet policies still succeed when entailment signals provide adequate support\.

DiagnosticStatusHotpotQAFEVERLinUCBT\.SamplingLinUCBT\.SamplingKG AlignmentConsistent80\(65\)73\(60\)112\(81\)111\(75\)Conflict9\(16\)8\(19\)0\(0\)0\(0\)Missing110\(220\)121\(219\)494\(309\)502\(308\)No Triplet0\(0\)0\(0\)2\(2\)2\(2\)Query EntailmentEntail115\(148\)112\(155\)350\(203\)351\(195\)Contradict59\(107\)57\(95\)258\(139\)264\(140\)Neutral25\(46\)33\(48\)0\(50\)0\(50\)Resp\. EntailmentEntail173\(229\)174\(230\)600\(377\)611\(375\)Contradict7\(30\)6\(22\)8\(10\)4\(9\)Neutral19\(42\)22\(46\)0\(5\)0\(1\)Table 4\.Diagnostic signal counts\. Each cell reports the \# correct \(incorrect\) instances, where correctness means EM=1 on HotpotQA and label accuracy \(ACC\)=1 on FEVER\.Further budget sensitivity analysis \(Figure[3](https://arxiv.org/html/2606.29377#A3.F3)\) shows that under tight budgets, the policy relies more heavily on low\-cost retrieval adjustments, especially on HotpotQA\.

![Refer to caption](https://arxiv.org/html/2606.29377v1/x6.png)\(a\)Stringent variant on FEVER
![Refer to caption](https://arxiv.org/html/2606.29377v1/x7.png)\(b\)Relaxed variant on FEVER
![Refer to caption](https://arxiv.org/html/2606.29377v1/x8.png)\(c\)Stringent variant on HotpotQA
![Refer to caption](https://arxiv.org/html/2606.29377v1/x9.png)\(d\)Relaxed variant on HotpotQA

Figure 3\.Analysis of failure type frequency under stringent and relaxed budgets\.
Diagnosing and Repairing Factual Errors in RAG under Budget Constraints

Similar Articles

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

Most RAG apps in production are confidently wrong and nobody talks about this enough

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Submit Feedback

Similar Articles

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation
Most RAG apps in production are confidently wrong and nobody talks about this enough
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG