Diagnosing and Repairing Factual Errors in RAG under Budget Constraints
Summary
This paper proposes D2R-RAG, a model-agnostic and resource-aware framework that diagnoses and repairs factual errors in RAG systems under latency and VRAM constraints, achieving better accuracy-efficiency trade-offs on FEVER and HotpotQA.
View Cached Full Text
Cached at: 06/30/26, 05:33 AM
# Diagnosing and Repairing Factual Errors in RAG under Budget Constraints
Source: [https://arxiv.org/html/2606.29377](https://arxiv.org/html/2606.29377)
\\volumeheader
390
###### Abstract\.
Retrieval\-Augmented Generation \(RAG\) improves the factuality of large language models by grounding responses in external evidence, yet real\-world deployments remain fragile\. Failures often stem from missing or weakly relevant evidence, as well as from generation that does not faithfully reflect the retrieved context\. Many existing approaches rely on fine\-tuning, privileged access to internal model signals, or resource\-insensitive escalation strategies, which limits their practicality in black\-box and budget\-constrained settings\. We proposeD2R\-RAG\(Diagnose\-to\-Repair RAG\), a model\-agnostic and resource\-aware framework that combines lightweight failure diagnosis with adaptive repair\.D2R\-RAGderives interpretable failure signatures from observable signals in the query, retrieved evidence, and generated response, and then selects from a small set of corrective actions under explicit latency and VRAM constraints\. Experiments on FEVER and HotpotQA show thatD2R\-RAGimproves reliability over recent baselines and achieves better accuracy–efficiency trade\-offs across multiple compute budgets\. The code is available at[https://github\.com/CyberScienceLab/D2R\-RAG/](https://github.com/CyberScienceLab/D2R-RAG/)\.
###### keywords:
Keywords: RAG, Factuality Verification, Contextual Bandits, Adaptive Repair\.
Soroush Hashemifar\\upstairs\\affilone, Havva Alizadeh Noughabi\\upstairs\\affiltwo, Fattane Zarrinkalam\\upstairs\\affilone,\\affiltwo,\*, Ali Dehghantanha\\upstairs\\affiltwo\\upstairs\\affiloneCollege of Engineering, University of Guelph, Guelph, ON, Canada\\upstairs\\affiltwoCyber Science Lab, School of Computer Science, University of Guelph, Guelph, ON, Canada\\emails\\upstairs
\* fzarrink@uoguelph\.ca
\\copyrightnotice
## 1\.Introduction
Retrieval\-Augmented Generation \(RAG\)\[lewis2020retrieval\]has become a standard approach for improving the factuality of large language model \(LLM\) outputs by grounding generation in external evidence rather than relying solely on parametric memory\. In many real deployments, however, RAG remains unreliable: correct answers may fail when the retriever misses key evidence, when retrieved passages are relevant but incomplete, or when the generator produces content that is not supported by the provided context\[xie2024adaptive\]\. These failure modes are stochastic and non\-uniform, and they are most problematic in settings with strict latency and hardware budgets \(e\.g\., limited GPU memory, rate\-limited APIs, or cost\-constrained edge/cloud deployments\), where iterative retrieval or resource\-intensive reranking are computationally prohibitive\[ray2025metis\]\.
Recent work has proposed dynamic and self\-correcting RAG variants from three directions: \(1\) training\-time modifications, \(2\) inference\-time control, and \(3\) learning\-based decision policies\. Self\-RAG\[asai2024self\]trains generators to emit reflection control tokens that decide when to retrieve and how to critique intermediate outputs, improving factuality through self\-assessment, but requires fine\-tuning and architectural coupling that limits use in black\-box or API\-based settings\. Moreover, DRAGIN\[su2403dragin\]uses token\-level uncertainty and other internal dynamics to trigger retrieval during inference, offering lightweight, training\-free control, yet relies on logit access \(unavailable in closed\-source LLM APIs\)\. Finally, learning\-based policies like MBA\-RAG\[tang2025mba\]\(multi\-armed bandit balancing quality and efficiency, dependent on an external query\-complexity predictor\) and QueryBandits\[cho2025querybandits\]\(bandit\-based query rewriting using semantic and lexical signals\) demonstrate that online decision\-making can improve robustness, but typically target specific pipeline stages or require additional supervision or modeling components\.
This paper addresses*model\-agnostic, resource\-aware RAG recovery in black\-box settings*, arguing that reliable recovery requires two capabilities: a*lightweight diagnostic*distinguishing retrieval\-side evidence insufficiency from generation\-side unfaithfulness using only observable artifacts, and the*least\-cost corrective action*under explicit latency and VRAM budgets\. We operationalize these principles inD2R\-RAG\(Diagnose\-to\-Repair RAG\), which performs triangulated verification by combining \(1\) textual entailment checks between retrieved evidence and both the query and response, with \(2\) structured consistency checks that align relational triples extracted from the response against a knowledge graph to identify entity/predicate mismatches\. These signals yield an interpretable failure signature separating missing evidence from unsupported generation\. Repair selection is then formulated as a contextual multi\-armed bandit\[langford2007epoch\]where each arm corresponds to a specific intervention \(query rewriting, retrieval depth or mode adjustment, cross\-encoder reranking, or index refresh\), and a LinUCB policy\[li2010contextual\]learns to balance expected factual recovery against computational cost\. Notably, the framework requires neither generator fine\-tuning nor access to internal logits, enabling deployment with closed\-source LLM APIs\.
We evaluateD2R\-RAGon fact verification and multi\-hop question answering using FEVER\[Thorne18Fever\]and HotpotQA\[yang2018hotpotqa\]\. Our evaluation emphasizes both factual performance and operational efficiency under multiple latency/VRAM budget regimes\. Beyond aggregate results, we analyze how the learned policy allocates high\-cost interventions across different failure signatures and how performance changes as resource budgets tighten\.
## 2\.Methodology
Figure 1\.Overview ofD2R\-RAG\.Figure[1](https://arxiv.org/html/2606.29377#S2.F1)overviews theD2R\-RAG, which augments a RAG pipeline with two modules\.
\(1\) Failure Diagnosis\.This stage determines whether the initial RAG output is trustworthy and identifies the likely failure type, using queryqq, evidenceEE, responserr, and predicted labelyy, which is suitable for black\-box deployment\. The goal is to produce an interpretable failure signature that separates retrieval\-side insufficiency from generation\-side unfaithfulness, rather than perfectly explaining every error\. We derive the signature using three complementary signals:query entailmenteqe^\{q\},response entailmentere^\{r\}, andKG alignment statusκ\\kappa\[naveen2023nli\]\. See Appendix[A](https://arxiv.org/html/2606.29377#A1)for additional details\.
Given\(κ,eq,er,y\)\(\\kappa,e^\{q\},e^\{r\},y\), our failure types are: wrong\-predicate \(WP\) for KG conflict, insufficient\-evidence \(IE\) when retrieved evidence does not entail the query \(eq=neutrale^\{q\}=\\textsf\{neutral\}\), wrong\-response \(WR\) when the response is not entailed by retrieved evidence \(er∈\{neutral,contradict\}e^\{r\}\\in\\\{\\textsf\{neutral\},\\textsf\{contradict\}\\\}\), and label–evidence mismatch \(LEM\) for FEVER when the predictedyyconflicts the query\-context entailment\. For FEVER, IE and LEM cases emit the abstention label*Unverified*rather than a potentially correct label for the wrong reason, improving trustworthiness and prioritizing retrieval\-oriented repairs \(evidence\-gated label prediction\)\.
\(2\) Adaptive Repair via Contextual Bandits\.D2R\-RAGperforms single\-shot repair by selecting a patch and re\-running the RAG pipeline, as a contextual multi\-armed bandit \(CMAB\)\[langford2007epoch\]: the learner observes context vector𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}, chooses actiona∈𝒜a\\in\\mathcal\{A\}, and receives a reward reflecting output quality and resource cost\. The action space𝒜\\mathcal\{A\}is intentionally small and black\-box deployable: \(1\) prompt\-level rewrites \(paraphrasing and simplification\) to reduce query ambiguity\[cho2025querybandits\]; \(2\) reranker activation using a cross\-encoder to improve evidence precision; and \(3\) retrieval\-level interventions adjusting depthkk, switching between BM25 and dense retrieval, or refreshing the index\. We learn the repair policy with LinUCB\[li2010contextual\], which assumes a linear reward modelr^\(a∣𝐱\)=𝐱⊤𝜽a\\hat\{r\}\(a\\mid\\mathbf\{x\}\)=\\mathbf\{x\}^\{\\top\}\\bm\{\\theta\}\_\{a\}and selects actions by maximizing an upper\-confidence bound \(more details in Appendix[B](https://arxiv.org/html/2606.29377#A2)\):
a∗=argmaxa∈𝒜\(𝐱⊤𝜽a\+α𝐱⊤𝐀a−1𝐱\)\.a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\left\(\\mathbf\{x\}^\{\\top\}\\bm\{\\theta\}\_\{a\}\+\\alpha\\sqrt\{\\mathbf\{x\}^\{\\top\}\\mathbf\{A\}\_\{a\}^\{\-1\}\\mathbf\{x\}\}\\right\)\.\(1\)whereα\\alphacontrols exploration and𝐀a\\mathbf\{A\}\_\{a\}is the action\-specific covariance matrix\. After applyinga∗a^\{\*\}, the revised output and measured action cost define the observed reward\. We use a resource\-aware reward scoring repairs by output reliability and budget compliance\. LetLaL\_\{a\}andVaV\_\{a\}denote the latency and additional VRAM usage incurred by actionaa, with per\-action budgetsBLB\_\{L\}andBVB\_\{V\}\. The reward is:
r\(a\)=14\(𝟙NF\+𝟙KG\+2⋅𝟙NLI\)∏x∈\{L,V\}𝟙\[xa≤Bx\]\(1−xaBx\)\.r\(a\)=\\frac\{1\}\{4\}\\left\(\\mathbbm\{1\}\_\{NF\}\+\\mathbbm\{1\}\_\{KG\}\+2\\cdot\\mathbbm\{1\}\_\{NLI\}\\right\)\\prod\_\{x\\in\\\{L,V\\\}\}\\mathbbm\{1\}\\\!\\left\[x\_\{a\}\\leq B\_\{x\}\\right\]\\\!\\left\(1\-\\frac\{x\_\{a\}\}\{B\_\{x\}\}\\right\)\.\(2\)where𝟙NF\\mathbbm\{1\}\_\{NF\}indicates aNoFailurediagnosis after repair,𝟙KG\\mathbbm\{1\}\_\{KG\}indicates triple\-level consistency, and𝟙NLI\\mathbbm\{1\}\_\{NLI\}indicates entailment support\. Hard gates𝟙\[xa≤Bx\]\\mathbbm\{1\}\[x\_\{a\}\\leq B\_\{x\}\]zero out rewards for budget\-violating actions, while multiplicative terms encourage cheaper actions among feasible ones\. Following each interaction, LinUCB updates only the selected action:
𝐀a←𝐀a\+𝐱𝐱⊤,𝐛a←𝐛a\+r\(a\)𝐱,𝜽a←𝐀a−1𝐛a\.\\mathbf\{A\}\_\{a\}\\leftarrow\\mathbf\{A\}\_\{a\}\+\\mathbf\{x\}\\mathbf\{x\}^\{\\top\},\\qquad\\mathbf\{b\}\_\{a\}\\leftarrow\\mathbf\{b\}\_\{a\}\+r\(a\)\\mathbf\{x\},\\qquad\\bm\{\\theta\}\_\{a\}\\leftarrow\\mathbf\{A\}\_\{a\}^\{\-1\}\\mathbf\{b\}\_\{a\}\.\(3\)
## 3\.Experiments
We investigate three research questions:RQ1:DoesD2R\-RAGimprove factual quality while respecting deployment costs compared to baselines?RQ2:Does the policy adapt interventions to specific failure modes rather than using a uniform strategy?RQ3:How does the reward formulation inD2R\-RAGdetermine the learned repair strategy, and how stable is this under varying latency/VRAM budgets?
### 3\.1\.Experimental Setup
Datasets\.We evaluateD2R\-RAGon FEVER\[Thorne18Fever\]\(2,000 claim verification instances\) and HotpotQA\[yang2018hotpotqa\]\(1,000 multi\-hop question answering instances\)\.
Baselines\.We evaluateD2R\-RAGagainstNaive\-RAG\(a single\-pass RAG\),Query Paraphrase\[deng2023rephrase\]\(rewriting the query conditioned on the retrieved context\),Context Expansion\(increasing dense retrieval depth fromkktok′=20k^\{\\prime\}=20\), andThompson Sampling\[Thompson1933thompson\]\(replacing LinUCB with a context\-free policy\)\. All methods utilize aGPT\-4o\-minigenerator and 128\-token chunks with 16\-token overlap, within LlamaIndex\.
Evaluation Metrics\.We reportExact Match \(EM\)for HotpotQA,Accuracy \(ACC\)for FEVER, and responseRelevanceandFaithfulness\. Efficiency is measured by Latency \(retrieval to final output\) and peak VRAM \(maximum GPU memory during execution\)\.
Implementation Settings\.We useDeBERTa\-v3\-large\-nlifor entailment check andBabelscape/rebel\-largefor triple extraction\. We impose 3s latency and 6 MB VRAM per query\. The repair policy is trained online for two epochs, with exploration parameterα=2\\alpha\{=\}2\.
### 3\.2\.Results
RQ1\.Table[1](https://arxiv.org/html/2606.29377#S3.T1)shows thatD2R\-RAGimproves factual quality while maintaining efficiency constraints: bold and underlined values represent best and second\-best results per dataset/metric, respectively\. On FEVER, it improves ACC from 56\.3% \(Naive\-RAG\) to 61\.5% \(Thompson Sampling\) and 60\.8% \(LinUCB\), withThompson Samplingachieving the best cost profile \(1\.14s, 0\.17 MB\) compared toLinUCB\(1\.47s, 0\.36 MB\)\. On HotpotQA,Thompson Samplingreaches 40\.4% EM with 71\.52% faithfulness, whileLinUCBmatches 39\.8% EM but at higher latency \(3\.41s vs 2\.37s\);LinUCBoccasionally exceeds the VRAM budget \(3\.41 MB\), since budget constraints are enforced per individual repair action rather than end\-to\-end\. Overall,LinUCBprioritizes quality at higher resource cost, whileThompson Samplingoffers a stronger quality–efficiency trade\-off \(see Appendix[C](https://arxiv.org/html/2606.29377#A3)\)\.
DatasetMethodLatencyVRAMRelevanceFaithfulnessACCEMFEVERNaive\-RAG1\.160\.2050\.1691\.6756\.3\-Query Paraphrase1\.320\.2050\.0992\.6956\.7\-Context Expansion1\.350\.2050\.2492\.8261\.8\-D2R\-RAG\(LinUCB\)1\.470\.3650\.8092\.3460\.8\-D2R\-RAG\(T\.Sampling\)1\.140\.1750\.0092\.3761\.5\-HotpotQANaive\-RAG1\.520\.3628\.3968\.58\-36\.2Query Paraphrase1\.600\.3627\.9074\.65\-39\.8Context Expansion1\.520\.3632\.1470\.70\-40\.6D2R\-RAG\(LinUCB\)3\.411\.0431\.6669\.89\-39\.8D2R\-RAG\(T\.Sampling\)2\.371\.2330\.6971\.52\-40\.4Table 1\.Comparison of D2R\-RAG and baselines on FEVER and HotpotQA\.RQ2\.Figure[2](https://arxiv.org/html/2606.29377#S3.F2)shows thatD2R\-RAGconditions actions on failure type\. On FEVER, IE cases favor retrieval escalation \(∼\\sim60–80% on deeper dense/BM25\), while on HotpotQA, WR and WP failures route to reranker activation, indicating that selecting the right passages \(not merely retrieving more\) is critical for multi\-hop questions\.LinUCBexplores broader action distribution, whileThompson Samplingexploits more \(concentrates on high\-utility repairs\), yet both preserve the same failure\-to\-action mapping\.
\(a\)LinUCB on FEVER
\(b\)Thompson Sampling on FEVER
\(c\)LinUCB on HotpotQA
\(d\)Thompson Sampling on HotpotQA
Figure 2\.Analysis of failure type frequency exposed byD2R\-RAG\.RQ3\.Table[2](https://arxiv.org/html/2606.29377#S3.T2)ablates the two mechanisms that encode deployment constraints:Unweighted, which removes the soft resource\-weighting terms, andUnconstrained, which removes the hard budget gates\. On HotpotQA,Unconstrainedreduces latency \(3\.41→\\rightarrow2\.03s\) and VRAM \(1\.04→\\rightarrow0\.84 MB\) with minimal EM drop, whileUnweightedimproves quality metrics \(Relevance 31\.66→\\rightarrow32\.20%; Faithfulness 69\.89→\\rightarrow70\.59%\) with near\-identical EM\. On FEVER, both ablations improve ACC \(60\.8→\\rightarrow61\.3\-61\.4%\) and reduce memory\. Budget sensitivity tests \(Table[3](https://arxiv.org/html/2606.29377#S3.T3)\) showStringent\(0\.7×\\times\) cuts VRAM on both datasets while preserving performance, andRelaxed\(1\.5×\\times\) lowers latency and boosts faithfulness\. Notably,D2R\-RAGachieves the highest HotpotQA EM at highest cost, demonstrating that budget shaping controls quality–cost trade\-offs\. More results are provided in Appendix[C](https://arxiv.org/html/2606.29377#A3)\.
DatasetVariantLatencyVRAMRelevanceFaithfulnessACCEMFEVERUnconstrained1\.400\.2750\.9192\.6161\.3–Unweighted1\.460\.1950\.6492\.6261\.4–D2R\-RAG1\.470\.3650\.8092\.3460\.8–HotpotQAUnconstrained2\.030\.8430\.3669\.46–39\.0Unweighted2\.790\.6832\.2070\.59–39\.4D2R\-RAG3\.411\.0431\.6669\.89–39\.8Table 2\.Overall performance comparison of reward variants\.DatasetVariantLatencyVRAMRelevanceFaithfulnessACCEMFEVERStringent2\.220\.2451\.0192\.8060\.6\-Relaxed1\.420\.3250\.6392\.7660\.6\-D2R\-RAG1\.470\.3650\.8092\.3460\.8\-HotpotQAStringent2\.230\.6629\.9971\.46\-39\.2Relaxed2\.310\.7230\.8171\.57\-39\.4D2R\-RAG3\.411\.0431\.6669\.89\-39\.8Table 3\.Overall performance comparison across budget variants\.
## 4\.Conclusion
We introducedD2R\-RAG, a repair framework for black\-box RAG systems with lightweight diagnostics and adaptive patch decisions\. Experiments suggest that reliability improves most when repair is targeted to specific failure signatures\. Limitations include: \(1\) diagnostic signal imperfections, since KG alignment and NLI models can struggle with noisy evidence and multi\-hop reasoning, cascading errors into suboptimal repairs; and \(2\) the policy is constrained to exposed repair actions, limiting its capability or overusing costly patches when cheaper alternatives are unavailable\.
## GenAI Usage Disclosure
OpenAI’s ChatGPT was used to refine sentence clarity and grammatical correctness during manuscript preparation\.
## Acknowledgments
This work was supported in part by the NSERC\-CSE Research Community Grants \(ALLRP 598786\-24\), NSERC Canada Research Chair Grant \(CRC\-2024\-00017\), and the National Cybersecurity Consortium \(2025\-1601\) projects\. Researchers funded through the NSERC\-CSE Research Communities Grants do not represent the Communications Security Establishment Canada or the Government of Canada\. Any research, opinions or positions they produce as part of this initiative do not represent the official views of the Government of Canada\.
## Appendix ADiagnostic Signals
We derive failure signatures from three complementary diagnostic signals\. The first and second are semantic support via textual entailment: an NLI model evaluates whether retrieved passages address the query and whether the response is supported by evidence\. Concretely, each passage inEEis treated as a premise and the query \(or response\) as a hypothesis; passage\-level predictions are aggregated into two coarse entailment stateseqe^\{q\}andere^\{r\}\. Intuitively,eqe^\{q\}captures whether retrieved evidence is relevant and sufficient for the information need, whileere^\{r\}captures whether the generated output remains faithful to the evidence\. The third source is structured consistency derived from relational triples\. Entailment signals can miss fine\-grained relational errors, e\.g\., response mentions the correct entities but asserts an incorrect predicate or swaps roles\. To capture such errors, we extract relations from the response and check whether the stated relations align with the underlying knowledge source\. In practice, this signal is most useful for distinguishing “topically plausible but relationally wrong” answers from answers that are merely missing evidence\. Algorithm[1](https://arxiv.org/html/2606.29377#alg1)summarizes the rule\-based diagnosis\.
## Appendix BContext Representation and Training Process
To form the context vector𝐱\\mathbf\{x\}used by the repair policy, in our implementation, we concatenate: \(1\) a semantic representation of the query, \(2\) the diagnosed failure type, \(3\) the entailment and triple\-alignment statuses, and \(4\) normalized remaining latency and VRAM budgets\. This representation supports cost\-sensitive choices: evidence\-insufficiency signatures favor retrieval\-side escalation, while evidence\-present but unsupported\-response signatures prioritize higher\-precision evidence selection \(e\.g\., reranking\) or query rewrites\. Algorithm[2](https://arxiv.org/html/2606.29377#alg2)summarizes training procedure\. The process is one\-shot: for each diagnosed failure, the policy selects a single action and performs exactly one additional RAG pass\.
Algorithm 1Failure Diagnosis1:query
qq, evidence set
EE, response
rr, predicted task label
yy
2:failure type
f∈\{NoFail,WP,IE,WR,LEM\}f\\in\\\{\\textsf\{NoFail\},\\textsf\{WP\},\\textsf\{IE\},\\textsf\{WR\},\\textsf\{LEM\}\\\}
3:
κ←TripleAlignStatus\(r\)\\kappa\\leftarrow\\textsc\{TripleAlignStatus\}\(r\)
4:
eq←AggregateNLI\(E,q\)e^\{q\}\\leftarrow\\textsc\{AggregateNLI\}\(E,q\)
5:
er←AggregateNLI\(E,r\)e^\{r\}\\leftarrow\\textsc\{AggregateNLI\}\(E,r\)
6:if
κ=conflict\\kappa=\\textsf\{conflict\}then
7:returnWP
8:endif
9:if
eq=neutrale^\{q\}=\\textsf\{neutral\}then
10:returnIE
11:endif
12:if
er≠entaile^\{r\}\\neq\\textsf\{entail\}then
13:returnWR
14:endif
15:if
yyis providedand
LabelCompatible\(y,eq\)=false\\textsc\{LabelCompatible\}\(y,e^\{q\}\)=\\textsf\{false\}then
16:returnLEM
17:endif
18:returnNoFailure
Algorithm 2Bandit Policy Learning1:dataset
𝒟\\mathcal\{D\}, default RAG configuration
PP
2:learned bandit parameters
\{𝐀a,𝐛a\}a∈𝒜\\\{\\mathbf\{A\}\_\{a\},\\mathbf\{b\}\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\}\(equivalently
\{𝜽a\}a∈𝒜\\\{\\bm\{\\theta\}\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\}\)
3:Initialize
𝐀a←𝐈\\mathbf\{A\}\_\{a\}\\leftarrow\\mathbf\{I\},
𝐛a←𝟎\\mathbf\{b\}\_\{a\}\\leftarrow\\mathbf\{0\}for all
a∈𝒜a\\in\\mathcal\{A\}
4:foreach epochdo
5:foreach query
q∈𝒟q\\in\\mathcal\{D\}do
6:
\(r,y,E\)←RAG\(q,P\)\(r,y,E\)\\leftarrow\\textsc\{RAG\}\(q,P\)
7:
f←DetectFailure\(q,E,r,y\)f\\leftarrow\\textsc\{DetectFailure\}\(q,E,r,y\)
8:if
f≠NoFailuref\\neq\\textsf\{NoFailure\}then
9:
𝐱←BuildContext\(q,f,E,r\)\\mathbf\{x\}\\leftarrow\\textsc\{BuildContext\}\(q,f,E,r\)
10:
a←SelectAction\(𝐱,\{𝐀a,𝐛a\}\)a\\leftarrow\\textsc\{SelectAction\}\(\\mathbf\{x\},\\\{\\mathbf\{A\}\_\{a\},\\mathbf\{b\}\_\{a\}\\\}\)
11:
P′←ApplyPatch\(P,a\)P^\{\\prime\}\\leftarrow\\textsc\{ApplyPatch\}\(P,a\)
12:
\(r′,y′,E′\)←RAG\(q,P′\)\(r^\{\\prime\},y^\{\\prime\},E^\{\\prime\}\)\\leftarrow\\textsc\{RAG\}\(q,P^\{\\prime\}\)
13:
f′←DetectFailure\(q,E′,r′,y′\)f^\{\\prime\}\\leftarrow\\textsc\{DetectFailure\}\(q,E^\{\\prime\},r^\{\\prime\},y^\{\\prime\}\)
14:
r\(a\)←ComputeReward\(f′,La,Va\)r\(a\)\\leftarrow\\textsc\{ComputeReward\}\(f^\{\\prime\},L\_\{a\},V\_\{a\}\)
15:UpdateLinUCB\(
a,𝐱,r\(a\)a,\\mathbf\{x\},r\(a\)\)
16:endif
17:endfor
18:endfor
19:return
\{𝐀a,𝐛a\}a∈𝒜\\\{\\mathbf\{A\}\_\{a\},\\mathbf\{b\}\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\}
## Appendix CDiagnostic Analysis and Budget Sensitivity
Regarding factual quality, Table[4](https://arxiv.org/html/2606.29377#A3.T4)shows that correct predictions concentrate in entailment\-supportive regimes: on HotpotQA, both policies yield 170\+ correct answers under response\-entailment versus far fewer under contradiction, while on FEVER, most correct predictions fall under entailed response with contradiction states remaining rare and mostly incorrect\. Frequent missing KG alignment indicates triple extraction falls short for many queries, yet policies still succeed when entailment signals provide adequate support\.
DiagnosticStatusHotpotQAFEVERLinUCBT\.SamplingLinUCBT\.SamplingKG AlignmentConsistent80\(65\)73\(60\)112\(81\)111\(75\)Conflict9\(16\)8\(19\)0\(0\)0\(0\)Missing110\(220\)121\(219\)494\(309\)502\(308\)No Triplet0\(0\)0\(0\)2\(2\)2\(2\)Query EntailmentEntail115\(148\)112\(155\)350\(203\)351\(195\)Contradict59\(107\)57\(95\)258\(139\)264\(140\)Neutral25\(46\)33\(48\)0\(50\)0\(50\)Resp\. EntailmentEntail173\(229\)174\(230\)600\(377\)611\(375\)Contradict7\(30\)6\(22\)8\(10\)4\(9\)Neutral19\(42\)22\(46\)0\(5\)0\(1\)Table 4\.Diagnostic signal counts\. Each cell reports the \# correct \(incorrect\) instances, where correctness means EM=1 on HotpotQA and label accuracy \(ACC\)=1 on FEVER\.Further budget sensitivity analysis \(Figure[3](https://arxiv.org/html/2606.29377#A3.F3)\) shows that under tight budgets, the policy relies more heavily on low\-cost retrieval adjustments, especially on HotpotQA\.
\(a\)Stringent variant on FEVER
\(b\)Relaxed variant on FEVER
\(c\)Stringent variant on HotpotQA
\(d\)Relaxed variant on HotpotQA
Figure 3\.Analysis of failure type frequency under stringent and relaxed budgets\.Similar Articles
ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation
ConflictRAG is a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts in retrieved documents, achieving 88.7% detection F1 and 5.3–6.1% correctness gains over baselines while reducing API costs by 62%.
Most RAG apps in production are confidently wrong and nobody talks about this enough
The article highlights a critical failure mode in production RAG systems where confident but incorrect answers arise from versioning issues and lack of uncertainty mechanisms. It proposes architectural improvements like routing layers, retrieval scoring, and hallucination checks to mitigate these errors.
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Skill-RAG is a failure-aware RAG framework that uses hidden-state probing and skill routing to diagnose and correct query-evidence misalignment in retrieval-augmented generation. The approach detects retrieval failures and selectively applies targeted skills (query rewriting, question decomposition, evidence focusing) to improve accuracy on hard cases and out-of-distribution datasets.
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
This paper introduces Context-Driven Decomposition (CDD), a probe to diagnose when RAG systems comply with retrieved context despite conflicting parametric knowledge, and releases the Epi-Scale benchmark for systematic study across model families.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
LatentRAG is a novel framework that shifts reasoning and retrieval for agentic RAG into continuous latent space, reducing inference latency by approximately 90% while maintaining performance comparable to explicit methods.