AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

arXiv cs.CL Papers

Summary

AdaGATE is a training-free evidence controller for multi-hop RAG that uses entity-centric gap tracking, micro-query generation, and utility-based selection to improve robustness under noisy retrieval, achieving state-of-the-art evidence F1 with fewer input tokens.

arXiv:2605.05245v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) remains brittle on multi-hop questions in realistic deployment settings, where retrieved evidence may be noisy or redundant and only limited context can be passed to the generator. Existing controllers address parts of this problem, but typically either expand context additively, select from a fixed top-k set, or optimize relevance without explicitly repairing missing bridge facts. We propose AdaGATE, a training-free evidence controller for multi-hop RAG that frames evidence selection as a token-constrained repair problem. AdaGATE combines entity centric gap tracking, targeted micro-query generation, and a utility based selection mechanism that balances gap coverage, corroboration, novelty, redundancy, and direct question relevance. We evaluate AdaGATE on HotpotQA under clean, redundancy, and noise injected retrieval conditions. Across all three settings, AdaGATE achieves the best evidence F1 among the compared controllers, reaching 62.3% on clean data and 71.2% under redundancy injection, while using 2.6x fewer input tokens than Adaptive-k. These results suggest that explicit gap-aware repair, combined with token-efficient evidence selection, improves robustness in multi-hop RAG under imperfect retrieval. Our code and evaluation pipeline are available at https://github.com/eliguo/AdaGATE.
Original Article
View Cached Full Text

Cached at: 05/08/26, 06:24 AM

# AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
Source: [https://arxiv.org/html/2605.05245](https://arxiv.org/html/2605.05245)
Yilin Guo Center for Data Science New York University yg3030@nyu\.edu&Yinshan Wang Tandon School of Engineering New York University yw9023@nyu\.edu&Yixuan Wang Center for Data Science New York University yw8735@nyu\.edu

###### Abstract

Retrieval\-augmented generation \(RAG\) remains brittle on multi\-hop questions in realistic deployment settings, where retrieved evidence may be noisy or redundant and only limited context can be passed to the generator\. Existing controllers address parts of this problem, but typically either expand context additively, select from a fixed top\-kkset, or optimize relevance without explicitly repairing missing bridge facts\. We proposeAdaGATE, atraining\-freeevidence controller for multi\-hop RAG that frames evidence selection as atoken\-constrained repairproblem\. AdaGATE combines entity centric gap tracking, targeted micro\-query generation, and a utility based selection mechanism that balancesgap coverage,corroboration,novelty,redundancy, anddirect question relevance\. We evaluate AdaGATE on HotpotQA under clean, redundancy, and noise injected retrieval conditions\. Across all three settings, AdaGATE achieves thebest evidence F1among the compared controllers, reaching 62\.3% on clean data and 71\.2% under redundancy injection, while using2\.6×\\timesfewer input tokensthan Adaptive\-k\. These results suggest that explicit gap\-aware repair, combined with token\-efficient evidence selection, improves robustness in multi\-hop RAG under imperfect retrieval\. Our code and evaluation pipeline are available at[https://github\.com/eliguo/AdaGATE](https://github.com/eliguo/AdaGATE)\.

AdaGATE: Adaptive Gap\-Aware Token\-Efficient Evidence Assembly for Multi\-Hop Retrieval\-Augmented Generation

Yilin GuoCenter for Data ScienceNew York Universityyg3030@nyu\.eduYinshan WangTandon School of EngineeringNew York Universityyw9023@nyu\.eduYixuan WangCenter for Data ScienceNew York Universityyw8735@nyu\.edu

## 1Introduction

Retrieval\-augmented generation \(RAG\) improves large language models \(LLMs\) by conditioning generation on external documents, reducing hallucinations and improving factual accuracy\(Fan et al\.,[2024](https://arxiv.org/html/2605.05245#bib.bib3)\)\. In realistic deployments, however, retrieved evidence is often noisy, redundant, or incomplete\. Under limited context budgets imposed by API costs and latency constraints\(Taguchi et al\.,[2025](https://arxiv.org/html/2605.05245#bib.bib9); Peng et al\.,[2025](https://arxiv.org/html/2605.05245#bib.bib7)\), RAG systems cannot simply pass all retrieved content to the generator\. This challenge is especially acute for multi\-hop questions, where answering often depends on assembling a small set of complementary passages: missing a bridge fact can cause failure, while admitting redundant or misleading evidence can distort the final answer\. These constraints motivate viewing multi\-hop RAG as a token\-constrained evidence assembly problem under imperfect retrieval\.

Prior work has shown that RAG performance depends not only on retrieval quality, but also on how retrieved evidence is selected and organized for generation\. Irrelevant passages can substantially degrade answer quality\(Cuconasu et al\.,[2024](https://arxiv.org/html/2605.05245#bib.bib2)\), and dense retrievers often return clusters of nearly duplicate chunks that reduce coverage of the full reasoning chain\. Recent methods address different parts of this problem\. Self\-RAG trains an LLM to interleave generation with retrieval\-aware reflection\(Asai et al\.,[2023](https://arxiv.org/html/2605.05245#bib.bib1)\)\. Adaptive\-kkselects a query specific number of passages based on similarity score gaps\(Taguchi et al\.,[2025](https://arxiv.org/html/2605.05245#bib.bib9)\)\. SEAL\-RAG performs entity centric gap repair through targeted micro\-queries and replacement\-based updates\(Lahmy and Yozevitch,[2025](https://arxiv.org/html/2605.05245#bib.bib5)\)\. However, these approaches typically either expand context additively, operate over a fixed top\-kkset, or do not explicitly balance gap repair, redundancy, and context efficiency within a single evidence selection procedure\.

![Refer to caption](https://arxiv.org/html/2605.05245v1/x1.png)Figure 1:AdaGATE framework overview\. Unlike SEAL\-RAG\(Lahmy and Yozevitch,[2025](https://arxiv.org/html/2605.05245#bib.bib5)\), it introduces a training\-free, gap\-aware controller that explicitly enforces token efficiency on top of a fixed retriever and LLM\.We proposeAdaGATE\(Figure[1](https://arxiv.org/html/2605.05245#S1.F1)\), atraining\-free controllerthat treats multi\-hop evidence selection as agap\-aware,token\-efficient repair\. AdaGATE maintains an entity centric ledger, issues targeted micro\-queries with a question aware fallback channel, andscores candidateswith a utility function balancing gap coverage, corroboration, novelty, redundancy, and question relevance\. A utility adaptive capacity heuristic then assembles a compact evidence set under a global token budget\.

Evaluated on HotpotQA under clean, redundancy injection and noise injection conditions against four baselines, AdaGATE achieves the highest evidence F1 across all three settings \(62\.3% clean, 71\.2% redundancy, 62\.7% noise\) while using 2\.6×\\timesfewer input tokens than Adaptive\-kk\.

The main contributions of this work are: \(1\) We formulate multi\-hop RAG under imperfect retrieval as a token constrained evidence repair problem and clarify the limitations of fixed\-kkevidence controllers in noisy and redundant settings; \(2\) We introduceAdaGATE, a training\-free controller that combines entity centric gap tracking, utility based evidence scoring, and adaptive capacity control for compact evidence assembly; and \(3\) We develop a stress tested evaluation protocol on HotpotQA with controlled redundancy and noise injection, and compare controllers across answer quality, grounding, and token efficiency\.

## 2Related Work

### 2\.1Multi\-Hop RAG under Imperfect Retrieval

Standard RAG pipelines retrieve a fixed top\-kkset of passages and concatenate them with the query, implicitly treating evidence selection as a one\-shot step\. Multi\-hop QA benchmarks such as HotpotQA and 2WikiMultiHopQA expose the limits of this assumption: answering often requires combining complementary facts across documents, and omitting a single bridge passage can cause failure even when retrieval recall is high\(Yang et al\.,[2018](https://arxiv.org/html/2605.05245#bib.bib12); Welbl et al\.,[2018](https://arxiv.org/html/2605.05245#bib.bib10)\)\. Prior work further shows that long or noisy contexts degrade generation quality due to distractor sensitivity and “lost in the middle” effects\(Liu et al\.,[2023](https://arxiv.org/html/2605.05245#bib.bib6); Cuconasu et al\.,[2024](https://arxiv.org/html/2605.05245#bib.bib2)\)\. These findings motivate controllers that explicitly manage evidence composition under imperfect retrieval\.

### 2\.2Active and Corrective RAG Controllers

Recent work has made the RAG controller more adaptive\. Self\-RAG trains an LLM to interleave generation with retrieval awareness reflection\(Asai et al\.,[2023](https://arxiv.org/html/2605.05245#bib.bib1)\)\. Adaptive\-RAG routes questions among non retrieval, single and multi step strategies based on estimated complexity\(Jeong et al\.,[2024](https://arxiv.org/html/2605.05245#bib.bib4)\)\. CRAG evaluates retrieved documents and triggers corrective actions such as additional retrieval or document decomposition when evidence quality appears low\(Yan et al\.,[2024](https://arxiv.org/html/2605.05245#bib.bib11)\)\. These methods make retrieval more adaptive, but they primarily decide when or whether to retrieve, and several rely on model finetuning, rather than focusing on token efficient evidence assembly for multi\-hop reasoning\.

SEAL\-RAG is most closely related to our setting\(Lahmy and Yozevitch,[2025](https://arxiv.org/html/2605.05245#bib.bib5)\)\. It maintains an entity centric ledger, identifies missing information as gaps, and issues targeted micro\-queries to repair a fixed evidence set through replacement rather than expansion\. Our work builds directly on this line of explicit gap\-aware repair, but extends it beyond a fixed top\-kksetting\.

### 2\.3Adaptive Evidence Selection and Position of This Work

A complementary line of work studies how much context to include\. Adaptive\-kkselects a query specific number of passages by identifying the largest drop in sorted similarity scores\(Taguchi et al\.,[2025](https://arxiv.org/html/2605.05245#bib.bib9)\)\. AdaGReS formulates evidence selection as a token budgeted optimization problem balancing relevance and redundancy\(Peng et al\.,[2025](https://arxiv.org/html/2605.05245#bib.bib7)\)\. These approaches reason about capacity and redundancy, but they do not explicitly model multi\-hop information gaps or use targeted micro\-queries to repair missing evidence\.

AdaGATE combines these two perspectives\. Like SEAL\-RAG, it is a training\-free controller that performs explicit gap\-aware evidence repair with an entity centric ledger\. Like Adaptive\-kkand AdaGReS, it reasons about context efficiency under limited budgets\. Its key difference is to integrate gap\-aware repair, question\-aware fallback retrieval, redundancy\-aware utility scoring, and adaptive capacity control within a single evidence selection procedure for multi\-hop RAG\.

## 3Method

We formulate multi\-hop RAG under deployment constraints as a token constrained evidence repair problem\. Given a queryqq, a corpus𝒟\\mathcal\{D\}, and a global token budgetBB, the goal is to assemble a compact evidence set that supports multi\-hop reasoning while avoiding redundant or misleading passages\. AdaGATE is a training\-free controller built on top of a fixed retriever and generator\. At each iterationtt, it maintains an evidence setEtE\_\{t\}, an entity centric ledgerUtU\_\{t\}, and a set of unresolved information gapsGtG\_\{t\}\. Relative to SEAL\-RAG\(Lahmy and Yozevitch,[2025](https://arxiv.org/html/2605.05245#bib.bib5)\), AdaGATE makes three changes: it replaces fixed\-kkevidence selection with constrained token selection, adds a question\-aware fallback channel to gap targeted retrieval, and uses adaptive utility capacity control to avoid overfilling the context with low value passages\.

### 3\.1Gap\-Aware Retrieval and Evidence State

Let𝒞t\\mathcal\{C\}\_\{t\}denote the candidate pool retrieved at iterationtt\. Each passagec∈𝒞tc\\in\\mathcal\{C\}\_\{t\}has token lengthℓ​\(c\)\\ell\(c\), and the final evidence set must satisfy

∑c∈Etℓ​\(c\)≤B\.\\sum\_\{c\\in E\_\{t\}\}\\ell\(c\)\\leq B\.\(1\)
Following SEAL\-RAG, AdaGATE uses two LLM\-based primitives: \(1\) ledger extraction, which summarizes the current evidence set into structured entity–relation–value tuples with confidence scores, and \(2\) gap specification, which identifies missing facts needed to answer the question\(Lahmy and Yozevitch,[2025](https://arxiv.org/html/2605.05245#bib.bib5)\)\. We treat these as black\-box components and focus on how AdaGATE uses them to guide retrieval and evidence selection under a limited context budget\.

For each gapg∈Gtg\\in G\_\{t\}, AdaGATE generates one or more targeted micro\-queries\. To improve robustness when gap extraction is noisy or overly abstract, it also generates a small set of question\-anchored fallback queries derived directly fromqq\. The union of gap\-aware and question\-aware queries is sent to the retriever to form the next candidate pool𝒞t\\mathcal\{C\}\_\{t\}\. This design allows the controller to continue exploring useful evidence even when the current gap representation is incomplete\.

### 3\.2Utility\-Based Evidence Scoring

Given the current query, ledger, gaps, and evidence state, AdaGATE assigns each candidate passagec∈𝒞tc\\in\\mathcal\{C\}\_\{t\}a scalar utility score

St​\(c\)\\displaystyle S\_\{t\}\(c\)=λ1​GapCov​\(c,Gt\)\+λ2​Corr​\(c,Ut\)\\displaystyle=\\lambda\_\{1\}\\,\\mathrm\{GapCov\}\(c,G\_\{t\}\)\+\\lambda\_\{2\}\\,\\mathrm\{Corr\}\(c,U\_\{t\}\)\+λ3​Nov​\(c,Ut\)−λ4​Red​\(c,Et\)\\displaystyle\\quad\+\\lambda\_\{3\}\\,\\mathrm\{Nov\}\(c,U\_\{t\}\)\-\\lambda\_\{4\}\\,\\mathrm\{Red\}\(c,E\_\{t\}\)\+λ5​RelQ​\(c,q\),\\displaystyle\\quad\+\\lambda\_\{5\}\\,\\mathrm\{Rel\}\_\{Q\}\(c,q\),\(2\)
The five terms capture complementary roles in multi\-hop evidence assembly\.GapCov​\(c,Gt\)\\mathrm\{GapCov\}\(c,G\_\{t\}\)rewards passages that directly address unresolved gaps\.Corr​\(c,Ut\)\\mathrm\{Corr\}\(c,U\_\{t\}\)rewards support for low\-confidence facts already present in the ledger\.Nov​\(c,Ut\)\\mathrm\{Nov\}\(c,U\_\{t\}\)favors passages that contribute new entities or relations rather than repeating lateral information\.Red​\(c,Et\)\\mathrm\{Red\}\(c,E\_\{t\}\)penalizes candidates that are highly similar to evidence already selected\. Finally,RelQ​\(c,q\)\\mathrm\{Rel\}\_\{Q\}\(c,q\)measures direct relevance to the original question and acts as a fallback signal when gap extraction is noisy\. Compared with SEAL\-RAG, the most important additions are the explicit redundancy penalty and the question\-aware relevance term, which together make the controller more robust under noisy or redundant retrieval\.

### 3\.3Token\-Constrained Selection with Adaptive Capacity

AdaGATE does not fix the number of passages passed to the generator\. Instead, it selects evidence under the token budget in Eq\.[1](https://arxiv.org/html/2605.05245#S3.E1), allowing the final evidence set size to vary with passage length and utility\. In practice, AdaGATE uses the utility score in Eq\.[2](https://arxiv.org/html/2605.05245#S3.E2)as a surrogate for marginal value and greedily assembles a compact evidence set from the highest scoring candidates\.

To avoid filling the available budget with many mediocre passages, AdaGATE estimates an effective capacity from the utility distribution\. Let

St\(1\)≥St\(2\)≥⋯≥St\(M\)S\_\{t\}^\{\(1\)\}\\geq S\_\{t\}^\{\(2\)\}\\geq\\dots\\geq S\_\{t\}^\{\(M\)\}denote candidate utilities sorted in descending order, and define adjacent drops

Δi=St\(i\)−St\(i\+1\)\.\\Delta\_\{i\}=S\_\{t\}^\{\(i\)\}\-S\_\{t\}^\{\(i\+1\)\}\.AdaGATE chooses the largest drop

i⋆=arg⁡maxi⁡Δii^\{\\star\}=\\arg\\max\_\{i\}\\Delta\_\{i\}and sets

Kteff=i⋆\+Bbuf,K\_\{t\}^\{\\text\{eff\}\}=i^\{\\star\}\+B\_\{\\text\{buf\}\},whereBbuf=2B\_\{\\text\{buf\}\}=2is a small buffer\. The largest utility drop separates a high\-value prefix from a lower\-value tail; AdaGATE prioritizes candidates within the topKteffK\_\{t\}^\{\\text\{eff\}\}range and greedily selects from them while enforcing the global token budget\.

AdaGATE iterates over four stages:extract,search,score, andreplace\. It first extracts the current ledger and unresolved gaps fromEtE\_\{t\}, then retrieves new candidates using both gap\-aware and question\-aware queries, scores candidates with Eq\.[2](https://arxiv.org/html/2605.05245#S3.E2)and estimates the effective capacity, and finally updates the evidence set by replacing lower utility passages with higher utility candidates while respecting the token budget\. The process stops when no useful repair remains, no meaningful gaps are identified, or a maximum number of repair iterations is reached\. After termination, the final evidence set is concatenated with the question and passed to the generator\.

## 4Experimental Setup

### 4\.1Dataset and Retrieval Setup

We evaluate on HotpotQA\(Yang et al\.,[2018](https://arxiv.org/html/2605.05245#bib.bib12)\), a multi\-hop QA benchmark over Wikipedia in which each question is associated with two supporting paragraphs and additional distractor passages\. We use the distractor setting, which provides both relevant and irrelevant evidence and is therefore suitable for studying evidence selection under imperfect retrieval\.

All controllers share the same retrieval infrastructure\. We use a single Pinecone index built from the first 1,000 HotpotQA validation examples, yielding 10,919 document chunks embedded with OpenAItext\-embedding\-3\-small\. Evaluation is conducted onN=200N=200validation examples\. Because all methods retrieve from the same index and use the same embedding model, observed differences reflect evidence control rather than changes in the retriever\.

### 4\.2Stress\-Test Retrieval Conditions

To evaluate robustness beyond the clean benchmark setting, we construct controlled perturbations of the candidate pools while keeping the question, gold answer, and supporting facts unchanged\. Each condition is indexed into a separate Pinecone namespace\.

#### Noise injection\.

We construct a mixed noise pool by combining two perturbation types applied to the original passages: syntax distortion \(word order scrambling, spelling corruption, and partial truncation\) and cross\-query injection \(irrelevant passages sampled from other examples\)\. The noise ratio is fixed atρnoise=0\.5\\rho\_\{\\text\{noise\}\}=0\.5, increasing the candidate pool from 10 to 20 passages per example\.

#### Redundancy injection\.

We augment each example with paraphrastic or partially overlapping variants of the gold supporting passages\. The redundancy ratio is fixed atρred=0\.5\\rho\_\{\\text\{red\}\}=0\.5, increasing the total pool to approximately 19,854 indexed documents\. Unlike the noise condition, these passages are topically relevant but largely non\-complementary, testing whether a controller can avoid spending budget on repeated evidence\.

### 4\.3Models and Baselines

All methods use the same retriever, generator, and evaluation setup\. The retriever is OpenAItext\-embedding\-3\-smallwith Pinecone, retrievingk=3k=3passages per query\. The generator isgpt\-4o\-mini, treated as a black\-box backbone without finetuning\.

We compare five controller settings:

- •Basic RAG: retrieve the top\-kkpassages and generate directly\.
- •Self\-RAG\(Asai et al\.,[2023](https://arxiv.org/html/2605.05245#bib.bib1)\): retrieval aware self\-reflective generation\.
- •Adaptive\-kk\(Taguchi et al\.,[2025](https://arxiv.org/html/2605.05245#bib.bib9)\): query specific passage count based on similarity score gaps\.
- •SEAL\-RAG\(Lahmy and Yozevitch,[2025](https://arxiv.org/html/2605.05245#bib.bib5)\): entity centric gap\-repair controller, evaluated withL∈\{1,3\}L\\in\\\{1,3\\\}repair iterations\.
- •AdaGATE: our gap\-aware, token\-constrained controller, evaluated withL∈\{1,3\}L\\in\\\{1,3\\\}and global token budgetB=3000B=3000\.

This comparison spans fixed size, adaptive size, and repair based evidence controllers under a shared retrieval and generation pipeline\.

### 4\.4Evaluation Metrics

We evaluate along four dimensions: answer correctness, evidence quality, grounding quality, and token efficiency\.

#### Answer correctness\.

We usegpt\-4oas a judge to compare predicted answers against gold answers semantically and assign a binary correctness label\. Using a stronger judge than the generator reduces self\-evaluation bias\(Saad\-Falcon et al\.,[2023](https://arxiv.org/html/2605.05245#bib.bib8)\)\.

#### Evidence quality\.

We report precision, recall, and F1 against the gold supporting document titles annotated in HotpotQA\. Retrieved titles are extracted from chunk headings and deduplicated before comparison\.

#### Grounding quality\.

We follow ARES\(Saad\-Falcon et al\.,[2023](https://arxiv.org/html/2605.05245#bib.bib8)\)and report Context Relevance \(CR\), Answer Faithfulness \(AF\), and Answer Relevance \(AR\), again usinggpt\-4oas the judge\. These metrics complement title based retrieval measures with generation\-aware judgments over the retrieved evidence and final answer\.

#### Token efficiency\.

We measure average input tokens per query, average number of documents passed to the generator, and tokens per correctly answered question, enabling direct cost–quality comparisons across controllers\.

All judges are blinded to controller identity and see only the question, answer, and retrieved passages\. We use fixed prompts throughout\.

## 5Results

We report results across seven controllers and three conditions\. Full numerical results are in Table[2](https://arxiv.org/html/2605.05245#A1.T2)in the Appendix\.

### 5\.1Evidence Quality and Answer Correctness on Clean Data

![Refer to caption](https://arxiv.org/html/2605.05245v1/x2.png)Figure 2:Answer correctness and evidence quality\. Red = best per condition; green = worst; black = AdaGATE otherwise\. AdaGATE achieves the highest F1 across all three conditions\.Figure[2](https://arxiv.org/html/2605.05245#S5.F2)reveals a consistent tension between precision and recall across all controllers in clean data\.

#### Add\-only controllers\.

Adaptive\-kkachieves the highest accuracy \(69\.0%\) but at the cost of severely collapsed precision \(P=0\.278\)\. Its high recall \(R=0\.820\) arises because it retrieves an average of 8\.6 documents per query, admitting many irrelevant passages alongside the gold supporting ones\. Basic RAG is the most balanced passive baseline \(Acc=58\.5%, F1=0\.601\)\. Self\-RAG’s document grading mechanism occasionally rejects all retrieved passages, leaving the generator with an empty evidence set and explaining its lower recall relative to Basic RAG\.

#### SEAL\-RAG: implementation versus paper\.

SEAL\-RAG achieves the highest precision \(P=0\.808 atL=3L\{=\}3\), but recall remains low and flat at approximately 0\.42 regardless of the condition or repair iterations\. This behavior is directly traceable to a discrepancy between the SEAL\-RAG paper and its implementation\. The paper describes a utility\-based ranking mechanism for evidence selection; the actual implementation replaces this with an LLM entity selection step that instructs the model to “choose the fewest entities needed to answer\.” In practice, this causes the model to select a single entity on the majority of questions, resulting in an average of only 1\.0 documents passed to the generator\. High precision follows naturally from single document selection, but this context systematically misses bridge facts on multi\-hop questions requiring evidence from two or more sources\. Increasing repair iterations fromL=1L\{=\}1toL=3L\{=\}3improves accuracy modestly \(62\.0%→\\to68\.5%\) but does not resolve the fundamental recall ceiling\.

#### AdaGATE\.

AdaGATE addresses this recall bottleneck by passing an average of 2\.8–3\.0 documents to the generator, assembling a richer evidence set through its adaptive capacity mechanism while suppressing redundant passages via the utility scoring function\. The result is the highest F1 on clean data \(62\.3% atL=1L\{=\}1\), outperforming SEAL\-RAG \(L=1L\{=\}1\) by 8\.2 F1 points\.

### 5\.2Token Efficiency

![Refer to caption](https://arxiv.org/html/2605.05245v1/x3.png)Figure 3:Token efficiency across controllers and conditions\. Red = most efficient; green = least efficient\. Adaptive\-kkis consistently the least token\-efficient controller\.Figure[3](https://arxiv.org/html/2605.05245#S5.F3)shows that controllers differ by more than an order of magnitude in token usage\. Adaptive\-kkis the least efficient in every condition, consuming 1,116 tokens on clean data and 1,592 under redundancy, with tokens\-per\-correct of 1,118—roughly 3\.3×\\timesthe cost of AdaGATE \(338\) and 8\.3×\\timesSEAL\-RAG \(134\)\. SEAL\-RAG is the most token efficient at 136–140 tokens, a direct consequence of single document selection, but at the cost of recall established in Section[5\.1](https://arxiv.org/html/2605.05245#S5.SS1)\.

AdaGATE uses 360 tokens on clean data, 2\.6×\\timesfewer than Adaptive\-kkwhile achieving substantially higher F1\. Under redundancy, token usage drops adaptively to 220–232 tokens as its redundancy penalty concentrates the evidence set on fewer, higher utility passages—simultaneously reducing cost and improving evidence quality\.

### 5\.3ARES Grounding Scores

![Refer to caption](https://arxiv.org/html/2605.05245v1/x4.png)Figure 4:ARES grounding scores \(CR = Context Relevance, AF = Answer Faithfulness, AR = Answer Relevance\)\. Red = best; green = worst\. SEAL\-RAG scores are consistently the lowest despite high retrieval precision\.Figure[4](https://arxiv.org/html/2605.05245#S5.F4)reveals a striking divergence between retrieval precision and ARES grounding scores\. Adaptive\-kkleads on all three ARES dimensions \(CR=0\.67, AF=0\.63, AR=0\.59 on clean\) despite collapsed retrieval precision, because its large context of 8–14 documents inflates faithfulness scores by increasing the probability that at least one passage supports the generated answer\. This highlights a fundamental limitation of document conditioned evaluation under retrieval with high recall: ARES scores reflect context availability rather than evidence precision\.

SEAL\-RAG produces the lowest ARES scores across all conditions \(CR=0\.24–0\.26, AF=0\.22–0\.23 on clean\)\. When the single selected document fails to contain the complete reasoning chain, the generator abstains rather than hallucinating, and the ARES judge penalizes abstentions as not faithful—explaining why SEAL\-RAG’s CR substantially exceeds its AF\.

AdaGATE achieves intermediate ARES scores \(CR=0\.58, AF=0\.51, AR=0\.51 on clean\) substantially higher than SEAL\-RAG\. The gap between AdaGATE and SEAL\-RAG on ARES is larger than the gap on retrieval F1, suggesting that multi\-document assembly provides meaningfully better grounding support and reduces abstention rates\.

### 5\.4Robustness Under Stress\-Test Conditions

#### Redundancy injection\.

Under redundancy \(ρ=0\.5\\rho\{=\}0\.5\), accuracy drops sharply for Basic RAG \(58\.5%→\\to47\.5%\) and Self\-RAG \(60\.5%→\\to49\.0%\)\. Adaptive\-kk’s accuracy counterintuitively increases to 72\.5% as its high\-recall strategy benefits from a pool containing many topically relevant paraphrases, but precision collapses to P=0\.109 and ARES scores drop sharply\. SEAL\-RAG accuracy is relatively stable \(58\.5%–65\.0%\) with high precision \(P=0\.868–0\.876\), but recall drops further \(R=0\.448–0\.455\) as the retriever must surface gold documents from a pool twice the size, and ARES scores stagnate\.

AdaGATE’s F1 improves substantially under redundancy \(62\.3%→\\to71\.2% atL=3L\{=\}3, and 62\.3%→\\to70\.9% atL=1L\{=\}1\), driven by the redundancy penalty term\. When the candidate pool is dominated by paraphrase variants, the novelty and redundancy terms jointly suppress their utility scores, causing AdaGATE to select fewer but more complementary documents\. Token usage drops adaptively from 360 to 220 tokens, and ARES scores remain stable atL=3L\{=\}3\(CR=0\.53, AF=0\.48\), while Basic RAG’s drop sharply \(CR=0\.50→\\to0\.36\)\.

#### Noise injection\.

Under noise, all controllers degrade in accuracy\. Noise passages retain high embedding similarity because syntax distortion does not fundamentally alter semantic representations, causing them to appear within the top\-kk\. AdaGATE’s accuracy drops to 54\.0% \(L=1L\{=\}1\), its weakest result, but F1 \(62\.7%\) matches clean performance, indicating that the utility scoring function partially penalizes corrupted passages through lower GapCov and RelQ scores\. Under noise atL=3L\{=\}3, AdaGATE achieves the highest CR among non\-Adaptive\-kkcontrollers \(0\.57\), suggesting that utility based filtering partially maintains context relevance even under retrieval degradation\.

### 5\.5Pipeline\-Level Analysis

Table[1](https://arxiv.org/html/2605.05245#S5.T1)presents three representative examples from clean evaluation logs, spanning three distinct behavioral patterns: systematic single document recall loss, micro\-query failure without fallback, and over conservative gap detection\.

Table 1:Pipeline\-level comparison of SEAL\-RAG and AdaGATE on three representative HotpotQA examples \(clean,L=1L\{=\}1\)\. Q1 shows SEAL\-RAG’s systematic single\-document selection and its recall cost even when both controllers answer correctly\. Q3 shows AdaGATE’sHqH\_\{q\}fallback channel recovering from a micro\-query that returns zero new documents for SEAL\-RAG\. Q21 shows AdaGATE’s conservative gap detection causing abstention while SEAL\-RAG’s less stringent sufficiency check produces the correct answer from partial evidence\.Q1 illustrates the most common behavioral pattern: SEAL\-RAG’s entity selection discards the Ed Wood passage despite answering correctly, while AdaGATE passes both gold documents\. This recall gap is inconsequential for simple comparisons but compounds on questions requiring multi\-hop bridge reasoning\. Q3 shows AdaGATE’sHqH\_\{q\}fallback recovering from a failed micro\-query that returns zero new documents for SEAL\-RAG, a failure mode that occurs whenever the gap specification is too abstract to match any indexed passage\. Q21 exposes the opposite failure: AdaGATE’s stringent gap detection causes abstention when indirect evidence would suffice, while SEAL\-RAG’s less conservative sufficiency check generates the correct answer from partial evidence—illustrating a fundamental tradeoff between hallucination prevention and abstention rate\.

#### Summary\.

AdaGATE achievesbetter precision\-recall balancethan fixed\-kkand add\-only baselinesat comparable token cost, andmore robustunder stress\-test conditions\. The pipeline analysis further reveals that SEAL\-RAG’s apparent precision advantage stems from an implementation that departs from its stated design, while AdaGATE’s multi\-document assembly and fallback retrieval provide more robust evidence coverage at the cost of occasional over conservative abstention\.

## 6Discussion

#### Hypothesis confirmation\.

Both central hypotheses were confirmed\. AdaGATE achieves the highest F1 across all three conditions \(62\.3% clean, 71\.2% redundancy, 62\.7% noise\) while using 2\.6×\\timesfewer tokens than Adaptive\-kk\. F1 under redundancy improves relative to clean, and F1 under noise matches clean performance despite accuracy dropping, confirming that the utility scoring function maintains evidence completeness even when individual passage quality deteriorates\.

#### Relation to prior work\.

The precision\-recall tradeoff between SEAL\-RAG and Adaptive\-kkmirrors the broader tension between context precision and answer coverage in the RAG literature\(Cuconasu et al\.,[2024](https://arxiv.org/html/2605.05245#bib.bib2)\)\. We additionally identify a concrete implementation gap in SEAL\-RAG: its stated utility based ranking is replaced in practice by an LLM entity selection step that collapses the evidence set to one document by construction\. The ARES analysis further reveals that Adaptive\-kk’s inflated scores reflect context availability rather than evidence quality, consistent with concerns about document conditioned evaluation\(Saad\-Falcon et al\.,[2023](https://arxiv.org/html/2605.05245#bib.bib8)\)\.

#### Limitations and future work\.

The current evaluation usesk=3k=3retrieval, which limits the candidate pool and prevents the token budget from binding; largerkkor longer repair chains would engage the budget more actively\. Evaluation is also restricted to HotpotQA’s distractor setting, and performance on web\-scale retrieval may differ\. The utility weights\(λ1,…,λ5\)\(\\lambda\_\{1\},\\ldots,\\lambda\_\{5\}\)were set heuristically and could be learned from supervision\. Finally, AdaGATE’s conservative gap detection leads to over abstention on questions where indirect evidence would suffice; calibrating the sufficiency threshold to question type is a natural direction for future work\.

## 7Conclusion

We presented AdaGATE, a training\-free evidence controller for multi\-hop RAG that combines entity centric gap tracking, a five dimensional utility scoring function, adaptive capacity control, and a question anchored fallback retrieval channel\. Evaluated on HotpotQA across clean, redundancy and noise injected conditions, AdaGATE achieves the highest evidence F1 in all three settings while using 2\.6×\\timesfewer tokens than Adaptive\-kk\. A pipeline\-level analysis reveals that SEAL\-RAG’s precision advantage is an artifact of an implementation that departs from its stated design, and that AdaGATE’s multi\-document assembly provides more robust evidence coverage at the cost of occasional over conservative abstention\. These results support viewing multi\-hop RAG as a token\-budgeted evidence repair problem and suggest that explicit gap modeling, redundancy\-aware scoring, and fallback retrieval together constitute a principled approach to robust evidence assembly under realistic deployment constraints\.

## Acknowledgments

This work was carried out as a course project for DS\-GA / LING\-GA 1012 \(Natural Language Understanding and Computational Semantics\) at New York University, Spring 2026\. We thank the course instructor, Prof\. Tal Linzen, and the teaching staff for their feedback during in\-semester presentations\. We also thank the authors of SEAL\-RAG\(Lahmy and Yozevitch,[2025](https://arxiv.org/html/2605.05245#bib.bib5)\)for releasing their code, which served as the starting point for our implementation\.

## References

- Asai et al\. \(2023\)Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi\. 2023\.[Self\-rag: Learning to retrieve, generate, and critique through self\-reflection](https://arxiv.org/abs/2310.11511)\.
- Cuconasu et al\. \(2024\)Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri\. 2024\.[The power of noise: Redefining retrieval for rag systems](https://doi.org/10.1145/3626772.3657834)\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 719–729, Washington, DC, USA\. Association for Computing Machinery\.
- Fan et al\. \(2024\)Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat\-Seng Chua, and Qing Li\. 2024\.[A survey on RAG meeting LLMs: Towards retrieval\-augmented large language models](https://doi.org/10.1145/3637528.3671470)\.In*Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, pages 6491–6501, New York, NY, USA\. Association for Computing Machinery\.
- Jeong et al\. \(2024\)Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C\. Park\. 2024\.[Adaptive\-rag: Learning to adapt retrieval\-augmented large language models through question complexity](https://arxiv.org/abs/2403.14403)\.
- Lahmy and Yozevitch \(2025\)Moshe Lahmy and Roi Yozevitch\. 2025\.[Replace, don’t expand: Mitigating context dilution in multi\-hop rag via fixed\-budget evidence assembly](https://arxiv.org/abs/2512.10787)\.
- Liu et al\. \(2023\)Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\. 2023\.[Lost in the middle: How language models use long contexts](https://arxiv.org/abs/2307.03172)\.
- Peng et al\. \(2025\)Chao Peng, Bin Wang, Zhilei Long, and Jinfang Sheng\. 2025\.[Adagres: Adaptive greedy context selection via redundancy\-aware scoring for token\-budgeted rag](https://arxiv.org/abs/2512.25052)\.
- Saad\-Falcon et al\. \(2023\)Jon Saad\-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia\. 2023\.[Ares: An automated evaluation framework for retrieval\-augmented generation systems](https://arxiv.org/abs/2311.09476)\.*Preprint*, arXiv:2311\.09476\.
- Taguchi et al\. \(2025\)Chihiro Taguchi, Seiji Maekawa, and Nikita Bhutani\. 2025\.[Efficient context selection for long\-context qa: No tuning, no iteration, just adaptive\-k](https://arxiv.org/abs/2506.08479)\.
- Welbl et al\. \(2018\)Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel\. 2018\.Constructing datasets for multi\-hop reading comprehension across documents\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 1933–1943\.
- Yan et al\. \(2024\)Shi\-Qi Yan, Jia\-Chen Gu, Yun Zhu, and Zhen\-Hua Ling\. 2024\.[Corrective retrieval augmented generation](https://arxiv.org/abs/2401.15884)\.
- Yang et al\. \(2018\)Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W\. Cohen, Ruslan Salakhutdinov, and Christopher D\. Manning\. 2018\.Hotpotqa: A dataset for diverse, explainable multi\-hop question answering\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 2369–2380\.

## Appendix AFull Evaluation Results

Table[2](https://arxiv.org/html/2605.05245#A1.T2)reports complete numerical results for all controllers across all conditions and metrics\.

Table 2:Full evaluation results \(HotpotQA distractor,k=3k=3,N=200N=200\)\. Accuracy \(%\) judged by GPT\-4o\. P = Precision, R = Recall, F1 against gold titles\. Token efficiency: avg\_tokens / avg\_docs / tokens\_per\_correct\. CR = Context Relevance, AF = Answer Faithfulness, AR = Answer Relevance \(ARES UES/IDP, GPT\-4o\)\.Bold= best per metric per condition\.

Similar Articles

Why Retrieval-Augmented Generation Fails: A Graph Perspective

arXiv cs.CL

This paper investigates why Retrieval-Augmented Generation (RAG) systems fail despite having access to correct evidence. Using circuit tracing and attribution graphs, the authors find that correct predictions exhibit deeper reasoning paths and more distributed evidence flow, while failures show shallow and fragmented patterns. They propose a graph-based error detection framework and targeted interventions to improve RAG reliability.

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

arXiv cs.CL

Skill-RAG is a failure-aware RAG framework that uses hidden-state probing and skill routing to diagnose and correct query-evidence misalignment in retrieval-augmented generation. The approach detects retrieval failures and selectively applies targeted skills (query rewriting, question decomposition, evidence focusing) to improve accuracy on hard cases and out-of-distribution datasets.

LightRAG: Simple and Fast Retrieval-Augmented Generation

Papers with Code Trending

The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

arXiv cs.CL

This paper introduces SeedRG, a semi-synthetic benchmark generation pipeline designed to eliminate knowledge leakage in Retrieval-Augmented Generation (RAG) evaluation by creating novel examples that preserve reasoning structures but are absent from model parametric memory.