What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It

arXiv cs.CL Papers

Summary

This paper introduces answer-in-context, a diagnostic metric for budget-constrained multi-hop RAG that measures whether the gold answer survives in the packed reader context, and proposes a submodular evidence packing method that improves over heuristics under specific conditions.

arXiv:2607.00725v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) under a fixed reader-context budget forces a selection problem: of the evidence retrieved, only a fraction can be shown to the reader. We argue that document recall -- the standard retrieval metric -- is the wrong quantity to optimize in this regime, and we make two contributions. First, as a general contribution, we introduce answer-in-context, a diagnostic that measures whether a gold answer survives as a contiguous span in the packed reader context (not the retrieved set). It predicts answer F1 better than recall (r=0.39-0.55 vs. about 0.31), separates answer quality roughly five-fold (0.60 vs. 0.12 on HotpotQA), and carries information beyond retrieval: it adds Delta R squared=0.17 over recall and shows a 4.6x EM gap even among questions where all gold was retrieved. We also confirm it interventionally: on 2WikiMultiHopQA a packing change that raises coverage but not answer-in-context yields no accuracy gain. Second, as a conditional contribution, we cast reader-context construction as budgeted monotone submodular maximization and build a packer that jointly optimizes relevance, query coverage, representativeness, and diversity. On HotpotQA with a 160-token budget and a 3B reader it beats a strong focused heuristic, MMR, and naive packing -- by up to +5.1 F1 at equal-or-lower token cost, across three seeds. Crucially, we map the scope of this win honestly: it requires the conjunction of (i) multi-hop complementary structure, (ii) retrieval that surfaces the evidence, (iii) a binding but not extreme budget, and (iv) a reader weak enough that evidence density, not reading capacity, is the bottleneck. A quantization-controlled reader-scale ladder (3B to 7B to 14B) shows the edge over the heuristic is absorbed by 7B and significantly reverses by 14B, while the diagnostic explains every boundary with a single variable.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:38 AM

# A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It
Source: [https://arxiv.org/html/2607.00725](https://arxiv.org/html/2607.00725)
###### Abstract

Retrieval\-augmented generation \(RAG\) under a fixed reader\-context budget forces a selection problem: of the evidence retrieved, only a fraction can be shown to the reader\. We argue that document recall—the standard retrieval metric—is the wrong quantity to optimize in this regime, and we make two contributions\. First, as a*general*contribution, we introduceanswer\-in\-context, a diagnostic that measures whether a gold answer survives as a contiguous span in the*packed*reader context \(not the retrieved set\)\. It predicts answer F1 better than recall \(r=0\.39r\{=\}0\.39–0\.550\.55vs\.∼0\.31\{\\sim\}0\.31\), separates answer quality roughly five\-fold \(0\.600\.60vs\.0\.120\.12on HotpotQA\), and carries information*beyond*retrieval: it addsΔ​R2=0\.17\\Delta R^\{2\}\{=\}0\.17over recall and shows a4\.6×4\.6\\timesEM gap even among questions where all gold was retrieved\. We also confirm it*interventionally*: on 2WikiMultiHopQA a packing change that raises coverage but not answer\-in\-context yields no accuracy gain\. Second, as a*conditional*contribution, we cast reader\-context construction as budgeted monotone submodular maximization and build a packer that jointly optimizes relevance, query coverage, representativeness, and diversity\. On HotpotQA with a 160\-token budget and a 3B reader it beats a strong focused heuristic, MMR, and naive packing—by up to\+5\.1\+5\.1F1 at equal\-or\-lower token cost, across three seeds\. Crucially, we map the scope of this win honestly: it requires the conjunction of \(i\) multi\-hop complementary structure, \(ii\) retrieval that surfaces the evidence, \(iii\) a binding but not extreme budget, and \(iv\) a reader weak enough that evidence density, not reading capacity, is the bottleneck\. A quantization\-controlled reader\-scale ladder \(3B→\\to7B→\\to14B\) shows the edge over the heuristic is absorbed by 7B and significantly*reverses*by 14B, while the diagnostic explains every boundary with a single variable\.

What Survives Into Context: A Diagnostic for Budget\-Constrained Multi\-Hop RAG and When Submodular Evidence Packing Improves It

Ananto Nayan BalaAhsanullah University of Science and Technologynayan\.ananto@gmail\.com

## 1Introduction

A retrieval\-augmented reader has a finite context window, and in practice an even smaller*evidence budget*: the share of that window allocated to retrieved passages\. Once retrieval returns more relevant text than fits, the system must decide what to keep\. This selection step is usually treated as an afterthought—concatenate the top\-kk, truncate to fit\(Lewiset al\.,[2020](https://arxiv.org/html/2607.00725#bib.bib1); Ramet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib7)\)—yet under a tight budget it is the step that decides whether the reader ever sees the answer\.

The community’s default retrieval metric, recall@kk, is computed on the*retrieved document set*\. But the reader never consumes the retrieved set; it consumes the*packed context*\. When packing discards evidence to fit a budget, recall and what\-the\-reader\-sees diverge\. The divergence is acute formulti\-hopquestions\(Yanget al\.,[2018](https://arxiv.org/html/2607.00725#bib.bib12); Trivediet al\.,[2022](https://arxiv.org/html/2607.00725#bib.bib13)\), where the answer depends on combining evidence from several documents: retrieving all of them is necessary but not sufficient, because the packer may keep a redundant pair and drop the bridge\. Figure[1](https://arxiv.org/html/2607.00725#S1.F1)makes the gap concrete\.

This paper starts from a measurement gap and ends with a method\. We first ask:*what property of the reader context actually predicts answer quality under a budget?*We defineanswer\-in\-context—does a gold answer appear verbatim in the packed context—and show it predicts answer F1 far better than retrieval recall on every dataset we test \(§[3](https://arxiv.org/html/2607.00725#S3)\)\. This reframes the budgeted\-RAG objective from “retrieve the gold documents” to “pack so the answer survives\.” We then ask:*can a principled packer move that quantity?*We formulate reader\-context construction asbudgeted monotone submodular maximization\(§[4](https://arxiv.org/html/2607.00725#S4)\) and show on HotpotQA it delivers a statistically clean win over heuristic packing, MMR, and naive concatenation across three seeds \(§[5](https://arxiv.org/html/2607.00725#S5)\)\. A per\-question decomposition ties the win to the diagnostic: the packer helps precisely by assembling complementary multi\-hop evidence into the reader context\.

Finally—and we view this as much a contribution as the method—wescope the win honestly\(§[6](https://arxiv.org/html/2607.00725#S6)\)\. Through controlled experiments on RAGBench, MuSiQue, a budget sweep, and a reader\-scale ladder, we identify four conditions that must co\-occur for principled packing to beat the best heuristic, and we show concrete settings where each fails\. On MuSiQue we try the obvious fix for the failing condition \(more retrieval\) and it changes nothing, turning a soft “does not transfer” into a precise boundary; and a quantization\-controlled reader\-scale ladder answers the “a stronger reader just absorbs your packing” objection with data—the edge over the heuristic is absorbed by 7B and significantly reverses by 14B, while the packer’s mechanism and its win over naive packing persist\. The diagnostic predicts every one of these patterns\.

#### Contributions\.

1. 1\.A diagnostic \(general\)\.Answer\-in\-context, a reader\-context\-level metric that predicts budgeted\-RAG quality better than recall on span\-answer datasets, with demonstrated*incremental validity*over recall \(Δ​R2=\+0\.17\\Delta R^\{2\}\{=\}\{\+\}0\.17; a4\.6×4\.6\\timesEM separation that survives even when all gold is retrieved\) and*interventional*support on 2Wiki\.
2. 2\.A method \(conditional\)\.A budgeted submodular evidence packer that significantly improves HotpotQA answer quality over heuristic, MMR, and naive packers at equal\-or\-lower token cost, with a mechanistic per\-question explanation\.
3. 3\.A scope map \(the honest core\)\.A four\-condition account of when principled packing beats the best heuristic, each condition demonstrated to fail in a controlled setting—including a retrieval\-unlock ablation on MuSiQue and a quantization\-controlled reader\-scale ladder \(3B→\\to7B→\\to14B\) that locates the reader scale at which curation stops paying off and begins to cost\.

We deliberately do*not*claim that graph\-structured evidence or submodular packing universally improves RAG\. The evidence supports a narrow, mechanistically explained claim plus a diagnostic that generalizes—which we believe is more useful than a broad claim that does not survive replication\.

queryRetrieverretrieved set \(recall@kkhere\)Packer\(≤B\\leq Btokens\)context \(answer\-in\-context here\)gold \#2 droppedReaderanswerFigure 1:Recall is scored on the*retrieved set*; the reader consumes the*packed context*\. Under a budget the packer can drop a retrieved gold document \(here “gold \#2”\), so high recall need not mean the answer survives\. Answer\-in\-context measures exactly what reaches the reader\.

## 2Related Work

#### Retrieval\-augmented generation\.

RAG couples a \(typically dense;Karpukhinet al\.,[2020](https://arxiv.org/html/2607.00725#bib.bib3)\) retriever with a reader LM\(Lewiset al\.,[2020](https://arxiv.org/html/2607.00725#bib.bib1); Guuet al\.,[2020](https://arxiv.org/html/2607.00725#bib.bib2); Izacard and Grave,[2021](https://arxiv.org/html/2607.00725#bib.bib4); Izacardet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib5)\)and now spans retrieval from trillions of tokens\(Borgeaudet al\.,[2022](https://arxiv.org/html/2607.00725#bib.bib6)\), in\-context retrieval\(Ramet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib7)\), black\-box augmentation\(Shiet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib8)\), joint instruction tuning\(Linet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib10)\), and self\-reflective variants\(Asaiet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib9)\); seeGaoet al\.\([2023](https://arxiv.org/html/2607.00725#bib.bib11)\)for a survey\. Most of this work reports retrieval recall and end\-task accuracy*separately*and treats context construction as fixed top\-kkconcatenation\. Our diagnostic targets the quantity in between—what the packed context actually contains—which becomes the binding variable once a budget forces selection\.

#### Multi\-hop question answering\.

HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2607.00725#bib.bib12)\), MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2607.00725#bib.bib13)\), 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2607.00725#bib.bib14)\), and WikiHop\(Welblet al\.,[2018](https://arxiv.org/html/2607.00725#bib.bib15)\)require composing evidence across documents\. A large line of work attacks the retrieval side of this difficulty with multi\-hop dense retrieval\(Xionget al\.,[2021](https://arxiv.org/html/2607.00725#bib.bib18)\), interleaved retrieval\-and\-reasoning\(Trivediet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib16); Presset al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib17)\)built on chain\-of\-thought prompting\(Weiet al\.,[2022](https://arxiv.org/html/2607.00725#bib.bib53)\), iterative retrieval\-generation\(Shaoet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib54); Jianget al\.,[2023b](https://arxiv.org/html/2607.00725#bib.bib20)\), and program\-style composition\(Khattabet al\.,[2022](https://arxiv.org/html/2607.00725#bib.bib19)\)\. We use these datasets not to improve retrieval but to*vary*whether the complementary evidence is present and surfaced, which is what determines whether a packer can help\.

#### Context selection and compression\.

Reducing reader context via reranking, selection, or compression is well studied\. The canonical redundancy\-aware reranker is Maximal Marginal Relevance \(MMR\)\(Carbonell and Goldstein,[1998](https://arxiv.org/html/2607.00725#bib.bib28)\), our direct baseline\. Recent methods compress or filter retrieved context—RECOMP\(Xuet al\.,[2024a](https://arxiv.org/html/2607.00725#bib.bib30)\), LLMLingua\(Jianget al\.,[2023a](https://arxiv.org/html/2607.00725#bib.bib31)\), Selective Context\(Liet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib32)\), context filtering\(Wanget al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib33)\), and robustness to irrelevant passages\(Yoranet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib34)\)\. “Lost in the middle” effects\(Liuet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib29)\)and long\-context studies\(Baiet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib35); Xuet al\.,[2024b](https://arxiv.org/html/2607.00725#bib.bib36)\)show that simply enlarging the window is not a substitute for choosing what goes in it\. Our packer differs in that its objective is tied to an explicit, measurable answer\-density quantity \(the diagnostic\), and our central message is a scope map for*when*principled selection helps at all\.

#### Submodular optimization for selection\.

Coverage\-and\-diversity objectives with the cost\-scaled greedy algorithm and its constant\-factor guarantee\(Nemhauseret al\.,[1978](https://arxiv.org/html/2607.00725#bib.bib39)\)were introduced for extractive summarization byLin and Bilmes \([2011](https://arxiv.org/html/2607.00725#bib.bib37),[2010](https://arxiv.org/html/2607.00725#bib.bib38)\); seeKrause and Golovin \([2014](https://arxiv.org/html/2607.00725#bib.bib40)\); Bilmes \([2022](https://arxiv.org/html/2607.00725#bib.bib41)\)for broader treatments\. We apply that machinery to*reader\-context evidence packing*for RAG and tie the objective to the answer\-in\-context quantity our diagnostic measures\.

#### Retrievers and readers\.

We use a bi\-encoder retriever\(Reimers and Gurevych,[2019](https://arxiv.org/html/2607.00725#bib.bib22); Xiaoet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib25)\)of the kind evaluated on MTEB\(Muennighoffet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib26)\)and BEIR\(Thakuret al\.,[2021](https://arxiv.org/html/2607.00725#bib.bib27)\), with classic sparse\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2607.00725#bib.bib21)\), late\-interaction\(Khattab and Zaharia,[2020](https://arxiv.org/html/2607.00725#bib.bib24)\), and cross\-encoder\(Nogueira and Cho,[2019](https://arxiv.org/html/2607.00725#bib.bib23)\)retrieval as the surrounding context\. Readers are instruction\-tuned LLMs\(Qwen Team,[2025](https://arxiv.org/html/2607.00725#bib.bib42); Touvronet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib43); Brownet al\.,[2020](https://arxiv.org/html/2607.00725#bib.bib44)\); the larger rungs of our reader ladder use 4\-bit NF4 quantization\(Dettmerset al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib45),[2022](https://arxiv.org/html/2607.00725#bib.bib46)\)to fit commodity GPUs, which is why we include a precision control\.

#### RAG evaluation\.

EM/F1\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2607.00725#bib.bib47)\)measure answer quality, while RAG\-specific frameworks score faithfulness and context relevance\(Eset al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib48); Saad\-Falconet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib49); Chenet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib50)\)over knowledge\- intensive suites\(Petroniet al\.,[2021](https://arxiv.org/html/2607.00725#bib.bib51); Mallenet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib52)\)\. These score the*retrieved*context or the*final*answer; answer\-in\-context instead measures the packed context the reader sees, and we show it has incremental validity over recall for predicting end\-task quality\.

## 3The Answer\-in\-Context Diagnostic

### 3\.1Definition

Given a question with gold answer setAAand a*materialized reader context*CC\(the concatenation of packed snippets actually shown to the reader\), we define:

- •answer\-in\-context=1\{=\}1if some normalizeda∈Aa\\in Aoccurs as a contiguous token subsequence of normalizedCC, else0;
- •gold\-doc reader coverage: fraction of gold documents contributing≥1\\geq 1snippet toCC;all\-gold\-in\-reader: whether*all*of them do;
- •gold\-token density: fraction ofCC’s tokens drawn from gold documents\.

These are computed on the*packed*run, not the retrieved set—the key difference from recall@kk, which is scored on retrieved document ids*before*packing\. Answer\-in\-context is a necessary condition for an extractive\-style reader to be correct, and we hypothesize it is the mediator explaining why higher recall need not raise answer quality under a budget\.

### 3\.2Answer\-in\-context predicts quality; recall does not

Table 1:Feature–quality correlations on HotpotQA \(seed 42, 500 questions,n=2,500n\{=\}2\{,\}500policy×\\timesquestion rows, budget 160\)\. Answer\-in\-context is the strongest single predictor—above both retrieval metrics\.Table[1](https://arxiv.org/html/2607.00725#S3.T1)pools all policy×\\timesquestion rows on HotpotQA and correlates each diagnostic with answer quality\. Answer\-in\-context is the strongest single predictor, above both retrieval metrics and reader\-level coverage\. Conditioning directly: mean F1 is0\.5960\.596when a gold answer is in the reader context versus0\.1230\.123when it is not \(a\+0\.47\+0\.47gap\)\. This resolves the “lower recall, better answers” paradox: under a budget, what matters is whether the answer*survives into context*, not how many gold documents were retrieved\.

### 3\.3Incremental validity: not recall in disguise

![Refer to caption](https://arxiv.org/html/2607.00725v1/aic_validity.png)Figure 2:Among HotpotQA questions where*all*gold paragraphs were retrieved \(recall@5=1\{=\}1\), whether packing keeps the answer in context is still decisive: F10\.610\.61vs\.0\.200\.20, EM0\.500\.50vs\.0\.110\.11\.27%27\\%of these retrieval\-perfect questions drop the answer during packing\. Clustered bootstrap on question id, three seeds\.A natural objection is that answer\-in\-context is near\-tautological with correctness, or a proxy for recall\. Two analyses refute this\. Both pool HotpotQA per\-question rows across three seeds\{42,13,7\}\\\{42,13,7\\\}\(10,500 rows, 1,500 questions\) with inference*cluster\-robust on question id*\.

\(a\) Incremental validity over recall\.A model of F1 on recall@5 alone explainsR2=0\.086R^\{2\}\{=\}0\.086; adding answer\-in\-context raises this toR2=0\.257R^\{2\}\{=\}0\.257, an increment ofΔ​R2=\+0\.17\\Delta R^\{2\}\{=\}\{\+\}0\.17\. The standardized coefficient on answer\-in\-context \(β=\+0\.21\\beta\{=\}\{\+\}0\.21\) is roughly4×4\\timesthat on recall \(β=\+0\.05\\beta\{=\}\{\+\}0\.05\), and the partial correlation of answer\-in\-context with F1 controlling for recall is\+0\.43\+0\.43\. Answer\-in\-context and recall@5 themselves correlate only\+0\.41\+0\.41—far from the≈1\{\\approx\}1that “it is just recall” would require\.

\(b\) It captures the packing step, orthogonal to retrieval\.Restrict to questions where retrieval already succeeded—all gold in the top\-5 \(n=7,739n\{=\}7\{,\}739\)\. Even here,27% still drop the answer during packing\(Figure[2](https://arxiv.org/html/2607.00725#S3.F2)\)\. Within this retrieval\-perfect subset, whether packing keeps the answer is decisive: F10\.610\.61vs\.0\.200\.20and EM0\.500\.50vs\.0\.110\.11\(a3\.0×3\.0\\times/4\.6×4\.6\\timesgap, tight clustered\-bootstrap CIs\)\. This is the cleanest evidence that answer\-in\-context measures the*packing*step rather than restating retrieval or correctness\. \(A minority of the 27% are answers that never appear verbatim even in gold passages—paraphrase, not packing failure—so this modestly overstates packing’s share; the predictive\-validity conclusion is unaffected\.\)

### 3\.4Generalization and an interventional test

Table 2:Answer\-in\-context–F1 correlation across five datasets\. Strongest on the two datasets where the packer shows no win—not an artifact of the method\. Degenerate on ExpertQA \(answers never appear verbatim\)\.Table[2](https://arxiv.org/html/2607.00725#S3.T2)shows the correlation is not specific to HotpotQA or to our packer; it is in fact strongest on MuSiQue and 2Wiki, where the packer shows no win\. This is the key evidence that the diagnostic is a dataset\-independent mediator, not a side effect of the method\.

#### An interventional test on 2Wiki\.

§[3\.3](https://arxiv.org/html/2607.00725#S3.SS3)is observational; 2WikiMultiHopQA lets us test the diagnostic*interventionally*\. We ran the exact HotpotQA factorial \(3B reader, budget 160, seeds\{42,13,7\}\\\{42,13,7\\\}, 500 questions\) on 2Wiki, whose retrieval clears the surfacing bar \(all\-gold@5=0\.43\{=\}0\.43\)\. The submodular packer assembles strictly more gold than the focused heuristic—gold\-doc coverage\+0\.054\+0\.054, all three seeds—yet raises answer\-in\-context by only−0\.007\-0\.007and F1 by−0\.008\-0\.008\(paired bootstrapp=0\.44p\{=\}0\.44, a clean null\)\. Coverage moves; answer\-in\-context does not; accuracy follows answer\-in\-context, not coverage\. The mechanism is that on 2Wiki’s compositional questions the answer\-bearing document is usually the one the heuristic already ranks first, so the*extra*gold the packer adds is bridging evidence that scaffolds reasoning without containing the answer string\. This is the interventional counterpart to §[3\.3](https://arxiv.org/html/2607.00725#S3.SS3): move coverage but not answer\-in\-context, and quality does not move\. \(For long free\-form answers such as ExpertQA the verbatim\-span diagnostic is degenerate; a semantic/entailment variant would be needed, which we leave to future work\.\)

## 4Method: Budgeted Submodular Evidence Packing

### 4\.1Objective

Given retrieved evidence for a query and a hard reader\-token budgetBB, we build a candidate set of source\-grounded snippets and select a subsetSSmaximizing

F​\(S\)=wrel​Rel​\(S\)\+wqry​QueryCov​\(S\)\+wcov​Repr​\(S\)\+wdiv​Div​\(S\)\\begin\{split\}F\(S\)=\{\}&w\_\{\\mathrm\{rel\}\}\\,\\mathrm\{Rel\}\(S\)\+w\_\{\\mathrm\{qry\}\}\\,\\mathrm\{QueryCov\}\(S\)\\\\ &\+w\_\{\\mathrm\{cov\}\}\\,\\mathrm\{Repr\}\(S\)\+w\_\{\\mathrm\{div\}\}\\,\\mathrm\{Div\}\(S\)\\end\{split\}\(1\)subject tocost​\(S\)≤B\\mathrm\{cost\}\(S\)\\leq Band a snippet cap\. Each term is monotone and submodular, normalized to\[0,1\]\[0,1\]:Rel\(modular\) is the same per\-snippet lexical relevance the focused heuristic uses—so heuristic and packer see identical candidates and singleton scores, isolating the*selection rule*;QueryCovis a set\-cover over distinct query content terms;Repris a saturated facility\-location term,∑imin⁡\(∑j∈Ssim​\(i,j\),α​degi\)\\sum\_\{i\}\\min\\\!\\big\(\\sum\_\{j\\in S\}\\mathrm\{sim\}\(i,j\),\\,\\alpha\\deg\_\{i\}\\big\), that rewards covering candidate mass but saturates so it cannot be gamed by near\-duplicates;Divis a concave\-over\-documents term,∑drelevance mass of​S​in​d\\sum\_\{d\}\\sqrt\{\\text\{relevance mass of \}S\\text\{ in \}d\}, spreading selection across sources\. We lead with relevance \(wrel=1\.0w\_\{\\mathrm\{rel\}\}\{=\}1\.0,wqry=0\.5w\_\{\\mathrm\{qry\}\}\{=\}0\.5,wcov=0\.4w\_\{\\mathrm\{cov\}\}\{=\}0\.4,wdiv=0\.3w\_\{\\mathrm\{div\}\}\{=\}0\.3,α=0\.3\\alpha\{=\}0\.3\); the other three terms act as coverage/redundancy regularizers that push complementary, answer\-bearing evidence into the budget\.

### 4\.2Algorithm

We maximizeFFwithcost\-scaled \(per\-token\) greedy—at each step add the feasible snippet with the largest marginal\-gain\-per\-token ratio—followed by theLin–Bilmes singleton fallback: if the single best feasible snippet outscores the greedy set, return it instead\. This is the standard constant\-factor template for budgeted monotone submodular maximization\(Lin and Bilmes,[2011](https://arxiv.org/html/2607.00725#bib.bib37); Nemhauseret al\.,[1978](https://arxiv.org/html/2607.00725#bib.bib39)\)\. The contribution is not the optimizer \(textbook\) but \(a\) applying it to reader\-context packing, \(b\) the four\-term objective tied to answer density, and \(c\) the controlled evaluation isolating the selection rule from the candidate features\.

### 4\.3Baselines and the factorial

Every packer consumes the*same*candidates, so comparisons isolate the objective\.Naive packed: greedily concatenate by relevance until the budget is hit\.Focused heuristic: the project’s prior best packer \(prefers new query\-term coverage across distinct documents, but checks the budget only after the fact and never normalizes gain by length\)\.MMR\(Carbonell and Goldstein,[1998](https://arxiv.org/html/2607.00725#bib.bib28)\):arg⁡maxi⁡\[λ​rel​\(i\)−\(1−λ\)​maxj∈S⁡sim​\(i,j\)\]\\arg\\max\_\{i\}\[\\lambda\\,\\mathrm\{rel\}\(i\)\-\(1\{\-\}\\lambda\)\\max\_\{j\\in S\}\\mathrm\{sim\}\(i,j\)\],λ=0\.7\\lambda\{=\}0\.7—the natural “isn’t this just redundancy reduction?” control\. Because the same packers apply to flatchunkretrieval or toACEgraph\-structured evidence \(a source\-linked claim/entity graph from earlier project stages\), we evaluate a\{\\\{chunk, ACE\}×\{\\\}\\times\\\{focused, MMR, submodular\}\\\}factorial plus a naive\-packed anchor and a per\-question oracle, separating “does the packer help” from “does the representation help\.”

## 5Results: The HotpotQA Win

#### Setup\.

All runs share a pipeline:bge\-small\-en\-v1\.5embeddings truncated to 320 dimensions,Qwen2\.5\-3B\-Instructreader, on dual T4 GPUs\. HotpotQA uses 500 questions; the headline is replicated across seeds\{42,13,7\}\\\{42,13,7\\\}\. The primary budget is 160 reader tokens\. Significance is paired bootstrap \(10,000 resamples, 95% CI\); multi\-seed tests pool \(seed, question\) instances\.

Table 3:Three\-seed means, HotpotQA\-500, budget 160, 3B reader\.chunk\_submodis the best fixed policy on*every*seed, at*fewer*tokens\.
#### The packer wins across three seeds\.

In Table[3](https://arxiv.org/html/2607.00725#S5.T3),chunk\_submodis the best fixed policy on every seed at≈145\{\\approx\}145tokens versus≈152\{\\approx\}152for the baselines\. Pooled three\-seed bootstrap \(n=1,500n\{=\}1\{,\}500\): submod vs\. focused\+0\.022\+0\.022F1\[\+0\.002,\+0\.041\]\[\+0\.002,\+0\.041\]; submod vs\. naive\+0\.051\+0\.051F1\[\+0\.030,\+0\.072\]\[\+0\.030,\+0\.072\]\(\+0\.053\+0\.053EM\); submod vs\. MMR\+0\.042\+0\.042F1; MMR vs\. focused−0\.020\-0\.020F1\[−0\.034,−0\.005\]\[\-0\.034,\-0\.005\]\. Three points: \(1\) the win is at*lower*cost, not more context; \(2\) the ordering is submod\>\{\>\}focused\>\{\>\}packed\>\{\>\}mmr; \(3\) the “it is just MMR” objection is empirically dead—plain MMR is*significantly worse*than the focused heuristic, so generic redundancy reduction hurts and only the full coverage\+representativeness\+diversity objective wins\.

#### Honest twist: packing helps chunk, not ACE\.

The packer significantly*hurts*ACE:ace\_submodvs\.ace\_focusedis−0\.021\-0\.021F1\[−0\.039,−0\.003\]\[\-0\.039,\-0\.003\], and under submodular packing chunk beats ACE\. ACE already compresses and de\-duplicates at the graph level, so little redundancy remains for the packer to exploit—graph compression and principled packing are partial substitutes\. This relocates the contribution from the*representation*to the*packing objective*, a finding only the factorial surfaces\.

#### Mechanism: complementary multi\-hop assembly\.

A per\-question decomposition \(seed 42\) attributes81%of the submod–focused gain to37 questionswhere the packer*newly placed a gold answer into the reader context*\(≈\+0\.39\{\\approx\}\{\+\}0\.39F1 each\)\. The route is better complementary coverage—all gold documents reach the context on 289 questions under submod vs\. 256 under focused—not higher raw token density\. The packer wins by moving exactly the quantity the diagnostic measures\. These results use a 3B reader; §[6\.5](https://arxiv.org/html/2607.00725#S6.SS5)shows the advantage*over the focused heuristic*is specific to this scale, while the win over naive packing and the mechanism persist\.

#### Headroom, and why we do not claim a router\.

The per\-question oracle reaches F1≈0\.60\{\\approx\}0\.60vs\. the best fixed policy’s≈0\.45\{\\approx\}0\.45\. Butchunk\_submodis already \(tied\-\)best on 79\.5% of questions; the remaining≈20%\{\\approx\}20\\%is an “answer\-in\-context lottery” whose deciding variable is unobservable at inference time, and an offline router over retrieval features collapses toward the best fixed policy\. We therefore report the oracle as*headroom*, not a deployed method\.

## 6When Does Principled Packing Help? A Scope Map

The HotpotQA win is real but*not universal*\. We ran controlled experiments to find its boundaries and arrived at four conditions that must co\-occur, each presented with the experiment that fails it\.

### 6\.1Condition 1: complementary structure

On RAGBench CovidQA \(n=246n\{=\}246\) and ExpertQA \(n=203n\{=\}203\), test split, the same factorial at budget 160, submod vs\. focused is not significant \(CovidQA−0\.010\-0\.010F1,p=0\.30p\{=\}0\.30; ExpertQA\+0\.005\+0\.005,p=0\.15p\{=\}0\.15\); on CovidQA the focused heuristic is the best chunk packer and ACE regains an edge\. These tasks are single\-pass with largely all\-gold context, so there is no complementary multi\-hop structure for the objective to assemble\. \(Answer\-in\-context still tracks quality,r=0\.39r\{=\}0\.39on CovidQA\.\)

### 6\.2Condition 2: retrieval that surfaces the evidence

MuSiQue is genuinely multi\-hop but retrieval\-bottlenecked: recall@5=0\.506\{=\}0\.506yetall\-gold@5=0\.184\{=\}0\.184—only 18% of questions have all gold retrieved\. Submod vs\. focused is\+0\.011\+0\.011F1 \(p=0\.34p\{=\}0\.34\), and naive packing is just as good;ace\_focusedis the best fixed policy\. The packer cannot assemble evidence retrieval never surfaced\. Yet the diagnostic is*strongest*here \(r=0\.54r\{=\}0\.54, Table[2](https://arxiv.org/html/2607.00725#S3.T2)\): answer\-in\-context still governs quality; retrieval simply rarely achieves it\.

### 6\.3Ruling out the obvious fix

Table 4:Tripling MuSiQue retrieval depth \(top\-kk5→125\{\\to\}12, nodes48→6448\{\\to\}64, expand5→85\{\\to\}8\) moves all\-gold coverage by zero basis points\. The bottleneck is qualitative, not a matter of depth\.A reviewer’s natural objection to §[6\.2](https://arxiv.org/html/2607.00725#S6.SS2)is “you just did not retrieve enough\.” We tested this: re\-running the full MuSiQue factorial with substantially wider retrieval left all\-gold coverage*unchanged*\(Table[4](https://arxiv.org/html/2607.00725#S6.T4)\), and the packer gap stayed null \(\+0\.003\+0\.003F1 at budget 160\)\. The bottleneck is therefore qualitative—the bi\-encoder cannot navigate 2–4 hop compositional chains regardless of pool size—which converts a soft negative into a precise statement: this condition needs a*qualitatively different*retriever \(iterative or chain\-of\-thought multi\-hop\(Trivediet al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib16); Xionget al\.,[2021](https://arxiv.org/html/2607.00725#bib.bib18)\)\), not more depth\.

### 6\.4Condition 3: binding\-but\-not\-extreme budget

![Refer to caption](https://arxiv.org/html/2607.00725v1/budget_sweep.png)Figure 3:HotpotQA budget sweep \(seed 42;B=160B\{=\}160is the three\-seed result\)\. The submod−\-focused gap is an inverted\-U, significant only atB≈160B\{\\approx\}160\(Δ\\DeltaF1\+0\.035\{\+\}0\.035,p=0\.04p\{=\}0\.04\): too tight and nothing complementary fits, too loose and the heuristic catches up\. Against naive packing \(band\) submod wins at*every*budget\. Per\-budget F1 in Table[7](https://arxiv.org/html/2607.00725#A1.T7)\.We predicted the submod advantage would grow monotonically as the budget tightens\.It does not\(Fig\.[3](https://arxiv.org/html/2607.00725#S6.F3)\): the gap is an inverted\-U peaking at≈160\{\\approx\}160—at 96 only 2–3 snippets fit \(no room for complementarity\); at 224 nearly everything fits \(the heuristic catches up\)\. Two cleaner facts survive: submod beats naive packing at*every*budget \(Δ\\DeltaF1\+0\.044\+0\.044to\+0\.055\+0\.055, allp≤0\.022p\\leq 0\.022\), andsubmod@160matchesfocused@224quality \(p=0\.80p\{=\}0\.80\) at≈30%\{\\approx\}30\\%fewer tokens \(145 vs\. 215\)—an iso\-quality efficiency result\.

### 6\.5Condition 4: a reader that is the bottleneck

![Refer to caption](https://arxiv.org/html/2607.00725v1/reader_ladder.png)Figure 4:Reader\-scale ladder \(HotpotQA, budget 160, seeds\{42,13\}\\\{42,13\\\}pooled\)\. The packer’s edge over the*focused heuristic*\(blue\) is positive at 3B, null at 7B, and significantly*negative*at 14B \(p∗<0\.05\{\}^\{\*\}p\{<\}0\.05\); the 7B fp16\-vs\-4\-bit control \(hollow diamond\) overlaps the fp16 point, so the trend is scale, not quantization\. The edge over*naive packing*\(red\) stays significantly positive at every rung\.The sharpest objection to §[5](https://arxiv.org/html/2607.00725#S5)is scaling:*a stronger reader can recover the answer from messier context, so a packer that merely tidies it is irrelevant at scale\.*Rather than test one larger reader, we trace the advantage along areader\-scale ladder—Qwen2\.5 at 3B, 7B, and 14B—re\-running the exact factorial of §[5](https://arxiv.org/html/2607.00725#S5)and changing only the reader\. Because 14B needs 4\-bit \(NF4\) quantization\(Dettmerset al\.,[2023](https://arxiv.org/html/2607.00725#bib.bib45)\)to fit dual T4s, we add a same\-sizeprecision control\(7B in fp16*and*4\-bit\) so any trend is attributable to scale, not quantization\.

Figure[4](https://arxiv.org/html/2607.00725#S6.F4)and Table[5](https://arxiv.org/html/2607.00725#S6.T5)give two clean readings\.\(1\) Scale, not quantization\.The control is decisive: 4\-bit 7B \(−0\.008\-0\.008,p=0\.55p\{=\}0\.55\) reproduces fp16 7B \(−0\.010\-0\.010,p=0\.45p\{=\}0\.45\) almost exactly—same sign, magnitude, null, and best fixed policy\.\(2\) Absorption then reversal\.At 3B the packer beats the focused heuristic \(\+0\.022\+0\.022\); at 7B the contrast is a symmetric null at both precisions; at 14B thefocused heuristic significantly beats the packer\(−0\.029\-0\.029F1,p=0\.013p\{=\}0\.013\)\. Curation stops paying at≈7\{\\approx\}7B and begins to*cost*by 14B\. Throughout,chunk\_submodstill packs strictly more gold \(coverage≈0\.78\{\\approx\}0\.78vs\.0\.730\.73\) and still beats*naive*packing significantly at every rung \(\+0\.044\+0\.044to\+0\.055\+0\.055F1,p≤0\.001p\\leq 0\.001\)\. Reader capability is a*mediator*: the same density edge passed through readers of rising sensitivity—once a reader can extract the answer from the focused pack, denser gold buys nothing and the packing overhead \(a few extra distractor documents\) is a small liability\.

Table 5:Reader\-scale ladder, pooled 2\-seed paired bootstrap \(n=1,000n\{=\}1\{,\}000; 3B is the three\-seed headline\)\. The precision control rules out quantization\.
### 6\.6Synthesis

1\. Multi\-hop, complementary structure?2\. Retrieval surfaces the evidence?3\. Budget binding, but not extreme?4\. Reader is the bottleneck?packer winsHotpotQA,B≈160B\{\\approx\}160, 3B readerno gainRAGBench \(single\-pass\)no gainMuSiQue \(all\-gold@5=\.18\{=\}\.18\)no gainB=96B\{=\}96or224224*reversal*≳\\gtrsim7B; flips by 14BnonononoyesyesyesyesFigure 5:When does principled packing beat the best heuristic? All four conditions must hold; each “no” is a regime we test where the win disappears—RAGBench \(§[6\.1](https://arxiv.org/html/2607.00725#S6.SS1)\), MuSiQue \(§[6\.2](https://arxiv.org/html/2607.00725#S6.SS2)–[6\.3](https://arxiv.org/html/2607.00725#S6.SS3)\), the budget extremes \(§[6\.4](https://arxiv.org/html/2607.00725#S6.SS4)\), and larger readers \(§[6\.5](https://arxiv.org/html/2607.00725#S6.SS5)\)\. Conditions 1–3 gate the win on or off; condition 4*reverses*it\.HotpotQA at budget≈160\{\\approx\}160with a 3B reader is where all four conditions hold \(Fig\.[5](https://arxiv.org/html/2607.00725#S6.F5)\), and there the win is large, significant, and three\-seed robust—a narrow but*precise and mechanistically complete*scope\. Conditions 1–3 are properties of the task and budget under which the packer cannot help at all; condition 4 is different in kind—the mechanism still operates \(it packs strictly denser gold\) but a strong enough reader stops*needing*the completeness and by 14B mildly*prefers*the cleaner pack\. In every case thediagnostic is the unifying variable: each boundary is a distinct reason the packer fails to raise answer\-in\-context, and accuracy tracks answer\-in\-context throughout \(an interventional dissociation confirmed directly in §[3\.4](https://arxiv.org/html/2607.00725#S3.SS4)\)\.

## 7Discussion

#### Why answer\-in\-context, not recall\.

Recall is scored on a set the reader never sees\. Under a budget, the binding question is whether the answer survives packing\. Answer\-in\-context makes the budgeted\-RAG objective observable and turns “retrieve better” into the sharper “pack so the answer survives\.” It is cheap \(a token\-subsequence check\) and, where gold answers are short spans, broadly applicable\.

#### Why the submodular packer works when it works\.

The win is not extra context \(submod uses*fewer*tokens\) or generic de\-duplication \(MMR loses\)\. It is the coverage\+representativeness\+diversity objective assembling*complementary*multi\-hop evidence—81% of the gain is questions where the answer newly enters context\. The diagnostic and the method describe the same phenomenon from two directions\.

#### Why the honest scope is the point\.

The factorial surfaced a finding we would otherwise have overclaimed: packing helps chunk, not ACE, because graph compression already removes the redundancy the packer exploits\. And the four\-condition scope—a falsified monotonicity prediction, a retrieval\-unlock ablation that ruled out the easy fix, and a quantization\-controlled reader\-scale ladder—is the kind of boundary\-mapping that makes a conditional claim trustworthy\. Locating*where*the packer stops paying \(and by 14B starts to cost\) tells a practitioner exactly when to reach for it—small, efficient readers under tight budgets—and when to prefer the simple heuristic\.

## Limitations

The headline win is demonstrated on*one*dataset \(HotpotQA\) at one budget regime; the cross\-dataset experiments are negatives/boundaries by design, so the positive claim rests on HotpotQA\. The reader ladder spans 3B/7B/14B but within a*single*model family \(Qwen2\.5\) and a single embedder \(bge\-small\-en\); whether the diagnostic’s predictive power and the packer’s mechanism hold for other reader families, stronger or instruction\-tuned retrievers, and 32B\+ readers is untested\. Some sweeps \(budget 96/128/224; the MuSiQue runs\) are single\-seed; only the budget\-160 headline is three\-seed\. Answer\-in\-context is span\-based and therefore degenerate for long free\-form answers \(ExpertQA\), where a semantic/entailment variant is needed\. The ACE graph construction is heuristic, so the “packing substitutes for graph compression” reading should be taken with that caveat\. Finally, we measure EM/F1 and context properties, not attribution faithfulness\(Eset al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib48); Saad\-Falconet al\.,[2024](https://arxiv.org/html/2607.00725#bib.bib49)\); a faithfulness\-aware version of answer\-in\-context is left to future work\.

## 8Conclusion

Budget\-constrained multi\-hop RAG is bottlenecked not by how many gold documents are retrieved but by whether the answer survives packing into the reader context\. We introducedanswer\-in\-context, a diagnostic that captures this and predicts answer quality better than retrieval recall across five datasets, confirmed both observationally \(Δ​R2=\+0\.17\\Delta R^\{2\}\{=\}\{\+\}0\.17over recall\) and interventionally \(a 2Wiki manipulation that moves coverage but not answer\-in\-context leaves accuracy flat\)\. We introduced abudgeted submodular evidence packerthat, with a 3B reader on HotpotQA, significantly and robustly improves answer quality at equal\-or\-lower token cost by assembling complementary multi\-hop evidence\. And we mapped thescopeof that win to four conditions, each demonstrated to fail, including a quantization\-controlled reader\-scale ladder \(3B→\\to7B→\\to14B\) showing the edge over the best heuristic is absorbed by 7B and significantly reverses by 14B, while the packer’s mechanism and its win over naive packing persist throughout\. The result is a general diagnostic plus a conditional, mechanistically explained method—sharply located where it pays off: evidence\-bottlenecked, not reader\-bottlenecked, budgeted multi\-hop QA\.

## References

- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 3119–3137\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Bilmes \(2022\)Submodularity in machine learning and artificial intelligence\.arXiv preprint arXiv:2202\.00132\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican,et al\.\(2022\)Improving language models by retrieving from trillions of tokens\.InProceedings of the 39th International Conference on Machine Learning \(ICML\),pp\. 2206–2240\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan,et al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 1877–1901\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- J\. Carbonell and J\. Goldstein \(1998\)The use of MMR, diversity\-based reranking for reordering documents and producing summaries\.InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),pp\. 335–336\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2607.00725#S4.SS3.p1.5)\.
- J\. Chen, H\. Lin, X\. Han, and L\. Sun \(2024\)Benchmarking large language models in retrieval\-augmented generation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17754–17762\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px6.p1.1)\.
- T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer \(2022\)LLM\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: efficient finetuning of quantized LLMs\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1),[§6\.5](https://arxiv.org/html/2607.00725#S6.SS5.p1.1)\.
- S\. Es, J\. James, L\. Espinosa Anke, and S\. Schockaert \(2024\)RAGAS: automated evaluation of retrieval augmented generation\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\): System Demonstrations,pp\. 150–158\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px6.p1.1),[Limitations](https://arxiv.org/html/2607.00725#Sx1.p1.1)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, and H\. Wang \(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.10997\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)REALM: retrieval\-augmented language model pre\-training\.InProceedings of the 37th International Conference on Machine Learning \(ICML\),pp\. 3929–3938\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Ho, A\. Duong Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics \(COLING\),pp\. 6609–6625\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\),pp\. 874–880\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Izacard, P\. Lewis, M\. Lomeli, L\. Hosseini, F\. Petroni, T\. Schick, J\. Dwivedi\-Yu, A\. Joulin, S\. Riedel, and E\. Grave \(2023\)Atlas: few\-shot learning with retrieval augmented language models\.Journal of Machine Learning Research \(JMLR\)24\(251\),pp\. 1–43\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Jiang, Q\. Wu, C\. Lin, Y\. Yang, and L\. Qiu \(2023a\)LLMLingua: compressing prompts for accelerated inference of large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 13358–13376\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Jiang, F\. Xu, L\. Gao, Z\. Sun, Q\. Liu, J\. Dwivedi\-Yu, Y\. Yang, J\. Callan, and G\. Neubig \(2023b\)Active retrieval augmented generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 7969–7992\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- V\. Karpukhin, B\. Oğuz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 6769–6781\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- O\. Khattab, K\. Santhanam, X\. L\. Li, D\. Hall, P\. Liang, C\. Potts, and M\. Zaharia \(2022\)Demonstrate\-search\-predict: composing retrieval and language models for knowledge\-intensive NLP\.arXiv preprint arXiv:2212\.14024\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- O\. Khattab and M\. Zaharia \(2020\)ColBERT: efficient and effective passage search via contextualized late interaction over BERT\.InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),pp\. 39–48\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- A\. Krause and D\. Golovin \(2014\)Submodular function maximization\.InTractability: Practical Approaches to Hard Problems,pp\. 71–104\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px4.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2607.00725#S1.p1.1),[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, B\. Dong, F\. Guerin, and C\. Lin \(2023\)Compressing context to enhance inference efficiency of large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 6342–6353\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lin and J\. Bilmes \(2010\)Multi\-document summarization via budgeted maximization of submodular functions\.InHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\-HLT\),pp\. 912–920\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px4.p1.1)\.
- H\. Lin and J\. Bilmes \(2011\)A class of submodular functions for document summarization\.InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies \(ACL\-HLT\),pp\. 510–520\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px4.p1.1),[§4\.2](https://arxiv.org/html/2607.00725#S4.SS2.p1.1)\.
- X\. V\. Lin, X\. Chen, M\. Chen, W\. Shi, M\. Lomeli, R\. James, P\. Rodriguez, J\. Kahn, G\. Szilvasy, M\. Lewis, L\. Zettlemoyer, and S\. Yih \(2024\)RA\-DIT: retrieval\-augmented dual instruction tuning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics \(TACL\)12,pp\. 157–173\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating the effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 9802–9822\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px6.p1.1)\.
- N\. Muennighoff, N\. Tazi, L\. Magne, and N\. Reimers \(2023\)MTEB: massive text embedding benchmark\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\),pp\. 2014–2037\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- G\. L\. Nemhauser, L\. A\. Wolsey, and M\. L\. Fisher \(1978\)An analysis of approximations for maximizing submodular set functions—I\.Mathematical Programming14\(1\),pp\. 265–294\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px4.p1.1),[§4\.2](https://arxiv.org/html/2607.00725#S4.SS2.p1.1)\.
- R\. Nogueira and K\. Cho \(2019\)Passage re\-ranking with BERT\.arXiv preprint arXiv:1901\.04085\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- F\. Petroni, A\. Piktus, A\. Fan, P\. Lewis, M\. Yazdani, N\. De Cao, J\. Thorne, Y\. Jernite, V\. Karpukhin, J\. Maillard, V\. Plachouras, T\. Rocktäschel, and S\. Riedel \(2021\)KILT: a benchmark for knowledge intensive language tasks\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),pp\. 2523–2544\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px6.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- Qwen Team \(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 2383–2392\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px6.p1.1)\.
- O\. Ram, Y\. Levine, I\. Dalmedigos, D\. Muhlgay, A\. Shashua, K\. Leyton\-Brown, and Y\. Shoham \(2023\)In\-context retrieval\-augmented language models\.Transactions of the Association for Computational Linguistics \(TACL\)11,pp\. 1316–1331\.Cited by:[§1](https://arxiv.org/html/2607.00725#S1.p1.1),[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: BM25 and beyond\.Foundations and Trends in Information Retrieval3\(4\),pp\. 333–389\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- J\. Saad\-Falcon, O\. Khattab, C\. Potts, and M\. Zaharia \(2024\)ARES: an automated evaluation framework for retrieval\-augmented generation systems\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),pp\. 338–354\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px6.p1.1),[Limitations](https://arxiv.org/html/2607.00725#Sx1.p1.1)\.
- Z\. Shao, Y\. Gong, Y\. Shen, M\. Huang, N\. Duan, and W\. Chen \(2023\)Enhancing retrieval\-augmented large language models with iterative retrieval\-generation synergy\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 9248–9274\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Shi, S\. Min, M\. Yasunaga, M\. Seo, R\. James, M\. Lewis, L\. Zettlemoyer, and W\. Yih \(2024\)REPLUG: retrieval\-augmented black\-box language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),pp\. 8364–8377\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Thakur, N\. Reimers, A\. Rücklé, A\. Srivastava, and I\. Gurevych \(2021\)BEIR: a heterogeneous benchmark for zero\-shot evaluation of information retrieval models\.InAdvances in Neural Information Processing Systems \(NeurIPS\), Datasets and Benchmarks Track,Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics \(TACL\)10,pp\. 539–554\.Cited by:[§1](https://arxiv.org/html/2607.00725#S1.p2.1),[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 10014–10037\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1),[§6\.3](https://arxiv.org/html/2607.00725#S6.SS3.p1.1)\.
- Z\. Wang, J\. Araki, Z\. Jiang, M\. R\. Parvez, and G\. Neubig \(2023\)Learning to filter context for retrieval\-augmented generation\.InarXiv preprint arXiv:2311\.08377,Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Welbl, P\. Stenetorp, and S\. Riedel \(2018\)Constructing datasets for multi\-hop reading comprehension across documents\.Transactions of the Association for Computational Linguistics \(TACL\)6,pp\. 287–302\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Xiao, Z\. Liu, P\. Zhang, N\. Muennighoff, D\. Lian, and J\. Nie \(2024\)C\-Pack: packed resources for general chinese embeddings\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),pp\. 641–649\.Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px5.p1.1)\.
- W\. Xiong, X\. L\. Li, S\. Iyer, J\. Du, P\. Lewis, W\. Y\. Wang, Y\. Mehdad, W\. Yih, S\. Riedel, D\. Kiela, and B\. Oğuz \(2021\)Answering complex open\-domain questions with multi\-hop dense retrieval\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1),[§6\.3](https://arxiv.org/html/2607.00725#S6.SS3.p1.1)\.
- F\. Xu, W\. Shi, and E\. Choi \(2024a\)RECOMP: improving retrieval\-augmented LMs with context compression and selective augmentation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Xu, W\. Ping, X\. Wu, L\. McAfee, C\. Zhu, Z\. Liu, S\. Subramanian, E\. Bakhturina, M\. Shoeybi, and B\. Catanzaro \(2024b\)Retrieval meets long context large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 2369–2380\.Cited by:[§1](https://arxiv.org/html/2607.00725#S1.p2.1),[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px2.p1.1)\.
- O\. Yoran, T\. Wolfson, O\. Ram, and J\. Berant \(2024\)Making retrieval\-augmented language models robust to irrelevant context\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2607.00725#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix ASingle\-Seed Reference Table

Table[6](https://arxiv.org/html/2607.00725#A1.T6)gives the per\-policy means underlying the answer\-in\-context mediation \(§[3\.2](https://arxiv.org/html/2607.00725#S3.SS2)\) and decomposition \(§[5](https://arxiv.org/html/2607.00725#S5)\), computed on seed 42’s 500 questions\.

Table 6:Seed\-42 per\-policy means, HotpotQA\-500, budget 160, 3B reader\. g\-cov=\{=\}gold\-doc reader coverage; all\-g=\{=\}all\-gold\-in\-reader\.Table 7:Per\-budget F1 underlying the budget sweep \(Fig\.[3](https://arxiv.org/html/2607.00725#S6.F3); seed 42,B=160B\{=\}160is the three\-seed result\)\. The submod−\-focused gap is an inverted\-U peaking at≈160\{\\approx\}160, not monotone\.
## Appendix BReader\-Scale Reference Tables

The packing/diagnostic columns are reader\-independent by construction, so they are identical across rungs; only EM/F1 move\. Table[8](https://arxiv.org/html/2607.00725#A2.T8)\(7B fp16\) and Table[9](https://arxiv.org/html/2607.00725#A2.T9)\(14B 4\-bit\) are the two ends of the ladder\. The 7B 4\-bit control reproduces 7B fp16 \(submod−\-focused−0\.007\-0\.007F1,p=0\.55p\{=\}0\.55; same best policy on both seeds\), with absolute F1≈1\{\\approx\}1–2 points lower \(the quantization tax\) but the contrast unchanged\.

Table 8:7B fp16\. Per\-seed best: seed 42→\{\\to\}chunk\_submod\(0\.396\); seed 13→\{\\to\}chunk\_focused\(0\.407\)\.Table 9:14B 4\-bit\. Per\-seed best: seed 42→\{\\to\}chunk\_focused\(0\.459\); seed 13→\{\\to\}ace\_focused\(0\.448\)\. The focused policies are best—the opposite of 3B—with identical packing underneath\.
## Appendix C2WikiMultiHopQA Interventional Check

3B reader, budget 160, seeds\{42,13,7\}\\\{42,13,7\\\}, 500 questions\. Retrieval gate: recall@5=0\.718\{=\}0\.718, all\-gold@5=0\.43\{=\}0\.43\. Key contrast, pooled 3\-seed bootstrap \(n=1,500n\{=\}1\{,\}500\):chunk\_submod−\-chunk\_focused=−0\.008\{=\}\-0\.008F1\[−0\.027,\+0\.012\]\[\-0\.027,\+0\.012\],p=0\.44p\{=\}0\.44, with coverage\+0\.054\+0\.054but answer\-in\-context−0\.007\-0\.007—coverage and answer\-in\-context move in opposite directions, and F1 follows answer\-in\-context\. Conditional F1 is0\.560\.56when the answer is in context versus0\.080\.08when not\.

Similar Articles

Answer Presence Drives RAG Rewriting Gains

Hugging Face Daily Papers

The paper investigates whether the performance gains from rewriting retrieved passages in RAG QA pipelines are causally driven by the presence of the gold answer string in the rewritten context, using controlled intervention audits across multiple models and datasets.