A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

arXiv cs.CL Papers

Summary

This paper introduces a four-condition diagnostic protocol to separate no-evidence answerability, oracle-evidence recoverability, full-context utilization, and retrieval-conditioned utilization in long-context and retrieval-augmented language models, tested on five open-weight models across multiple datasets.

arXiv:2606.06758v1 Announce Type: new Abstract: Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory, fail despite receiving the right passages, or cite evidence without converting it into the requested answer. This paper proposes a matched four-condition evidence-availability protocol--no evidence, full context, retrieved evidence, and oracle-evidence reference--for diagnosing evidence utilization under fixed examples, prompts, score fields, retrieval settings, and validity checks. ONCU is used as a protocol-bound estimator of recovered oracle-reference evidence advantage and is computed only for denominator-valid groups; denominator-free answer, evidence, retrieval, and failure-audit metrics are reported separately. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families across Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, with 18,000 ONCU-compatible predictions. The main finding is a task-dependent bottleneck split: controlled synthetic settings primarily expose full-context utilization failures, whereas the tested realistic multi-hop settings primarily expose retrieval-chain coverage failures in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. The contribution is a diagnostic protocol for separating no-evidence answerability, oracle-evidence recoverability, full-context utilization, and retrieval-conditioned utilization, rather than a single-score leaderboard for long-context or retrieval-augmented systems.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:20 AM

# A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models
Source: [https://arxiv.org/html/2606.06758](https://arxiv.org/html/2606.06758)
\\JAIRAE

To be assigned\\JAIRTrackArticle

###### Abstract\.

Background:Final\-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long\-context or retrieval\-augmented language model used the evidence it was given\. A model can answer from parametric memory, fail despite receiving the right passages, or cite evidence without converting it into the requested answer\.

Objectives:We study how a matched diagnostic protocol can separate no\-evidence answerability, oracle\-evidence recoverability, full\-context utilization, and retrieval\-conditioned utilization when contextual evidence is supplied under specified evaluation conditions\.

Methods:We propose a matched four\-condition evidence\-availability protocol—no evidence, full context, retrieved evidence, and oracle\-evidence reference—that binds each score term to a distinct diagnostic role under fixed examples, prompts, score fields, and validity checks\. ONCU is used as the protocol\-bound estimator of recovered oracle\-reference evidence advantage and is interpreted only for denominator\-valid groups\. The empirical study covers five local open\-weight models from the Qwen, Gemma, Llama, and Mistral families, with 18,000 ONCU\-compatible predictions over Controlled\-ONCU\-safe16K, HotpotQA\-ONCU, and 2WikiMultiHopQA\-ONCU\.

Results:Under the tested local\-model, dataset, and retriever conditions, the main finding is a task\-dependent bottleneck split\. Controlled synthetic settings primarily expose full\-context utilization failures, whereas the tested realistic multi\-hop settings expose retrieval\-chain coverage failures in denominator\-free answer/evidence metrics, with ONCU supporting the same direction on oracle\-improving groups\. Dense@16 and hybrid@16 retrieved inputs narrow some gaps but do not overturn the full\-context\-over\-retrieved pattern in the tested realistic multi\-hop protocol\.

Conclusions:The contribution is a protocol\-level diagnostic framework for the tested local open\-weight and reconstructed QA settings\. It separates no\-evidence answerability, oracle\-reference recoverability, full\-context utilization, and retrieval\-conditioned utilization without treating ONCU as a universal ranking score\.

††copyright:none††journalvolume:0††article:0††publicationmonth:1††journalyear:2026## 1\.Introduction

Long\-context language models and retrieval\-augmented systems are often evaluated by final answer accuracy\. Accuracy is necessary, but it does not identify how a model used evidence\. A model can answer a question correctly without reading the supplied context, fail despite receiving the correct evidence, or produce a correct\-looking answer while citing irrelevant passages\. Conversely, a retrieve\-then\-read pipeline can improve final accuracy because retrieval removes distractors, or it can reduce accuracy because retrieval drops one hop of a multi\-hop evidence chain\. These cases are scientifically different, but final accuracy alone collapses them\.

We ask a diagnostic question: when contextual evidence is made available to a model, how much of the recoverable evidence\-derived advantage is actually recovered by the tested input condition? The question is not equivalent to asking whether a model has a long context window, whether a retriever has high recall, or whether an answer is correct\. It requires comparing at least three quantities for the same model and examples: what the model can answer without evidence, what it can answer when the relevant evidence is isolated, and what it can answer under the actual full\-context or retrieved\-evidence condition\.

We use*evidence utilization*operationally in this paper\. The protocol measures observable condition\-level behavior: how answer and evidence scores change when evidence availability is changed under fixed prompts, decoding controls, retrieval settings, and scoring rules\. It does not claim mechanistic causal attribution to internal model states, attention paths, or hidden computations for individual answers\.

We therefore frame evidence\-utilization evaluation as a matched diagnostic protocol\. The protocol fixes four evidence\-availability conditions: no evidence, full context, retrieved evidence, and oracle\-evidence reference\. The answer contract, decoding policy, retrieval settings, and scoring pipeline are held fixed across conditions\. ONCU is the protocol’s baseline\-adjusted recovered\-advantage estimator: it normalizes a contextual condition’s score between the no\-evidence baseline and the oracle\-evidence reference\. Its diagnostic content comes from the protocol binding, not from the algebraic ratio alone: no\-evidence answerability, recoverable oracle\-evidence advantage, full\-context utilization, and retrieval\-conditioned utilization are jointly estimated under the same model, examples, score field, grouping scheme, and denominator\-validity checks\.

This distinction matters because several common metrics do not identify evidence utilization on their own\. Raw answer F1 does not subtract no\-evidence answerability\. Evidence F1 does not verify that cited evidence was converted into the final answer\. Retrieval recall measures pre\-reader availability, not reader use\. Oracle gap ignores what the model could already answer without context\. The proposed framework is designed to keep these factors separate: evidence availability, full\-context localization, retrieval\-chain coverage, multi\-hop integration, answer conversion, and output\-format stability\.

For a broad JAIR readership, the intended use of the framework is diagnostic\. AI researchers can use the protocol to ask whether an apparent contextual improvement comes from evidence use, no\-evidence answerability, retrieval\-chain coverage, long\-context localization, or output\-format stability\. This matters for model developers, retrieval\-system designers, benchmark authors, and evaluators because the same final answer score can correspond to different mechanisms and therefore to different research conclusions\.

The reported evaluation contains three ONCU\-compatible benchmark components: Controlled\-ONCU\-safe16K, HotpotQA\-ONCU, and 2WikiMultiHopQA\-ONCU\. These datasets support the same four\-condition protocol because no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence inputs can be constructed from the same underlying examples\. We evaluate a primary three\-model panel, Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B, and add a model\-family extension with Llama3\.1\-8B and Mistral\-Small3\.1\-24B\. The main text gives priority to the core four\-condition results, denominator\-validity audit, model\-family extension, matched dense@16/hybrid@16 ONCU sensitivity, and controlled length–position scaling as mechanism evidence\. Retrieval\-only checks, reader\-facing sweeps, cross\-encoder reranking, external BABILong/RULER\-lite validation, and failure\-taxonomy validation are reported as supporting audits rather than separate benchmark claims\.

##### Central thesis\.

The contribution is not a new universal score, but a matched diagnostic protocol that makes no\-evidence answerability, oracle\-reference gain, full\-context utilization, and retrieval\-conditioned utilization jointly observable under the same examples\. ONCU is one estimator inside this protocol: for a specified score field and contextual condition, it estimates the recovered fraction of an operational oracle\-reference evidence advantage after no\-evidence answerability is subtracted\. The diagnostic value comes from observing the four evidence roles together with a denominator\-validity audit, which separates answer priors from evidence\-derived gains and retrieval coverage from reader\-side conversion without turning the result into a leaderboard score or a claim of mechanistic causal attribution\.

##### Diagnostic proposition\.

For a score fieldSSand a contextual conditioncc, the recovered evidence\-advantage targetRc=\(Sc−Sno\)/\(Soracle−Sno\)R\_\{c\}=\(S\_\{c\}\-S\_\{\\mathrm\{no\}\}\)/\(S\_\{\\mathrm\{oracle\}\}\-S\_\{\\mathrm\{no\}\}\)is identifiable as a condition\-level diagnostic only whenSnoS\_\{\\mathrm\{no\}\},ScS\_\{c\},SoracleS\_\{\\mathrm\{oracle\}\}, the grouping scheme, and the denominator\-validity conditionSoracle\>SnoS\_\{\\mathrm\{oracle\}\}\>S\_\{\\mathrm\{no\}\}are observed under the same model and examples\. If any of these terms is absent, two evaluations can agree on the reported metric while disagreeing on whether the condition recovered evidence\-derived advantage\. This is the empirical and formal gap filled by the four\-condition protocol\.

The paper makes five contributions:

- •It specifies a fixed four\-condition evidence\-availability protocol that jointly observes no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence behavior under the same examples, model, answer contract, retrieval controls, score field, and grouping policy\.
- •It defines a denominator\-validity regime for oracle\-referenced utilization: non\-positive oracle\-over\-baseline denominators are explicitly marked, raw and clipped utilization regimes are separated, and denominator\-free answer, evidence, retrieval, and failure\-pattern audit metrics are reported alongside the normalized estimate\.
- •It uses ONCU as the protocol\-bound estimator of recovered oracle\-reference evidence advantage, while making the broader diagnostic object the matched protocol plus validity audit rather than a standalone metric or model\-ranking score\.
- •It gives a formal joint\-observability argument and counterexamples showing why answer accuracy, evidence F1, retrieval recall, oracle gap, context gain, or a normalized contrast without matched evidence roles cannot individually distinguish the failure modes targeted by the protocol\.
- •It provides aggregate failure\-pattern audits under the tested local open\-weight models, reconstructed oracle\-compatible QA components, and retriever settings, indicating that controlled synthetic settings primarily expose full\-context utilization failures whereas the tested HotpotQA\- and 2WikiMultiHopQA\-derived retrieve\-then\-read settings often expose evidence\-chain coverage failures\.

##### Why this is not just normalized gain\.

A normalized gain supplies a ratio; it does not by itself determine what the numerator and denominator mean\. In this paper the ratio is meaningful only because the protocol fixes four evidence roles on the same examples: no evidence, full context, retrieved evidence, and oracle\-evidence reference\. The denominator\-validity audit is part of the estimand, because groups with no oracle\-over\-baseline advantage cannot support a recovered\-advantage interpretation\. The output is therefore a failure\-localizing diagnostic contrast, paired with denominator\-free answer, evidence, and retrieval metrics, rather than a standalone normalized score or model leaderboard\.

## 2\.Related Work

This section positions the paper relative to evaluation practice in question answering, long\-context modeling, retrieval, grounding, and normalized performance contrasts\. It also identifies the closest JAIR articles by structure and evaluation motivation, because the present contribution is intended as a journal\-style diagnostic framework rather than as another task\-specific benchmark report\.

### 2\.1\.Closed\-Book, Open\-Book, and No\-Context Evaluation

Closed\-book question answering evaluates what a model can answer without task\-specific context, while open\-book and retrieval\-augmented settings provide external passages before generation\. This distinction is central to the present paper because no\-evidence answerability is not a nuisance variable: it is one of the quantities that must be estimated\. Standard open\-book accuracy can be high even when part of the score comes from parametric memory, dataset priors, or question artifacts\. The no\-evidence condition in our protocol therefore serves a different role from a conventional closed\-book baseline\. It is not used to rank models; it estimates the amount of answer performance that should not be credited to the supplied context\.

This positioning differs from ordinary open\-book QA evaluation on datasets such as SQuAD or HotpotQA, where exact match and token F1 remain the primary answer metrics\(rajpurkar2016squad;yang2018hotpotqa\)\. Those metrics are necessary but not sufficient for evidence\-utilization diagnosis\. They do not identify whether the answer was already available without context, whether the full context supplied usable evidence that the model failed to localize, or whether a retrieved context omitted part of the evidence chain\. ONCU addresses this narrower diagnostic target by comparing each contextual condition against both a no\-evidence baseline and an isolated\-evidence reference for the same model and examples\.

### 2\.2\.Long\-Context Benchmarks and Context\-Sensitivity Tests

Long\-context benchmarks evaluate whether models can process long inputs, retrieve needles, aggregate distributed facts, or answer questions over long documents\(bai2024longbench;kuratov2024babilong;hsieh2024ruler;kamradt2023needle\)\. LongBench provides heterogeneous long\-context tasks; BABILong stresses reasoning over facts embedded in long documents; RULER extends needle\-style tests to multi\-hop tracing and aggregation\. These benchmarks are important predecessors because they expose failures that are invisible in short\-context QA\. They also motivate the central diagnostic concern of this paper: long context length and final answer accuracy do not, by themselves, identify whether evidence\-derived advantage has been recovered\.

Position\-sensitivity and needle\-style tests are especially relevant because they reveal that relevant information can be harder to use at some context lengths or positions than others\(liu2024lost\)\. The present protocol uses that insight but asks a narrower accounting question\. A position\-sensitivity score can show that performance degrades when evidence is placed early, middle, or far from the query, but it need not separate no\-evidence answerability, isolated\-evidence answerability, full\-context reading, retrieved\-context reading, and denominator validity for the same examples\. ONCU is therefore not a replacement for position\-sensitive benchmarks; it is a way to express such effects as recovered oracle\-reference advantage under a matched evidence\-availability protocol\.

The present protocol complements such benchmarks by asking a question that most long\-context benchmarks do not instantiate directly: for the same examples and the same model, how much performance is obtained with no evidence, with the full context, with a retrieved subset, and with isolated oracle evidence? The matched four\-condition design turns that question into an auditable contrast that separates answer priors, full\-context localization, retrieval\-chain coverage, and answer conversion in the same denominator\-normalized analysis\.

### 2\.3\.Retrieval Evaluation Versus Reader\-Side Utilization

Retrieval\-augmented generation separates evidence selection from answer generation\(lewis2020rag;karpukhin2020dpr\)\. Classical and neural retrieval metrics, including lexical BM25\-style retrieval, dense retrieval, rank fusion, and recall at a fixed budget, measure whether relevant passages appear before the reader sees the prompt\(robertson2009bm25;cormack2009rrf;reimers2019sbert\)\. These metrics are indispensable for diagnosing evidence availability, but they are pre\-reader quantities\. A retriever can expose the necessary passages while the reader still fails to integrate or convert them into the correct answer; conversely, a weaker retrieval score can sometimes provide enough local context for answer generation\.

This distinction is why the paper reports both retrieval\-only and reader\-facing retriever\-family analyses\. The retrieval\-only analysis measures evidence\-chain availability, ranking, and distractor exposure before generation\. The reader\-facing validation reruns answer generation on lexical, dense, and hybrid retrieved contexts to test whether retrieval improvements survive downstream reading\. Neither analysis is allowed to replace the core four\-condition ONCU protocol unless the no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence references are matched for the same model, dataset, retriever family, and score field\.

### 2\.4\.Attribution, Citation Faithfulness, and Evidence Grounding

Attribution and citation\-grounding work asks whether a generated claim is supported by cited sources\. Frameworks such as Attributable to Identified Sources and RAG evaluation methods such as RAGAS separate answer quality, context relevance, and faithfulness in retrieval\-augmented systems\(rashkin2023ais;es2024ragas\)\. These works address a different but related problem\. They test whether generated text is grounded in evidence, whereas ONCU asks how much of the recoverable oracle\-reference answer advantage is obtained under a specified evidence\-availability condition\.

The two perspectives are complementary\. A model can cite relevant passages but produce the wrong answer, yielding reasonable evidence overlap and low answer utilization\. A model can answer correctly with weak citations, yielding high answer accuracy but weak grounding\. A retriever can return all gold passages while the reader fails\. Conversely, a faithful generated explanation can still be uninformative about whether the answer improvement came from the supplied context or from a no\-evidence prior\. The diagnostic protocol in this paper therefore keeps answer metrics, evidence metrics, retrieval diagnostics, attribution\-style grounding, and ONCU separate rather than treating any one of them as a complete measure of evidence use\.

### 2\.5\.Normalized Gain, Oracle Gaps, and Baseline\-Adjusted Evaluation

ONCU connects normalized\-gain style reasoning with a matched intervention design for long\-context and retrieval\-augmented QA\. Normalized contrasts and relative\-improvement measures have a long history in evaluation, including information retrieval and pre/post improvement analysis\(jarvelin2002cumulated;hake1998interactive\)\. The methodological contribution here is to bind a normalized contrast to four controlled evidence roles: no evidence estimates answer priors, oracle evidence estimates the isolated\-evidence reference, and full\-context or retrieved\-evidence conditions test whether that recoverable advantage survives the actual input regime\. The estimator, intervention design, denominator\-validity filter, and audit layer are treated as a single diagnostic package\.

Protocol binding changes the interpretation of the normalized contrast\. Rather than applying a generic ratio to arbitrary score pairs, the framework requires joint observation under matched examples and model settings, makes non\-positive oracle\-over\-baseline denominators explicit, and reports oracle\-referenced utilization together with denominator\-free answer, evidence, retrieval, and failure diagnostics\. Without those protocol constraints, a normalized score can inherit the same ambiguities as ordinary context gain or oracle\-gap reporting\.

This design matters because the same final score can arise from different mechanisms\. A context gainSc−SnoS\_\{c\}\-S\_\{\\mathrm\{no\}\}captures improvement over the no\-evidence baseline but leaves the recoverable oracle\-reference advantage implicit\. An oracle gapSoracle−ScS\_\{\\mathrm\{oracle\}\}\-S\_\{c\}captures distance from isolated evidence but leaves no\-evidence answerability implicit\. Retrieval recall measures pre\-reader evidence availability, while evidence F1 and answer F1 evaluate different parts of the pipeline\. ONCU combines the relevant score contrasts inside a specified protocol and reports them with denominator\-validity and audit requirements\.

### 2\.6\.Closest JAIR Articles and Journal\-Level Positioning

The closest JAIR articles are close in evaluation motivation and journal\-level structure rather than in proposing the same long\-context metric\. The technical lineage of this paper comes from long\-context evaluation, retrieval evaluation, grounding and attribution evaluation, and normalized performance contrasts; the JAIR positioning is methodological\.\(gehrmann2023cracked\)survey obstacles in generated\-text evaluation and argue that surface\-level automatic metrics often fail to measure the properties researchers intend to study\. The present paper makes the same kind of evaluation\-methodology intervention, but it narrows the target to evidence utilization in long\-context and retrieval\-augmented question answering and provides a concrete four\-condition protocol with released artifacts\.

\(lamalfa2024lmaas\)analyze accessibility, reproducibility, reliability, and trustworthiness challenges of language\-model\-as\-a\-service systems\. Our work differs by using local open\-weight models under fixed inference controls in the reported experiments so that diagnostic evidence\-availability contrasts can be audited within that model scope\. Finally,\(gundersen2024reproducibility\)articulate JAIR’s reproducibility mechanisms, including structured abstracts, checklists, and reproducibility\-oriented reporting\. The present submission follows that methodological structure but contributes a substantive empirical diagnostic framework: ONCU is paired with denominator\-validity checks, failure audits, external validations, and explicit protocol documentation\.

### 2\.7\.Positioning of the Present Protocol

Table[1](https://arxiv.org/html/2606.06758#S2.T1)states the thesis operationally\. Adjacent evaluation families observe useful variables, but each omits at least one component of the full diagnostic protocol\. The present contribution is not reducible to any single row in the table: it combines the four evidence conditions, retriever/reader separation, denominator\-valid grouping, oracle\-referenced utilization, and companion denominator\-free diagnostics\.

Table 1\.Why the Four\-Condition Diagnostic Protocol Is Not Reducible to a Single Prior Evaluation Line\.SnoS\_\{\\mathrm\{no\}\}is no\-evidence answerability,ScS\_\{c\}is a contextual condition score, andSoracleS\_\{\\mathrm\{oracle\}\}is the oracle\-evidence reference\. The protocol is defined by joint observation under matched examples, model, score field, retriever/reader setting, grouping scheme, and denominator\-validity audit\.

## 3\.Problem Formulation

### 3\.1\.Long\-Context Question Answering

We consider a long\-context question\-answering setting\. Each example consists of a questionqq, a long contextCC, a gold answera∗a^\{\\ast\}, and a set of oracle evidence passagesE∗E^\{\\ast\}when such annotations are available\. The model receives the question and a condition\-specific input context and produces a predicted answera^\\hat\{a\}and, when required by the output contract, a set of cited evidence passagesE^\\hat\{E\}:

\(1\)x=\(q,C,a∗,E∗\),y=\(a^,E^\)\.x=\(q,C,a^\{\\ast\},E^\{\\ast\}\),\\qquad y=\(\\hat\{a\},\\hat\{E\}\)\.

### 3\.2\.Evidence Utilization Failures

We define evidence utilization as the ability to transform available contextual evidence into the correct final answer\. A failure can occur at different stages\. The model may fail to locate relevant evidence, select only a subset of the required evidence, select distractors, fail to integrate multiple passages, or convert correctly identified evidence into the wrong answer form\. The diagnostic protocol is designed to distinguish these mechanisms rather than collapse them into a single final\-accuracy score\.

### 3\.3\.Oracle\-Reference Normalized Context Utilization

LetSnoS\_\{\\mathrm\{no\}\}denote the score under the no\-evidence condition, and letSoracleS\_\{\\mathrm\{oracle\}\}denote the score under the oracle\-evidence reference condition\. For any evaluated contextual conditioncc, such as full context or retrieved evidence, letScS\_\{c\}denote the corresponding score\. ONCU is defined as a protocol\-bound ratio:

\(2\)ONCUraw​\(c\)=Sc−SnoSoracle−Sno\.\\mathrm\{ONCU\}\_\{\\mathrm\{raw\}\}\(c\)=\\frac\{S\_\{c\}\-S\_\{\\mathrm\{no\}\}\}\{S\_\{\\mathrm\{oracle\}\}\-S\_\{\\mathrm\{no\}\}\}\.A value near 1 indicates that conditionccrecovers most of the oracle\-reference advantage\. A value near 0 indicates that the condition contributes little beyond the no\-evidence baseline\. IfSoracle≤SnoS\_\{\\mathrm\{oracle\}\}\\leq S\_\{\\mathrm\{no\}\}, ONCU is treated as invalid for that group because the oracle\-evidence reference does not establish a meaningful evidence\-derived advantage\. For reporting aggregate results, we use a clipped version,

\(3\)ONCUclip​\(c\)=min⁡\(1,max⁡\(0,ONCUraw​\(c\)\)\),\\mathrm\{ONCU\}\_\{\\mathrm\{clip\}\}\(c\)=\\min\\left\(1,\\max\\left\(0,\\mathrm\{ONCU\}\_\{\\mathrm\{raw\}\}\(c\)\\right\)\\right\),while retaining raw ONCU values in the released per\-group outputs\. Because clipping can suppress above\-oracle and below\-baseline behavior, the main text also includes raw\-vs\-clipped audit rows rather than relying only on clipped aggregates\. In the experiments, ONCU is computed over metadata groups and averaged over valid groups\. Consequently, ONCU should not be read as an unconditional dataset\-wide average; dataset\-level statements require companion answer, evidence, retrieval, and robustness metrics that do not depend on the oracle\-over\-baseline denominator\.

##### Interpretation boundary\.

ONCU is conditional on the score field, grouping scheme, oracle\-evidence construction, and denominator\-valid subset\. The oracle\-evidence condition is an operational reference, not a perfect ceiling on attainable performance; this boundary is developed in Section[4\.7](https://arxiv.org/html/2606.06758#S4.SS7)\. The clipped value in Eq\. \([3](https://arxiv.org/html/2606.06758#S3.E3)\) supports a compact recovered\-fraction summary, but raw ONCU remains the primary diagnostic audit for above\-oracle behavior, below\-baseline behavior, and unstable denominators\. Any dataset\-level conclusion therefore requires denominator\-free answer/evidence metrics and retrieval diagnostics alongside ONCU\.

## 4\.Theoretical Properties of Oracle\-Referenced Evaluation

This section formalizes the protocol\-level contribution\. ONCU is the normalized diagnostic estimator inside a fixed evidence\-availability protocol, but its interpretation depends on reporting the matched evidence roles, validity filters, and supporting audits together\. The section states the properties that make the normalization interpretable and clarifies the conditions under which the estimator yields a well\-defined evidence\-utilization contrast\.

### 4\.1\.From Normalized Gain to Diagnostic Estimation

Equation \([2](https://arxiv.org/html/2606.06758#S3.E2)\) turns normalized gain into a diagnostic estimator by assigning each term a controlled evidence\-availability role\. The no\-evidence condition estimates what the same model can answer without contextual support; the oracle\-evidence condition estimates what becomes possible when the required evidence is isolated; and the full\-context and retrieved\-evidence conditions test whether that recoverable advantage survives the actual input regime\. The same model, examples, scoring field, prompt contract, decoding policy, and validity audit are held fixed, which makes the contrast interpretable as a condition\-level utilization estimate rather than an arbitrary score difference\.

The protocol\-level view distinguishes ONCU from reporting a raw context gain, an oracle gap, or a retrieval recall score alone\. A raw gain measures improvement over no evidence; an oracle gap measures distance from an isolated\-evidence reference; retrieval recall measures pre\-reader availability; and answer or evidence metrics evaluate different downstream products\. ONCU integrates the no\-evidence and oracle\-reference sides of the comparison in one denominator\-normalized estimate, while companion answer, evidence, retrieval, and failure\-pattern audit metrics preserve the underlying mechanisms\.

The resulting framework has three inseparable parts\. First, the diagnostic estimator specifies the normalized contrast between contextual and reference conditions\. Second, the protocol fixes how no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence inputs are constructed from the same examples\. Third, the empirical audit layer reports denominators, invalid groups, raw scores, evidence overlap, retrieval coverage, parse failures, and aggregate failure\-pattern behavior\. Together, these parts give the normalized contrast diagnostic content: a low recovered\-advantage score can be examined through answer priors, retrieval\-chain loss, full\-context localization failure, answer conversion, and denominator stability under the same experimental contract\.

##### Joint\-observability proposition\.

LetRcR\_\{c\}denote the recovered evidence\-advantage target for conditioncc\. If an evaluation does not jointly observeSnoS\_\{\\mathrm\{no\}\},ScS\_\{c\}, andSoracleS\_\{\\mathrm\{oracle\}\}on the same examples, with the same model, score field, answer contract, and denominator\-validity rule, thenRcR\_\{c\}is not identified by the reported quantities\. In particular, there exist paired evaluations that agree on a final contextual score, a context gain, an oracle gap, a retrieval\-recall score, or an evidence\-overlap score while assigning different values toRcR\_\{c\}\. The four\-condition protocol makes this target observable by fixing the three score terms and the denominator\-valid subset before aggregation\.

##### Proof sketch\.

HoldingScS\_\{c\}fixed while varyingSnoS\_\{\\mathrm\{no\}\}changes the recovered fraction without changing the contextual answer score\. HoldingSc−SnoS\_\{c\}\-S\_\{\\mathrm\{no\}\}fixed while varyingSoracle−SnoS\_\{\\mathrm\{oracle\}\}\-S\_\{\\mathrm\{no\}\}changes the recovered fraction without changing context gain\. HoldingSoracle−ScS\_\{\\mathrm\{oracle\}\}\-S\_\{c\}fixed while varyingSnoS\_\{\\mathrm\{no\}\}changes the recovered fraction without changing oracle gap\. Finally, retrieval recall and evidence overlap can remain high when the reader fails to convert the evidence into the answer\. The counterexamples below instantiate these cases and show why the diagnostic object requires joint observation rather than any one metric alone\.

### 4\.2\.Why Existing Metrics Do Not Identify Evidence Utilization

The diagnostic target in this paper is not ordinary improvement over a baseline\. It is the recovered fraction of an oracle\-reference evidence advantage under a matched evidence\-availability intervention\. Several standard metrics cannot identify this target, even when they are individually useful\.

##### Accuracy counterexample\.

Consider two systems on the same dataset\. System A obtainsSno=0\.70S\_\{\\mathrm\{no\}\}=0\.70,Sfull=0\.75S\_\{\\mathrm\{full\}\}=0\.75, andSoracle=0\.90S\_\{\\mathrm\{oracle\}\}=0\.90\. System B obtainsSno=0\.10S\_\{\\mathrm\{no\}\}=0\.10,Sfull=0\.75S\_\{\\mathrm\{full\}\}=0\.75, andSoracle=0\.90S\_\{\\mathrm\{oracle\}\}=0\.90\. Both have identical full\-context score, but their recovered evidence advantage differs sharply:

0\.75−0\.700\.90−0\.70=0\.25,0\.75−0\.100\.90−0\.10=0\.8125\.\\frac\{0\.75\-0\.70\}\{0\.90\-0\.70\}=0\.25,\\qquad\\frac\{0\.75\-0\.10\}\{0\.90\-0\.10\}=0\.8125\.Full\-context accuracy alone cannot distinguish answer priors from contextual evidence use\.

##### Retrieval\-recall counterexample\.

Suppose a retriever returns all oracle passages for a multi\-hop question, so retrieval recall is 1\.0, but the reader answers incorrectly because it fails to integrate the passages\. Retrieval recall correctly indicates evidence availability before generation, but it does not measure reader\-side utilization\. Conversely, a retriever may retrieve only one of several annotated passages but include enough neighboring context for the reader to answer\. Retrieval recall and reader utilization are therefore non\-equivalent diagnostic quantities\.

##### Evidence\-F1 counterexample\.

A model may cite the correct evidence passages while producing the wrong entity, date, comparison direction, or arithmetic value\. Evidence F1 would be high, while answer F1 would be low\. Such cases are common in multi\-hop settings where selecting the evidence and converting it into the requested answer form are separate operations\. Evidence overlap cannot replace an answer\-side utilization estimator\.

##### Oracle\-gap and context\-gain counterexamples\.

The oracle gapSoracle−ScS\_\{\\mathrm\{oracle\}\}\-S\_\{c\}measures distance from an isolated\-evidence reference but ignores what the model could answer without evidence\. The context gainSc−SnoS\_\{c\}\-S\_\{\\mathrm\{no\}\}adjusts for answer priors but ignores how much recoverable evidence advantage was available\. Two groups can have identical context gain and very different oracle\-reference denominators, or identical oracle gaps and very different no\-evidence baselines\. Both references are required for the recovery question asked here\.

These counterexamples establish the need for a matched protocol rather than a single new score\. ONCU is interpretable only because the no\-evidence, contextual, and oracle\-reference conditions are constructed over the same examples, score field, model, and grouping scheme\. It remains conditional on the denominator\-valid subset and must be paired with denominator\-free companion metrics\.

Table 2\.Formal Failure Modes Behind the Joint\-Observability Proposition\. Each metric captures a useful quantity but leaves at least one term of the recovered\-evidence\-advantage target unidentified\.

### 4\.3\.ONCU and Causal Evidence Use

ONCU provides condition\-level behavioral evidence about how answer and evidence scores change when evidence availability is controlled\. The protocol changes observable evidence availability and measures the resulting answer and evidence behavior\. A higher full\-context or retrieved\-evidence ONCU value indicates that the tested condition recovers more of the oracle\-evidence advantage under the fixed prompting, decoding, retrieval, and scoring pipeline\.

This scope is aligned with the claims in this paper\. Terms such as evidence utilization, retrieval bottleneck, and answer conversion are used as behavioral diagnostics under controlled condition contrasts\. Mechanistic causal attribution—for example, identifying the hidden states, attention paths, or internal computations that caused a specific answer—is a distinct research question that can be studied with additional interventions such as counterfactual passage replacement, activation\-level analysis, or causal mediation tests\. The present contribution supplies the diagnostic estimation and empirical audit framework for observable condition\-level performance\.

### 4\.4\.Positive Affine Invariance

LetSSbe any scalar task score used in the diagnostic protocol, such as relaxed answer F1, strict answer F1, exact match, or evidence F1\. Consider a positive affine transformation of the score,

\(4\)S′=a​S\+b,a\>0\.S^\{\\prime\}=aS\+b,\\qquad a\>0\.For any contextual conditioncc, the transformed ONCU value is

\(5\)ONCUraw′​\(c\)\\displaystyle\\mathrm\{ONCU\}^\{\\prime\}\_\{\\mathrm\{raw\}\}\(c\)=Sc′−Sno′Soracle′−Sno′\\displaystyle=\\frac\{S^\{\\prime\}\_\{c\}\-S^\{\\prime\}\_\{\\mathrm\{no\}\}\}\{S^\{\\prime\}\_\{\\mathrm\{oracle\}\}\-S^\{\\prime\}\_\{\\mathrm\{no\}\}\}\(6\)=\(a​Sc\+b\)−\(a​Sno\+b\)\(a​Soracle\+b\)−\(a​Sno\+b\)\\displaystyle=\\frac\{\(aS\_\{c\}\+b\)\-\(aS\_\{\\mathrm\{no\}\}\+b\)\}\{\(aS\_\{\\mathrm\{oracle\}\}\+b\)\-\(aS\_\{\\mathrm\{no\}\}\+b\)\}\(7\)=a​\(Sc−Sno\)a​\(Soracle−Sno\)\\displaystyle=\\frac\{a\(S\_\{c\}\-S\_\{\\mathrm\{no\}\}\)\}\{a\(S\_\{\\mathrm\{oracle\}\}\-S\_\{\\mathrm\{no\}\}\)\}\(8\)=ONCUraw​\(c\)\.\\displaystyle=\\mathrm\{ONCU\}\_\{\\mathrm\{raw\}\}\(c\)\.ONCU is therefore invariant to positive affine rescaling of the underlying score\. This matters because the interpretation of ONCU depends on relative recovery of evidence\-derived advantage rather than on the arbitrary scale or offset of a particular score field\. The property does not imply that all score fields are interchangeable; it only shows that once a score field has been chosen, linear rescaling does not change the ONCU diagnosis\.

### 4\.5\.Baseline\-Adjusted Interpretation

The numerator in Eq\. \([2](https://arxiv.org/html/2606.06758#S3.E2)\),Sc−SnoS\_\{c\}\-S\_\{\\mathrm\{no\}\}, estimates the performance gain obtained by conditionccbeyond the no\-evidence baseline\. The denominator,Soracle−SnoS\_\{\\mathrm\{oracle\}\}\-S\_\{\\mathrm\{no\}\}, estimates the evidence\-derived advantage available to the same model, on the same dataset, under the same score field and grouping scheme\. ONCU therefore answers the following diagnostic question:

> How much of the oracle\-evidence advantage beyond the no\-evidence baseline is recovered by the evaluated contextual condition?

This baseline adjustment is essential for realistic question\-answering datasets such as HotpotQA, where a model can sometimes answer from parametric knowledge, dataset priors, or question artifacts even when no context is provided\. Without subtracting the no\-evidence baseline, a high full\-context score could be incorrectly interpreted as strong context utilization when part of the score is already attainable without evidence\.

### 4\.6\.Denominator Validity Condition

ONCU is meaningful only when the oracle\-evidence reference provides a positive evidence\-derived advantage:

\(9\)Soracle\>Sno\.S\_\{\\mathrm\{oracle\}\}\>S\_\{\\mathrm\{no\}\}\.IfSoracle≤SnoS\_\{\\mathrm\{oracle\}\}\\leq S\_\{\\mathrm\{no\}\}, the denominator in Eq\. \([2](https://arxiv.org/html/2606.06758#S3.E2)\) is zero or negative, and the ratio no longer measures recovery of an available evidence advantage\. Such cases are not necessarily failures of the model or errors in the data\. They can occur when the model already answers the group well without evidence, when the oracle\-evidence snippet is too narrow, when aliases or neighboring context outside the oracle snippet are useful, or when the score field is insensitive to the additional evidence\. We therefore mark these groups as invalid for ONCU aggregation rather than forcing a ratio whose sign or magnitude would no longer answer the recovered\-advantage question\.

The denominator condition defines the scope of each ONCU estimate\. If a metadata group is denominator\-invalid, ONCU is not treated as missing evidence for the model; it indicates that the isolated oracle\-evidence reference did not create a positive recoverable advantage over the no\-evidence baseline for that group under the chosen score field and grouping scheme\. Every ONCU table therefore reports valid\-group counts before ONCU is interpreted, and the result sections pair ONCU with denominator\-free answer F1, evidence F1, parse\-failure counts, bootstrap intervals, and larger\-sample robustness checks\. In the HotpotQA setting, where valid groups are comparatively few in the 200\-sample matrix, the paper reports an explicit denominator audit and states the claim in two layers: sample\-level answer and evidence metrics support the full\-context\-over\-retrieved direction over all evaluated samples, while ONCU supports the same direction on oracle\-improving groups\.

### 4\.7\.Oracle Reference Is Not an Upper Bound

The oracle\-evidence condition is a reference condition, not a guaranteed upper bound\. It isolates the annotated supporting evidence, but it may omit local neighboring sentences, redundant descriptions, aliases, or contextual cues that help answer generation\. Retrieved chunks may include the oracle passages plus adjacent text, and full contexts may contain auxiliary evidence not included in the oracle snippet\. A raw ONCU value above 1 is therefore not a mathematical impossibility and should not be silently clipped away as noise\. The interpretation is consequently “recovered advantage relative to the operational oracle\-evidence reference,” not recovery relative to a perfect answerability ceiling\.

This point changes the interpretation of the estimator\. The term*oracle\-reference normalized*means normalized relative to an oracle\-evidence*reference*, not normalized by a perfect ceiling on attainable performance\. The paper therefore reports clipped ONCU for recovered\-fraction readability while preserving raw ONCU and above\-oracle counts as audit signals\. Above\-oracle behavior can indicate helpful auxiliary context, oracle\-snippet incompleteness, or sensitivity to input formulation\. Below\-baseline behavior can indicate context\-induced distraction or output\-format degradation\. Both regimes are part of the diagnostic evidence rather than exceptions to be hidden\.

### 4\.8\.Raw Versus Clipped ONCU

Raw ONCU values are retained because they carry diagnostic information that clipped aggregates can hide\. A raw value greater than 1 means that conditionccoutperforms the oracle\-evidence reference for the corresponding group\. This can happen when the full context or retrieved chunks contain useful ancillary information that is absent from the oracle\-evidence snippet, or when answer generation is sensitive to input formulation\. A raw value below 0 means that conditionccperforms below the no\-evidence baseline, indicating that the provided context may distract the model or induce an incorrect answer\. These two regimes are not nuisance values: both are failure or boundary modes that help distinguish oracle\-reference incompleteness, context\-induced degradation, and retrieval\-induced auxiliary evidence\.

For aggregate reporting, we use the clipped value in Eq\. \([3](https://arxiv.org/html/2606.06758#S3.E3)\)\. Clipping maps the group\-level ratio to the interval\[0,1\]\[0,1\]and supports the interpretation of aggregate ONCU as a recovered fraction of the oracle\-reference advantage\. This reporting convention is secondary to the raw diagnostic signal\. To avoid making clipped aggregates look cleaner than the underlying behavior, the paper reports clipped ONCU together with raw\-score columns, valid\-group counts, denominator audits, and a main\-text raw\-vs\-clipped clipping\-regime audit\. The aggregate audit is only a compact preview of clipping behavior; the underlying interpretation depends on the per\-group raw ONCU, valid\-group indicators, and score denominators that show whether aggregate results are driven by in\-range utilization patterns, above\-oracle cases, below\-baseline cases, or unstable denominators\.

The raw\-vs\-clipped audit therefore tracks three regimes that clipping can collapse: below\-baseline contextual behavior \(ONCUraw<0\\mathrm\{ONCU\}\_\{\\mathrm\{raw\}\}<0\), in\-range recovered advantage \(0≤ONCUraw≤10\\leq\\mathrm\{ONCU\}\_\{\\mathrm\{raw\}\}\\leq 1\), and above\-reference contextual behavior \(ONCUraw\>1\\mathrm\{ONCU\}\_\{\\mathrm\{raw\}\}\>1\)\. These regimes are inspected in the reported raw\-vs\-clipped audits and preserved in the released per\-group files\.

### 4\.9\.ONCU as a Within\-Model Diagnostic Ratio

ONCU is a within\-model, within\-task diagnostic ratio\. It is interpreted together with the underlying no\-evidence, contextual, oracle\-evidence, answer, and evidence scores\. This reporting convention is important because ONCU measures recovered oracle\-reference advantage for a specified model, dataset, score field, and condition set\. A smaller model can recover a large fraction of its own oracle\-reference advantage, while a stronger model can expose a larger recoverable gap that is harder for its full\-context condition to realize\.

This distinction supports the experimental analysis in the paper\. The goal is to identify where the performance bottleneck lies: evidence localization in the long context, retrieval coverage, multi\-hop integration, answer conversion, or the absence of a positive oracle\-over\-baseline denominator\. ONCU contributes the normalized bottleneck view, while absolute task performance and evidence\-quality metrics report the raw answer and grounding behavior\.

## 5\.Benchmark Construction

### 5\.1\.Overview

We construct a diagnostic evaluation suite for long\-context evidence utilization\. The reported ONCU evaluation uses three oracle\-compatible benchmark components: a controlled synthetic component, a HotpotQA\-derived multi\-hop component, and a 2WikiMultiHopQA\-derived multi\-hop component\. These components include evidence annotations that allow the same example to be evaluated under no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence conditions\. We additionally include BABILong\-200 and RULER\-lite\-240 as external answer\-performance validation settings\. These external settings are useful for testing reasoning\-in\-a\-haystack and synthetic long\-context behavior, respectively, but they are not used for ONCU computation in this study because the current adapters do not provide the full oracle\-reference condition set in the same format as the oracle\-compatible components\.

Figure[1](https://arxiv.org/html/2606.06758#S5.F1)summarizes the core diagnostic protocol\. External BABILong\-200 and RULER\-lite\-240 checks are reported separately as answer\-performance validation because they do not instantiate the complete oracle\-reference protocol\.

Matched evaluation unitSame examples, model, prompts, answer contract, score field, and grouping schemeNo evidencequestion onlySnoS\_\{\\mathrm\{no\}\}Full contextcomplete long inputSfullS\_\{\\mathrm\{full\}\}Retrieved evidencecompact retrieved inputSretS\_\{\\mathrm\{ret\}\}Oracle\-evidence referenceisolated supporting evidenceSoracleS\_\{\\mathrm\{oracle\}\}Denominator\-valid auditinterpret ONCU only whenSoracle\>SnoS\_\{\\mathrm\{oracle\}\}\>S\_\{\\mathrm\{no\}\}Protocol\-bound outputsrecovered oracle\-reference advantage \(ONCU\) \+ denominator\-free answer/evidence metrics, retrieval diagnostics, and failure summaries

Figure 1\.Matched four\-condition diagnostic protocol\. The four evidence\-availability conditions are observed on the same examples under one scoring contract\. ONCU is computed only after the denominator\-validity audit and is interpreted with denominator\-free answer/evidence metrics, retrieval diagnostics, and failure summaries\.A compact protocol diagram\. The same examples, model, prompts, answer contract, score field, and grouping scheme are evaluated under no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence\-reference conditions\. A denominator\-validity audit precedes ONCU and companion diagnostics\.Table 3\.Benchmark Components and Their Roles in the Diagnostic Framework\.
### 5\.2\.Sample Selection and Stratification

All reported datasets are materialized as fixed processed JSONL files before model evaluation and are referenced by configuration files during inference\. Sample inclusion is determined only by the dataset\-construction pipeline and metadata filters, not by model outputs\. This prevents post\-hoc selection of examples after observing model behavior\.

For the controlled benchmark, the full generator supports 4K, 8K, 16K, and 32K context settings\. The primary cross\-model comparison uses the Controlled\-ONCU\-safe16K\-200 subset, which is drawn from the 4K, 8K, and 16K controlled settings after excluding 32K contexts\. The exclusion is applied before evaluation to avoid backend\-specific truncation near the maximum context window when task instructions, question text, passage identifiers, and structured\-output constraints are included\. Each controlled sample retains metadata for context length, evidence position, distractor similarity, and reasoning type\. The subset is selected to cover the available metadata cells in the safe 16K range, and the ONCU aggregation is later performed over valid metadata groups rather than directly over examples so that larger cells do not dominate the diagnostic summary\.

For HotpotQA\-ONCU, examples are first converted into passage\-identified long\-context instances\. Supporting facts are aligned to visible passages, and an example is retained only when the required supporting facts can be reliably mapped to oracle evidence passage identifiers\. The gold evidence identifiers and distractor labels are stored in metadata and are not exposed to the model\. The HotpotQA\-ONCU\-200 set is used in the balanced 3\-model by 2\-dataset main matrix\. The HotpotQA\-500 robustness set is constructed with the same filtering rules, the same passage\-identifier format, the same top\-k=3k=3main retrieval protocol, and a fixed random seed of 42\. The 500\-sample setting therefore changes sample size, not the task definition or evaluation contract\.

Across both benchmark components, the no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence conditions are produced from the same underlying examples\. Within a model–dataset run, condition differences therefore reflect evidence availability rather than changes in the evaluated sample pool\.

For 2WikiMultiHopQA\-ONCU\-500, examples are converted from the 2WikiMultiHopQA multi\-hop question\-answering dataset, which provides evidence information containing reasoning paths for multi\-hop questions\(ho2020wikimultihopqa\)\. We construct 500 passage\-identified long\-context instances using the same four\-condition diagnostic protocol as the other ONCU\-compatible components\. Each retained example contains a gold answer, oracle evidence passages, a full passage\-annotated context, and distractor passages\. The resulting 500\-sample set is used as an additional realistic multi\-hop validation component rather than as a replacement for the balanced 200\-sample core matrix\.

For BABILong\-200, we construct an external validation set from four context configurations, 0k, 1k, 2k, and 4k, and five task types, qa1, qa2, qa3, qa6, and qa7, using 10 examples per task–configuration cell\. This yields 200 examples before model evaluation\. BABILong samples are evaluated under no\-evidence, full\-context, and retrieved\-evidence conditions using the same deterministic decoding and lexical retrieval settings as the main experiments\. Because the current BABILong adapter does not provide oracle\-evidence annotations compatible with ONCU, no oracle\-evidence condition is constructed and BABILong is reported only as external answer\-performance validation\.

For RULER\-lite\-240, we construct an external synthetic long\-context validation set with three task families,retrieval\_key\_value,multi\_hop\_trace, andaggregation\_sum, four context lengths, 4K, 8K, 16K, and 32K, and 20 examples per task–length cell\. This yields 240 examples before model evaluation\. RULER\-lite samples are evaluated under full\-context and retrieved\-context conditions with the same three evaluated models and a fixed top\-k=3k=3lexical retrieval setting\. Because this adapter does not instantiate no\-evidence and oracle\-evidence references compatible with ONCU, RULER\-lite is reported only as external answer\-performance validation\.

### 5\.3\.Controlled\-ONCU

The controlled component is designed to isolate the factors affecting long\-context evidence utilization\. It systematically varies four dimensions: context length, evidence position, distractor similarity, and reasoning type\.

- •Context length:4K, 8K, 16K, and 32K tokens\.
- •Evidence position:front, middle, end, and scattered in the main controlled generator; decile midpoint positions in the length–position scaling extension\.
- •Distractor similarity:none, low, high, and conflicting\.
- •Reasoning type:single\-hop, multi\-hop, comparison, and arithmetic\.

This design enables controlled analysis of where and why long\-context models fail\.

The main cross\-model controlled comparison uses the safe16K subset described above\. To test whether the aggregate controlled full\-context bottleneck hides systematic scaling behavior, we additionally construct a controlled length–position scaling extension\. This extension crosses four context lengths, 4K, 8K, 16K, and 32K, with ten decile evidence\-position buckets,pos\_00throughpos\_09, four distractor\-similarity settings, four reasoning types, and five random seeds per cell\. The resulting processed input contains 3,200 samples\. It is used as a diagnostic extension rather than as a replacement for the balanced three\-model safe16K comparison\.

### 5\.4\.HotpotQA\-ONCU

The HotpotQA\-derived component provides a realistic multi\-hop question\-answering setting\(yang2018hotpotqa\)\. We align supporting facts to passages and retain only examples whose supporting facts can be reliably mapped to oracle evidence passages\. These samples allow ONCU computation in a real multi\-hop setting\.

### 5\.5\.2WikiMultiHopQA\-ONCU

The 2WikiMultiHopQA\-derived component provides a second realistic multi\-hop setting\. 2WikiMultiHopQA was designed to evaluate reasoning steps by combining structured and unstructured information and by providing evidence information containing reasoning paths for multi\-hop questions\(ho2020wikimultihopqa\)\. This makes it well aligned with oracle\-referenced evaluation: the oracle\-evidence reference can be constructed from the annotated evidence path, while the full\-context and retrieved\-evidence conditions test whether the model or retriever preserves and uses the required evidence chain\. We report a 500\-sample 2WikiMultiHopQA\-ONCU validation set for all three evaluated models\.

### 5\.6\.External Long\-Context Validation

LongBench, BABILong, and RULER are important long\-context benchmarks for evaluating model robustness beyond a single controlled task family\(bai2024longbench;kuratov2024babilong;hsieh2024ruler\)\. In this study, we report BABILong\-200 as an external reasoning\-in\-a\-haystack validation setting and RULER\-lite\-240 as an external synthetic long\-context validation setting\. LongBench is treated as future external validation because its heterogeneous task types require additional adapter design to define comparable evidence references and metrics\. The current BABILong and RULER\-lite adapters do not instantiate the same oracle\-referenced four\-condition protocol as Controlled\-ONCU, HotpotQA\-ONCU, and 2WikiMultiHopQA\-ONCU; therefore, both external settings are interpreted separately from the oracle\-referenced core results\.

### 5\.7\.Passage Annotation and Leakage Prevention

All visible passages are assigned neutral passage identifiers such as\[passage\_id: p0001\]\. Gold evidence and distractor labels are stored only in metadata and are not exposed to the model\. This prevents label leakage and ensures that models must infer relevance from passage content rather than explicit evidence markers\.

## 6\.Diagnostic Protocol

### 6\.1\.Core Diagnostic Conditions

The main evaluation uses fixed diagnostic conditions rather than prompt optimization\. Each condition changes the evidence available to the model while keeping the answer contract, decoding policy, and evaluation metrics fixed\.

#### 6\.1\.1\.No\-Evidence Condition

The no\-evidence condition provides only the question\. It estimates answer priors and parametric knowledge\. This condition is essential for realistic question\-answering datasets, where a model may answer some questions without using the supplied context\.

#### 6\.1\.2\.Full\-Context Condition

The full\-context condition provides the complete long context and asks the model to answer while citing supporting passage identifiers\. This is the primary condition for measuring whether evidence embedded in a long input is used effectively\.

#### 6\.1\.3\.Retrieved\-Evidence Condition

The retrieved\-evidence condition applies a deterministic lexical retriever to select a compact evidence set before answer generation, following the role of lexical retrieval baselines such as BM25 in information retrieval\(robertson2009bm25\)\. In the main diagnostic matrix, retrieval is configured with top\-k=3k=3, chunk size 220, and overlap 40\. Candidate chunks are produced from the same passage\-annotated context used in the full\-context condition, and retrieval is applied before answer generation using the question as the query\. The selected chunks are then passed to the same answer\-generation contract used by the other contextual conditions\. This condition is not presented as a new prompting method or as a claim about the strongest deployable RAG pipeline\. It is a matched diagnostic condition that asks whether a fixed compact retrieved context recovers the evidence\-derived advantage observed under the no\-evidence, full\-context, and oracle\-evidence references\. In the retrieval\-budget ablation, only top\-kkis varied, while chunk size, overlap, decoding policy, output contract, and evaluation pipeline are held fixed\.

Retrieval family and retrieval depth are treated as explicit protocol variables rather than hidden implementation details\. We therefore run three retriever\-family checks\. The first is a retrieval\-only audit that holds the processed examples, passage segmentation, chunk size, overlap, and evidence\-overlap metrics fixed while varying the retrieval family: lexical retrieval, off\-the\-shelf dense sentence\-embedding retrieval, hybrid lexical–dense rank fusion using reciprocal ranks, a deterministic iterative query\-expansion baseline, and an oracle retrieval reference\. This audit diagnoses evidence\-chain availability, ranking, and distractor exposure before downstream answer generation\. The second is a matched retriever\-family ONCU sensitivity experiment that reruns the complete no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence protocol for dense@16 and hybrid@16 retrieved inputs on HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500\. The third is a broader reader\-facing validation that reruns answer generation for lexical, dense, and hybrid retrieved contexts at top\-k∈\{3,8,16\}k\\in\\\{3,8,16\\\}and reports answer/evidence metrics across budgets\. The matched sensitivity experiment tests whether the denominator\-valid ONCU contrast and denominator\-free answer/evidence contrast remain stable under stronger retrieved inputs; the broader reader\-facing validation tests whether retrieval\-family improvements transfer to downstream answer generation\.

#### 6\.1\.4\.Oracle\-Evidence Condition

The oracle\-evidence condition provides only the gold supporting evidence\. It serves as an oracle\-evidence reference for ONCU normalization and verifies whether the model can answer when the necessary evidence is isolated\. Because retrieved chunks may contain oracle passages plus adjacent local context, the oracle\-evidence condition is an empirical reference rather than a guaranteed upper bound for every individual group\.

### 6\.2\.Auxiliary Diagnostic Probes

Auxiliary probes are used only for failure analysis\. The concise\-reasoning probe tests whether a fixed reasoning\-style instruction changes utilization behavior\. The evidence\-selection probe separates evidence selection from answer generation\. The evidence\-sufficiency probe adds a verification and expansion step to reveal missing\-support failures\. These probes are not optimized per model or per dataset and are not used as the basis for the main ONCU claims\.

### 6\.3\.Fixed Protocol and Reproducibility

All instruction templates are fixed before model comparison and shared across models and datasets\. We do not tune templates for individual models, datasets, or failed examples\. Local Ollama experiments use deterministic decoding with temperature set to 0 and an explicit context\-window configuration of 32,768 tokens\. The retrieved\-evidence condition uses a fixed deterministic lexical retrieval configuration\. Each run records configuration files, per\-sample metrics, ONCU summaries, and protocol metadata for reproducibility\.

##### Runtime and model\-environment record\.

The released artifact includesRUNTIME\_REPRODUCIBILITY\_RECORD\.mdandruntime\_reproducibility\_record\.json, generated byscripts/export\_runtime\_record\.py\. The record captures the available runtime metadata for the reproduction host, including operating system, Python executable, package versions, PyTorch/CUDA status, GPU name and memory,nvidia\-smioutput, deterministic inference controls, model tags requested by the protocol, context\-window settings, and any unavailable runtime commands\. The submitted Runpod record reports Linux/x86\_64, Python 3\.11\.10, PyTorch 2\.4\.1\+cu124, an NVIDIA A40 GPU with 44\.43 GB reported memory, NVIDIA driver 570\.211\.01, and CUDA 12\.8\. Model identifiers, retrieval settings, decoding controls, context length, and output paths are also fixed in the released configuration files and reproduction README\.

##### Hyperparameter\-search boundary\.

The experiments are diagnostic protocol runs rather than open\-ended hyperparameter optimization\. Final reported settings are fixed inconfigs/; the explored sensitivity dimensions are explicitly enumerated in the manuscript and artifacts: retrieval budgets, lexical/dense/hybrid retriever families, dense@16 and hybrid@16 matched ONCU sensitivity, cross\-encoder reranking candidate/final budgets, controlled context length and evidence position, model\-family extension, and external BABILong/RULER\-lite validation settings\. No additional unpublished prompt tuning, model\-specific decoding search, or per\-dataset failed\-example tuning is used to select the main reported results\.

## 7\.Evaluation Metrics

### 7\.1\.Answer Metrics

We report both strict and relaxed answer metrics\. Strict exact match and strict F1 require the predicted answer to match the gold answer precisely\. Relaxed metrics apply normalization such as lowercasing, accent removal, punctuation removal, and synthetic suffix removal when appropriate\.

### 7\.2\.Evidence Metrics

We evaluate evidence selection using evidence precision, recall, and F1:

\(10\)PE=\|E^∩E∗\|\|E^\|,P\_\{E\}=\\frac\{\|\\hat\{E\}\\cap E^\{\\ast\}\|\}\{\|\\hat\{E\}\|\},\(11\)RE=\|E^∩E∗\|\|E∗\|,R\_\{E\}=\\frac\{\|\\hat\{E\}\\cap E^\{\\ast\}\|\}\{\|E^\{\\ast\}\|\},\(12\)F​1E=2​PE​REPE\+RE\.F1\_\{E\}=\\frac\{2P\_\{E\}R\_\{E\}\}\{P\_\{E\}\+R\_\{E\}\}\.

### 7\.3\.ONCU Metrics

We compute ONCU using multiple score fields, including strict answer F1, relaxed answer F1, strict exact match, and relaxed exact match\. Unless otherwise specified, ONCU\-Relaxed\-F1 refers to ONCU computed with relaxed answer F1 asSSin Eq\. \([2](https://arxiv.org/html/2606.06758#S3.E2)\)\. This is the primary ONCU metric because it accounts for semantically correct answers that differ only in formatting or synthetic identifiers\. The alternative strict\-F1 and exact\-match variants are treated as score\-field sensitivity checks; the main claims are stated only where ONCU is interpreted together with denominator\-free answer/evidence metrics and denominator\-validity audits\.

### 7\.4\.Companion Metrics and Audit Outputs

ONCU is interpreted together with denominator\-free answer and evidence metrics, retrieval diagnostics, bootstrap intervals, paired contrasts, and failure\-pattern audits\. The main text uses those quantities only where they support the central diagnostic argument; the complete alternative\-metric tables and representative case studies are moved to Appendix[A\.1](https://arxiv.org/html/2606.06758#A1.SS1)to keep the exposition focused\.

### 7\.5\.Bootstrap Confidence Intervals

To assess statistical reliability, we compute non\-parametric bootstrap confidence intervals for the final 200\-sample matrix, the HotpotQA\-500 robustness checks, and the BABILong\-200 external validation\. For answer and evidence metrics, examples are resampled with replacement within each model–dataset–condition group\. For ONCU, valid metadata groups are resampled with replacement within each model–dataset–condition group, because ONCU is computed as a group\-normalized quantity\. The final 200\-sample analysis, the HotpotQA\-500 robustness analysis, and the BABILong\-200 answer\-performance analysis use 5,000 bootstrap replicates, and intervals are reported as two\-sided 95% percentile intervals\. These intervals are used only as reliability checks for the diagnostic trends and are not used to tune prompts, retrieval settings, or model\-specific decoding policies\.

### 7\.6\.Paired Effect Sizes and Multiple\-Comparison Control

The core diagnostic protocol is a repeated\-measures design: the same examples are evaluated under no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence conditions\. We therefore report paired condition contrasts in addition to aggregate means\. For a contrast such as retrieved evidence minus full context, we align examples by sample identifier, compute the per\-sample score difference, and report the paired mean difference, a paired bootstrap 95% confidence interval, and a standardized paired effect size\. The standardized effect size is the mean paired difference divided by the standard deviation of the paired differences\. These quantities are intended to measure effect magnitude, not merely statistical detectability\.

For the controlled length–position scaling analysis, the same generated cells are evaluated across context lengths and evidence\-position buckets\. We therefore compute paired length and position contrasts over matched position or length–position cells, and we supplement them with regression\-style diagnostics over aggregated ONCU cells\. For the retrieval\-family ablation, we compute paired contrasts over sample identifiers because each retriever and top\-kksetting is evaluated on the same underlying examples\.

When multiple confirmatory contrasts are reported within the same analysis family, we apply Holm adjustment to the normal\-approximation diagnosticpp\-values\(holm1979sequential\)\. For exploratory regression\-style diagnostics, we additionally report Benjamini–Hochberg false\-discovery\-rate adjusted values\(benjamini1995fdr\)\. The interpretation of the results emphasizes effect sizes and confidence intervals rather than isolatedpp\-values, following the general caution that statistical significance alone does not measure effect size or scientific importance\(wasserstein2016asa\)\.

### 7\.7\.Failure\-Pattern Audit Summary

Failure\-pattern labels are used as aggregate descriptive diagnostics rather than item\-level causal ground truth\. The operational rules and validation audit are reported in Appendix[A\.3](https://arxiv.org/html/2606.06758#A1.SS3)and Appendix[A\.4](https://arxiv.org/html/2606.06758#A1.SS4); the main text relies primarily on answer/evidence metrics and uses label counts only as supporting evidence\.

## 8\.Experiments

The experiment section keeps the main diagnostic chain in the body and moves secondary checks to the appendix\. The main text focuses on the primary four\-condition matrix, ONCU validity behavior, realistic multi\-hop robustness, model\-family extension, matched dense/hybrid sensitivity, and controlled length–position scaling as mechanism evidence\. Complete confidence\-interval tables, failure\-pattern audit details, raw\-versus\-clipped rows, retrieval sweeps, and external validation tables are preserved in the appendix material rather than treated as independent contributions\.

### 8\.1\.Experimental Setup

The setup instantiates the matched four\-condition protocol across controlled and realistic QA settings\. All main claims are scoped to local open\-weight models, reconstructed oracle\-compatible QA protocols, and the tested retriever families\.

The primary core experiments evaluate three open\-weight models: Qwen2\.5\-14B\(qwen2025qwen25\), Qwen3\-14B\(qwen2025qwen3\), and Gemma3\-12B\(gemma2025gemma3\)\. We first evaluate each model on two 200\-sample datasets, Controlled\-ONCU\-safe16K and HotpotQA\-ONCU, under four fixed core diagnostic conditions: no evidence, full context, retrieved evidence, and oracle\-evidence reference\. The completed balanced matrix therefore contains

3×2×200×4=48003\\times 2\\times 200\\times 4=4800model predictions\. We then evaluate the same three models on 2WikiMultiHopQA\-ONCU\-500 under the same four conditions, adding

3×500×4=60003\\times 500\\times 4=6000predictions\. The original oracle\-compatible primary panel therefore contains 10,800 fixed\-condition predictions\. The 2WikiMultiHopQA run is reported as a larger realistic multi\-hop validation component, while the 200\-sample Controlled\-ONCU and HotpotQA\-ONCU datasets remain the balanced cross\-dataset matrix\.

To test model\-family robustness within the local open\-weight setting, we add a separate model\-family extension with Llama3\.1\-8B\(meta2024llama31\)and Mistral\-Small3\.1\-24B\(mistral2025small31\)\. Both models are run locally through the same Ollama\-backed protocol as the primary panel\. The extension repeats the same four diagnostic conditions on Controlled\-ONCU\-safe16K\-200, HotpotQA\-ONCU\-200, and 2WikiMultiHopQA\-ONCU\-500, adding

2×\(200\+200\+500\)×4=72002\\times\(200\+200\+500\)\\times 4=7200fixed\-condition predictions\. The ONCU\-compatible model\-generation evidence in this paper therefore contains 18,000 predictions across five local open\-weight models spanning the Qwen, Gemma, Llama, and Mistral families\. The model\-family extension is reported separately from the original three\-model balanced matrix because it was added as a local open\-weight robustness check rather than used to tune the protocol\.

The controlled safe16K subset contains samples drawn from 4K, 8K, and 16K contexts\. The 32K setting is excluded from the core model comparison because a 32K context plus task instructions, question text, and structured\-output constraints can approach backend context limits and introduce truncation\-related confounds\. This restriction reduces truncation confounds when comparing similarly sized local models\.

The ONCU\-compatible model\-generation experiments use deterministic decoding with temperature set to 0, maximum generation length set to 1024 tokens, and an explicit Ollama context\-window configuration of 32,768 tokens\(ollama2026\)\. The main retrieved\-evidence condition uses the fixed lexical retrieval setting described above: top\-k=3k=3, chunk size 220, and overlap 40\. For Qwen3\-14B, thinking output is disabled at the API level to preserve the same structured\-output contract used for the other models\. We treat this as inference control rather than prompt modification: the task instructions and diagnostic conditions remain unchanged\.

Structured\-output parse failures are retained and scored as incorrect\. Four of the six final 200\-sample primary runs have zero parse failures\. Qwen3\-14B on HotpotQA\-ONCU\-200 has one parse failure out of 800 predictions\. Gemma3\-12B on Controlled\-ONCU\-safe16K\-200 has six parse failures out of 800 predictions, all under the full\-context condition\. Gemma3\-12B on HotpotQA\-ONCU\-200 has zero parse failures\. In the primary 2WikiMultiHopQA\-ONCU\-500 runs, Qwen2\.5\-14B and Gemma3\-12B each have one parse failure out of 2000 predictions, while Qwen3\-14B has zero parse failures\. In the Llama/Mistral extension, the controlled runs have zero parse failures; the HotpotQA runs have two parse failures for Llama3\.1\-8B and one for Mistral\-Small3\.1\-24B; and the 2WikiMultiHopQA runs have three parse failures for Llama3\.1\-8B and one for Mistral\-Small3\.1\-24B\. Keeping parse failures in the denominator avoids selectively discarding difficult outputs and keeps the model comparisons conservative\.

In addition to the ONCU\-compatible core matrix, we evaluate BABILong\-200 as an external answer\-performance validation setting\. BABILong\-200 uses the same three models and the no\-evidence, full\-context, and retrieved\-evidence conditions, yielding3×200×3=18003\\times 200\\times 3=1800additional predictions\. Because oracle evidence is unavailable in the current BABILong adapter, this external validation is excluded from ONCU aggregation and failure evidence\-overlap interpretation\.

We also evaluate RULER\-lite\-240 as an additional external answer\-performance validation setting\. RULER\-lite\-240 uses the same three models and two conditions, full context and retrieved context, yielding3×240×2=14403\\times 240\\times 2=1440additional predictions\. The RULER\-lite run uses deterministic short\-answer JSON generation, top\-k=3k=3lexical retrieval for the retrieved\-context condition, and exact\-match/token\-F1 answer scoring\. Because the adapted RULER\-lite setting does not instantiate no\-evidence and oracle\-evidence references, it is excluded from ONCU aggregation and evidence\-overlap interpretation\.

We also run a retrieval\-only retriever\-family ablation on HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500\. This ablation evaluates lexical, dense, hybrid rank\-fusion, deterministic iterative, and oracle retrieval variants at top\-k∈\{3,5,8,16\}k\\in\\\{3,5,8,16\\\}\. Since this analysis scores retrieved passage sets directly and does not call a reader model, it is not counted in the 10,800 oracle\-compatible model predictions\. Its role is to test, within the evaluated retriever families and budgets, whether the retrieved\-evidence bottleneck is explained by lexical retrieval alone, by retrieval budget, or by the difficulty of recovering complete multi\-hop evidence chains without exposing the reader to many distractors\.

To test whether the retrieved\-evidence results are robust to stronger retriever families under the same diagnostic references, we run a matched retriever\-family ONCU sensitivity experiment\. The experiment covers HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500 for Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B\. For each model–dataset pair, we rerun the full four\-condition protocol twice: once with dense retrieved evidence at top\-k=16k=16and once with hybrid lexical–dense rank\-fusion retrieved evidence at top\-k=16k=16\. This adds

3×\(200\+500\)×2×4=16,8003\\times\(200\+500\)\\times 2\\times 4=16\{,\}800fixed\-condition predictions\. The no\-evidence, full\-context, and oracle\-evidence references are regenerated under the same answer contract, decoding policy, chunk size, overlap, and scoring pipeline; only the retrieved\-evidence input family changes\. The sensitivity experiment is not used to tune the main lexical top\-k=3k=3protocol\. It estimates whether the full\-context\-versus\-retrieved contrast remains stable when the retrieved input is strengthened by dense or hybrid retrieval and a larger retrieval budget\.

To close the remaining gap between pre\-reader retrieval coverage and system\-level retrieve\-then\-read behavior, we also run a broader reader\-facing retriever\-family validation\. This validation evaluates lexical, dense, and hybrid retrieved contexts at top\-k∈\{3,8,16\}k\\in\\\{3,8,16\\\}for all three answer models on HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500, yielding3×\(200\+500\)×3×3=18,9003\\times\(200\+500\)\\times 3\\times 3=18\{,\}900reader predictions\. The fixed answer contract, decoding policy, chunk size, overlap, and scoring pipeline are preserved\. This auxiliary sweep is reported with answer F1, evidence F1, parse failures, and retrieval diagnostics rather than as a separate full ONCU matrix across all budgets, because only the dense@16 and hybrid@16 settings are rerun with complete matched no\-evidence and oracle\-evidence references\.

To further test whether the retrieved\-context conclusion depends on the absence of reranking, we add a two\-stage cross\-encoder audit, reported in Appendix[A\.11](https://arxiv.org/html/2606.06758#A1.SS11)\. The audit uses hybrid lexical–dense retrieval as the first stage andcross\-encoder/ms\-marco\-MiniLM\-L6\-v2as the reranker\. It covers seven candidate/final\-budget variants for all five evaluated local models on HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500, adding5×7×\(200\+500\)=24,5005\\times 7\\times\(200\+500\)=24\{,\}500reader predictions\. The resulting ONCU values are sample\-level sensitivity diagnostics joined to existing no\-evidence and oracle\-evidence references, not replacements for the metadata\-group ONCU tables used in the main four\-condition matrix\.

Finally, we run a controlled length–position scaling extension on the 3,200\-sample controlled scaling input for Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B\. Each model is evaluated under no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence conditions, yielding3×3,200×4=38,4003\\times 3\{,\}200\\times 4=38\{,\}400additional fixed\-condition predictions\. The scaling extension is counted separately from the 10,800 oracle\-compatible main predictions because it is a focused mechanistic probe rather than a replacement for the balanced core matrix\. Its purpose is to test whether full\-context ONCU varies systematically with context length and evidence position under known evidence structure and whether the qualitative pattern replicates across both the Qwen\-family models and Gemma3\-12B\.

### 8\.2\.Retrieval Baseline Interpretation and Sensitivity Preview

The main retrieved\-evidence rows should be read as a matched lexical@3 diagnostic intervention, not as a claim that lexical@3 is the strongest retrieval\-augmented pipeline\. Retrieval sensitivity is assessed through a complete dense@16/hybrid@16 four\-condition ONCU matrix, broader reader\-facing sweeps, and a five\-model cross\-encoder reranking audit\. Only the matched dense@16/hybrid@16 reruns support main\-text ONCU sensitivity claims; the broader sweeps diagnose retrieval behavior and are reported as auxiliary evidence\.

This distinction is important for interpreting the realistic multi\-hop results: improving retrieval family or increasing top\-kkcan improve evidence\-chain coverage and downstream answer F1, while the same expansion can also expose the reader to more distractors\.

The broader reader\-facing sweep confirms that stronger or larger\-budget retrieved inputs can improve downstream retrieved\-context performance, but the best setting is model\- and dataset\-dependent and larger budgets increase distractor exposure\. The five\-model cross\-encoder reranking audit gives the same message: reranking narrows some retrieved\-context gaps, yet answer F1, evidence F1, and ONCU do not always select the same reranked setting\. These results are therefore used as auxiliary retrieval\-sensitivity evidence, with full tables in Appendix[A\.10](https://arxiv.org/html/2606.06758#A1.SS10)and Appendix[A\.11](https://arxiv.org/html/2606.06758#A1.SS11)\. The main retrieval claim retained in the body is the matched dense@16/hybrid@16 ONCU sensitivity experiment, because it reruns the complete four\-condition protocol\.

### 8\.3\.Main 200\-Sample Answer and Evidence Results

The core 200\-sample matrix shows a task\-dependent bottleneck split\. Controlled examples favor compact retrieved or oracle evidence over full context, while HotpotQA\-derived examples favor full context over the fixed retrieved condition\. These first results use denominator\-free answer and evidence scores over all evaluated samples; ONCU is interpreted only after the denominator\-valid group audit in the next subsection\.

Table[4](https://arxiv.org/html/2606.06758#S8.T4)reports the final 200\-sample results\. The controlled setting reveals a consistent full\-context utilization bottleneck across all three models\. Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B obtain full\-context relaxed F1 values of 0\.538, 0\.526, and 0\.514, respectively\. In contrast, their retrieved\-evidence relaxed F1 values are 0\.975, 0\.993, and 0\.845, and their oracle\-evidence reference scores are 0\.903, 0\.995, and 0\.990\. In the controlled safe16K setting, all three models can therefore answer from compact or isolated evidence substantially better than from the same evidence embedded in a long controlled context\.

The HotpotQA\-derived setting shows the opposite pattern\. For all three models, full\-context answering outperforms retrieved\-evidence evaluation\. Qwen2\.5\-14B obtains full\-context relaxed F1 of 0\.733 compared with 0\.590 under retrieved evidence; Qwen3\-14B obtains 0\.689 compared with 0\.558; and Gemma3\-12B obtains 0\.668 compared with 0\.561\. Evidence F1 values show the same direction: full\-context evidence F1 is consistently higher than retrieved\-evidence F1\. The fixed compact retrieved condition can therefore remove or under\-rank supporting facts required for the tested HotpotQA multi\-hop setting\. The reader\-facing retrieval validation in Appendix[A\.10](https://arxiv.org/html/2606.06758#A1.SS10)shows that stronger or larger\-budget retrieved inputs can improve retrieved answer F1, but the best retrieved configuration is not uniform across models and datasets and must be interpreted together with distractor exposure\.

Table 4\.Final 200\-Sample Core Diagnostic Results\. Each row summarizes 200 examples evaluated under four core conditions\. Full denotes full\-context input, Ret\. denotes retrieved evidence, and Oracle\-ref\. denotes the oracle\-evidence reference\. Parse errors are retained and scored as incorrect\.
### 8\.4\.Denominator\-Valid Group Audit and Sensitivity Checks

ONCU validity is treated as a result in its own right rather than as a hidden preprocessing step\. The valid/invalid group audit in Table[5](https://arxiv.org/html/2606.06758#S8.T5)and the HotpotQA filter\-sensitivity table below specify where the denominator supports a recovered\-advantage interpretation\. HotpotQA ONCU is interpreted only over oracle\-improving metadata groups, while the dataset\-level HotpotQA direction is supported by denominator\-free answer and evidence metrics\.

Table[5](https://arxiv.org/html/2606.06758#S8.T5)gives the valid\-group counts for the main 200\-sample matrix and the larger HotpotQA\-500 robustness runs before the ONCU table is interpreted\. The controlled safe16K benchmark has broad denominator coverage: all three models have at least 133 valid groups out of 134\. HotpotQA\-ONCU\-200 is more restrictive, with 28–29 valid groups out of 38\.

The larger HotpotQA\-500 runs audit whether the valid\-group coverage changes with sample size under the same metadata grouping scheme and evaluation contract\. Unique HotpotQA groups increase from 38 to 46\. Valid groups increase from 29 to 43 for Qwen2\.5\-14B, from 28 to 41 for Qwen3\-14B, and from 28 to 38 for Gemma3\-12B\. The remaining invalid groups have the same denominator status: the oracle\-evidence reference does not exceed the no\-evidence baseline under relaxed answer F1\. This means the denominator\-invalid groups are not parse failures or failed evidence mappings; they are groups for which the model is already comparatively answerable without evidence, or the isolated oracle snippet does not create additional recoverable advantage\.

Table 5\.Denominator\-Validity Audit for ONCU Aggregation\. Valid groups satisfySoracle\>SnoS\_\{\\mathrm\{oracle\}\}\>S\_\{\\mathrm\{no\}\}under the relaxed\-F1 score field\. The HotpotQA\-500 runs are larger\-sample robustness checks, not replacements for the balanced 200\-sample matrix\.Table[6](https://arxiv.org/html/2606.06758#S8.T6)makes the HotpotQA\-ONCU\-200 denominator boundary explicit at main\-text level\. The invalid groups are all comparison\-type metadata groups under the current grouping scheme\. Their answer\-type mix is dominated by entity questions, with smaller numbers of yes/no, string, and numeric answers\. The raw invalid\-group relaxed\-F1 columns show the scores before ONCU filtering; they make the denominator boundary visible in the main results\. The final three columns compare three ways to summarize the full\-minus\-retrieved direction: a sample\-level relaxed\-F1 contrast over all examples, the usual unweighted group\-averaged ONCU contrast over denominator\-valid groups, and a denominator\-weighted ONCU contrast over the same valid groups\.

Table 6\.HotpotQA\-ONCU\-200 Denominator\-Filter Sensitivity\. Invalid groups are those withSoracle≤SnoS\_\{\\mathrm\{oracle\}\}\\leq S\_\{\\mathrm\{no\}\}\. Raw invalid\-group F1 is reported as No/Full/Ret\./Oracle\. The answer\-type profile counts samples inside invalid groups; E=entity, N=number, S=string, Y=yes/no\. The three contrast columns report Full/Ret\./Δ\\Deltafor sample\-level relaxed F1, unweighted valid\-group ONCU, and denominator\-weighted valid\-group ONCU\.This audit sharpens the HotpotQA interpretation\. First, the denominator\-invalid groups are concentrated in a recognizable metadata region rather than arising from parse failures or unavailable oracle passages\. Second, sample\-level answer F1 over all HotpotQA\-ONCU\-200 examples supports full context over retrieved evidence for all three models, independently of ONCU filtering\. Third, both unweighted valid\-group ONCU and denominator\-weighted valid\-group ONCU preserve the same full\-over\-retrieved direction\. The resulting claim is therefore intentionally scoped: sample\-level answer and evidence metrics support full\-over\-retrieved behavior on HotpotQA overall, and ONCU supports the same direction on oracle\-improving groups\.

### 8\.5\.Oracle\-Reference Normalized Context Utilization

After denominator validity is made explicit, ONCU supports the same task\-dependent split\. The group\-normalized relaxed\-F1 ONCU table is interpreted in light of Tables[5](https://arxiv.org/html/2606.06758#S8.T5)and[6](https://arxiv.org/html/2606.06758#S8.T6)\. The values are diagnostics over oracle\-improving groups, not unconditional dataset\-level scores\.

The score columns in Table[7](https://arxiv.org/html/2606.06758#S8.T7)are group averages, so they may differ slightly from the example\-level means in Table[4](https://arxiv.org/html/2606.06758#S8.T4)\. ONCU is averaged only over metadata groups for which the oracle\-evidence reference condition exceeds the no\-evidence condition\.

The controlled results show broad denominator coverage and a stable full\-context utilization bottleneck\. Full\-context ONCU is 0\.583 for Qwen2\.5\-14B, 0\.535 for Qwen3\-14B, and 0\.515 for Gemma3\-12B\. In contrast, retrieved\-evidence ONCU is 0\.981, 0\.994, and 0\.842, respectively\. In this setting, both ONCU and denominator\-free example\-level F1 point in the same direction: compact evidence recovers much more of the oracle\-reference advantage than the full long context\.

The HotpotQA\-derived results show the reverse relationship on oracle\-improving groups\. Full\-context ONCU is 0\.906 for Qwen2\.5\-14B, 0\.787 for Qwen3\-14B, and 0\.719 for Gemma3\-12B, while retrieved\-evidence ONCU is 0\.639, 0\.557, and 0\.536\. This conditional ONCU result is aligned with the denominator\-free evidence already reported in Table[4](https://arxiv.org/html/2606.06758#S8.T4): raw answer F1 and evidence F1 both favor full context over retrieved evidence for all three models\. The broader HotpotQA conclusion therefore rests on two compatible layers: sample\-level answer/evidence metrics over all evaluated examples, and ONCU over denominator\-valid oracle\-improving groups\.

Table 7\.Final 200\-Sample ONCU\-Relaxed\-F1 Results\.SfullS\_\{\\mathrm\{full\}\}andSretS\_\{\\mathrm\{ret\}\}denote group\-averaged relaxed F1 for the full\-context and retrieved\-evidence conditions\.
### 8\.6\.Aggregate Raw\-vs\-Clipped ONCU Audit

The main controlled\-versus\-realistic split is not an artifact of clipping\. The raw\-ratio audit in Appendix[A\.2](https://arxiv.org/html/2606.06758#A1.SS2)shows one aggregate above\-oracle case, Qwen2\.5\-14B on Controlled\-safe16K under retrieved evidence, with a raw ratio of 1\.079 before clipping\. Clipped values support the recovered\-fraction summary, while raw values remain the audit trail for above\-oracle or below\-baseline behavior\.

### 8\.7\.Statistical and Failure\-Audit Summary

Bootstrap intervals, paired contrasts, and aggregate failure\-pattern summaries support the mean\-table conclusions\. Detailed numeric audit tables are placed in Appendix[A\.5](https://arxiv.org/html/2606.06758#A1.SS5), Appendix[A\.6](https://arxiv.org/html/2606.06758#A1.SS6), and Appendix[A\.7](https://arxiv.org/html/2606.06758#A1.SS7)so that the main text does not treat each audit as an independent result\.

### 8\.8\.Cross\-Model Findings

Stronger isolated\-evidence performance does not automatically imply stronger full\-context utilization\. In the 200\-sample controlled setting, Qwen3\-14B and Gemma3\-12B reach oracle\-reference relaxed F1 near 1\.0, but their full\-context ONCU values remain 0\.535 and 0\.515\. Qwen2\.5\-14B has lower oracle\-reference F1 but slightly higher full\-context ONCU at 0\.583\. This is a within\-protocol diagnostic comparison, not a general model ranking\.

The Gemma3\-12B controlled result adds a cross\-family perspective\. Gemma3\-12B has a high oracle\-reference relaxed F1 of 0\.990, but its retrieved\-evidence ONCU is 0\.842, lower than the Qwen\-family values above 0\.98\. ONCU separates the full\-context utilization failure from model\-specific sensitivity to compact retrieved evidence, a distinction that aggregate answer accuracy would obscure\.

### 8\.9\.Dataset\-Dependent Bottlenecks

The dominant bottleneck is dataset\- and pipeline\-dependent\. In Controlled\-ONCU, retrieved\-evidence evaluation dominates full\-context answering in both example\-level metrics and broad\-coverage ONCU\. In HotpotQA\-ONCU, denominator\-free answer/evidence metrics favor full context, and ONCU agrees over oracle\-improving groups\. The HotpotQA retrieved\-evidence result diagnoses the tested deterministic lexical retrieval condition, not all RAG pipelines\.

This difference illustrates why ONCU should be interpreted together with evidence F1 and dataset structure\. A low full\-context ONCU may reflect failure to use available evidence, whereas a low retrieved\-evidence ONCU may reflect a retrieval\-recall bottleneck rather than a generation bottleneck\. The HotpotQA results also show why no\-evidence normalization is necessary: all three models obtain non\-trivial no\-evidence relaxed F1, indicating that parametric knowledge or question priors can otherwise be mistaken for context utilization\.

### 8\.10\.Three\-Model Controlled Length–Position Scaling

The controlled full\-context deficit varies systematically with context length and evidence position\. The scaling extension uses the same 3,200 generated samples for Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B, crossing four context lengths, ten decile evidence\-position buckets, four reasoning types, four distractor\-similarity settings, and five seeds per cell\. This extension serves as mechanism evidence for the controlled benchmark, not as a replacement for the balanced cross\-dataset matrix\.

Table[8](https://arxiv.org/html/2606.06758#S8.T8)summarizes the length effect\. Across all three models, mean full\-context ONCU declines sharply as context length increases\. Qwen2\.5\-14B falls from 0\.999 at 4K to 0\.163 at 32K\. Qwen3\-14B falls from 0\.980 to 0\.150\. Gemma3\-12B starts lower at 4K, with mean full\-context ONCU of 0\.791, but converges to the same low 32K regime, with mean full\-context ONCU of 0\.157\. The controlled full\-context deficit is therefore not limited to one model family\.

Table 8\.Three\-Model Controlled Length Scaling\. Each full\-context cell reports clipped ONCU\-Relaxed\-F1 averaged over the ten evidence\-position buckets for a fixed context length\. Retrieved\-evidence ONCU is included as an internal compact\-evidence competence check\.Table[9](https://arxiv.org/html/2606.06758#S8.T9)reports the 32K position profile, where the length effect is most severe\. The qualitative pattern is shared across models: early and middle positions recover little oracle\-referenced advantage, while the final evidence decile remains the strongest\. Qwen2\.5\-14B and Qwen3\-14B show an especially sharp recency\-skewed collapse: at 32K, clipped full\-context ONCU is near zero through most early and middle deciles, partially recovers atpos\_08, and reaches 1\.000 atpos\_09\. Gemma3\-12B shows a smoother but directionally consistent profile, with low early and middle ONCU values, partial recovery atpos\_08, and its strongest 32K value atpos\_09\.

Table 9\.32K Full\-Context ONCU by Evidence Position\. Each cell reports clipped ONCU\-Relaxed\-F1 at the longest controlled context length\. The full length–position grid is released in the controlled scaling artifacts\.The retrieved\-evidence condition provides the key internal validity check\. For Qwen2\.5\-14B and Qwen3\-14B, retrieved\-evidence ONCU remains essentially saturated across context lengths\. Gemma3\-12B also benefits substantially from compact evidence, but its retrieved\-evidence ONCU remains around 0\.84 rather than 1\.00\. This difference is diagnostically important\. It shows that all three models suffer a full\-context length–position utilization collapse, but Gemma3\-12B also retains an additional compact\-evidence bottleneck after localization has been simplified\.

Failure\-type analysis supports this interpretation\. In the Gemma3\-12B scaling run, the retrieved\-evidence condition has only 21 evidence\-localization failures out of 3,200 examples, but still has 802 evidence\-integration failures and 823 answer\-conversion failures\. The oracle\-evidence condition has 1,747 successes and 1,453 answer\-conversion failures\. Gemma3\-12B’s lower retrieved\-evidence ONCU is therefore not primarily caused by missing compact evidence; it reflects a reader\-side conversion and integration limitation\. In contrast, the Qwen\-family models recover nearly all oracle\-reference advantage when the evidence is compactly supplied, so their controlled scaling deficit is more directly attributable to full\-context evidence localization and utilization\.

We treat the scaling extension as a three\-model mechanism audit\. It is consistent with prior observations that long\-context models can be sensitive to the position of relevant information, but the ONCU framing asks a sharper question: whether a model recovers the same oracle\-evidence advantage when evidence is embedded in the full context versus compactly supplied\. The main conclusion is not that all models behave identically\. Rather, the shared qualitative trend is that full\-context ONCU degrades with length and evidence position, while the recovered compact\-evidence advantage differs by model family\.

### 8\.11\.Retrieval and External\-Validation Summary

Retrieval\-budget, retrieval\-only, reader\-facing, and external\-validation checks support the main diagnosis without expanding the central claim\. The central retrieval result retained in the main text is the matched dense@16/hybrid@16 ONCU sensitivity experiment, because it reruns the full no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence protocol\. Appendix[A\.8](https://arxiv.org/html/2606.06758#A1.SS8)–[A\.13](https://arxiv.org/html/2606.06758#A1.SS13)gives the detailed retrieval budgets, retrieval\-only ablations, reader\-facing validation, BABILong, and RULER\-lite tables\.

### 8\.12\.HotpotQA\-500 Robustness and Valid\-Group Audit

The HotpotQA full\-context\-over\-retrieved pattern and denominator\-validity behavior persist when the sample size is increased\. The HotpotQA\-500 robustness check covers Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B\. These runs audit stability and selection bias; they do not replace the balanced 3\-by\-2 main matrix\.

Table[10](https://arxiv.org/html/2606.06758#S8.T10)compares the 200\-sample and 500\-sample HotpotQA runs under the same top\-k=3k=3diagnostic protocol\. For Qwen2\.5\-14B, the larger run contains 500 examples and 2000 total predictions with zero parse failures\. The sample\-level pattern remains stable: full\-context relaxed F1 is 0\.733, while retrieved\-evidence relaxed F1 is 0\.600\. Evidence F1 also continues to favor full\-context input, with full\-context evidence F1 of 0\.676 compared with retrieved\-evidence evidence F1 of 0\.465\. The ONCU pattern is preserved as well: full\-context ONCU is 0\.845, while retrieved\-evidence ONCU is 0\.657\.

Qwen3\-14B shows the same qualitative trend\. In the 500\-sample run, full\-context relaxed F1 is 0\.715, compared with retrieved\-evidence relaxed F1 of 0\.582\. Full\-context evidence F1 is 0\.699, compared with retrieved\-evidence evidence F1 of 0\.456\. The group\-normalized ONCU pattern remains stable: full\-context ONCU is 0\.787, while retrieved\-evidence ONCU is 0\.593\. Gemma3\-12B also preserves the same direction: full\-context relaxed F1 is 0\.675 compared with retrieved\-evidence relaxed F1 of 0\.576, full\-context evidence F1 is 0\.587 compared with retrieved\-evidence evidence F1 of 0\.399, and full\-context ONCU is 0\.671 compared with retrieved\-evidence ONCU of 0\.483\. Increasing the HotpotQA sample size preserves the full\-context\-over\-retrieved relationship across all three evaluated models\.

The larger runs also clarify the valid\-group issue\. Under the same metadata grouping scheme, the number of unique HotpotQA metadata groups increases from 38 to 46 for all three models\. The number of valid ONCU groups increases from 29 to 43 for Qwen2\.5\-14B, from 28 to 41 for Qwen3\-14B, and from 28 to 38 for Gemma3\-12B\. The number of invalid groups decreases from 9 to 3 for Qwen2\.5\-14B, from 10 to 5 for Qwen3\-14B, and from 10 to 8 for Gemma3\-12B\. In all cases, invalid groups are caused by the same criterion: the oracle\-evidence reference is not above the no\-evidence baseline\. This indicates that invalid HotpotQA ONCU groups are not caused by failed outputs or missing evidence mappings\. Rather, they reflect cases where the model’s no\-evidence baseline is already strong enough that the oracle\-evidence condition does not provide a positive denominator for ONCU\.

We also compute group\-level bootstrap confidence intervals for the 500\-sample ONCU values\. For Qwen2\.5\-14B, full\-context ONCU is 0\.845 \[0\.769, 0\.909\], while retrieved\-evidence ONCU is 0\.657 \[0\.575, 0\.733\]\. For Qwen3\-14B, full\-context ONCU is 0\.787 \[0\.710, 0\.854\], while retrieved\-evidence ONCU is 0\.593 \[0\.510, 0\.675\]\. For Gemma3\-12B, full\-context ONCU is 0\.671 \[0\.578, 0\.759\], while retrieved\-evidence ONCU is 0\.483 \[0\.385, 0\.579\]\. These intervals reproduce the same direction as the 200\-sample matrix and provide larger\-sample support for the HotpotQA conclusion across all three models\.

Table 10\.HotpotQA\-500 Robustness and Valid\-Group Audit for Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B\. The 500\-sample runs preserve the sample\-level full\-context\-over\-retrieved direction and increase valid ONCU groups\. Bracketed values are 95% bootstrap confidence intervals for ONCU\-Relaxed\-F1 over denominator\-valid groups\.This robustness check sharpens the HotpotQA interpretation in three ways\. First, the larger samples reproduce the same qualitative answer and evidence trends for all three models, supporting the claim that the retrieved\-evidence bottleneck is not an artifact of the 200\-sample evaluation\. Second, the valid\-group audit shows that ONCU validity is governed by the oracle\-over\-no\-evidence denominator rather than by parse failures or failed evidence mapping\. Third, raw full\-context and retrieved\-evidence F1 preserve the same direction before ONCU denominator filtering is applied\. The HotpotQA claim is therefore stated in two layers: sample\-level answer and evidence metrics support full\-over\-retrieved behavior over the evaluated samples, while ONCU supports the same direction on oracle\-improving groups\.

### 8\.13\.2WikiMultiHopQA\-ONCU\-500 Realistic Multi\-hop Validation

The 2WikiMultiHopQA\-ONCU\-500 results provide a second realistic multi\-hop setting in which full context outperforms the tested retrieved\-evidence condition\. This validation concerns the reconstructed 2Wiki protocol and deterministic lexical retrieved inputs, not all multi\-hop retrieval systems\.

Across all three models, full\-context input outperforms retrieved\-evidence input on relaxed answer F1\. Qwen2\.5\-14B obtains 0\.549 relaxed F1 under full context compared with 0\.378 under retrieved evidence; Qwen3\-14B obtains 0\.560 compared with 0\.400; and Gemma3\-12B obtains 0\.537 compared with 0\.395\. Evidence F1 shows the same direction: full\-context evidence F1 ranges from 0\.594 to 0\.663, whereas retrieved\-evidence evidence F1 ranges from 0\.316 to 0\.352\. The ONCU\-Relaxed\-F1 results preserve this pattern\. Full\-context ONCU is 0\.610 for Qwen2\.5\-14B, 0\.534 for Qwen3\-14B, and 0\.369 for Gemma3\-12B, while retrieved\-evidence ONCU is 0\.381, 0\.367, and 0\.215, respectively\. Failure\-pattern analysis is consistent with the gap: retrieved evidence produces substantially more evidence\-localization failures, ranging from 40\.6% to 44\.0%, whereas full\-context localization failures remain between 1\.2% and 10\.4%\. These results support the interpretation that, in realistic multi\-hop settings, deterministic lexical retrieve\-then\-read evaluation can become bottlenecked by evidence\-chain coverage, even when the full\-context condition still imposes a non\-trivial integration burden on the model\.

Table 11\.2WikiMultiHopQA\-ONCU\-500 Core Results\.Table 12\.2WikiMultiHopQA\-ONCU\-500 ONCU Results\.Table 13\.2WikiMultiHopQA\-ONCU\-500 Failure\-Type Breakdown for Contextual Conditions\. Rates are percentages over 500 examples per row\. Loc\., Sel\., Int\., Conv\., Succ\., and Parse denote evidence localization failure, evidence selection failure, evidence integration failure, answer conversion failure, categorical success, and structured\-output parsing failure, respectively\.
### 8\.14\.Model\-Family Extension: Llama and Mistral

The task\-dependent bottleneck split is not limited to the original Qwen/Gemma panel within the tested local open\-weight scope\. We rerun the complete four\-condition protocol with Llama3\.1\-8B and Mistral\-Small3\.1\-24B using the same processed inputs, evidence conditions, decoding policy, answer contract, lexical retrieval setting, and scoring pipeline\. The extension is a local model\-family robustness check, not evidence about frontier proprietary systems or all larger open\-weight models\.

Table[14](https://arxiv.org/html/2606.06758#S8.T14)reports the extension\. The controlled result preserves the original synthetic\-context pattern\. Llama3\.1\-8B obtains full\-context ONCU 0\.760 and retrieved\-evidence ONCU 0\.938, while Mistral\-Small3\.1\-24B obtains full\-context ONCU 0\.625 and retrieved\-evidence ONCU 0\.996\. Even with a new Llama\-family model and a larger Mistral\-family model, compact retrieved evidence recovers more of the oracle\-reference advantage than the full long context in the controlled setting\.

The realistic multi\-hop settings preserve the opposite pattern\. On HotpotQA\-ONCU\-200, full\-context ONCU is 0\.742 for Llama3\.1\-8B and 0\.809 for Mistral\-Small3\.1\-24B, compared with retrieved\-evidence ONCU 0\.568 and 0\.539\. On 2WikiMultiHopQA\-ONCU\-500, full\-context ONCU is 0\.371 and 0\.596, compared with retrieved\-evidence ONCU 0\.254 and 0\.308\. The stronger absolute performance of Mistral\-Small3\.1\-24B on the realistic datasets does not remove the retrieval\-chain bottleneck: retrieved input remains below full context when the deterministic retriever under\-supplies or under\-ranks multi\-hop evidence\.

Table 14\.Model\-Family Extension Results\. The extension adds Llama3\.1\-8B and Mistral\-Small3\.1\-24B to the same four\-condition ONCU\-compatible protocol\. F1 columns are example\-level relaxed answer F1\. ONCU columns are group\-averaged clipped ONCU computed from relaxed answer F1 over denominator\-valid metadata groups\.The extension strengthens the paper’s model\-family robustness within the tested local open\-weight panel\. Mistral\-Small3\.1\-24B is stronger than Llama3\.1\-8B in absolute full\-context F1 on the realistic multi\-hop datasets, yet both models show the same qualitative condition\-level diagnosis: controlled synthetic contexts expose full\-context localization and utilization failures, whereas the tested realistic multi\-hop retrieve\-then\-read settings are constrained by retrieval\-chain coverage before the reader can use the evidence\.

### 8\.15\.Matched Retriever\-Family ONCU Sensitivity

Stronger retrieved inputs narrow but do not eliminate the realistic multi\-hop full\-context\-over\-retrieved pattern in the matched ONCU setting\. The dense@16 and hybrid@16 sensitivity experiment reruns the complete four\-condition protocol on the same model–dataset pairs\. Retrieval\-only coverage is not treated as reader utilization; the matched rerun is used because it preserves no\-evidence, full\-context, retrieved\-evidence, and oracle\-evidence references for each setting\.

Table 15\.Matched Dense/Hybrid ONCU Sensitivity\.Table[15](https://arxiv.org/html/2606.06758#S8.T15)shows that stronger retrieved inputs narrow but do not eliminate the realistic multi\-hop gap\. On HotpotQA\-ONCU\-200, full\-context F1 remains higher than dense@16 and hybrid@16 retrieved F1 for all three models\. The same ordering holds for ONCU: Qwen2\.5\-14B obtains full\-context ONCU 0\.906 compared with retrieved ONCU 0\.736 for dense@16 and 0\.762 for hybrid@16; Qwen3\-14B obtains 0\.787 compared with 0\.654 and 0\.626; and Gemma3\-12B obtains 0\.719 compared with 0\.624 and 0\.681\. Hybrid retrieval improves the retrieved condition for Qwen2\.5\-14B and Gemma3\-12B, but it does not overtake the full\-context reference\.

The 2WikiMultiHopQA\-ONCU\-500 results are more nuanced but preserve the same diagnostic conclusion within the tested dense@16 and hybrid@16 settings\. These retrieved inputs substantially improve the retrieved\-evidence condition relative to the original lexical@3 reader\-facing baseline, yet full\-context ONCU remains higher for all three models\. The gap is largest for Qwen2\.5\-14B and Qwen3\-14B, while Gemma3\-12B nearly closes the gap under dense@16\. This pattern is important: it shows that the earlier retrieval conclusion should not be read as a claim about lexical retrieval alone\. Stronger retrieval can materially improve retrieve\-then\-read behavior, but the complete matched protocol still separates pre\-reader evidence availability from reader\-side utilization and answer conversion\.

Dense and hybrid retrieval change the magnitude of the retrieved\-evidence bottleneck and improve retrieval\-conditioned performance in several rows\. The matched reruns nevertheless preserve the need for the four\-condition contrast\. They also show why retrieval\-only and reader\-facing audits must both be reported: a retrieval family can improve chain coverage, evidence F1, and answer F1 without necessarily recovering the same oracle\-reference advantage as full\-context input\.

## 9\.Discussion

The main scientific claim is that final accuracy, retrieval coverage, and citation overlap do not directly identify the behavioral evidence\-utilization contrast studied here\. That contrast becomes observable only through matched evidence\-availability conditions whose denominator\-valid scope is reported before condition\-level ONCU values are interpreted\. The no\-evidence condition estimates answerability that should not be credited to the supplied context; the oracle\-evidence reference estimates the recoverable advantage of isolated evidence; and the full\-context and retrieved\-evidence conditions test whether that advantage survives the actual input regime\. ONCU is the conditional estimator used for this contrast, while the accompanying audits determine whether the denominator and companion evidence are trustworthy\.

The tested failures should therefore be interpreted at the level of evidence presentation, retrieval, and reading, rather than as a simple long\-context\-versus\-RAG comparison\. In controlled synthetic settings, compact or isolated evidence is often sufficient for correct answering, but the same evidence embedded in a long context yields substantially lower recovered oracle\-reference advantage\. This pattern is consistent with full\-context localization or utilization failure\. In the tested realistic multi\-hop settings, full context often outperforms deterministic retrieved evidence, which indicates that the bottleneck can move upstream to retrieval\-chain coverage\. The central result is not that full context or retrieval is uniformly better, but that the dominant bottleneck is task\- and pipeline\-dependent under the evaluated conditions\.

The auxiliary analyses support this interpretation without replacing the core protocol\. Model\-family extension tests whether the controlled\-versus\-realistic split persists within a broader local open\-weight panel\. Retrieval\-only analyses diagnose pre\-reader evidence availability; the matched dense@16/hybrid@16 ONCU sensitivity experiment tests stronger retrieved inputs under complete references; reader\-facing and cross\-encoder reranking audits test whether retrieval improvements survive answer generation across evaluated budgets; and failure\-pattern audits provide aggregate descriptive support rather than item\-level causal labels\. Each component addresses a different possible confound, but the paper’s primary inferential unit remains the matched four\-condition contrast\.

Several inferences are deliberately outside this claim\. ONCU is not a universal model\-ranking score, the oracle\-evidence condition is not a guaranteed upper bound, and denominator\-valid ONCU is not an unconditional dataset\-wide utilization estimate\. The strongest supported claim is behavioral and condition\-level: under the tested fixed evidence\-availability interventions, the evaluated long\-context and retrieve\-then\-read systems differ in how much recoverable evidence advantage survives evidence presentation, retrieval, reading, and output formatting\.

## 10\.Limitations

The study measures observable condition\-level behavior rather than mechanistic causal evidence use\. ONCU measures how scores change when evidence availability is changed under a fixed protocol\. It does not prove that a particular internal computation, attention path, or hidden state causally used a specific passage\. Such claims would require additional interventions such as counterfactual passage replacement, activation patching, causal mediation, or mechanistic tracing\.

The oracle\-evidence condition is an isolated\-evidence reference, not a perfect upper bound\. Annotated evidence may omit helpful neighboring context, aliases, or redundant support, while retrieved chunks and full contexts can contain auxiliary information\. Different oracle\-evidence extraction policies could change the denominator\-valid subset and the frequency of above\-oracle or below\-baseline raw ONCU behavior\. For this reason, raw ONCU above 1 and below 0 are preserved as diagnostic regimes, and clipped ONCU is used only as a reporting convention for recovered\-fraction summaries\.

ONCU aggregation is conditional on denominator\-valid groups satisfyingSoracle\>SnoS\_\{\\mathrm\{oracle\}\}\>S\_\{\\mathrm\{no\}\}\. This condition is broad in Controlled\-ONCU\-safe16K but narrower in HotpotQA\-ONCU\-200\. The HotpotQA conclusions should therefore be read in two layers: ONCU supports claims over oracle\-improving groups, while dataset\-level conclusions rely on denominator\-free answer/evidence scores, HotpotQA\-500 robustness, retrieval\-budget checks, and failure summaries\.

The model panel covers five local open\-weight models across Qwen, Gemma, Llama, and Mistral families, but it does not cover hosted proprietary frontier systems, larger open\-weight variants such as 70B\-scale models, learned retrievers, supervised multi\-hop retrieval policies, or domain\-specific long\-context systems\. The empirical conclusions should therefore be read as strongest for the evaluated local open\-weight long\-context and retrieve\-then\-read behavior under deterministic decoding and fixed structured\-output contracts\.

The automatic failure\-pattern labels should be interpreted as aggregate descriptive support rather than item\-level causal labeling\. The human audit shows moderate annotator agreement but substantially lower rule\-versus\-final\-human agreement\. We therefore use the labels to summarize broad failure patterns and rely on continuous answer, evidence, retrieval, and ONCU metrics for primary claims\.

The retriever\-family sensitivity experiment addresses the most direct lexical\-baseline concern by rerunning the complete four\-condition protocol for dense@16 and hybrid@16 retrieved inputs\. It is still not an exhaustive retriever benchmark\. It does not cover learned retrievers, supervised multi\-hop retrieval policies, retriever fine\-tuning, or every top\-kkbudget under full ONCU references\. The broader reader\-facing validation and five\-model cross\-encoder reranking audit therefore remain auxiliary sensitivity analyses, and future work should extend the matched protocol to learned retrieval systems and more retrieval budgets\.

Finally, the external BABILong\-200 and RULER\-lite\-240 experiments are answer\-performance validations, not ONCU benchmarks\. They do not instantiate matched no\-evidence and oracle\-evidence reference conditions\. Extending the full four\-condition protocol to these and other long\-context benchmarks is future work\.

## Data and Code Availability

The implementation, fixed configurations, processed\-data builders, released processed inputs, frozen experiment summaries, confidence intervals, and reproduction instructions are available in the fixed public GitHub release:[https://github\.com/Haizhoux0517/long\_context\_cue/releases/tag/v1\.0\.1\-jair](https://github.com/Haizhoux0517/long_context_cue/releases/tag/v1.0.1-jair)\. The release is tagged asv1\.0\.1\-jairand corresponds to the artifact snapshot used for this submission\. Source code is released under the MIT License\. Documentation, figures, tables, README files, and supplementary materials are released under CC BY 4\.0\. Third\-party datasets and benchmark resources retain their original licenses and terms of use, as documented inDATA\_LICENSES\.md\.

The release contains the protocol code, fixed run configurations, core processed JSONL inputs, frozen result artifacts, runtime and model\-environment records, and scripts for recomputing the reported tables\. The submitted snapshot was verified withscripts/check\_release\_artifacts\.py\-\-strict\-data\-\-strict\-clean\. TableLABEL:tab:repo\_artifact\_mapprovides the reviewer\-facing map from reported experiment families to repository paths, including which auxiliary inputs are generated by released builders rather than checked in as core inputs\. Runtime records include hardware and software metadata, deterministic inference controls, and unavailable\-tool diagnostics; exact Ollama, model digest, and quantization entries are recorded when available on the reproduction host\.

Table 16\.Reviewer\-Facing Code and Data Artifact Map\. The paths are aligned with the submitted repository layout\. “Released” denotes files or directories present in the repository release\. The processed JSONL files are released for auditability and are also regenerable from the builder scripts and public source datasets\.Artifact groupRepository path\(s\)StatusReviewer audit roleExperiment runnerlongcue/run\_experiment\.py;longcue/ReleasedExecutes the fixed diagnostic protocol and writes per\-sample predictions and metrics\.Protocol and ONCU computationscripts/validate\_diagnostic\_protocol\.py;scripts/recompute\_oncu\.py;longcue/evaluation/oncu\.pyReleasedValidates fixed protocol fields and recomputes raw and clipped ONCU summaries from per\-sample metrics\.Dataset builders and adaptersscripts/build\_controlled\_cue\.py;scripts/build\_controlled\_scaling\_cue\.py;scripts/build\_hotpotqa\_cue\.py;scripts/build\_2wiki\_cue\.py;scripts/build\_babilong\_cue\.py;scripts/build\_ruler\_lite\.py;longcue/data/ReleasedBuilds controlled, controlled\-scaling, HotpotQA, 2WikiMultiHopQA, BABILong, and RULER\-lite processed inputs\. The BABILong and RULER\-lite adapters mark the external sets as not ONCU\-compatible because the full oracle\-reference condition protocol is unavailable\.Run configurationsconfigs/\*\_200\_core\_final\.yaml;configs/hotpotqa\_\*\_500\_core\_robust\.yaml;configs/twowiki\_\*\_500\_core\.yaml;configs/model\_family\_extension/\*\.yaml;configs/babilong\_\*\_200\_external\.yaml;configs/ablations/retriever\_family\_\*\.yaml;configs/ablations/reader\_facing\_retfam\_\*\.yaml;configs/retriever\_family\_oncu\_sensitivity/\*\.yaml;configs/scaling/controlled\_scaling\_\*\.yaml;scripts/run\_ruler\_lite\_external\.pyReleasedRecords dataset paths, model names, retrieval settings, deterministic decoding settings, ablation settings, output directories, and the RULER\-lite external validation runner\.Released processed JSONL inputsdata/processed/controlled\_oncu\_200\_safe16k\.jsonl;data/processed/hotpotqa\_cue\_200\.jsonl;data/processed/hotpotqa\_cue\_500\.jsonl;data/processed/twowiki\_cue\_500\.jsonl;data/processed/babilong\_cue\_200\_external\.jsonlReleasedProvides the materialized inputs for the core ONCU\-compatible runs, HotpotQA robustness run, and BABILong external validation; these files are included for auditability and can also be regenerated from the builder scripts\.Generated auxiliary inputsdata/processed/controlled\_scaling\_3200\.jsonl;data/processed/ruler\_lite\_240\.jsonlGeneratedRecreated by the released controlled\-scaling and RULER\-lite builder scripts before rerunning those auxiliary audits; the corresponding frozen summary artifacts are released underexperiment\_backups/\.Frozen result artifactsexperiment\_backups/sci200\_final\_3model\_20260525/;experiment\_backups/twowiki\_500\_validation\_20260527/;experiment\_backups/model\_family\_extension\_20260601/;experiment\_backups/hotpotqa\_500\_robustness\_20260525/;experiment\_backups/babilong\_200\_external\_20260526/;experiment\_backups/ruler\_lite\_external\_20260530\_final/;experiment\_backups/retriever\_family\_ablation\_20260527/;experiment\_backups/retriever\_family\_oncu\_sensitivity\_20260602/;experiment\_backups/reader\_facing\_retriever\_family\_20260530/;experiment\_backups/controlled\_scaling\_20260527/;experiment\_backups/failure\_taxonomy\_human\_validation\_20260530/ReleasedStores the result CSV files, exportedLaTeXtables, confidence intervals, model\-family extension outputs, external\-validation summaries, retrieval\-family ablation summaries, matched retriever\-family ONCU sensitivity summaries, reader\-facing retriever\-family validation summaries, controlled scaling summaries, human\-validation audit outputs, and final summaries used in the paper, including the completed 2WikiMultiHopQA\-ONCU\-500 outputs\.Cross\-encoder reranking auditconfigs/rerank\_sensitivity/;RUN\_CE\_RERANK\_CONFIG\_LIST\.sh;scripts/summarize\_ce\_rerank\_five\_model\.py;experiment\_backups/rerank\_sensitivity\_20260602/five\_model\_ce\_rerank\_summary/Released summary/configProvides the seven\-setting, five\-model cross\-encoder reranking appendix audit over HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500 through released configurations, runner list, summarization script, and lightweight summary CSV/TEX files\.Confidence intervals, retrieval ablations, and failure summariesscripts/bootstrap\_sci200\_final\_ci\.py;scripts/bootstrap\_hotpotqa500\_robustness\_ci\.py;scripts/bootstrap\_babilong200\_external\_ci\.py;scripts/summarize\_ruler\_lite\_external\.py;scripts/summarize\_sci200\_failure\_breakdown\.py;scripts/recompute\_twowiki500\_tables\.py;scripts/run\_retriever\_family\_ablation\.py;scripts/prepare\_retriever\_family\_oncu\_sensitivity\.py;scripts/summarize\_reader\_facing\_retriever\_results\.py;scripts/summarize\_controlled\_scaling\.pyReleasedRegenerates bootstrap confidence intervals, external\-validation summaries, final failure\-type summaries, 2Wiki derived tables, retrieval\-family ablation summaries, matched retriever\-family ONCU sensitivity configurations, reader\-facing retriever\-family validation summaries, and controlled scaling summaries from completed outputs\.Failure\-taxonomy human validationscripts/export\_failure\_taxonomy\_audit\.py;scripts/summarize\_failure\_taxonomy\_audit\.py;experiment\_backups/failure\_taxonomy\_human\_validation\_20260530/ReleasedStores the stratified blind audit sample, annotation codebook, anonymous annotator files, adjudicated final human labels, agreement summaries, confusion matrices, exportedLaTeXtables, and manifest for the human\-validation audit\.Runtime and model environment recordscripts/export\_runtime\_record\.pyRUNTIME\_REPRODUCIBILITY\_RECORD\.mdruntime\_reproducibility\_record\.jsonReleasedRecords deterministic inference controls, Python/package versions, GPU/VRAM, PyTorch/CUDA, NVIDIA driver/CUDA information, local model tags requested by the protocol, context\-window configuration, and unavailable runtime commands for reproducibility audit\.Release audit checkerscripts/check\_release\_artifacts\.pyReleasedVerifies the declared repository artifacts, including core files, released processed inputs, frozen result directories, and auxiliary summary/config artifacts\. The submitted release passes the strict audit commandpythonscripts/check\_release\_artifacts\.py\-\-strict\-data\-\-strict\-clean\.Reproduction and dependenciesREADME\_REPRODUCE\.mdARTIFACT\_MANIFEST\.mdrequirements\.txtpyproject\.tomlReleasedDocuments environment setup, model names, execution commands, expected outputs, dependency metadata, and the mapping from paper tables to repository artifacts\.Table 16\.Reviewer\-Facing Code and Data Artifact Map \(continued\)\.Processed inputs\.The processed evaluation files are JSONL records with explicit sample identifiers, questions, passage\-annotated contexts, gold answers, and metadata fields\. The ONCU\-compatible datasets also include oracle\-evidence identifiers, which are used to construct the oracle\-evidence condition and to compute evidence\-overlap metrics\. BABILong\-200 and RULER\-lite\-240 are kept separate because their current adapters do not provide the full no\-evidence and oracle\-evidence reference protocol required for ONCU\. Original public datasets, including HotpotQA, 2WikiMultiHopQA, BABILong, and RULER\-style benchmark resources, should be obtained from their official providers subject to their licenses and terms of use\.

Regeneration paths\.The repository records exact commands and checksums inREADME\_REPRODUCE\.mdandARTIFACT\_MANIFEST\.md\. The main paper reports audit locations rather than printing long checksums inline\. The principal regeneration paths are:

- •Runtime and model environment record:generated byscripts/export\_runtime\_record\.pyasRUNTIME\_REPRODUCIBILITY\_RECORD\.mdandruntime\_reproducibility\_record\.json; the submitted record captures the Runpod A40 reproduction host, deterministic inference controls, PyTorch/CUDA metadata, NVIDIA driver/CUDA information, and requested local model tags\.
- •2WikiMultiHopQA\-ONCU\-500:generated byscripts/build\_2wiki\_cue\.pywith the fixed seed reported in the artifact manifest\.
- •Retriever\-family ablation:produced byscripts/run\_retriever\_family\_ablation\.pyand archived underexperiment\_backups/retriever\_family\_ablation\_20260527/\.
- •Retriever\-family ONCU sensitivity:generated byscripts/prepare\_retriever\_family\_oncu\_sensitivity\.py, run withconfigs/retriever\_family\_oncu\_sensitivity/\*\.yaml, and archived underexperiment\_backups/retriever\_family\_oncu\_sensitivity\_20260602/\.
- •Five\-model cross\-encoder reranking audit:configured withconfigs/rerank\_sensitivity/, run withRUN\_CE\_RERANK\_CONFIG\_LIST\.sh, summarized byscripts/summarize\_ce\_rerank\_five\_model\.py, and represented in the release by lightweight summary tables underexperiment\_backups/rerank\_sensitivity\_20260602/five\_model\_ce\_rerank\_summary/\.
- •RULER\-lite\-240:generated byscripts/build\_ruler\_lite\.py, evaluated byscripts/run\_ruler\_lite\_external\.py, summarized byscripts/summarize\_ruler\_lite\_external\.py, and archived underexperiment\_backups/ruler\_lite\_external\_20260530\_final/\. The generated input file is recreated by the builder before rerunning the audit rather than treated as a checked\-in core input\. The Qwen3\-14B run uses the API\-level thinking\-output control described in Section 8\.
- •Controlled scaling:generated byscripts/build\_controlled\_scaling\_cue\.py, summarized byscripts/summarize\_controlled\_scaling\.py, and archived underexperiment\_backups/controlled\_scaling\_20260527/\. The generated 3,200\-sample input is regenerated by the released builder when rerunning this auxiliary audit\.
- •Failure\-taxonomy human validation:exported byscripts/export\_failure\_taxonomy\_audit\.py, summarized byscripts/summarize\_failure\_taxonomy\_audit\.py, and archived underexperiment\_backups/failure\_taxonomy\_human\_validation\_20260530/\.

If anonymous review is required, the repository URL should be replaced by an anonymized archival link or anonymous repository snapshot during submission and restored after review\. Publication metadata fields such as DOI, volume, article number, and associate editor should be left blank or suppressed in the review PDF until assigned by JAIR\.

## 11\.Conclusion

Long\-context and retrieval\-augmented evaluation can be treated as a diagnostic protocol\-and\-estimation problem\. Final answer accuracy alone cannot determine whether an answer reflects no\-evidence answerability, whether a retriever preserved the necessary evidence chain, or whether the reader converted available evidence into the requested answer\. The proposed four\-condition protocol—no evidence, full context, retrieved evidence, and oracle\-evidence reference—makes those contrasts observable, and ONCU provides a baseline\-adjusted estimate of recovered oracle\-reference advantage when its denominator is valid\.

Across the five tested local open\-weight models, the same qualitative split appears\. Controlled synthetic settings primarily expose full\-context utilization failures: compact or isolated evidence is usable, but the same evidence embedded in long inputs is not reliably converted into oracle\-reference advantage\. In the tested HotpotQA and 2WikiMultiHopQA settings, deterministic retrieve\-then\-read input can lose support that remains available in full context, exposing retrieval\-chain coverage failures\. Scaling, model\-family extension, retrieval\-family, reader\-facing, external\-validation, and human\-audit analyses support this bounded interpretation while also clarifying its limits\.

The framework complements answer, evidence, and retrieval metrics by adding a matched diagnostic protocol, denominator\-validity regime, and aggregate failure\-pattern audit for separating no\-evidence answerability, evidence availability, reader utilization, denominator validity, and output stability\. Future work should extend the matched four\-condition protocol to stronger frontier and larger open\-weight models, learned multi\-hop retrievers and rerankers, domain\-specific long\-context tasks, and mechanistic interventions that can test causal evidence use directly\.

## References

## Appendix ASupplementary Audit Material

The main text is intentionally restricted to the evidence chain needed for the JAIR submission argument: protocol, estimator validity, primary ONCU results, controlled and realistic robustness checks, model\-family coverage, and the matched retriever\-family ONCU sensitivity experiment\. The appendices below preserve detailed audit material that is useful for reviewers but would otherwise make the main article read like an experiment report\.

### A\.1\.Comparison with Alternative Diagnostic Scores

ONCU is intended to complement, not replace, standard answer, evidence, and retrieval metrics\. Raw answer F1 and exact match measure whether the final answer is correct, but they do not distinguish contextual evidence use from no\-evidence answer priors or parametric knowledge\. Evidence F1 measures whether cited passages overlap with oracle evidence, but a model can cite relevant evidence and still fail to integrate or convert it into the correct answer\. Retrieval recall@kkmeasures whether the retriever exposes the required evidence before answer generation, but it does not measure whether the reader uses that evidence\. Oracle\-gap and context\-gain scores each control one side of the diagnostic comparison, but not both\. ONCU is designed for the narrower question of how much oracle\-evidence advantage a contextual condition recovers after adjusting for what the same model can answer without evidence\. Because ONCU filters non\-positive denominators, the empirical sections also report unnormalized example\-level answer and evidence scores; these raw\-score analyses serve as denominator\-free sensitivity checks on the direction of the reported effects\.

Table[17](https://arxiv.org/html/2606.06758#A1.T17)summarizes these distinctions\. The comparison is important because several failure patterns in this paper would be hard to interpret from any single raw metric\. A high full\-context answer F1 can partly reflect no\-evidence answerability; high evidence overlap can coexist with answer\-conversion failure; and high retrieval recall can still lead to low reader utilization\. ONCU addresses a distinct diagnostic question by normalizing a contextual score between the no\-evidence baseline and the oracle\-evidence reference\.

Table 17\.Comparison of ONCU with Alternative Diagnostic Scores\. Baseline indicates whether the score adjusts for no\-evidence answerability, Oracle\-ref\. indicates whether it accounts for an isolated\-evidence reference, and Evidence indicates whether the score directly evaluates evidence availability or overlap\.Table[18](https://arxiv.org/html/2606.06758#A1.T18)gives representative cases from the released artifacts\. The controlled examples show why raw answer F1 alone is insufficient: Qwen3\-14B has non\-trivial full\-context F1 on the controlled safe16K setting, but ONCU shows that the full\-context condition recovers only about half of the oracle\-evidence advantage, whereas compact retrieved evidence nearly saturates the oracle reference\. The realistic multi\-hop examples show the opposite phenomenon: full context can preserve evidence\-chain coverage better than deterministic retrieve\-then\-read input, so retrieval can reduce ONCU even when the retrieved context is shorter\. The controlled 32K cases further show why position\-aware ONCU is useful: early evidence can collapse to zero recovered advantage while the final evidence decile remains recoverable\. Finally, retrieval\-only rows illustrate that retrieval recall and chain coverage are pre\-reader diagnostics; they identify evidence availability bottlenecks but do not by themselves measure reader utilization\.

Table 18\.Metric\-Comparison Case Studies\. Raw answer F1, evidence F1 or chain coverage, context gain, oracle gap, and ONCU answer different diagnostic questions\. Dashes indicate metrics that are not defined for retrieval\-only rows or for aggregated controlled\-scaling cells without evidence\-overlap summaries\.
### A\.2\.Aggregate Raw\-vs\-Clipped ONCU Audit

The main ONCU tables report clipped group\-averaged ONCU because clipped values support the recovered\-fraction interpretation\. The raw\-ratio audit in Table[19](https://arxiv.org/html/2606.06758#A1.T19)makes visible when a contextual condition exceeds the oracle\-evidence reference or falls below the no\-evidence baseline before clipping\. Raw ratios are computed from the group\-averaged score columns in Table[7](https://arxiv.org/html/2606.06758#S8.T7)and are interpreted as clipping\-sensitivity diagnostics rather than as replacements for the official group\-averaged ONCU values\.

Table 19\.Aggregate Raw\-vs\-Clipped ONCU Audit for the Final 200\-Sample Matrix\. Raw ratios are computed from the group\-averaged score columns in Table[7](https://arxiv.org/html/2606.06758#S8.T7)as\(Sc−Sno\)/\(Soracle−Sno\)\(S\_\{c\}\-S\_\{\\mathrm\{no\}\}\)/\(S\_\{\\mathrm\{oracle\}\}\-S\_\{\\mathrm\{no\}\}\)\. They are used only as a clipping\-sensitivity audit; the official ONCU results remain the group\-averaged values reported in Table[7](https://arxiv.org/html/2606.06758#S8.T7)\.
### A\.3\.Failure Diagnosis

We diagnose failures according to the relationship between answer correctness and evidence correctness\. Table[20](https://arxiv.org/html/2606.06758#A1.T20)summarizes the operational rules used to assign the main failure categories\. The labels are intended as categorical diagnostic annotations rather than replacements for continuous answer F1 or evidence F1\. In particular, the success rate in this taxonomy can be stricter than relaxed answer F1 because it depends jointly on answer and evidence behavior\.

Table 20\.Operational Rules for Failure\-Type Assignment\.
### A\.4\.Human Validation of Failure\-Type Assignment

The failure taxonomy in Table[20](https://arxiv.org/html/2606.06758#A1.T20)is rule\-based and is therefore used as a scalable diagnostic approximation rather than as ground\-truth causal attribution\. To evaluate whether the operational labels align with human judgments, we conducted a blinded human\-validation audit on a stratified sample of 300 failed predictions\. The sample was drawn across datasets, models, evidence\-availability conditions, and rule\-assigned failure categories\. Two annotators independently assigned one of six labels: localization, selection, integration, conversion, parse\-format, or ambiguous\. One annotator was the author and the other was an anonymous independent annotator who requested not to be named\. During the initial annotation pass, annotators were blind to the rule\-based labels and to each other’s labels\.

Before adjudication, the annotators agreed on 213 of 300 items, yielding raw agreement of 0\.710 and Cohen’sκ=0\.588\\kappa=0\.588\(cohen1960coefficient\)\. The remaining 87 disagreements were resolved by blind adjudication to obtain final human labels\. Against these adjudicated labels, the rule\-based taxonomy matched 154 of 300 items, corresponding to raw agreement of 0\.513 and Cohen’sκ=0\.355\\kappa=0\.355\. Table[21](https://arxiv.org/html/2606.06758#A1.T21)summarizes the validation results, and Table[22](https://arxiv.org/html/2606.06758#A1.T22)reports the label distributions\.

The human\-validation audit supports two interpretations\. First, the annotators show moderate agreement on a difficult six\-way failure\-labeling problem, which indicates that the taxonomy captures recognizable failure modes rather than arbitrary categories\. Second, the lower rule\-versus\-final\-human agreement shows that the automatic rule labels should not be interpreted as item\-level causal ground truth\. We therefore use the automatic taxonomy for aggregate diagnostic analysis, preserve continuous answer and evidence metrics as the primary measurements, and release the human\-validation labels, adjudication outcomes, agreement summaries, and confusion matrices for auditability\.

Table 21\.Human Validation of Failure\-Type Assignment\. Two annotators independently labeled a stratified audit sample of failed predictions\. Disagreements were resolved by blind adjudication to produce final human labels\.Table 22\.Failure\-Label Distributions in the Human\-Validation Audit\. Counts are over the same 300 audited failed predictions\.
### A\.5\.Final 200\-Sample Bootstrap Confidence Intervals

Table[23](https://arxiv.org/html/2606.06758#A1.T23)reports group\-level bootstrap confidence intervals for ONCU\-Relaxed\-F1 in the final 200\-sample matrix\. The intervals support the two main diagnostic patterns\. In the controlled setting, retrieved\-evidence ONCU remains substantially higher than full\-context ONCU for all three models: Qwen2\.5\-14B obtains 0\.981 \[0\.955, 1\.000\] under retrieved evidence compared with 0\.583 \[0\.506, 0\.663\] under full context; Qwen3\-14B obtains 0\.994 \[0\.985, 1\.000\] compared with 0\.535 \[0\.451, 0\.619\]; and Gemma3\-12B obtains 0\.842 \[0\.782, 0\.897\] compared with 0\.515 \[0\.438, 0\.591\]\. These intervals reinforce the conclusion that the controlled benchmark exposes a full\-context utilization bottleneck rather than a general inability to answer from compact evidence\.

The HotpotQA\-derived setting shows the reverse pattern on denominator\-valid groups\. Full\-context ONCU is consistently higher than retrieved\-evidence ONCU: Qwen2\.5\-14B obtains 0\.906 \[0\.843, 0\.958\] compared with 0\.639 \[0\.511, 0\.755\], Qwen3\-14B obtains 0\.787 \[0\.679, 0\.881\] compared with 0\.557 \[0\.441, 0\.675\], and Gemma3\-12B obtains 0\.719 \[0\.604, 0\.819\] compared with 0\.536 \[0\.428, 0\.645\]\. Together with the lower retrieved\-evidence F1 values in Table[4](https://arxiv.org/html/2606.06758#S8.T4), these intervals support the scoped interpretation that retrieved evidence can become the bottleneck in realistic multi\-hop samples: sample\-level answer/evidence metrics support the direction over evaluated examples, and ONCU supports it over oracle\-improving groups\.

Table 23\.Bootstrap 95% Confidence Intervals for Final 200\-Sample ONCU\-Relaxed\-F1\. ONCU is bootstrapped over valid metadata groups using 5,000 bootstrap resamples\. Full denotes full\-context input and Ret\. denotes retrieved evidence\.
### A\.6\.Final 200\-Sample Failure\-Type Analysis

Table[24](https://arxiv.org/html/2606.06758#A1.T24)reports the final 200\-sample failure\-type breakdown for the two contextual conditions\. The taxonomy is used as a diagnostic annotation and should be interpreted together with continuous answer and evidence metrics\. Its categorical success label is stricter than relaxed answer F1 because it depends jointly on answer and evidence behavior\.

The controlled setting shows a consistent reduction in evidence\-localization failures under retrieved evidence\. For Qwen2\.5\-14B, localization failures decrease from 36\.0% under full\-context input to 0\.0% under retrieved evidence\. Qwen3\-14B shows a similar decrease from 23\.5% to 0\.0%, and Gemma3\-12B decreases from 10\.5% to 0\.0%\. This is consistent with the interpretation that the controlled benchmark exposes a full\-context evidence\-utilization bottleneck: when the relevant evidence is compactly provided, localization\-like failures are largely removed by the automatic taxonomy\.

The HotpotQA\-derived setting shows the opposite pattern\. Full\-context input has very low localization failure rates, ranging from 0\.5% to 1\.5% across models\. In contrast, retrieved\-evidence input increases localization failures to 23\.0% for Qwen2\.5\-14B, 22\.5% for Qwen3\-14B, and 26\.0% for Gemma3\-12B\. This is consistent with the retrieval\-coverage interpretation of the HotpotQA results: deterministic lexical retrieval can discard or truncate supporting facts required for multi\-hop reasoning\. The automatic labels are used as aggregate descriptive support, not as item\-level causal ground truth\.

The Gemma3\-12B controlled run further illustrates the diagnostic value of the framework\. Although retrieved evidence removes localization failures, Gemma3\-12B retains higher integration failures than the Qwen\-family models\. Its full\-context condition also contains a small number of structured\-output parsing failures\. The failure taxonomy complements ONCU by describing whether low utilization scores are associated, in aggregate, with missing evidence, evidence integration, answer conversion, or structured\-output instability\.

Table 24\.Final 200\-Sample Failure\-Type Breakdown for Contextual Conditions\. Rates are percentages over 200 examples per row\. Loc\., Sel\., Int\., Conv\., Succ\., and Parse denote evidence localization failure, evidence selection failure, evidence integration failure, answer conversion failure, categorical success, and structured\-output parsing failure, respectively\. The success label is stricter than relaxed answer F1 and is used only as a diagnostic category\.
### A\.7\.Statistical Modeling of Diagnostic Effects

The preceding tables are descriptive summaries over fixed diagnostic conditions\. To ensure that the main conclusions are not driven by unpaired table comparisons, we add a statistical support layer that respects the repeated\-measures structure of the evaluation\. The analysis is not used to select prompts, models, datasets, or retrieval settings; it is a post\-hoc audit of the fixed released artifacts\.

Table[25](https://arxiv.org/html/2606.06758#A1.T25)reports paired effect\-size estimates for the central claims\. In the controlled 200\-sample matrix, retrieved evidence outperforms full\-context input for all three models, with paired relaxed\-F1 gains of 0\.331 for Gemma3\-12B, 0\.438 for Qwen2\.5\-14B, and 0\.467 for Qwen3\-14B\. In HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500, the sign reverses: full\-context input outperforms retrieved evidence, supporting the interpretation that realistic multi\-hop retrieved\-evidence failures are often retrieval\-coverage failures rather than reader\-side failures alone\. The same table also quantifies the controlled scaling effects\. Across all three models, the 4K–32K full\-context ONCU drop is large, and the 32K final\-decile advantage over early and middle deciles remains large after paired resampling\.

Table 25\.Statistical Support for Main Diagnostic Effects\. Estimates are paired mean differences unless otherwise noted\. Confidence intervals are paired bootstrap 95% intervals\.pHolmp\_\{\\mathrm\{Holm\}\}controls family\-wise error within the stated analysis family\.Table[26](https://arxiv.org/html/2606.06758#A1.T26)reports regression\-style checks\. The controlled scaling regressions are fit over aggregated length–position ONCU cells\. Their purpose is to test whether the length and position patterns remain visible when summarized as continuous effects rather than as heatmaps\. The log\-context\-length coefficient is negative for all three models, while the position\-fraction coefficient is positive, indicating lower ONCU at longer contexts and stronger utilization near later evidence positions\. The retriever\-family regression supports the retrieval\-ablation interpretation: larger retrieval budgets increase full\-chain coverage, while retriever\-family effects differ by dataset\. These regressions are diagnostic support for effect direction and magnitude; they do not replace the paired contrasts or the released per\-sample and per\-group metrics\.

Table 26\.Regression\-Style Statistical Checks\. Controlled scaling rows are OLS diagnostics over aggregated length–position ONCU cells\. Retriever rows are OLS diagnostics over retrieval\-summary cells\. These models support effect direction and magnitude rather than replacing the paired analyses\.Together, the paired contrasts, adjusted significance diagnostics, and regression\-style checks strengthen the main interpretation without changing its scope\. Controlled long\-context failures are statistically large full\-context utilization failures\. Realistic multi\-hop retrieve\-then\-read failures are statistically consistent with retrieval\-coverage limitations\. The controlled scaling extension is not merely a visual heatmap pattern: length and evidence position are associated with systematic variation in oracle\-referenced utilization across all three evaluated models\.

### A\.8\.HotpotQA Retrieval\-Budget Sensitivity

The lower retrieved\-evidence performance on HotpotQA\-ONCU\-200 can reflect retrieval budget as well as reader\-side utilization\. To separate these factors, we run a retrieval\-budget sensitivity analysis for Qwen2\.5\-14B and Qwen3\-14B on HotpotQA\-ONCU\-200, varying the deterministic lexical retrieval budget from top\-k=3k=3to top\-k=5k=5and top\-k=8k=8while keeping the dataset, output contract, decoding policy, and evaluation pipeline fixed\. The top\-k=3k=3setting corresponds to the main final\-matrix configuration\.

Table[27](https://arxiv.org/html/2606.06758#A1.T27)shows that increasing the lexical retrieval budget partially improves retrieved\-evidence performance, but does not close the gap to full\-context ONCU\. For Qwen2\.5\-14B, moving from top\-k=3k=3to top\-k=8k=8increases retrieved\-evidence F1 from 0\.478 to 0\.520 and retrieved\-evidence ONCU from 0\.639 to 0\.723\. For Qwen3\-14B, the same change increases retrieved\-evidence F1 from 0\.469 to 0\.530 and retrieved\-evidence ONCU from 0\.557 to 0\.634\. These improvements indicate that the HotpotQA retrieved\-evidence bottleneck is partly related to evidence coverage\.

However, the retrieved\-evidence condition remains below the corresponding full\-context ONCU values from Table[7](https://arxiv.org/html/2606.06758#S8.T7)\. For Qwen2\.5\-14B, top\-k=8k=8retrieved\-evidence ONCU is 0\.723, compared with full\-context ONCU of 0\.906\. For Qwen3\-14B, top\-k=8k=8retrieved\-evidence ONCU is 0\.634, compared with full\-context ONCU of 0\.787\. Increasing the lexical top\-kkbudget mitigates but does not eliminate the retrieved\-evidence bottleneck\. Realistic multi\-hop retrieval therefore requires better evidence ranking, multi\-hop coverage, and distractor control rather than simply retrieving more lexical chunks\.

Table 27\.HotpotQA Retrieval\-Budget Sensitivity for Qwen2\.5\-14B and Qwen3\-14B\. The table reports retrieved\-evidence condition performance on HotpotQA\-ONCU\-200 under different lexical top\-kkbudgets\. ONCU is computed with relaxed answer F1\.The ablation therefore refines the interpretation of the HotpotQA results\. Both models improve when the retrieval budget is expanded, showing that retrieval depth is a real protocol variable\. At the same time, the improvement is incomplete: the top\-k=8k=8retrieved\-evidence condition still falls short of full\-context ONCU\. The results point to a broader multi\-hop retrieval problem: the retriever must recover both supporting facts, rank them sufficiently high, and avoid adding distracting evidence that complicates downstream reasoning\. The full\-context condition remains stronger because it preserves all candidate supporting facts, even though it places a larger utilization burden on the model\.

### A\.9\.Retriever\-Family Retrieval Ablation

The preceding HotpotQA top\-kkanalysis varies only the lexical retrieval budget\. The next protocol variable is retriever family\. We therefore run a retrieval\-only ablation on HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500 without changing the materialized examples, passage identifiers, chunk size, overlap, or evidence\-overlap metrics\.

The evaluated retrieval families are lexical retrieval, dense sentence\-embedding retrieval, hybrid lexical–dense retrieval using Reciprocal Rank Fusion, a deterministic iterative query\-expansion baseline, and oracle retrieval\. The dense retriever uses a fixed off\-the\-shelf sentence\-embedding model and is not trained on either benchmark\. The hybrid retriever combines lexical and dense rankings using reciprocal ranks, avoiding score\-scale calibration between sparse and dense retrieval\. The iterative retriever is a deterministic expansion baseline rather than a learned multi\-hop retriever: it retrieves initial passages from the question, expands the retrieval query with selected retrieved text, and merges the ranked outputs\. The oracle condition is not a deployable retriever; it is included as an evidence\-chain coverage reference\.

Table[28](https://arxiv.org/html/2606.06758#A1.T28)reports the compact main\-paper view of the ablation for top\-k=3k=3and top\-k=16k=16\. The complete top\-k=3,5,8,16k=3,5,8,16retrieval\-only outputs are released as CSV artifacts\. On HotpotQA\-ONCU\-200, lexical retrieval is competitive at small retrieval budgets: at top\-k=3k=3, lexical retrieval obtains full\-chain coverage of 0\.540, compared with 0\.405 for dense retrieval, 0\.455 for hybrid retrieval, and 0\.430 for deterministic iterative retrieval\. At the larger top\-k=16k=16budget, hybrid retrieval obtains the highest non\-oracle full\-chain coverage, 0\.805, but also exposes the reader to a substantially higher distractor rate than the small\-budget settings\.

The 2WikiMultiHopQA\-ONCU\-500 pattern is different\. Dense retrieval substantially improves full\-chain coverage relative to lexical retrieval: dense top\-k=3k=3obtains full\-chain coverage of 0\.472 compared with 0\.368 for lexical top\-k=3k=3, and dense top\-k=16k=16obtains 0\.756 compared with 0\.524 for lexical top\-k=16k=16\. However, this coverage improvement comes with increased distractor exposure: dense top\-k=16k=16has a distractor identifier rate of 0\.511, compared with 0\.260 for lexical top\-k=16k=16\. The oracle rows confirm that the evidence mappings themselves are not the limiting factor; oracle retrieval reaches complete or near\-complete full\-chain coverage once enough oracle passages are permitted\. Realistic retrieved\-evidence failures therefore reflect a trade\-off among evidence\-chain coverage, ranking quality, retrieval budget, and distractor exposure, rather than a single failure of lexical retrieval\.

Table 28\.Retriever\-Family Retrieval\-Only Ablation on HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500\. Recall, F1, and full\-chain coverage are computed against oracle evidence identifiers\. Distr\. denotes the distractor identifier rate among retrieved passages\. Avg\. Pass\. is the average number of unique retrieved passages after chunk\-to\-passage aggregation\.The ablation refines the interpretation of the earlier retrieved\-evidence results\. On HotpotQA, the main lexical retriever is a competitive small\-budget diagnostic condition: it is stronger than dense and hybrid retrieval at the main top\-k=3k=3setting and remains competitive at top\-k=8k=8\. On 2WikiMultiHopQA, dense retrieval is clearly stronger than lexical retrieval in full\-chain coverage, showing that retriever family matters\. Across both datasets, increasing top\-kkimproves recall and full\-chain coverage but also increases distractor exposure\. The main retrieved\-evidence bottleneck should be interpreted as a dataset\-dependent evidence\-chain retrieval problem rather than as a single failure mode of deterministic lexical retrieval\.

### A\.10\.Reader\-Facing Retriever\-Family Validation

The matched sensitivity experiment tests dense@16 and hybrid@16 under the full four\-condition protocol\. We additionally retain the broader reader\-facing validation because it answers a different question: across lexical, dense, and hybrid retrieval and across top\-k∈\{3,8,16\}k\\in\\\{3,8,16\\\}, which retrieved input actually works best for the downstream reader? This validation passes retrieved contexts to Qwen2\.5\-14B, Qwen3\-14B, and Gemma3\-12B under the same answer contract and deterministic decoding policy\. It covers HotpotQA\-ONCU\-200 and 2WikiMultiHopQA\-ONCU\-500 and contains 18,900 reader predictions\.

This validation is reported with answer/evidence metrics rather than as a full ONCU matrix because it varies both retriever family and retrieval budget\. The matched ONCU sensitivity above supplies the complete four\-condition references for dense@16 and hybrid@16; this broader sweep supplies budget\- and family\-level reader\-facing context for interpreting those results\. We report answer F1, evidence F1, parse failures, and the corresponding retrieval\-only chain\-coverage and distractor diagnostics\.

Table[29](https://arxiv.org/html/2606.06758#A1.T29)summarizes the best reader\-facing retrieved setting for each model–dataset pair and compares it with the main lexical top\-k=3k=3retrieved setting\. The pattern is not a uniform victory for a single retriever\. On HotpotQA\-ONCU\-200, the best reader\-facing configuration is lexical top\-k=16k=16for Gemma3\-12B, hybrid top\-k=16k=16for Qwen2\.5\-14B, and dense top\-k=16k=16for Qwen3\-14B\. On 2WikiMultiHopQA\-ONCU\-500, dense retrieval is the best setting for Gemma3\-12B and Qwen2\.5\-14B, while hybrid retrieval is best for Qwen3\-14B\. In all six model–dataset pairs, the best setting uses top\-k=16k=16, but the larger budget also increases distractor exposure\. Retrieval\-family improvements can transfer to reader\-side answer performance, but the transfer is conditional on the dataset, model, retrieval family, and budget\.

Table 29\.Reader\-Facing Retriever\-Family Validation\. Lexical@3 is the main retrieved\-evidence setting used in the core protocol\. Best setting is selected by relaxed answer F1 among lexical, dense, and hybrid retrieval at top\-k∈\{3,8,16\}k\\in\\\{3,8,16\\\}\. Chain coverage and distractor rate are retrieval\-only diagnostics for the selected best setting\. This auxiliary validation is reported as a budget\-level reader\-facing sweep rather than as a complete ONCU matrix across all budgets\.The reader\-facing validation changes the paper’s retrieval claim in an important way\. It supports the view that retrieval\-only coverage is an incomplete proxy for retrieve\-then\-read performance: dense retrieval is clearly useful on 2WikiMultiHopQA, but HotpotQA remains model\-dependent, and the largest answer gains appear only when the retrieval budget is expanded\. The result therefore strengthens the diagnostic framing rather than producing a simple recommendation to replace lexical retrieval with dense retrieval\. A retrieval system should be audited at both stages: whether it exposes the full evidence chain before reading, and whether the reader converts the exposed evidence into a correct answer\.

### A\.11\.Cross\-Encoder Reranking Audit

To test whether the retrieved\-context conclusion depends on the absence of a reranking stage, we add a two\-stage cross\-encoder reranking audit for all five evaluated local models\. The audit uses hybrid lexical–dense retrieval as a first\-stage candidate generator, followed bycross\-encoder/ms\-marco\-MiniLM\-L6\-v2to rescore query–chunk pairs\. We instantiate seven reranked retrieved\-context variants: CE@32→\{8,16\}\\rightarrow\\\{8,16\\\}, CE@64→\{8,16,24\}\\rightarrow\\\{8,16,24\\\}, and CE@128→\{16,24\}\\rightarrow\\\{16,24\\\}, where the first number is the first\-stage candidate pool and the second is the final reader budget\. For this reader\-facing reranking audit, ONCU is computed by joining each reranked retrieved\-condition prediction with the existing no\-evidence and oracle\-evidence references by sample identifier; the values are interpreted as sample\-level sensitivity diagnostics rather than replacements for the metadata\-group ONCU tables used in the main four\-condition matrix\.

Table[30](https://arxiv.org/html/2606.06758#A1.T30)reports the best answer\-F1 reranked setting for each model–dataset pair\. Reranking narrows retrieved\-context gaps in several rows, but it does not reduce retrieval\-augmented behavior to a single best setting\. On HotpotQA\-200, the selected rows vary across CE@64 and CE@128 candidate pools depending on the model\. On 2Wiki\-500, all five best answer\-F1 rows use CE@128→\\rightarrow24\. The ONCU\-best rows are not always identical to the answer\-F1\-best rows, reinforcing the diagnostic distinction among answer conversion, evidence overlap, and oracle\-referenced recovery\.

Table 30\.Five\-Model Cross\-Encoder Reranking Audit\. CE@m→km\\rightarrow kdenotes hybrid first\-stage retrieval withmmcandidates, cross\-encoder reranking, and final reader budgetkk\. The table reports the best answer\-F1 reranked setting for each model–dataset pair; ONCU is computed by joining each reranked retrieved prediction with the corresponding no\-evidence and oracle\-evidence references by sample identifier\.
### A\.12\.External BABILong\-200 Validation

To address whether the observed diagnostic patterns extend beyond the oracle\-compatible benchmark components, we add a BABILong\-200 external validation experiment\. BABILong is a reasoning\-in\-a\-haystack benchmark designed to test whether language models can use facts distributed across long documents\(kuratov2024babilong\)\. Because the current BABILong adapter does not provide oracle\-evidence annotations compatible with ONCU, this experiment is reported as answer\-performance validation rather than as an additional ONCU benchmark\.

The BABILong\-200 setting contains 200 examples constructed from four context configurations, 0k, 1k, 2k, and 4k, and five task types, qa1, qa2, qa3, qa6, and qa7, with 10 examples per task–configuration cell\. Each model is evaluated under three fixed conditions: no evidence, full context, and retrieved evidence\. The same deterministic decoding and lexical retrieval settings are used as in the main experiments\. Since oracle evidence is unavailable in the current adapter, evidence F1 and ONCU are not interpreted for this validation setting\.

Table[31](https://arxiv.org/html/2606.06758#A1.T31)reports relaxed answer F1 with 95% bootstrap confidence intervals over examples\. Across all three models, full\-context input obtains higher relaxed F1 than retrieved evidence\. Qwen2\.5\-14B obtains 0\.558 relaxed F1 under full context compared with 0\.517 under retrieved evidence; Qwen3\-14B obtains 0\.479 compared with 0\.454; and Gemma3\-12B obtains 0\.523 compared with 0\.493\. No parse failures occur in any BABILong run\. The no\-evidence condition is zero for Qwen2\.5\-14B and Qwen3\-14B and remains substantially lower than the contextual conditions for Gemma3\-12B, indicating that the selected BABILong examples generally require contextual information\.

These results provide external support for the full\-context\-over\-retrieved answer\-performance pattern\. In an independent reasoning\-in\-a\-haystack setting, deterministic lexical retrieval again appears to underperform the full\-context condition on answer F1\. However, because BABILong\-200 lacks oracle\-evidence annotations in the current adapter and the confidence intervals for full\-context and retrieved\-evidence performance overlap, the results should be interpreted as directional external validation rather than as an ONCU\-based or significance\-based claim\.

Table 31\.External BABILong\-200 Validation\. BABILong is reported as answer\-performance validation rather than as an ONCU benchmark because the current adapter does not provide oracle\-evidence annotations\. Bracketed values are 95% bootstrap confidence intervals for relaxed answer F1\.
### A\.13\.External RULER\-lite\-240 Validation

To complement BABILong with a synthetic long\-context external benchmark, we evaluate RULER\-lite\-240 under full\-context and retrieved\-context reading regimes\. RULER\-style tasks stress whether a model can recover and manipulate task\-relevant information from long inputs\(hsieh2024ruler\)\. This experiment is not reported as ONCU: the adapted setting does not define the no\-evidence and oracle\-evidence reference conditions required for oracle\-reference normalization\.

The set contains 240 examples from three task families: key–value retrieval, multi\-hop trace following, and aggregation by summation\. Each task family is evaluated at 4K, 8K, 16K, and 32K with 20 examples per task–length cell\. The retrieved\-context condition uses top\-k=3k=3lexical retrieval over the same input context\. For Qwen3\-14B under Ollama, structured thinking is disabled during this short\-answer evaluation so that the generation budget is assigned to the final JSON response field rather than to the auxiliary thinking field\.

Table[32](https://arxiv.org/html/2606.06758#A1.T32)reports exact match, answer F1, and parse\-failure rates\. Retrieved context improves exact match for all three models: Qwen2\.5\-14B improves from 0\.771 to 0\.871, Qwen3\-14B from 0\.758 to 0\.896, and Gemma3\-12B from 0\.521 to 0\.713\. Averaged across models, exact match increases from 0\.683 to 0\.826\. Because the full\-context condition also produces parse failures for Qwen2\.5\-14B and Gemma3\-12B, the table reports parse\-failure rates explicitly\. Treating parse failures as incorrect is the main evaluation convention\. If the denominator is restricted to parsed outputs only, full\-context exact match is approximately 0\.801 for Qwen2\.5\-14B, 0\.758 for Qwen3\-14B, and 0\.622 for Gemma3\-12B, while retrieved\-context exact match is unchanged because no retrieved\-context parse failures occur\. The retrieved\-context advantage therefore remains for all three models, although part of the raw full\-context deficit reflects output\-format instability under longer inputs\.

The result should be read as external answer\-performance validation, not as oracle\-referenced evidence utilization\. It strengthens a narrower conclusion: retrieved\-versus\-full\-context behavior is task\-structured\. Synthetic long\-context tasks can expose full\-context localization and formatting burdens, while realistic multi\-hop and reasoning\-in\-a\-haystack settings can expose retrieval\-chain coverage or compression bottlenecks\.

Table 32\.External RULER\-lite\-240 Validation\. RULER\-lite is reported as answer\-performance validation rather than as an ONCU benchmark because the adapted setting does not define the no\-evidence and oracle\-evidence reference conditions required for oracle\-reference normalization\. EM denotes exact match\.

## Appendix BJAIR Reproducibility Checklist

This appendix follows the JAIR reproducibility\-checklist structure for the present empirical diagnostic study\. The table gives a short answer and points to the manuscript or repository locations that substantiate the answer\. Statuses distinguish complete “Yes” entries from*Scoped Yes*entries, where the release is complete for the submitted protocol but not a general\-purpose re\-release of upstream datasets, and*Summary audit*entries, where the artifact supports verification through frozen outputs and summaries rather than rerunning every auxiliary analysis end to end\.

Table 33\.JAIR Reproducibility Checklist Responses\.Checklist itemStatusLocation and detailsAll articlesAll claims investigated in this work are clearly stated\.YesAbstract, Section 1, Section 11\.Clear explanations are given for how the reported work substantiates the claims\.YesSections 3–9; Tables[4](https://arxiv.org/html/2606.06758#S8.T4),[7](https://arxiv.org/html/2606.06758#S8.T7), andLABEL:tab:repo\_artifact\_map\.Limitations or technical assumptions are stated clearly and explicitly\.YesSections 3\.3, 4, 8\.1, and 10\.Conceptual outlines and important implementation details of introduced AI methods are provided\.YesSections 3–7; repositorylongcue/andscripts/\.Motivation is provided for design choices, including algorithms, implementation choices, parameters, datasets, and experimental protocols beyond metrics\.YesSections 4–8 and the artifact manifest\.Theoretical contributionsDoes this paper make theoretical contributions?YesThe contribution is a diagnostic estimator/protocol, not a new learning algorithm\.All assumptions and restrictions are stated clearly and formally\.YesSections 3–4\.All novel claims are stated formally where appropriate\.YesSection 1 states the diagnostic proposition; Equations[2](https://arxiv.org/html/2606.06758#S3.E2)–[3](https://arxiv.org/html/2606.06758#S3.E3)and Sections 4\.1–4\.7 formalize the estimator, validity condition, and interpretive boundary\.Proofs of all non\-trivial formal claims are provided in sufficient detail\.YesSection 4 provides the joint\-observability proof sketch, positive\-affine\-invariance derivation, denominator\-validity condition, and metric counterexamples; no theorem\-heavy algorithmic claim is made\.Complex formalism is motivated and explained clearly\.YesSections 3–4\.Mathematical notation and formalism serve clarity and precision\.YesSections 3–4 use only the notation needed for ONCU and validity conditions\.Appropriate citations are given for non\-trivial theoretical tools and techniques\.YesSections 2 and 4\.Computational experimentsDoes this paper include computational experiments?YesSections 5–9\.All source code required for conducting experiments is included in an online appendix or will be made publicly available\.Scoped Yeslongcue/,scripts/,configs/, andREADME\_REPRODUCE\.mdcover the submitted diagnostic protocol and released audit artifacts\.The source code comes with a license that allows free usage for reproducibility purposes\.YesSource code is released under the MIT License in the publicv1\.0\.1\-jairrelease\.The source code comes with a license that allows free usage for research purposes in general\.YesThe repository includes an MITLICENSEfile for source code\.Raw, unaggregated data from all experiments is included or will be made publicly available\.Scoped YesProcessed inputs, frozen outputs, per\-sample metrics, and summary files are mapped in TableLABEL:tab:repo\_artifact\_map; third\-party source corpora are referenced through their public upstream locations and terms\.The unaggregated data comes with a license that allows free usage for reproducibility purposes\.Scoped YesProject\-generated processed inputs, derived summaries, tables, figures, and supplementary materials are released for audit under the documented repository licenses; third\-party source datasets retain their upstream terms inDATA\_LICENSES\.md\.The unaggregated data comes with a license that allows free usage for research purposes in general\.Scoped YesProject\-generated artifacts are released under the documented repository licenses; third\-party benchmark resources are cited and governed by their public upstream licenses and terms listed inDATA\_LICENSES\.md\.Random\-number generation and seeds are described sufficiently for replication\.YesSections 5, 7, 8; repository build scripts and README\.The execution environment is described, including hardware/software where relevant\.Scoped YesSections 6 and 8,README\_REPRODUCE\.md,RUNTIME\_REPRODUCIBILITY\_RECORD\.md, andruntime\_reproducibility\_record\.jsonrecord deterministic controls, Python/package versions, PyTorch 2\.4\.1\+cu124, NVIDIA A40 hardware, driver 570\.211\.01, CUDA 12\.8, and model/configuration paths for the recorded local execution environment\.Evaluation metrics are clearly explained and motivated\.YesSection 7 and Appendix A\.The number of algorithm runs used to compute each result is reported\.YesSections 5 and 8; table captions specify sample counts and prediction totals\.Reported results have not been cherry\-picked by silently ignoring unsuccessful experiments\.Summary auditSection 8 reports parse failures, invalid groups, confidence intervals, and failure\-pattern audits; the release maps frozen outputs and summaries used to reproduce the reported tables\.Analysis goes beyond single\-dimensional summaries and includes variation/confidence/distributional information\.Summary auditBootstrap intervals, paired effect sizes, valid\-group audits, and failure\-pattern audits are reported in Sections 8 and Appendix A, with full auxiliary tables kept in the release artifacts\.All hyperparameter settings are reported with rationale or selection method\.Scoped YesSections 6–8 andconfigs/report the fixed diagnostic settings and tested sensitivity settings used for the submitted protocol\.The number and range of hyperparameter settings explored before final experiments are indicated\.Scoped YesSections 6–8 andconfigs/enumerate the fixed diagnostic settings and explored sensitivity ranges: retrieval budgets, retriever families, dense/hybrid@16 matched sensitivity, CE reranking candidate/final budgets, context lengths, evidence positions, and model\-family extension\.Appropriate statistical tests are used in the presence of noise effects\.YesSection 7\.4 and Appendix A\.DatasetsDoes this work rely on one or more datasets?YesSection 5\.All newly introduced datasets are included or will be made publicly available with long\-term accessibility\.Scoped YesControlled\-ONCU and processed ONCU inputs are included in the publicv1\.0\.1\-jairrelease; external benchmark\-derived inputs remain governed by upstream access terms\.Newly introduced datasets come with a license for reproducibility purposes\.Scoped YesNewly generated project documentation and supplementary materials are CC BY 4\.0; dataset\-derived artifacts are governed byDATA\_LICENSES\.md\.Newly introduced datasets come with a license for research purposes in general\.Scoped YesNewly generated project materials are CC BY 4\.0; external benchmark\-derived materials retain upstream terms\.All datasets drawn from literature or public sources are accompanied by appropriate citations\.YesSection 5 and References\.All datasets drawn from existing literature are publicly available\.Scoped YesHotpotQA, 2WikiMultiHopQA, BABILong, and RULER\-style resources are public subject to their providers’ terms; the release maps the processed inputs used in this study where redistribution is appropriate\.All new datasets and non\-public datasets are described in detail, including statistics and construction/annotation process where relevant\.YesSection 5 and TableLABEL:tab:repo\_artifact\_map\.All preprocessing, augmentation, batching, or splitting methods are described in detail\.Scoped YesSections 5–6 and repository build scripts describe the submitted protocol; frozen processed inputs are included or mapped so reviewers can audit the exact evaluated instances\.Brief explanationExplanation of partial or scope\-qualified answers\.YesNo item is left as an unexplainedPartial\.*Scoped Yes*marks items complete for the submitted diagnostic protocol but bounded by upstream dataset licenses, recorded local execution, or release\-asset scope;*Summary audit*marks items supported by frozen outputs and summaries rather than by requiring every auxiliary run to be regenerated during review\.Table 33\.JAIR Reproducibility Checklist Responses \(continued\)\.

Similar Articles

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

arXiv cs.CL

Introduces a four-condition diagnostic protocol to identify whether failures in long-context memory systems stem from write-side compression discarding evidence or retrieval-side missing stored information. The analysis reveals write-side gaps dominate for most baselines, motivating the proposed Expected Predictive Compression (EPC) method that improves preservation of relevant evidence.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv cs.CL

This paper introduces a paired-prompt protocol to measure 'evaluation-context divergence' in open-weight LLMs, finding that models behave differently depending on whether prompts are framed as evaluations or live deployments. The study highlights heterogeneity across models, with some being 'eval-cautious' and others 'deployment-cautious', raising concerns about the validity of safety benchmarks.