Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

arXiv cs.CL Papers

Summary

This paper introduces a framework for auditing source-dependence in medical multi-source RAG systems, releasing the TransplantQA benchmark, HERO-QA retrieval strategy, and a structured-output judge to measure inter-source answer relationships. It demonstrates that better retrieval reveals more disagreement than previously estimated, and argues for shifting NLP evaluation from answer correctness to inter-source relationship analysis.

arXiv:2605.29084v1 Announce Type: new Abstract: A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:15 AM

# Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
Source: [https://arxiv.org/html/2605.29084](https://arxiv.org/html/2605.29084)
Yubo Li, Rema Padman, Ramayya Krishnan Carnegie Mellon University \{yubol, rpadman, rk2x\}@andrew\.cmu\.edu

###### Abstract

A retrieval\-augmented generation \(RAG\) system deployed over a multi\-author institutional corpus can give a different answer to the same question depending on which source it retrieves — a failure mode the dominant single\-gold\-answer paradigm cannot diagnose\. We argue that*source\-dependence*is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the*inter\-source relationship*\. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts:TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources;HERO\-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured\-output judge that scores inter\-source relationships on a validated 5\-label taxonomy\. At scale, better retrieval reveals far more disagreement than prior estimates suggested — understating its*prevalence*, not its intensity\. The framework is domain\-agnostic and transfers to legal and educational RAG: measuring source\-dependence is a responsibility for deployed multi\-source NLP generally\.

Same Question, Different Source, Different Answer: Auditing Source\-Dependence in Medical Multi\-Source RAG

Yubo Li, Rema Padman, Ramayya KrishnanCarnegie Mellon University\{yubol, rpadman, rk2x\}@andrew\.cmu\.edu

## 1Introduction

A patient three months past a heart transplant types a question into an institutional Q&A system:*“When can I travel internationally again?”*111Adapted from a real patient post on a transplant forum included in our benchmark\.Behind the system, an RAG pipeline retrieves passages from the patient\-education handbook of the institution that performed the surgery\. The answer is grounded, cited, and confidently delivered\. Had the same query been grounded in a peer institution’s handbook, the recommended waiting period might have been three, six, or twelve months — with identical confidence and fluency, and no indication that the guidance is institution\-specific rather than universal\.

This kind of*inter\-source heterogeneity*is endemic to medical RAG\. Patient\-facing institutional documents reflect local protocols, editorial choices, and decades of accumulated risk\-management caution; they are not interchangeable\. Yet the dominant benchmarks for medical question answering — MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.29084#bib.bib1)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2605.29084#bib.bib2)\), PubMedQA\(Jinet al\.,[2019](https://arxiv.org/html/2605.29084#bib.bib3)\), BioASQ\(Tsatsaroniset al\.,[2015](https://arxiv.org/html/2605.29084#bib.bib4)\)— assume one correct answer per question and cannot diagnose whether the answer a patient sees is contingent on which document the retriever happened to return\.

We argue this exposes a missing axis of NLP evaluation\. As RAG becomes deployed infrastructure over multi\-author institutional corpora — in medicine, but equally in law and education — the field needs to measure*source\-dependence*: whether the answer a user receives is contingent on which source the retriever happened to return\. We frame this as a new mission for evaluation research, and operationalise it by shifting the unit of analysis from single\-answer correctness to*inter\-source relationship*: given the same question, what is the structured relationship between the answer a generator produces when grounded in documentAAversus documentBB? This paper makes four contributions toward that shift, using transplant patient education as a case study in which institutional sources demonstrably disagree\.

1. 1\.An evaluation\-paradigm argument\(§[1](https://arxiv.org/html/2605.29084#S1), §[7](https://arxiv.org/html/2605.29084#S7)\): the single\-gold\-answer paradigm cannot diagnose source\-dependence, the dominant failure mode of deployed multi\-source RAG; closing the gap requires evaluating the inter\-source relationship, not refining single\-gold benchmarks\.
2. 2\.TransplantQA\(§[3](https://arxiv.org/html/2605.29084#S3)\): a benchmark operationalising this shift — 1,115 real patient questions, each answered by grounding generation in 102 transplant patient\-education handbooks \(the candidate sources\) from 23 U\.S\. centers across five organ types, partitioned into a*general*subset \(answered by every handbook\) and an*organ\-specific*subset, enabling both full\-corpus and stratified inter\-source comparison\.
3. 3\.HERO\-QA\(§[4\.2](https://arxiv.org/html/2605.29084#S4.SS2)\): a hierarchical evidence retrieval and orchestration strategy for handbook\-grounded clinical QA, using full\-document context for short handbooks \(eliminating retrieval\-miss failures\) and section\-aware hierarchical retrieval with reranking for longer ones, with explicit retrieval metadata for grounding audit\.
4. 4\.Empirical characterization at scale\(§[6](https://arxiv.org/html/2605.29084#S6)\): the full output of a production run over the benchmark \(48,056 grounded answers, 5,730,465 pairwise comparisons\), released for reuse\. The inter\-source relationship is measured by a structured\-output judge \(the evaluation instrument; §[4\.3](https://arxiv.org/html/2605.29084#S4.SS3)\) validated against human annotators atκ=0\.842\\kappa=0\.842\(§[5](https://arxiv.org/html/2605.29084#S5)\)\.

Our characterization also yields a methodological observation: comparing the reference run against an earlier 14B run with a lower\-capacity retriever, the average handbook absence rate drops 13\.6 pp while per\-pair divergence is essentially unchanged \(§[6\.4](https://arxiv.org/html/2605.29084#S6.SS4)\) — prior estimates understated the*prevalence*of disagreement, not its*intensity*\. Crucially, the framework is not medicine\-specific: legal RAG \(retrieving over federal/state/circuit precedent\) and educational RAG \(retrieving over state\-stratified curriculum standards\) deploy over the same kind of multi\-source corpora and inherit the same blind spot, and the three components — multi\-source benchmark, inter\-source taxonomy, structured\-output judge — transfer directly to both \(§[7](https://arxiv.org/html/2605.29084#S7)\)\. Measuring source\-dependence is thus a mission for deployed multi\-source NLP broadly, not a medical\-domain convenience\.

## 2Related Work

#### Medical QA benchmarks\.

Medical\-QA evaluation treats QA as single\-best\-answer prediction: MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.29084#bib.bib1)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2605.29084#bib.bib2)\), PubMedQA\(Jinet al\.,[2019](https://arxiv.org/html/2605.29084#bib.bib3)\), and BioASQ\(Tsatsaroniset al\.,[2015](https://arxiv.org/html/2605.29084#bib.bib4)\)score against curated gold answers, and patient\-facing extensions\(Ben Abachaet al\.,[2017](https://arxiv.org/html/2605.29084#bib.bib5); Zenget al\.,[2020](https://arxiv.org/html/2605.29084#bib.bib6); Singhalet al\.,[2023](https://arxiv.org/html/2605.29084#bib.bib7)\)retain the single\-gold assumption\. TransplantQA instead makes the*relationship*between answers grounded in different documents the unit of analysis; to our knowledge no prior medical QA benchmark tests inter\-source heterogeneity at this scale\.

#### LLM\-as\-judge and cross\-document inconsistency\.

LLM\-as\-judge protocols\(Zhenget al\.,[2023](https://arxiv.org/html/2605.29084#bib.bib8); Zhuet al\.,[2025](https://arxiv.org/html/2605.29084#bib.bib11); Kimet al\.,[2024](https://arxiv.org/html/2605.29084#bib.bib10); Liuet al\.,[2023](https://arxiv.org/html/2605.29084#bib.bib9)\)typically return a single scalar or label; our judge instead co\-emits narrative metadata \(divergence\_topic,clinical\_significance\), enabling the taxonomy and severity analyses of §[6](https://arxiv.org/html/2605.29084#S6)at essentially unchanged per\-pair cost\. Separately, contradiction detection via NLI\(Schusteret al\.,[2022](https://arxiv.org/html/2605.29084#bib.bib12)\), factuality decomposition\(Minet al\.,[2023](https://arxiv.org/html/2605.29084#bib.bib13)\), and RAG\-hallucination evaluation\(Niuet al\.,[2024](https://arxiv.org/html/2605.29084#bib.bib14)\)target a binary signal against a reference; we instead treat each answer as faithful to its source and ask whether two sources*themselves*agree, with a 5\-label taxonomy that surfacesComplementary/Divergentvariation a binary lens misses\.

#### Institutional variation in medicine\.

Wennberg and Gittelsohn \([1973](https://arxiv.org/html/2605.29084#bib.bib15)\)documented small\-area variation in clinical practice unexplained by patient characteristics, launching a long literature on clinical\-practice variation\. Patient\-facing educational material is the visible boundary of this institutional variation; TransplantQA provides an NLP\-tractable instrument for measuring it\.

## 3The TransplantQA Benchmark

TransplantQA pairs a corpus of patient\-education handbooks from U\.S\. transplant centers with a question set drawn from real patient information\-seeking behavior, so that an RAG system’s answer to any benchmark question can be grounded in \(and evaluated against\) multiple plausible institutional sources\. Unlike single\-gold medical QA benchmarks, the unit of analysis in TransplantQA is the inter\-source*relationship*between answers grounded in different documents\.

### 3\.1Handbook Corpus

We collected 102 patient\-education handbooks from 23 major U\.S\. solid\-organ transplant centers, representing 16 of the 20 largest programs by procedure volume\. The corpus spans five organ types — heart \(26\), lung \(26\), kidney \(22\), liver \(17\), and pancreas \(11\) — and the contributing institutions are geographically distributed across the United States, comprising both large academic medical centers and community\-based transplant programs\. All documents were obtained as PDFs from institutional websites and patient education portals\.

Centers organize patient education differently: some provide separate documents for the pre\-transplant phase \(evaluation, listing, waiting\) and the post\-transplant phase \(recovery, medications, long\-term follow\-up\), while others issue a single combined handbook\. We treat each phase\-specific document as a distinct unit, yielding 37 pre\-transplant, 39 post\-transplant, and 26 combined handbooks\. Each is assigned an identifier encoding organ, institution, and care phase \(e\.g\.,heart\_baylor\_combined\)\. Table[1](https://arxiv.org/html/2605.29084#S3.T1)summarizes the corpus\.

Table 1:TransplantQA handbook corpus by organ\.*Centers*is the number of distinct contributing institutions\.
### 3\.2Question Set

We curated 1,115 patient questions to serve as the evaluation set for cross\-center comparison \(Figure[1](https://arxiv.org/html/2605.29084#S3.F1)\)\. Questions were*harvested from real online transplant communities and platforms*— patient forums and social media \(e\.g\., Reddit transplant subreddits, Mayo Clinic Connect, Inspire\), patient\-advocacy organizations \(National Kidney Foundation, American Liver Foundation\), and institutional Q&A pages — using transplant\- and symptom\-keyword search to surface genuine information needs\. The 3,000\+ harvested candidates were then \(i\) de\-duplicated \(cosine\>0\.85\>0\.85plus manual review\), \(ii\) double\-checked for quality and relevance, and \(iii\)*anonymized and rephrased*to strip user\-identifying content and make each question self\-contained, yielding the released 1,115 \(mean length 23\.6 words\)\. Source breakdown and inclusion criteria are in Appendix[A](https://arxiv.org/html/2605.29084#A1)\.

![Refer to caption](https://arxiv.org/html/2605.29084v1/images/transplantQA.png)Figure 1:TransplantQA construction\. Patient questions are harvested from real online transplant communities and platforms \(patient forums and social media, patient\-advocacy organizations, and institutional Q&A\) via transplant\- and symptom\-keyword search, then de\-duplicated, quality/relevance\-checked, and anonymized and rephrased to remove user\-identifying information — yielding 1,115 questions \(311 general answered by every handbook \+ 804 organ\-specific\), paired with 102 patient\-education handbooks from 23 U\.S\. centers across five organ types\.Each question is annotated with: \(i\) an*organ\-type label*— heart, kidney, liver, lung, pancreas, or*general*; \(ii\) one or more clinical topic categories drawn from a 13\-topic taxonomy \(Appendix[B](https://arxiv.org/html/2605.29084#A2)\); and \(iii\) fine\-grained sub\-topic tags \(43 unique\)\. Questions are multi\-labeled to reflect cross\-cutting concerns\.

#### General vs\. organ\-specific split\.

A central design choice is the partition of the question set into a*general*subset \(311 questions, 27\.9%\) and an*organ\-specific*subset \(804 questions across five organ types\)\. General questions address topics relevant to all transplant recipients — immunosuppressant side effects, reproductive health, mental health — and are answered by*every*handbook in the corpus, producing\(1022\)=5,151\\binom\{102\}\{2\}=5\{,\}151pairwise comparisons per question\. Organ\-specific questions are answered only by handbooks of the matching organ type, producing\(No2\)\\binom\{N\_\{o\}\}\{2\}comparisons whereNo∈\{11,17,22,26,26\}N\_\{o\}\\in\\\{11,17,22,26,26\\\}\. The two subsets together support both full\-corpus and stratified inter\-source analyses\.

### 3\.3Anonymization and Release

Because questions are harvested from public forums and social media, every released question was anonymized and rephrased to remove any user\-identifying content from the original post \(Appendix[A](https://arxiv.org/html/2605.29084#A1)\); the released benchmark also uses anonymized handbook identifiers\. Center names in handbook IDs are retained because transplant centers are public institutions and the analyses we enable are explicitly cross\-institutional\. Release\-location metadata is anonymized for review; the planned release package includes the benchmark, the raw handbook\-extraction output, the question annotations, and the full pairwise\-comparison outputs\. Original PDFs are not redistributed but are listed by URL for independent retrieval\. Appendix[C](https://arxiv.org/html/2605.29084#A3)provides a Datasheet\-style data card\(Gebruet al\.,[2021](https://arxiv.org/html/2605.29084#bib.bib22)\)\.

## 4Pipeline Architecture

Our pipeline is a three\-stage process that takes the benchmark question set and the handbook corpus as input and produces, for every benchmark question, a structured matrix of pairwise inter\-handbook relationships\. It runs on open\-weight LLMs \(Qwen3\-32B for both generation and judging in our reference run\) and is designed for resumable execution on heterogeneous SLURM clusters\. The methodological core of this section is*HERO\-QA*, the hierarchical evidence\-retrieval strategy used in Stage 2 \(§[4\.2](https://arxiv.org/html/2605.29084#S4.SS2), Figure[2](https://arxiv.org/html/2605.29084#S4.F2)\); the structured pairwise judge in Stage 3 \(§[4\.3](https://arxiv.org/html/2605.29084#S4.SS3)\) is the measurement instrument that operationalises the inter\-source evaluation\.

### 4\.1Stage 1: Structured Extraction

Raw PDF handbooks are converted to structured JSON using LlamaParse\(LlamaIndex,[2024](https://arxiv.org/html/2605.29084#bib.bib21)\), preserving section headings, paragraph boundaries, and page metadata\. The per\-handbook output contains organ type, institution, care phase, source path, full text, and a section list with headings, body text, and page numbers\. This structure enables section\-aware chunking in Stage 2\. Extraction is idempotent\.

### 4\.2Stage 2: HERO\-QA Retrieval\-Augmented Generation

HERO\-QA\(Hierarchical Evidence Retrieval and Orchestration for Handbook\-grounded clinical QA\) is the retrieval strategy used in Stage 2 \(Figure[2](https://arxiv.org/html/2605.29084#S4.F2)\)\. It is a recall\-first*multi\-layer*retrieval system designed for the institutional\-handbook setting, in which a query descends through a length\-routing gate, a hierarchical document model, four parallel first\-stage retrievers, rank fusion, cross\-encoder reranking, and parent\-section expansion\. Throughout, HERO\-QA exposes retrieval metadata \(which mode produced the context, which sections were touched\) so downstream evaluation can audit whether an answer was grounded in full\-document or retrieved evidence\.

![Refer to caption](https://arxiv.org/html/2605.29084v1/images/hero.png)Figure 2:HERO\-QA: a multi\-layer retrieval system\. A query is routed by handbook length: short handbooks bypass retrieval and use full\-document context \(Route A\); long handbooks descend through a hierarchical document model \(document→\\rightarrowsections→\\rightarrowchild chunks\), four parallel first\-stage retrievers \(dense FAISS, child BM25, section\-body navigation, title navigation\), RRF fusion, cross\-encoder reranking, and parent\-section expansion\. The top evidence grounds Qwen3\-32B generation; retrieval metadata is retained for audit, and a low\-evidence signal triggers full\-document fallback\.Routing and document model \(Layers 0–1\)\.Short handbooks \(full text≤80\\leq 80k chars\) are passed in full and retrieval is skipped, eliminating retrieval\-miss for short documents\. Longer handbooks are decomposed into*parent sections*\(preserving headings/pages\) and overlapping*child chunks*\(160 words, 32\-word overlap, each prefixed with its parent heading\); this document→\\rightarrowsection→\\rightarrowchunk hierarchy is the substrate for retrieval and expansion\.

Four parallel retrievers \+ fusion \+ rerank \(Layers 2–4\)\.Against the expanded query, HERO\-QA runs four first\-stage retrievers: dense child\-chunk retrieval \(FAISS\(Douzeet al\.,[2026](https://arxiv.org/html/2605.29084#bib.bib17)\)withBAAI/bge\-large\-en\-v1\.5\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.29084#bib.bib18)\)\), sparse child\-chunk BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2605.29084#bib.bib16)\),*section\-body navigation*\(BM25 over section text, hits mapped to child chunks\), and*title navigation*\(BM25 over section headings, catching topic matches when body wording differs\)\. The four rankings are combined by Reciprocal Rank Fusion \(kRRF=60k\_\{\\mathrm\{RRF\}\}\\\!=\\\!60\(Cormacket al\.,[2009](https://arxiv.org/html/2605.29084#bib.bib19)\); navigation signals down\-weighted\) and reranked with a MiniLM cross\-encoder\(Wanget al\.,[2020](https://arxiv.org/html/2605.29084#bib.bib20)\)\.

Parent\-section expansion \(Layer 5\)\.Top child chunks are expanded back to their parent sections plus immediate neighbours, so the generator receives coherent section\-level context; the top\-5 expanded passages form the evidence\. An evidence\-sufficiency check triggers full\-document fallback when retrieved evidence is weak\.

Answer generation\.For each \(question, handbook\) pair the retrieved passages are supplied to Qwen3\-32B at temperature 0 with a fixed prompt \(Appendix[D](https://arxiv.org/html/2605.29084#A4)\) instructing the model to \(a\) rely exclusively on the provided context, \(b\) return a standardizedNOT ADDRESSEDprefix when the handbook contains no relevant information rather than fabricate, and \(c\) cite the supporting section heading when one exists\. The stage produces 48,056 grounded answers in the reference run\.

### 4\.3Stage 3: Structured Pairwise Judgment

#### Absence pre\-screen\.

Each answer is first screened for absence: a fast heuristic checks for the canonicalNOT ADDRESSEDprefix, and answers that escape the heuristic are passed to a binary classifier \(also Qwen3\-32B\) using a structured YES/NO prompt\. Absence is cached per \(handbook, question\) pair, so each handbook is screened once across all comparisons it participates in\. Any pair containing at least one absent answer is immediately assigned theAbsentlabel, skipping the comparison call\.

#### Five\-label taxonomy\.

For every pair of non\-absent answers, the judge classifies their relationship into one of five categories with operational definitions \(Table[2](https://arxiv.org/html/2605.29084#S4.T2)\)\. The taxonomy is designed to be \(a\) clinically interpretable, \(b\) jointly exhaustive over the relationships we observed during pilot annotation, and \(c\) ordered along a coverage–agreement axis from no information \(Absent\) through full alignment \(Consistent\), additive but compatible content \(Complementary\), substantive but bounded disagreement \(Divergent\), to outright opposition \(Contradictory\)\.

Table 2:Five\-label taxonomy for pairwise comparison of center\-specific answers\. Examples are drawn from the released benchmark\.
#### Structured output beyond the label\.

A standard LLM\-as\-judge protocol would return only the classification\. Our judge instead returns a structured JSON record per pair containing five fields:

1. 1\.classification— one of the five labels;
2. 2\.reasoning— a 2–3 sentence clinical justification;
3. 3\.divergence\_topic— a short noun phrase naming the*locus*of disagreement \(emitted only whenclassification∉\{Consistent,Absent\}\\not\\in\\\{\\textsc\{Consistent\},\\textsc\{Absent\}\\\}\);
4. 4\.clinical\_significance∈\{low,medium,high\}\\in\\\{\\mathrm\{low\},\\mathrm\{medium\},\\mathrm\{high\}\\\}— judge\-assessed severity \(emitted only forDivergentandContradictory\);
5. 5\.judge\_metadata— input/output token counts and decoding latency\.

The two narrative fields are the key methodological enabler of the downstream analyses described in §[6](https://arxiv.org/html/2605.29084#S6)\. Clustering 34,706divergence\_topicstrings yields a 991\-node taxonomy of disagreement themes; theclinical\_significancefield permits stakes\-adjusted aggregation\. The judge prompt and the full output schema are in Appendix[E](https://arxiv.org/html/2605.29084#A5); inference is greedy \(temperature 0\) for reproducibility\.

#### Comparison matrix\.

For a question answered byNNhandbooks the\(N2\)\\binom\{N\}\{2\}pairwise records and the integer matrix𝐌∈\{0,…,4\}N×N\\mathbf\{M\}\\in\\\{0,\\ldots,4\\\}^\{N\\times N\}encoding the labels are written together as a single per\-question JSON file\. Diagonal entries areConsistentby convention\. Per\-question artefacts are independent and idempotent, enabling resume\-safe incremental execution\.

### 4\.4Implementation and Scale

The released pipeline runs over the full benchmark on a heterogeneous SLURM cluster \(PSC Bridges\-2, NVIDIA H100 80 GB\) with a sharded executor that splits the question set into 10*general*shards and 10*non\-general*shards per pipeline stage; each shard is resumable at the matrix\-file granularity for comparison and the question\-file granularity for generation\. The complete production run produces48,056 answers\(Stage 2\) and5,730,465 pairwise comparisons\(Stage 3\), of which 4,519,245 pre\-screen asAbsentand 1,211,220 require an LLM\-judge call\. Total wall\-time and compute cost are reported in Appendix[F](https://arxiv.org/html/2605.29084#A6)\. To our knowledge this is the largest documented application of LLM\-as\-judge to a single medical heterogeneity benchmark\.

## 5Validating the Evaluation Instrument

The structured\-output judge is the measurement instrument through which we read inter\-source relationships; its trustworthiness underwrites every finding in §[6](https://arxiv.org/html/2605.29084#S6)\. We validate it along two axes: agreement with human clinical annotators \(§[5\.1](https://arxiv.org/html/2605.29084#S5.SS1)\) and an ablation against the natural alternative protocol — a label\-only judge followed by a post\-hoc extractor — confirming that the structured single\-call design is required, not a convenience \(§[5\.2](https://arxiv.org/html/2605.29084#S5.SS2)\)\.

### 5\.1Human–judge agreement

We validate the structured\-output judge against human annotators on a stratified sample of 200 pairwise records \(40 per non\-absent label, plus 40Absentcontrols\);Contradictoryis over\-sampled at 46% of all contradictions in the production run for power on the rare class\. Annotators see the original question and both handbook answers;the judge’s label, reasoning, divergence topic, and clinical\-significance rating are withheld\. Two annotators rate each pair following the operational definitions in Table[2](https://arxiv.org/html/2605.29084#S4.T2); protocol and rubric in Appendix[I](https://arxiv.org/html/2605.29084#A9)\.

#### Results\.

Both annotators completed all 200 pairs\. Inter\-annotator agreement isCohen’sκ=0\.655\\kappa=0\.655\(raw agreement73\.0%73\.0\\%\) — substantial under Landis–Koch\. The two annotators agreed on 146/200 pairs; we treat their joint\-agreed label as the human\-majority gold\. On those 146 pairs the judge agrees with the majority87\.7%87\.7\\%of the time, yieldingjudge\-vs\-majorityκ=0\.842\\kappa=0\.842\(almost perfect\) and weighted F1=0\.876=0\.876\(macro F1=0\.841=0\.841\)\. Per\-label F1:Absent1\.001\.00,Contradictory0\.990\.99,Consistent0\.830\.83,Complementary0\.700\.70,Divergent0\.690\.69\.

#### Failure\-mode taxonomy\.

Of 18 judge errors against the majority, 14 \(78%\) cluster on theComplementary/Divergentboundary: 8 cases where the majority callsComplementarybut the judge callsDivergent, and 6 where the majority callsComplementarybut the judge callsConsistent\. The judge’s discrimination is robust at the extremes \(presence/absence; flat contradictions\) but soft on the middle of the coverage–agreement axis — consistent with the taxonomy’s design intent thatComplementarysits betweenConsistentandDivergent\.

#### Clinical significance\.

On 49 pairedDivergent/Contradictorypairs where all three \(judge, A, B\) rated significance, judge\-vs\-humanκ=0\.385\\kappa=0\.385— fair but not strong\. The judge’s grades are directionally correct \(no systematic*low*/*high*flips\) but the fine\-grained gradations should be treated as a population\-level signal, not a per\-pair adjudication\.

### 5\.2Structured vs\. label\-only judge: an ablation

A natural alternative to our structured single\-call judge is a label\-only judge followed by a post\-hoc extractor that conditions on \(question, answera, answerb, label\) to recoverdivergence\_topicandclinical\_significancein a second call\. We test the two protocols \(Condition A: structured single\-call, ours; Condition B: label\-only \+ post\-hoc\) on the same 200\-pair sample\. Three findings emerge\. \(i\) Categorical agreement isκ=0\.669\\kappa=0\.669, but the disagreement concentrates on the most consequential class: of A’s 40Divergentpairs, B agrees on only 4 and downgrades 31 \(78%\) toComplementary\. \(ii\) Clinical significance is unrecoverable post\-hoc: onn=44n\\\!=\\\!44pairedDivergent/Contradictorypairs B returns*high*for all 44 \(κ=0\\kappa\\\!=\\\!0against A’s mixed*high*/*medium*\)\. \(iii\) Topic strings on agreed\-label pairs are semantically equivalent and cluster identically under the §[6](https://arxiv.org/html/2605.29084#S6)pipeline\. Condition B is≈\\approx55–6×6\\timesfaster per pair but loses theDivergent/Complementarydiscrimination and severity gradation\. Structured single\-call output is therefore a design requirement of the framework, not a convenience\.

## 6Benchmark Characterization

We apply our pipeline to the full TransplantQA benchmark using Qwen3\-32B as both generator and judge, reporting global and stratified label distributions, the per\-organ heterogeneity profile, and a system\-level comparison\.

### 6\.1Global Label Distribution

Of the 5,730,465 pairwise comparisons, 4,519,245 \(78\.9%\) pre\-screen asAbsentbecause at least one handbook returnedNOT ADDRESSED\. Of the remaining 1,211,220 LLM\-judged pairs,Complementarydominates \(75\.4%\), followed byDivergent\(12\.9%\),Consistent\(7\.1%\), andContradictory\(<0\.1%<\\\!0\.1\\%\)\. Explicit contradiction is therefore rare; the dominant mode of disagreement is two centers covering different aspects of the same question \(Complementary\) or giving substantively different recommendations \(Divergent\)\.

### 6\.2Per\-Organ Heterogeneity

Table[3](https://arxiv.org/html/2605.29084#S6.T3)reports per\-organ rates: the absence raterabsr\_\{\\mathrm\{abs\}\}, the per\-pair divergence rateRdivR\_\{\\mathrm\{div\}\}\(fraction of non\-absent pairs labelledDivergentorContradictory\), the per\-pair consistency rateRconR\_\{\\mathrm\{con\}\}, and the proportion of questions in each organ for which at least one pair is divergent \(pctany​div\\mathrm\{pct\}\_\{\\mathrm\{any\\,div\}\}\)\.

Table 3:Per\-organ heterogeneity rates from our reference production run\. Per\-pair rates are averaged over non\-absent pairs\.Absence dominates across all organs \(60–78%\): even within the matching\-organ subsets, the average handbook addresses only one third to half of relevant patient questions\. Per\-pair divergence rates cluster between 0\.14 and 0\.19, with pancreas and general questions sitting at the top of the range\. The prevalence metricpctany​div\\mathrm\{pct\}\_\{\\mathrm\{any\\,div\}\}exhibits broader spread \(30–56%\), reflecting that pancreas and liver questions are more often answered by a small subset of handbooks \(so even when divergence exists, it concentrates within a few questions\)\.

### 6\.3Per\-Handbook Coverage Spread

Per\-handbook absence rates span0\.45 to 0\.99\(mean 0\.74\), a 2×\\timesspread between the most\-comprehensive and most\-silent handbooks\. The handbook×\\timesquestion\-organ heatmap \(Appendix Figure[3](https://arxiv.org/html/2605.29084#A7.F3)\) shows the expected block\-diagonal pattern but also systematic editorial differences: some handbooks are broadly comprehensive across all columns, while others are silent even within their own organ\.

### 6\.4System\-Level Comparison: 14B Earlier Run vs\. 32B Reference Run

A previous run over the same benchmark used a hybrid\-retrieval pipeline with Qwen3\-14B as both generator and judge\. Comparing it to the 32B reference run \(per\-organ deltas in Appendix[H](https://arxiv.org/html/2605.29084#A8)\) isolates the effect of the pipeline upgrade\. Three observations stand out: \(i\) absence drops 12–19 pp across every organ \(meanΔ​rabs=−0\.136\\Delta r\_\{\\mathrm\{abs\}\}=\-0\.136\) as better retrieval surfaces passages the earlier pipeline missed; \(ii\) per\-pair divergence rates are roughly unchanged or modestly lower \(meanΔ​Rdiv=−0\.031\\Delta R\_\{\\mathrm\{div\}\}=\-0\.031; the stronger judge is not more aggressive\); \(iii\) the proportion of questions showing*any*divergence rises substantially \(mean\+15\.9\+15\.9pp\), driven mechanically by the absence drop\. The per\-pair rate reported by earlier baselines \(≈20%\\approx 20\\%\) is thus stable, but the*prevalence*of disagreement was substantially understated because absence was hiding it: stronger pipelines reveal latent disagreement rather than manufacturing it\.

### 6\.5Downstream Uses Enabled by Structured Output

The two narrative fields support analyses that classifier\-only judges cannot\. Embedding the 16,113 uniquedivergence\_topicstrings and clustering them yields a 991\-theme taxonomy of*what*sources disagree about \(largest themes: post\-transplant pregnancy timing, blood\-test frequency, rejection symptoms, dental\-care timing\); theclinical\_significancefield permits severity\-weighted re\-aggregation, which empirically tracks unweighted disagreement frequency closely \(Spearmanρ\>0\.99\\rho\>0\.99at the question, topic, and handbook levels\) and is most useful for surfacing individual high\-stakes pairs\. These analyses are enabled by the structured judge output, not by the labels alone\.

## 7Discussion

#### Generalisation to non\-medical deployed RAG\.

The framework’s three slots — multi\-source benchmark, inter\-source taxonomy, structured\-output judge — are domain\-agnostic\.*Legal RAG*\(Westlaw AI, Lexis\+ AI, Harvey\) retrieves over jurisdictional layers and firm\-specific research, yet single\-gold benchmarks \(LegalBench, LexGLUE\) cannot surface whether a query grounded in California versus Texas precedent diverges in client\-actionable ways\.*Educational RAG*retrieves over state\-stratified standards \(Common Core, NGSS\) and publisher\-specific expositions, while ScienceQA/GSM8K cannot surface whether a student’s answer depends on which state’s materials were indexed\. Each instantiates the same slots with a domain\-appropriate taxonomy: this paper’s empirical contribution is medical, its methodological contribution is for deployed RAG generally\.

#### Judge limitations\.

An LLM judge inherits known biases\(Zhenget al\.,[2023](https://arxiv.org/html/2605.29084#bib.bib8); Kimet al\.,[2024](https://arxiv.org/html/2605.29084#bib.bib10)\):*self\-preference*when generator and judge share a family \(pair\-symmetric framing mitigates but does not eliminate this; §[5](https://arxiv.org/html/2605.29084#S5)measuresκ=0\.842\\kappa\\\!=\\\!0\.842agreement\),*length/citation artefacts*, and*cost*\(Appendix[F](https://arxiv.org/html/2605.29084#A6)\)\.

## 8Conclusion

We introduced TransplantQA, the HERO\-QA retrieval system, and a structured\-output LLM\-as\-judge as instruments for measuring inter\-source heterogeneity in deployed medical RAG; all artefacts \(48,056 answers, 5\.73M pairwise comparisons, judge–majorityκ=0\.842\\kappa=0\.842\) are released\. Empirically, prior estimates understated the*prevalence*of disagreement, not its intensity — absence was hiding it\. Methodologically, structured single\-call judging is a requirement, not a convenience: post\-hoc extraction loses theDivergent/Complementarydiscrimination and severity gradation the framework depends on\.

## Limitations

The empirical instantiation is confined to U\.S\. solid\-organ transplant patient education \(English, 2024–2025 snapshot\); legal and educational transferability \(§[7](https://arxiv.org/html/2605.29084#S7)\) is conceptual\. The judge is an LLM; the 200\-pair validation measures population\-level agreement but cannot detect sub\-axis biases \(institution, organ, answer length\)\(Zhenget al\.,[2023](https://arxiv.org/html/2605.29084#bib.bib8); Kimet al\.,[2024](https://arxiv.org/html/2605.29084#bib.bib10)\); the released per\-pair JSON preserves judge reasoning for individual\-decision audit\. Apparent inter\-source divergence can also be inflated by retrieval failures rather than true disagreement; the absence pre\-screen partially mitigates this\.

## References

- Overview of the medical question answering task at TREC 2017 LiveQA\.National Institute of Standards and Technology\.External Links:[Document](https://dx.doi.org/10.6028/NIST.SP.500-324.qa-overview),[Link](https://doi.org/10.6028/NIST.SP.500-324.qa-overview)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px1.p1.1)\.
- G\. V\. Cormack, C\. L\. A\. Clarke, and S\. Büttcher \(2009\)Reciprocal rank fusion outperforms Condorcet and individual rank learning methods\.InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 758–759\.External Links:[Document](https://dx.doi.org/10.1145/1571941.1572114),[Link](https://doi.org/10.1145/1571941.1572114)Cited by:[§4\.2](https://arxiv.org/html/2605.29084#S4.SS2.p3.1)\.
- M\. Douze, A\. Guzhva, C\. Deng, J\. Johnson, G\. Szilvasy, P\. Mazaré, M\. Lomeli, L\. Hosseini, and H\. Jégou \(2026\)The Faiss library\.IEEE Transactions on Big Data\.Note:Early accessExternal Links:[Document](https://dx.doi.org/10.1109/TBDATA.2025.3618474),[Link](https://doi.org/10.1109/TBDATA.2025.3618474)Cited by:[§4\.2](https://arxiv.org/html/2605.29084#S4.SS2.p3.1)\.
- T\. Gebru, J\. Morgenstern, B\. Vecchione, J\. W\. Vaughan, H\. Wallach, H\. Daumé III, and K\. Crawford \(2021\)Datasheets for datasets\.Communications of the ACM64\(12\),pp\. 86–92\.External Links:[Document](https://dx.doi.org/10.1145/3458723),[Link](https://doi.org/10.1145/3458723)Cited by:[Appendix C](https://arxiv.org/html/2605.29084#A3.p1.1),[§3\.3](https://arxiv.org/html/2605.29084#S3.SS3.p1.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.External Links:[Document](https://dx.doi.org/10.3390/app11146421),[Link](https://doi.org/10.3390/app11146421)Cited by:[§1](https://arxiv.org/html/2605.29084#S1.p2.1),[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu \(2019\)PubMedQA: a dataset for biomedical research question answering\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 2567–2577\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1259),[Link](https://aclanthology.org/D19-1259/)Cited by:[§1](https://arxiv.org/html/2605.29084#S1.p2.1),[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Kim, J\. Suk, S\. Longpre, B\. Y\. Lin, J\. Shin, S\. Welleck, G\. Neubig, M\. Lee, K\. Lee, and M\. Seo \(2024\)Prometheus 2: an open source language model specialized in evaluating other language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 4334–4353\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.248),[Link](https://aclanthology.org/2024.emnlp-main.248/)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px2.p1.1),[§7](https://arxiv.org/html/2605.29084#S7.SS0.SSS0.Px2.p1.1),[Limitations](https://arxiv.org/html/2605.29084#Sx1.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 2511–2522\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153),[Link](https://aclanthology.org/2023.emnlp-main.153/)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px2.p1.1)\.
- LlamaIndex \(2024\)LlamaParse: a document parsing service for structured PDF extraction\.Note:[https://www\.llamaindex\.ai/llamaparse](https://www.llamaindex.ai/llamaparse)Accessed: 2026\-05\-26Cited by:[§4\.1](https://arxiv.org/html/2605.29084#S4.SS1.p1.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 12076–12100\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.741),[Link](https://aclanthology.org/2023.emnlp-main.741/)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)RAGTruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 10862–10878\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.585),[Link](https://aclanthology.org/2024.acl-long.585/)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InProceedings of the Conference on Health, Inference, and Learning,Proceedings of Machine Learning Research, Vol\.174,pp\. 248–260\.External Links:[Link](https://proceedings.mlr.press/v174/pal22a.html)Cited by:[§1](https://arxiv.org/html/2605.29084#S1.p2.1),[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: BM25 and beyond\.Foundations and Trends in Information Retrieval4\(1–2\),pp\. 1–174\.External Links:[Document](https://dx.doi.org/10.1561/1500000019),[Link](https://doi.org/10.1561/1500000019)Cited by:[§4\.2](https://arxiv.org/html/2605.29084#S4.SS2.p3.1)\.
- T\. Schuster, S\. Chen, S\. Buthpitiya, A\. Fabrikant, and D\. Metzler \(2022\)Stretching sentence\-pair NLI models to reason over long documents and clusters\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Abu Dhabi, United Arab Emirates,pp\. 394–412\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.28),[Link](https://aclanthology.org/2022.findings-emnlp.28/)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl, P\. Payne, M\. Seneviratne, P\. Gamble, C\. Kelly, A\. Babiker, N\. Schärli, A\. Chowdhery, P\. Mansfield, D\. Demner\-Fushman, B\. Agüera y Arcas, D\. Webster, G\. S\. Corrado, Y\. Matias, K\. Chou, J\. Gottweis, N\. Tomasev, Y\. Liu, A\. Rajkomar, J\. Barral, C\. Semturs, A\. Karthikesalingam, and V\. Natarajan \(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06291-2),[Link](https://doi.org/10.1038/s41586-023-06291-2)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Tsatsaronis, G\. Balikas, P\. Malakasiotis, I\. Partalas, M\. Zschunke, M\. R\. Alvers, D\. Weissenborn, A\. Krithara, S\. Petridis, D\. Polychronopoulos, Y\. Almirantis, J\. Pavlopoulos, N\. Baskiotis, P\. Gallinari, T\. Artiéres, A\. N\. Ngomo, N\. Heino, E\. Gaussier, L\. Barrio\-Alvers, M\. Schroeder, I\. Androutsopoulos, and G\. Paliouras \(2015\)An overview of the BIOASQ large\-scale biomedical semantic indexing and question answering competition\.BMC Bioinformatics16\(1\),pp\. 138\.External Links:[Document](https://dx.doi.org/10.1186/s12859-015-0564-6),[Link](https://doi.org/10.1186/s12859-015-0564-6)Cited by:[§1](https://arxiv.org/html/2605.29084#S1.p2.1),[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)MiniLM: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 5776–5788\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)Cited by:[§4\.2](https://arxiv.org/html/2605.29084#S4.SS2.p3.1)\.
- J\. Wennberg and A\. Gittelsohn \(1973\)Small area variations in health care delivery\.Science182\(4117\),pp\. 1102–1108\.External Links:[Document](https://dx.doi.org/10.1126/science.182.4117.1102),[Link](https://doi.org/10.1126/science.182.4117.1102)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Xiao, Z\. Liu, P\. Zhang, N\. Muennighoff, D\. Lian, and J\. Nie \(2024\)C\-Pack: packed resources for general chinese embeddings\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 641–649\.External Links:[Document](https://dx.doi.org/10.1145/3626772.3657878),[Link](https://doi.org/10.1145/3626772.3657878)Cited by:[§4\.2](https://arxiv.org/html/2605.29084#S4.SS2.p3.1)\.
- G\. Zeng, W\. Yang, Z\. Ju, Y\. Yang, S\. Wang, R\. Zhang, M\. Zhou, J\. Zeng, X\. Dong, R\. Zhang, H\. Fang, P\. Zhu, S\. Chen, and P\. Xie \(2020\)MedDialog: large\-scale medical dialogue datasets\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 9241–9250\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.743),[Link](https://aclanthology.org/2020.emnlp-main.743/)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 46595–46623\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px2.p1.1),[§7](https://arxiv.org/html/2605.29084#S7.SS0.SSS0.Px2.p1.1),[Limitations](https://arxiv.org/html/2605.29084#Sx1.p1.1)\.
- L\. Zhu, X\. Wang, and X\. Wang \(2025\)JudgeLM: fine\-tuned large language models are scalable judges\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=xsELpEPn4A)Cited by:[§2](https://arxiv.org/html/2605.29084#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AQuestion sources and inclusion criteria

The 1,115 released questions were drawn from an initial pool of 3,000\+ candidates collected from four families of public, patient\-facing sources\. Table[4](https://arxiv.org/html/2605.29084#A1.T4)reports the top\-10 source names in the final benchmark\.

Table 4:Top\-10 source names for the released question set, by number of contributing questions\.Source families \(final shares\): institutional Q&A pages \(31\.2%\), community forums such as Reddit and Mayo Clinic Connect \(25\.1%\), patient\-facing medical organizations \(24\.9%\), and a long tail of government health agencies and patient advocacy sites \(18\.8%\)\. 69\.9% of questions are geolocated to the United States\.

Collection and inclusion\.Candidate questions were harvested from the source platforms above using transplant\- and symptom\-keyword search\. A candidate was retained if it \(a\) was*relevant*to transplant patient education \(excluding administrative or off\-topic questions\) and \(b\) was*non\-duplicative*of an earlier\-retained question \(cosine deduplication at threshold 0\.85 followed by manual review of near\-duplicates\)\. Every retained question was then*anonymized and rephrased*to \(c\) strip personally identifying information about the asker or named individuals and \(d\) make the question*self\-contained*\(interpretable without surrounding conversational context\)\.

## Appendix BTopic taxonomy

Each question is annotated with one or more of 13 top\-level topic categories\. Table[5](https://arxiv.org/html/2605.29084#A2.T5)lists the categories and their share of the question set \(multi\-label, percentages can sum to\>\>100%\)\.

Table 5:13\-topic taxonomy for the question set\. Multi\-label\.A second tier of 43 fine\-grained sub\-topic tags refines these categories \(e\.g\.,*Medications→\\rightarrowTacrolimus interactions*;*Reproductive Health→\\rightarrowMycophenolate timing before pregnancy*\)\. The sub\-topic list is included in the released annotation file\.

## Appendix CData card \(Datasheet for Datasets\)

Following the recommendations ofGebruet al\.\([2021](https://arxiv.org/html/2605.29084#bib.bib22)\), we provide a structured data card\.

Motivation\.Created to enable evaluation of medical RAG systems on a corpus with genuine institutional heterogeneity, and to enable analysis of that heterogeneity itself\.

Composition\.1,115 patient\-derived questions; 102 transplant patient\-education handbooks from 23 U\.S\. centres across 5 organ types; 48,056 grounded answers from the reference production run; 5,730,465 pairwise comparisons \(1,211,220 LLM\-judged, 4,519,245 absence\-pre\-screened\); per\-question matrices; per\-shard summaries\.

Collection\.Questions collected over \[date range\] from public sources listed in Appendix[A](https://arxiv.org/html/2605.29084#A1)\. Handbooks downloaded as PDFs from public institutional websites in 2024–2025\. No interaction with patients or clinicians for data collection\.

Preprocessing\.Questions lightly paraphrased for anonymisation and self\-containment\. Handbooks extracted from PDF via LlamaParse; chunked at section boundaries with 512\-token sub\-chunking\. Answers and judgments produced by Qwen3\-32B at temperature 0\.

Uses\.Intended for evaluating medical RAG systems’ behaviour under multi\-source corpora, for measuring institutional heterogeneity in patient education, and as a benchmark for new LLM judges\.Not intendedfor ranking individual transplant centres or for direct clinical decision support\.

Distribution\.Release\-location metadata is anonymized for review\. Original handbook PDFs are not redistributed but are listed by URL\.

Maintenance\.Maintained by the authors, with annual updates planned when new handbook revisions are detected\.

## Appendix DAnswer\-generation prompt

The reference production run uses the HERO\-QA system prompt \(theHERO\_QA\_SYSTEM\_PROMPTbelow\) paired with theUSER\_TEMPLATEfor evidence framing\. The earlier hybrid\-retrieval baseline used a comparable system prompt without the section\-citation requirement\.

> System: You are a clinical information assistant using HERO\-QA evidence\. Answer the patient’s question based ONLY on the provided handbook evidence from this specific transplant center\. Follow these rules strictly: 1\. If the evidence answers the question, give the answer using only that evidence\. 2\. Cite the supporting section heading, and page if provided\. If pages are unknown, cite the section heading only\. 3\. If the evidence does not answer the question, respond exactly: "NOT ADDRESSED: This handbook does not contain information on this topic\." 4\. Do not use outside medical knowledge\. Do not fill gaps with general transplant advice\. User: \#\# Handbook Context \{context\} \#\# Patient Question \{question\}

Generation runs with greedy decoding \(temperature 0\),max\_new\_tokens=512, and<think\>\.\.\.</think\>reasoning blocks stripped before the answer is persisted\.

## Appendix EJudge prompt and output schema

Our judge uses two prompts: a binary absence\-detection prompt \(used only when the heuristicNOT ADDRESSEDprefix is not detected\) and the main comparison prompt\.

#### Absence\-detection prompt\.

> You are a clinical information assistant\. Read the following response that was generated from a transplant center handbook and determine whether it effectively states that the handbook does NOT contain information on the topic\. Response: \{answer\} Does this response indicate the handbook does not address the question? Answer with exactly one word: YES or NO

#### Comparison prompt\.

> You are a clinical expert evaluating whether two transplant center handbooks give consistent guidance on the same patient question\. \#\# Task Compare Answer A and Answer B and classify their relationship as exactly one of: ABSENT / CONSISTENT / COMPLEMENTARY / DIVERGENT / CONTRADICTORY \#\# Definitions \- ABSENT: One or both answers indicate the handbook does not contain information on the topic, so no meaningful comparison can be made\. \- CONSISTENT: Both answers provide substantive clinical content and give the same clinical recommendation\. \- COMPLEMENTARY: Both answers provide substantive clinical content that is compatible, but they differ in level of detail\. \- DIVERGENT: Both answers provide substantive clinical content but differ in a clinically meaningful way \(different thresholds, timelines, or recommendations that would lead to different patient behavior\)\. \- CONTRADICTORY: Both answers provide substantive clinical content that gives directly opposing guidance\. IMPORTANT: If either answer states the handbook does not address the topic, or provides no substantive clinical content, you MUST classify the pair as ABSENT\. \#\# Input Question: \{question\} Answer A \(\{center\_a\}\): \{answer\_a\} Answer B \(\{center\_b\}\): \{answer\_b\} \#\# Output \(JSON only, no other text\) \{\{ "classification": "<label\>", "reasoning": "<2\-3 sentence clinical justification\>", "divergence\_topic": "<specific sub\-topic of divergence, if applicable, else null\>", "clinical\_significance": "<low/medium/high if divergent or contradictory, else null\>" \}\}

#### Output schema and parsing\.

Judge outputs are parsed as JSON; if parsing fails, a fallback extractor scans the raw text for a recognised label and assigns the remaining fields tonull\. Across the 1,211,220 LLM\-judged pairs in the reference run, JSON parsing succeeded on\>\>99\.5% of calls\.

## Appendix FCompute cost

Production runs used NVIDIA H100 80 GB GPUs on PSC Bridges\-2 via a SLURM allocation\. Wall\-time figures aggregate the 20 generation shards and 20 comparison shards from the released reference run\.

Table 6:Approximate compute cost of the reference production run\.At an indicative H100\-80 GB cloud rate of $3–4/hour, the total reference\-run cost is approximately $1\.3K–$1\.8K\. The pipeline is fully resumable: a stalled or pre\-empted shard can be re\-launched without recomputing its already\-persisted per\-question artefacts\. Smaller domains \(10–20 handbooks\) are runnable on a single H100 in under 24 hours\.

## Appendix GPer\-handbook coverage heatmap

![Refer to caption](https://arxiv.org/html/2605.29084v1/images/handbook_by_organ_q.png)Figure 3:Handbook×\\timesquestion\-organ absence rate\. Rows are the 102 handbooks \(grouped and colour\-coded by organ\); columns are the six question\-organ groups\. Red = the handbook is silent on that organ’s questions\. The block\-diagonal structure reflects that organ\-specific handbooks answer mainly their own\-organ and general questions; rows that are pale across all columns \(e\.g\., several Mayo Clinic, UChicago, Houston Methodist handbooks\) are broadly comprehensive\.
## Appendix HSystem\-level delta, 14B vs\. 32B

Table[7](https://arxiv.org/html/2605.29084#A8.T7)reports the per\-organ deltas underlying the system\-level comparison in §[6\.4](https://arxiv.org/html/2605.29084#S6.SS4)\.

Table 7:System delta: 32B reference run \(HERO\-QA retrieval \+ 32B judge\)−\-14B earlier run \(hybrid retrieval \+ 14B judge\)\. The pipeline upgrade systematically lowers absence without inflating per\-pair divergence; instead, the*prevalence*of divergence rises\.
## Appendix IAnnotation protocol

The full validation protocol — sample design, annotator\-facing rubric with operational tiebreakers, clinical\-significance definitions, calibration plan, quality assurance, and scoring metrics — is provided as supplementary material underdrafts/annotation\_study/PROTOCOL\.md\. The 200\-pair stratified sample \(sample\_v1/annotation\_sample\_full\.csv\), two shuffled annotator\-facing packets \(packets/annotator\_\{A,B\}\.csv\), and the deterministic sampler \(src/analysis/build\_annotation\_sample\.py\) are released alongside the benchmark for full reproducibility\.

Similar Articles

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

arXiv cs.CL

A large-scale study across 5 models (7B–72B), 10 biomedical QA datasets, 4 retrieval methods, and 4 corpora finds that RAG yields only small and inconsistent gains (1–2 points) over no-retrieval baselines in biomedical question answering. The study concludes that the main bottleneck is not retrieval quality but models' limited ability to effectively use retrieved evidence.

Answer Presence Drives RAG Rewriting Gains

Hugging Face Daily Papers

The paper investigates whether the performance gains from rewriting retrieved passages in RAG QA pipelines are causally driven by the presence of the gold answer string in the rewritten context, using controlled intervention audits across multiple models and datasets.

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

arXiv cs.CL

This paper proposes claim-selective certification for high-risk medical retrieval-augmented generation (RAG), decomposing responses into verifiable claims and scoring them against evidence to produce actions (full, partial, conflict, abstain) using an intent-aware selector, achieving low unsupported-claim risk and high action accuracy.