ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
Summary
This paper introduces ClinicalBench and the EpiKG system, evaluating assertion-aware retrieval for clinical question answering on MIMIC-IV data across multiple LLMs. It demonstrates that handling negation and temporality in retrieval significantly improves performance over standard baselines.
View Cached Full Text
Cached at: 05/13/26, 06:09 AM
# Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
Source: [https://arxiv.org/html/2605.11143](https://arxiv.org/html/2605.11143)
Alex Stinard, MD Department of Clinical Sciences, College of Medicine University of Central Florida, Orlando, FL 32816 alex\.stinard@ucf\.edu
ClinicalBench: Stress\-Testing Assertion\-Aware Retrieval
for Cross\-Admission Clinical QA on MIMIC\-IV
Alex Stinard, MD
Department of Clinical Sciences, College of Medicine
University of Central Florida, Orlando, FL 32816
alex\.stinard@ucf\.edu
Preprint — arXiv version
Abstract
Objective\.Reasoning benchmarks measure clinical performance on clean inputs\. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family\-versus\-patient attribution can flip a correct answer to a wrong one\.
Materials and Methods\.EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent\. ClinicalBench is a 400\-question test over 43 MIMIC\-IV patients across 9 assertion\-sensitive categories\. A 7\-condition ablation tests each piece of EpiKG across six LLMs \(Opus 4\.6, GPT\-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1\.5 4B, Qwen 3\.5 35B\)\. Three physicians blindly adjudicated 100 paired items\.
Results\.Author\-blind primary endpoint: leave\-author\-out paired exact McNemar on 50 items \(Hird×\\timesNadeem unanimous strict\),Δ=\+22\.0pp\\Delta=\+22\.0\\,\\text\{pp\}\[\+5\.1pp, \+31\.5pp\] \(95% Newcombe CI\),p=0\.0192p=0\.0192\. The architectural novelty is C2b \(Contriever dense\-RAG\)→\\toC4g\_kw \(intent\-aware KG\-RAG\) on the change\-excludedn=362n=362endpoint:\+8\.84pp\+8\.84\\,\\text\{pp\}\(paired McNemarp=1\.79×10−3p=1\.79\\times 10^\{\-3\}\);\+12\.43pp\+12\.43\\,\\text\{pp\}under oracle intent\. Sensitivity analyses: three\-rater physician majority\+24\.0pp\+24\.0\\,\\text\{pp\}\(p=0\.0075p=0\.0075; Fleiss’κ=0\.413\\kappa=0\.413; subject to single\-author circularity since the author is R1\); deterministic keyword proxy\+39\.5pp\+39\.5\\,\\text\{pp\}over LLM\-alone \(reproducibility tool, not a clinical correctness claim\)\. The audit found 56% of auto\-generated references defective\.
Discussion\.Across the six models, the gain shrinks as the LLM\-alone baseline rises \(β=−1\.123\\beta=\-1\.123,r=−0\.921r=\-0\.921,p=0\.009p=0\.009\)\. Withn=6n=6this looks more like regression to the mean than encoding substituting for model size\. The author built the system, generated the initial gold standard, and performed the internal audit\. The primary endpoint uses external physician ratings with the author left out\.
Conclusion\.Carrying assertion labels and routing by question intent improve cross\-admission clinical QA across six LLMs\. ClinicalBench and the evaluation artifacts are public\.
## 1Background and Significance
Large language models match or exceed physician\-level performance on medical licensing exams\[[1](https://arxiv.org/html/2605.11143#bib.bib1),[2](https://arxiv.org/html/2605.11143#bib.bib2),[3](https://arxiv.org/html/2605.11143#bib.bib3)\], and reasoning benchmarks like HealthBench Professional, MedQA, and USMLE\-style items measure that last mile of clinical reasoning given clean vignettes\. Real EHR use exposes a complementary, undermeasured layer:*retrieval faithfulness*on messy charts, where negation, temporal drift, source conflict, and semantic compression must be navigated before reasoning\. The harder question is not whether AI can reason like a physician but whether it can read like one—physicians, of course, do both\. A single sentence—“patient denies chest pain, sister had MI at 45, will consider statin if lipids remain elevated”—encodes negation, family attribution, hypothetical intent, and an implicit present condition\. Clinical NLP detects these assertions accurately\[[4](https://arxiv.org/html/2605.11143#bib.bib4),[5](https://arxiv.org/html/2605.11143#bib.bib5)\], but RAG pipelines flatten the context, conflating “patient denies” with “patient has\.” This is the*epistemic propagation gap*, inside a broader structural\-representation gap—assertion typing, temporal indexing, experiencer attribution preserved to retrieval—that reasoning benchmarks do not probe\.
To the best of current knowledge, no patient\-level clinical KG\-RAG system jointly preserves assertion state on graph edges and routes retrieval by question intent\. OMOP excludes negated conditions fromCONDITION\_OCCURRENCE\[[6](https://arxiv.org/html/2605.11143#bib.bib6)\], and FHIR providesverificationStatusforConditionresources only\. Existing graph\-augmented RAG systems—including GraphRAG\[[7](https://arxiv.org/html/2605.11143#bib.bib7)\], GFM\-RAG\[[8](https://arxiv.org/html/2605.11143#bib.bib8)\], KARE\[[9](https://arxiv.org/html/2605.11143#bib.bib9)\], and Medical\-Graph\-RAG\[[10](https://arxiv.org/html/2605.11143#bib.bib10)\]—build KGs that discard the metadata distinguishing “patient has diabetes” from “rule out diabetes\.” A parallel*temporal integration gap*exists: clinical events admit bi\-temporal storage \(valid \+ transaction time, in the Snodgrass tradition; cf\. Zep\[[11](https://arxiv.org/html/2605.11143#bib.bib11)\]\) plus an NLP\-asserted temporality labelτa∈\{Past,Current,Future\}\\tau\_\{a\}\\in\\\{\\textsc\{Past\},\\textsc\{Current\},\\textsc\{Future\}\\\}, yet existing systems model at most a subset\[[12](https://arxiv.org/html/2605.11143#bib.bib12),[11](https://arxiv.org/html/2605.11143#bib.bib11)\]\.
The core empirical finding is interactional: assertion preservation alone does not improve aggregate accuracy unless retrieval is also routed by question type\. Three contributions are made:
1. 1\.ClinicalBench\.A 400\-question single\-site, same\-record stress test over 43 MIMIC\-IV patients \(convenience sample; 32 with two admissions, 11 single\-admission\) and 9 assertion\-sensitive categories, exposing category×\\timescondition interactions aggregate scores hide\. It targets retrieval faithfulness on real charts rather than reasoning on clean vignettes, complementing exam\-style benchmarks at a different layer\. SliceBench, a small supporting case study on record complexity, is also introduced\.
2. 2\.EpiKG and the epistemic propagation gap\.The loss of assertion metadata across clinical NLP pipelines is formalized, an information\-theoretic loss bound is derived \(Section[3\.3](https://arxiv.org/html/2605.11143#S3.SS3), Appendix[B](https://arxiv.org/html/2605.11143#A2)\), and a patient\-level clinical KG\-RAG system is implemented that preserves assertion and temporal metadata while routing retrieval by question intent\.
3. 3\.Author\-blind primary endpoint and architectural novelty\.The author\-blind primary is a paired test: leave\-author\-out exact McNemar onn=50n=50unanimous\-strict items adjudicated by two external physicians, yieldingΔ=\+22\.0pp\\Delta=\+22\.0\\,\\text\{pp\}\(95% Newcombe CI\[\+5\.1,\+31\.5\]\[\+5\.1,\+31\.5\],p=0\.0192p=0\.0192\)\. The architectural novelty is the paired delta of intent\-aware KG\-RAG over a strong dense\-RAG baseline \(Contriever\), C2b→\\toC4g\_kw\+8\.84pp\+8\.84\\,\\text\{pp\}on the change\-excludedn=362n=362endpoint \(McNemarp=1\.79×10−3p=1\.79\\times 10^\{\-3\}; oracle\+12\.43pp\+12\.43\\,\\text\{pp\}\)\. Secondary sensitivities are demoted: three\-rater majority\+24\.0pp\+24\.0\\,\\text\{pp\}\(single\-author circularity since the author is one rater\) and a deterministic reproducibility proxy \(keyword evaluator\)\+39\.5pp\+39\.5\\,\\text\{pp\}\(not a clinical\-correctness claim\)\. Cross\-model convergence acrossn=6n=6models is descriptive only: a linear regression of C1 baseline against C1→\\toC4g\_oracle delta yieldsβ=−1\.123\\beta=\-1\.123,r=−0\.921r=\-0\.921,p=0\.009p=0\.009, consistent with regression to the mean rather than encoding substituting for parameter count\.
The author designed the benchmark, built the system, and conducted the internal evaluation; this circularity is structurally mitigated by frozen evaluation artifacts, external physician evaluation, and cross\-model replication, but readers should weight claims accordingly \(Section[4\.1](https://arxiv.org/html/2605.11143#S4.SS1.SSS0.Px4)\)\.
Together these yield a benchmark\-supported design hypothesis: preserve epistemic metadata, route retrieval by intent, and evaluate by category×\\timescondition interaction rather than aggregate score\. The central research question—*when does structured epistemic context help, hurt, or break even?*—is answered interactionally on a single\-site, in\-distribution stress test designed for retrieval faithfulness rather than cross\-site generalization\.
### 1\.1Related Work
Prior work is organized along four axes \(extended discussion and Table[6](https://arxiv.org/html/2605.11143#A4.T6)in Appendix[D\.1](https://arxiv.org/html/2605.11143#A4.SS1)\)\.
#### Clinical reasoning benchmarks\.
Reasoning evaluations are complementary\. HealthBench Professional\[[13](https://arxiv.org/html/2605.11143#bib.bib13)\], MedQA\[[14](https://arxiv.org/html/2605.11143#bib.bib14)\], and MedPaLM 2\[[1](https://arxiv.org/html/2605.11143#bib.bib1)\]score reasoning on vignettes with facts pre\-supplied; EpiKG measures retrieval faithfulness on real longitudinal EHRs with dispersed facts and negation, temporal, and source ambiguity\. The two probe different stages: the last mile \(reasoning given clean inputs\) versus the first mile \(reading the right patient from messy charts\)\.
#### Medical RAG and clinical QA\.
Graph\-augmented retrieval is a leading paradigm: GraphRAG\[[7](https://arxiv.org/html/2605.11143#bib.bib7)\], GFM\-RAG\[[8](https://arxiv.org/html/2605.11143#bib.bib8)\], and Medical\-Graph\-RAG\[[10](https://arxiv.org/html/2605.11143#bib.bib10)\]build population\-level graphs but do not propagate note\-derived assertion or temporal metadata \(bi\-temporal storage with NLP\-asserted scope label, in our framing\)\. Existing benchmarks—MedPaLM 2\[[1](https://arxiv.org/html/2605.11143#bib.bib1)\], MIRAGE\[[15](https://arxiv.org/html/2605.11143#bib.bib15)\], emrQA\[[16](https://arxiv.org/html/2605.11143#bib.bib16)\]—target factual recall or grounded retrieval, not assertion\-faithful longitudinal QA over real EHRs\.
#### Clinical KG construction\.
Multi\-LLM KG\-RAG\[[17](https://arxiv.org/html/2605.11143#bib.bib17)\], AutoRD\[[18](https://arxiv.org/html/2605.11143#bib.bib18)\], and RECAP\-KG\[[19](https://arxiv.org/html/2605.11143#bib.bib19)\]apply LLMs to clinical KG construction but do not propagate assertion status into the final graph\.
#### Assertion detection and temporal KGs\.
NegEx\[[20](https://arxiv.org/html/2605.11143#bib.bib20)\], ConText\[[21](https://arxiv.org/html/2605.11143#bib.bib21)\], and Gul et al\.\[[5](https://arxiv.org/html/2605.11143#bib.bib5)\]treat assertion detection as terminal annotation; MedTKG\[[12](https://arxiv.org/html/2605.11143#bib.bib12)\]and Graphiti\[[11](https://arxiv.org/html/2605.11143#bib.bib11)\]implement temporal KGs but lack epistemic propagation\. Two structural gaps emerge: an*epistemic propagation gap*\(assertion labels are not persisted into KGs\) and a*temporal integration gap*\(temporal formalisms are annotation layers, not retrieval\-participating edge attributes\)\. EpiKG closes both by carrying assertion and temporal metadata as first\-class properties through every pipeline stage\.
## 2Objective
To evaluate whether preserving assertion and temporal metadata in a patient\-level clinical knowledge graph, then routing retrieval by question intent, improves cross\-admission clinical question answering over electronic health records\.
## 3Materials and Methods
### 3\.1Method Overview
EpiKG implements three ideas \(Figure[1](https://arxiv.org/html/2605.11143#S3.F1)\): \(1\) end\-to\-end epistemic preservation, carrying assertion labels through extraction, OMOP mapping, KG materialization, and retrieval; \(2\) bi\-temporal edge storage \(valid time, transaction time; in the Snodgrass tradition, cf\. Graphiti\[[11](https://arxiv.org/html/2605.11143#bib.bib11)\]\) plus an NLP\-asserted temporality labelτa∈\{Past, Current, Future\}\\tau\_\{a\}\\in\\\{\\text\{Past, Current, Future\}\\\}derived from clinical\-text scope \(data\-modeling clarification in Appendix[C\.1](https://arxiv.org/html/2605.11143#A3.SS1)\); and \(3\) intent\-aware routing matching graph traversal to question type\. The first two are infrastructure; the third is where the performance gain originates\.

Figure 1:EpiKG system workflow with concrete data examples\.Top: 9\-stage pipeline from clinical note to answer, with the goldα\\alpharibbon tracing assertion preservation end\-to\-end\.Middle: actual data at each stage—a discharge note with negation, family history, and conditional language is extracted into assertion\-labeled mentions, materialized as KG edges with temporality, and filtered by intent\-aware routing\.Bottom: the four routing strategies with formal operations\. The example shows how aCurrent\_Statequery filters out conditional edges while preserving confirmed medications\.Alt text: Multi\-row workflow diagram\. The top row shows clinical note ingestion through extraction, OMOP mapping, graph construction, retrieval, and answer generation\. A highlighted assertion label is preserved across stages\. Middle panels show example note text, extracted mentions, graph edges, and routed evidence\. The bottom row compares routing operations for default, change, current\-state, and historical queries\.
### 3\.2Epistemic Assertion Schema
Clinical notes contain qualified statements \(“no evidence of pneumonia,” “possible CHF,” “mother had breast cancer”\) that standard representations discard: OMOP excludes negated conditions\[[6](https://arxiv.org/html/2605.11143#bib.bib6)\]; FHIR limits assertion metadata\. EpiKG defines a seven\-value assertion taxonomy:
α∈\{Pres\., Abs\., Poss\., Cond\., Hypo\., Fam\.Hx\., Hist\.\}\\alpha\\in\\\{\\text\{\\scriptsize Pres\., Abs\., Poss\., Cond\., Hypo\., Fam\.Hx\., Hist\.\}\\\}\(1\)extending the i2b2 six\-class taxonomy\[[4](https://arxiv.org/html/2605.11143#bib.bib4)\]by separatingHistoricalfromFamily\_History\(Appendix[Q](https://arxiv.org/html/2605.11143#A17)\)\. A rule\-based classifier \(122 scope\-aware trigger patterns\) assignsα\\alphawith a confidence score, propagated through every stage\. Each edge carries bi\-temporal metadata \(valid \+ transaction time\) plus an NLP\-asserted temporality labelτa\\tau\_\{a\}, with Allen\-style interval relations stored as edge metadata \(data\-modeling clarification in Appendix[C\.1](https://arxiv.org/html/2605.11143#A3.SS1)\)\.
### 3\.3Formal Epistemic Preservation
The epistemic invariant is formalized as a testable pipeline property \(Appendix[B](https://arxiv.org/html/2605.11143#A2)\)\. An assertion\-blind pipeline collapses all labels toPresent, reducing assertion entropy to zero\[[22](https://arxiv.org/html/2605.11143#bib.bib22)\]; its faithfulness bound is1−fnp\(c\)1\-f\_\{\\textit\{np\}\}\(c\), wherefnpf\_\{\\textit\{np\}\}is the fraction of non\-present mentions—substantially below 1 for concepts like pneumonia or diabetes\. Empirical consequences are measured via category\-stratified accuracy in Section[4](https://arxiv.org/html/2605.11143#S4)\.
### 3\.4Intent\-Aware Retrieval \(C4g\)
The base retrieval pipeline uses bidirectional BFS over patient KG edges and OMOP vocabulary relationships \(Appendix[C\.2](https://arxiv.org/html/2605.11143#A3.SS2)\), but treats all questions uniformly\. Different clinical question types require fundamentally different graph operations:*change*requires cross\-admission set differencing,*current\-state*needs the most recent valid edges,*historical*must recover resolved conditions\.
#### Intent classifier\.
A rule\-based classifier maps each question toChange,Current\_State,Historical, orDefault\(Algorithm[1](https://arxiv.org/html/2605.11143#algorithm1), Appendix[S](https://arxiv.org/html/2605.11143#A19)\)\. Primary results use keyword\-only classification \(production\-realistic\); oracle classification with category metadata is an upper bound\. Keyword classification reduces Opus C4g from 68\.5% to 60\.2% \(−8\.3pp\-8\.3\\,\\text\{pp\}; Section[4\.1](https://arxiv.org/html/2605.11143#S4.SS1.SSS0.Px2)\)\.
#### Routing strategies\.
Changepartitions edges byhadm\_idand computes set differences across admission pairs\.Current\_Statefilters edges toτa=Current\\tau\_\{a\}=\\textsc\{Current\}or open validity\.Historicalselectsτa=Past\\tau\_\{a\}=\\textsc\{Past\}edges, augmented by admission\-based inference \(concepts in earlier but not the latest admission are labeled “resolved”\)\. Each intent triggers a type\-specific prompt template \(Appendix[U](https://arxiv.org/html/2605.11143#A21)\)\. Figure[2](https://arxiv.org/html/2605.11143#S3.F2)shows a historical question answered incorrectly under C1 \(no evidence\) and C4 \(stale PRESENT label\) but correctly under C4g’s temporal filtering\.

Figure 2:Worked example with retrieved knowledge subgraphs\. AHistoricalquestion is answered under three conditions\. C1 \(left\) has no patient data\. C4 \(center\) retrieves an unfiltered BFS subgraph where all edges are labeledPresent—including a stale label for cholelithiasis \(red\)\. C4g \(right\) appliesHistoricalrouting, filtering to resolved status with temporal evidence from two admissions\. Mini\-graph insets show the actual retrieved subgraph structure\.Alt text: Three side\-by\-side answer cards for one historical question\. The left C1 card lacks patient evidence, the center C4 card retrieves a stale present edge, and the right C4g card filters by historical intent to recover resolved\-status evidence across two admissions\.
Existing clinical QA benchmarks evaluate factual recall \(MedQA\[[14](https://arxiv.org/html/2605.11143#bib.bib14)\]\) or agent task completion\[[23](https://arxiv.org/html/2605.11143#bib.bib23)\]but do not test epistemic qualifiers or cross\-admission reasoning\. ClinicalBench111Unrelated to identically named benchmarks in other clinical NLP subfields\.is a same\-record retrieval\-faithfulness stress test on MIMIC\-IV\[[24](https://arxiv.org/html/2605.11143#bib.bib24)\]; SliceBench is a small supporting case study\.
### 3\.5ClinicalBench: Assertion\-Sensitive Clinical QA
ClinicalBench comprises 400 questions over 43 MIMIC\-IV patients across two tasks \(A: 200 negation\-aware retrieval;B: 200 temporal reasoning\) and 9 categories \(negation,conditional,uncertainty,family\_history,sequence,current\_state,duration,historical,change\)\. The v2 reference set began as auto\-generated NLP labels with 54 physician corrections and is provisional\. Questions are authored from the same de\-identified charts later ingested—an in\-distribution retrieval\-faithfulness stress test, not cross\-site generalization\. Four ablation conditions progressively add components; two bookends \(C6 all\-notes, C7 deterministic\-KG\) probe design boundaries \(Table[1](https://arxiv.org/html/2605.11143#S3.T1)\)\.
#### Cohort structure\.
The cohort comprises 43 MIMIC\-IV patients selected as a convenience sample \(no formal stratification or random sampling\); 11 patients have a single hospital admission, 32 have two admissions, and 0 have three or more\. The ‘cross\-admission’ framing applies cleanly to the 32\-patient two\-admission subset; on the 11\-patient single\-admission subset, retrieval still operates over multiple notes within an admission\. Demographics in Appendix[Y](https://arxiv.org/html/2605.11143#A25)\.
#### Linguistic categories, not computable phenotypes\.
ClinicalBench’s 9 categories \(negation, conditional, uncertainty, family\_history, sequence, current\_state, duration, historical, change\) are linguistic constructs adapted from the i2b2 assertion taxonomy\[[4](https://arxiv.org/html/2605.11143#bib.bib4)\]\. They are NOT computable phenotypes in the PheKB / eMERGE / OHDSI sense; the per\-category C1→\\toC4g deltas should not be interpreted as phenotype\-validation evidence\. Phenotype\-validatable evaluation requires OMOP/SNOMED concept\-sets, multi\-site PPV, and chart\-review\-validated cohorts \(Newton et al\.,*JAMIA*2013; Hripcsak & Ryan,*JBI*2019\), none of which this work provides\.
#### Experiencer\-attribution defects\.
Post\-hoc verification identified 8 items \(qids in Appendix[P\.2](https://arxiv.org/html/2605.11143#A16.SS2)\) with experiencer\-attribution defects: sourcesection=Family Historybut goldexpected\_answerasserts the disease as a current/historical condition of the patient\. These are excluded from the change\-excluded keyword endpoint \(n=362n=362, now reported as a sensitivity comparator; see Endpoints below\), reducing the keyword delta from\+40\.0pp\+40\.0\\,\\text\{pp\}to\+39\.5pp\+39\.5\\,\\text\{pp\}\. The items remain in the released v2 gold for transparency; v3 corrections are planned post\-publication\.
Table 1:ClinicalBench conditions\. C1–C4g: ablation ladder; C6/C7: bookend baselines; C1b/C4g\+: extensions \(n=240n=240\)\.IDShort NameConditionRetrievalAssertionTemporalC1LLM\-aloneLLM AloneNoneNoneNoneC2TF\-IDF RAG\+ Vanilla RAG \(TF\-IDF\)TF\-IDF doc chunksNoneNoneC2bDense RAG\+ Vanilla RAG \(dense\)Contriever doc chunksNoneNoneC3KG\-RAG\+ KG\-RAG \(no assertions\)Graph \+ DocNoneNoneC4KG\-RAG\+Assert\+ Epistemic KG\-RAGGraph \+ DocFull \(7\-class\)Bi\-temporal\+labelC4gKG\-RAG\+Route\+ Intent\-Aware KG\-RAGGraph \+ Doc \(type\-specific\)Full \(7\-class\)Bi\-temporal\+labelC6Long ContextLong ContextAll notesNoneNoneC7Deterministic KGDeterministic KGKG lookup \(no LLM\)FullBi\-temporal\+labelC1bDischarge OnlyDischarge SummaryDischarge doc onlyNoneNoneC4g\+KG\-RAG\+NotesKG\-RAG \+ Full NotesGraph \+ Doc \+ All notesFull \(7\-class\)Bi\-temporal\+label
#### Endpoints\.
Theprimary endpointis the leave\-author\-out paired exact McNemar test onn=50n=50matched \(C1, C4g\) qid pairs from the three\-rater external adjudication, restricted to Hird×\\timesNadeem unanimous strict ratings \(Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1)\)\. This is the substantive author\-blind comparison; it eliminates single\-author circularity at the cost ofn=362→n=50n=362\\to n=50statistical power\.Secondary / sensitivity: \(i\) deterministic keyword reproducibility proxy on the change\-excludedn=362n=362subset \(C4g\_keyword vs\. C1; reported with patient\-level cluster bootstrap CIs over 43 patients — not a clinical\-correctness claim\); \(ii\) three\-rater majority vote \(n=100n=100, subject to single\-author circularity\); \(iii\) oracle C4g upper bound; \(iv\) hard cross\-admission subset \(change∪\\cupcurrent\_state∪\\cuphistorical,n=122n=122, post\-selection\-inference caveat\); \(v\) C1b vs\. C4g\+ extension \(n=240n=240\); \(vi\) fulln=400n=400\.Diagnostic: per\-category C1→\\toC4g deltas and the C3→\\toC4→\\toC4g decomposition\. Thechangecategory is excluded from secondary \(i\) due to known reference defects \(Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\)\. Cohort demographics in Appendix[Y](https://arxiv.org/html/2605.11143#A25)\.
#### Endpoint pre\-registration and provenance\.
The leave\-author\-out paired exact McNemar statistic \(b=4b=4,c=15c=15,p=0\.0192p=0\.0192\) was computed and reported as a sensitivity analysis in commit0c510d7\(v85\.1, 2026\-04\-26\), prior to the JAMIA\-10 external review; its underlying three\-rater adjudication data were frozen earlier \(commit7b388db, v74\) and were not modified subsequently\.Promotion to primary endpoint occurred in v86 \(2026\-04\-27\) post\-hoc, in response to external\-reviewer feedback identifying single\-author circularity as the dominant threat to the originally\-declaredn=362n=362keyword primary\(commit96746e9, v82, 2026\-04\-26\)\. Because both endpoints are deterministic computations over already\-frozen data, the promotion does not involve additional data collection or model re\-running; it does, however, change the headline claim and the statistical\-power profile, and we therefore report the leave\-author\-out primary alongside the originally\-declaredn=362n=362keyword endpoint as a sensitivity\. The 8\-item experiencer\-attribution exclusion \(n=370→362n=370\\to 362\) was identified post\-hoc during external review \(the\+0\.5pp\+0\.5\\,\\text\{pp\}impact is documented in §[4\.1](https://arxiv.org/html/2605.11143#S4.SS1)\)\. The hard cross\-admissionn=122n=122subset was selected post\-hoc with a post\-selection\-inference caveat in §results\. Pre\-registration was not performed on AsPredicted or OSF; future versions of this benchmark will pre\-register endpoints prior to data collection\.
### 3\.6SliceBench: Complexity\-Stratified Case Study
SliceBench is a small case study \(6 MIMIC\-IV patients, 144 questions, three complexity tiers\) testing whether KG\-augmented retrieval scales with record complexity\. Five conditions \(B0–B4\) form a monotone context progression; the critical comparison is B2→\\toB3, which adds structured KG context with assertion metadata while holding documents fixed \(Appendix[K](https://arxiv.org/html/2605.11143#A11)\)\.
### 3\.7Evaluation Protocol
#### ClinicalBench
uses a*deterministic keyword evaluator*\(v2\)—exact word\-boundary matching with abstention\-detection gate \(Appendix[P](https://arxiv.org/html/2605.11143#A16)\)\. The physician\-audited subset provides the most credible human accuracy estimate\. Primary answering model: Claude Opus 4\.6; cross\-model: MedGemma 27B, GPT\-OSS 20B, Qwen3\.5 35B, Gemma 4 31B, MedGemma 1\.5 4B\.
#### SliceBench
uses LLM\-as\-judge with separated answering/judging models\[[15](https://arxiv.org/html/2605.11143#bib.bib15)\]\(Claude Sonnet 4\.5 answers, Opus 4\.6 judges\)\.
#### Statistical reporting\.
The author\-blind primary endpoint \(leave\-author\-out paired exact McNemar,n=50n=50\) is reported with two\-sided exact binomialpp\-values and a 95% Newcombe CI on the paired difference; this endpoint does not depend on any bootstrap\. For the keyword sensitivity endpoint and other secondary contrasts we report BCa bootstrap 95% CIs \(n=2,000n=2\{,\}000, seed 42\)\[[25](https://arxiv.org/html/2605.11143#bib.bib25)\]with patient\-level cluster bootstrap \(43 patients\) primary and question\-level secondary \(caveat:n=43n=43clusters is at the low end of cluster\-bootstrap reliability; cf\. Cameron & Miller\[[26](https://arxiv.org/html/2605.11143#bib.bib26)\]\)\. McNemar’s test\[[27](https://arxiv.org/html/2605.11143#bib.bib27)\]with Benjamini–Hochberg FDR correction is used for paired condition comparisons and cross\-model contrasts\. Safety score with asymmetric weighting \(w=2\.0w=2\.0\) in Appendix[E](https://arxiv.org/html/2605.11143#A5)\.
### 3\.8Physician Adjudication Protocol
The author \(board\-certified emergency physician, system designer\) conducted a blinded internal audit of 120 paired C1/C4g questions with randomized A/B labels, rating each on five dimensions: reference correctness, model correctness, score fairness, safety, utility \(Appendix[O](https://arxiv.org/html/2605.11143#A15)\)\. The adjudication supports C4g\>\>C1 \(\+35\.0pp\+35\.0\\,\\text\{pp\}strict,\+31\.7pp\+31\.7\\,\\text\{pp\}lenient; paired exact McNemarp<10−8p<10^\{\-8\}strict\) and revealed a 56% reference\-answer defect rate\.
#### External physician adjudication\.
Two independent physicians \(senior attending, 20\+ years; resident\) completed the same blinded 100\-item protocol, yielding a three\-rater majority vote with Fleiss’κ\\kappaand exact\-binomial McNemarpp\-values \(Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7)\)\. This is in\-distribution physician adjudication, not multi\-site phenotype validation\.
## 4Results
EpiKG is evaluated with ClinicalBench \(same\-record retrieval\-faithfulness stress test\) and SliceBench \(small case study on patient complexity\)\. ClinicalBench full\-set results use a deterministic keyword proxy; the physician\-adjudicated subset provides the most credible human accuracy estimate; SliceBench uses LLM\-as\-judge \(Section[3\.7](https://arxiv.org/html/2605.11143#S3.SS7)\)\. Claude Opus 4\.6 is the primary answering model; cross\-model evaluation spans MedGemma 27B, GPT\-OSS 20B, Qwen3\.5 35B, Gemma 4 31B, and MedGemma 1\.5 4B \(Appendix[P](https://arxiv.org/html/2605.11143#A16)\)\. Both benchmarks report BCa bootstrap 95% CIs \(n=2,000n=2\{,\}000, seed 42\)\[[25](https://arxiv.org/html/2605.11143#bib.bib25)\]\.
### 4\.1ClinicalBench: Primary Ablation
Table 2:ClinicalBench full\-set*proxy*results \(400 questions, Claude Opus 4\.6, keyword evaluator v2 with abstention detection\)\. Reproducibility scaffolding only — not the primary endpoint \(see Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1)\)\. Ablation ladder \(C1–C4g\) progressively adds components; C4 isolates assertion metadata from intent routing; C6 and C7 are bookend baselines\. C4gkw\{\}\_\{\\text\{kw\}\}\(keyword routing, secondary/sensitivity\) and C4goracle\{\}\_\{\\text\{oracle\}\}\(oracle routing, upper bound\) are shown separately\. Sig: \* denotes BCa CI excluding zero \(caveat:n=43n=43patients is at the low end of cluster\-bootstrap reliability; cf\. Cameron & Miller\[[26](https://arxiv.org/html/2605.11143#bib.bib26)\]\)\. Best inbold\.ConditionAccuracyΔ\\Deltavs C1C1LLM Alone21\.8%—C2\+ Vanilla RAG \(TF\-IDF\)52\.0%\+30\.2pp\+30\.2\\,\\text\{pp\}\*C2b\+ Vanilla RAG \(dense\)50\.8%\+29\.0pp\+29\.0\\,\\text\{pp\}\*C3\+ KG\-RAG \(no assertions\)50\.0%\+28\.2pp\+28\.2\\,\\text\{pp\}\*C4\+ Assertions \(no routing\)46\.2%\+24\.5pp\+24\.5\\,\\text\{pp\}\*C4gkw\{\}\_\{\\text\{kw\}\}\+ Intent\-Aware KG\-RAG \(keyword\)60\.2%\+38\.5pp\+38\.5\\,\\text\{pp\}\*C4goracle\{\}\_\{\\text\{oracle\}\}\+ Intent\-Aware KG\-RAG \(oracle\)68\.5%\+46\.8pp\+46\.8\\,\\text\{pp\}\*C6Long Context \(all notes\)59\.2%\+37\.5pp\+37\.5\\,\\text\{pp\}\*C7Deterministic KG \(no LLM\)—†—
Figure 3:Ablation results\.\(a\)Waterfall chart showing incremental accuracy changes \(Opus,n=400n=400\)\. Retrieval provides the largest gain \(\+30\.2pp\+30\.2\\,\\text\{pp\}\); switching from flat to KG\-structured retrieval is neutral \(−2\.0pp\-2\.0\\,\\text\{pp\}\); assertions without routing*hurt*\(−3\.8pp\-3\.8\\,\\text\{pp\}\); keyword routing recovers and extends \(\+14\.0pp\+14\.0\\,\\text\{pp\}\); oracle routing adds\+8\.3pp\+8\.3\\,\\text\{pp\}\.\(b\)Qualitative progression of model answers across conditions\.\(c\)KG\-RAG benefit generalizes across all six models \(\+20\+20–47pp47\\,\\text\{pp\}over C1\)\.Alt text: Three\-panel ablation figure\. A waterfall plot shows accuracy rising with retrieval, falling slightly with assertions alone, and rising again with intent routing\. A qualitative answer panel compares condition outputs\. A cross\-model panel shows KG\-RAG gains for every tested model\.
ClinicalBench provides three evaluators that yield directionally consistent estimates: physician three\-rater majority\+24\.0pp\+24\.0\\,\\text\{pp\}\(p=0\.0075p=0\.0075; sensitivity, subject to single\-author circularity; Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7)\), internal author adjudication\+35\.0pp\+35\.0\\,\\text\{pp\}strict \(descriptive, single\-rater; Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\), and a deterministic keyword reproducibility proxy\+39\.5pp\+39\.5\\,\\text\{pp\}\(NOT a clinical correctness claim\)\. The keyword evaluator is reproducibility scaffolding—shallow keyword matching with no polarity check, favoring C4g’s structured\-answer style—and is reported here for replicability, not as the substantive comparison \(Appendix[P\.3](https://arxiv.org/html/2605.11143#A16.SS3)\)\. On the change\-excludedn=362n=362proxy endpoint \(8 experiencer\-attribution\-defective items excluded\), keyword C4g reaches 62\.4% versus C1 22\.9% \(\+39\.5pp\+39\.5\\,\\text\{pp\}; McNemarp=2\.44×10−30p=2\.44\\times 10^\{\-30\}\); oracle provides a\+43\.1pp\+43\.1\\,\\text\{pp\}upper bound \(66\.0%\)\. C6 \(long context\) scores 59\.2%—−9\.3pp\-9\.3\\,\\text\{pp\}below C4goracle\{\}\_\{\\text\{oracle\}\}\(p=0\.001p=0\.001\), gap concentrated in current\-state \(C6 18\.0% vs\. C4g 70\.0%\)\.†C7 returns template refusals on\>\>98% of questions and is semantically 0% \(Appendix[X](https://arxiv.org/html/2605.11143#A24)\)\. The author\-blind primary endpoint is the leave\-author\-out paired exact McNemar reported in Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1)\.
#### Ablation decomposition\.
C2 \(TF\-IDF\) and C2b \(Contriever dense\) score comparably \(52\.0% vs\. 50\.8%;p=0\.62p=0\.62\), and C3 \(50\.0%\) is similar: retrieval method does not explain the C2→\\toC4g gap\. C4 scores 46\.2%—*below*C3 \(−3\.8pp\-3\.8\\,\\text\{pp\}, n\.s\.\); intent routing recovers and extends \(\+14\.0pp\+14\.0\\,\\text\{pp\}keyword,\+22\.3pp\+22\.3\\,\\text\{pp\}oracle;p<10−6p<10^\{\-6\}\)\. Per\-category, assertions alone help assertion\-sensitive \(negation\+22\.7pp\+22\.7\\,\\text\{pp\}, uncertainty\+15\.0pp\+15\.0\\,\\text\{pp\}\) but degrade temporal \(historical−30\.0pp\-30\.0\\,\\text\{pp\}, sequence−45\.0pp\-45\.0\\,\\text\{pp\}\); routing reverses these \(Appendix[X](https://arxiv.org/html/2605.11143#A24)\)\. The C2b→\\toC4g architectural delta vs\. dense\-RAG baseline is reported separately below\.
#### Intent classification sensitivity\.
Keyword\-only C4g \(60\.2%\) outperforms C4 without routing \(\+14\.0pp\+14\.0\\,\\text\{pp\}\) and the best non\-KG baseline C2 \(\+8\.2pp\+8\.2\\,\\text\{pp\}\), indicating the architecture’s value does not require an oracle classifier \(per\-category classifier accuracy in Appendix Table[23](https://arxiv.org/html/2605.11143#A19.T23)\)\.

Figure 4:Oracle vs\. keyword\-only intent routing across the four models with both routing variants\. Dumbbells span from keyword C4g \(light blue\) to oracle C4g \(dark blue\); gray diamonds show C1 baselines\. The dashed line marks C4 \(no routing, 46\.2%\)\. Most models gain modestly from oracle routing; Qwen3\.5 \(run withthink:falsedue to an Ollama repetition\-penalty issue, Appendix[P\.1](https://arxiv.org/html/2605.11143#A16.SS1)\) does not benefit from oracle routing in this configuration\.Alt text: Dumbbell chart comparing keyword and oracle routing accuracy by model\. Each row includes a low C1 baseline point and higher C4g points\. Most models improve with oracle routing\.
#### Evaluator hierarchy\.
LLM\-as\-judge \(\+28\.5pp\+28\.5\\,\\text\{pp\}; Appendix[V](https://arxiv.org/html/2605.11143#A22)\) corroborates the physician and keyword estimates\. The keyword evaluator is too strict in 40% of cases vs\. too lenient in 5% \(7\.5:1 ratio\)\.
#### Circularity disclosure\.
The author designed ClinicalBench, built EpiKG, generated initial reference answers, and conducted internal adjudication—a degree of role overlap that could bias results\. Three structural mitigations bound this risk: frozen public release of evaluator and predictions; two external physicians confirmed C4g under blinded majority vote \(\+24\.0pp\+24\.0\\,\\text\{pp\}\); five additional LLMs all show significant benefit \(\+20\.4\+20\.4to\+43\.1pp\+43\.1\\,\\text\{pp\}oracle\)\. Independent replication on a separately authored benchmark is the definitive test\.
#### Hard cross\-admission subset\.
On the 122\-item subset requiring synthesis across≥2\\geq 2admissions, C4goracle\{\}\_\{\\text\{oracle\}\}reaches 72\.1% versus C1 14\.8% \(\+57\.4pp\+57\.4\\,\\text\{pp\}; 95% CI: \[\+47\.5pp, \+66\.0pp\]\)\.
#### Architectural novelty: structured intent\-aware retrieval over a strong dense\-RAG baseline\.
C2b \(Contriever dense RAG\)→\\toC4gkw\{\}\_\{\\text\{kw\}\}\(intent\-aware KG\-RAG\) onn=362n=362yields\+8\.84pp\+8\.84\\,\\text\{pp\}\(McNemarp=1\.79×10−3p=1\.79\\times 10^\{\-3\}\); oracle classification yields\+12\.43pp\+12\.43\\,\\text\{pp\}\. On fulln=400n=400,\+9\.50pp\+9\.50\\,\\text\{pp\}keyword /\+17\.75pp\+17\.75\\,\\text\{pp\}oracle\. This isolates the structural\-retrieval\-with\-routing contribution over a reasonable dense\-retrieval baseline; it is the defensible architectural novelty number, separating retrieval\-vs\-no\-retrieval from structured\-retrieval\-vs\-flat\-retrieval\.
### 4\.2Discharge Summary vs\. KG\-RAG \(Cross\-Model Extension\)
A clinically realistic comparison pits discharge summary alone \(C1b\) against EpiKG with all notes \(C4g\+full\) onn=240n=240across four models\. Three of four models reach significance under cluster bootstrap; GPT\-OSS shows a directional but non\-significant gain: Opus\+12\.5pp\+12\.5\\,\\text\{pp\}\(57\.5%→\\to70\.0%\), Qwen3\.5\+10\.4pp\+10\.4\\,\\text\{pp\}\(58\.3%→\\to68\.8%\), MedGemma\+8\.8pp\+8\.8\\,\\text\{pp\}\(55\.0%→\\to63\.8%\), GPT\-OSS\+1\.7pp\+1\.7\\,\\text\{pp\}\(60\.4%→\\to62\.1%,p=0\.32p=0\.32\)\. The range\+1\.7\+1\.7to\+12\.5pp\+12\.5\\,\\text\{pp\}is consistent with structured retrieval over full clinical notes generalizing across parameter and training differences \(per\-category breakdowns in Appendix Table[16](https://arxiv.org/html/2605.11143#A14.T16)\)\.
### 4\.3Category×\\timesCondition Interaction

Figure 5:ClinicalBench per\-category accuracies across six conditions \(Claude Opus 4\.6, evaluator v2\)\. This plot is descriptive only: category counts vary, no multiplicity correction is applied, and categories with known evaluation defects, especially change, can inflate apparent gaps\. C4 is lower than C3 overall, while C4g is highest overall\.Alt text: Radar chart showing per\-category accuracy for six ClinicalBench conditions\. C4g covers the largest area overall, especially on cross\-admission categories\. C4 is smaller than C3 in several temporal categories, illustrating the assertion\-without\-routing regression\.
Per\-category accuracy \(Figure[5](https://arxiv.org/html/2605.11143#S4.F5); Appendix Table[7](https://arxiv.org/html/2605.11143#A6.T7)\) reveals a*category×\\timescondition interaction*\(diagnostic only;n=20n=20–30 per category, no multiplicity correction\)\. C4g improves all 9 categories over C1, with the largest gains in cross\-admission synthesis\. C4 helps assertion\-sensitive categories but degrades temporal ones, resolved by intent routing \(Appendix Figure[7](https://arxiv.org/html/2605.11143#A10.F7)\); C6 lags C4g most on current state \(−52pp\-52\\,\\text\{pp\}\) and conditional \(−5pp\-5\\,\\text\{pp\}\), with parity elsewhere\.
### 4\.4Cross\-Model Evaluation
Table 3:ClinicalBench cross\-model results \(change\-excluded keyword sensitivity endpoint,n=362n=362, keyword evaluator v2; the substantive primary endpoint is the leave\-author\-out paired exact McNemar in Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1)\)\. C4goracle\{\}\_\{\\text\{oracle\}\}shown as upper bound; all six models benefit \(oracle deltas\+20\.4\+20\.4to\+43\.1pp\+43\.1\\,\\text\{pp\}, allq<10−10q<10^\{\-10\}after BH\-FDR\)\.All six models benefit \(Table[3](https://arxiv.org/html/2605.11143#S4.T3); oracle deltas\+20\.4\+20\.4to\+43\.1pp\+43\.1\\,\\text\{pp\}, allq<10−10q<10^\{\-10\}after BH\-FDR\), with benefit inverse to baseline strength \(Appendix Table[7](https://arxiv.org/html/2605.11143#A6.T7)\)\.
To assess whether the convergence of C4goracle\{\}\_\{\\text\{oracle\}\}accuracies \(range 55\.8–66\.0%\) despite C1 spanning 21\.8–39\.5% reflects encoding partially substituting for parameter count, we regressed C1→\\toC4goracle\{\}\_\{\\text\{oracle\}\}delta on C1 baseline\. Slopeβ=−1\.123\\beta=\-1\.123, Pearsonr=−0\.921r=\-0\.921,p=0\.009p=0\.009\(n=6n=6models\)\. The strong negative slope is consistent with regression to the mean rather than encoding\-substitution; the substitution hypothesis is therefore not supported by this evidence and is removed from the contribution claims \(§[1](https://arxiv.org/html/2605.11143#S1)\)\.
### 4\.5SliceBench
A small supporting case study \(SliceBench, 6 patients, 144 questions\) is consistent with a complexity\-dependent KG effect but does not reach aggregate significance \(Appendix[I](https://arxiv.org/html/2605.11143#A9)\)\.
### 4\.6Physician Adjudication
A blinded internal audit of 120 paired questions \(Section[3\.8](https://arxiv.org/html/2605.11143#S3.SS8), Appendix[X](https://arxiv.org/html/2605.11143#A24)\) shows C4g at 62\.5% strict / 84\.2% lenient vs\. C1 at 27\.5% / 52\.5%\. Deltas:\+35\.0pp\+35\.0\\,\\text\{pp\}strict \(95% paired Wald CI: \[\+24\.8pp, \+45\.2pp\]; paired exact McNemarp<10−8p<10^\{\-8\}\) and\+31\.7pp\+31\.7\\,\\text\{pp\}lenient; on a 3\-level ordinal score \(correct=1, partial=0\.5, incorrect=0\) C4g higher in 64/120, tied in 46, lower in 10 \(sign testp<0\.0001p<0\.0001\)\. The keyword evaluator overestimates the strict delta by∼5\{\\sim\}5pprelative to physician judgment; all three methods agree on direction\.
#### Evaluator agreement\.
Keyword agrees with physician 54\.2% \(Cohen’sκ=0\.18\\kappa=0\.18\)\[[28](https://arxiv.org/html/2605.11143#bib.bib28)\]; on the 106 items with physician\-confirmed correct references, strict delta rises to\+41\.4pp\+41\.4\\,\\text\{pp\}—gold\-standard defects attenuate, not inflate, the benefit\. C4g improves or ties on all 9 categories \(p<0\.002p<0\.002\); safe rate\+15\.8pp\+15\.8\\,\\text\{pp\}, helpful rate\+36\.7pp\+36\.7\\,\\text\{pp\}\(Appendix[X](https://arxiv.org/html/2605.11143#A24)\)\.
### 4\.7External Physician Adjudication \(Three\-Rater\)
Three physicians—Reviewer 1 \(A\.S\., senior internal, 20\+ yr\), Reviewer 2 \(C\.H\., senior external, 20\+ yr\), and Reviewer 3 \(S\.N\., external resident\)—independently rated the same blinded 100\-item subset \(50 C1, 50 C4g\)\.
Table 4:Three\-rater external physician adjudication \(n=100n=100paired items, 50 C1 / 50 C4g, blinded\)\. Per\-reviewer C4g−\-C1 strict deltas are shown with exact\-binomial McNemarpp\-values\. The three\-rater majority vote is reported as a sensitivity comparator; the author\-blind primary endpoint is the leave\-author\-out paired exact McNemar \(Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1)\)\.#### Author\-blind primary endpoint: leave\-author\-out paired exact McNemar\.
The author\-blind primary endpoint is the leave\-author\-out paired exact McNemar on 50 matched \(C1, C4g\) qid pairs from the three\-rater external adjudication, restricted to Hird×\\timesNadeem unanimous strict ratings \(no inclusion of the author\)\. C1 12/50 \(24\.0%\)→\\toC4g 23/50 \(46\.0%\),Δ=\+22\.0pp\\Delta=\+22\.0\\,\\text\{pp\}\[95% Newcombe CI:\+5\.1pp\+5\.1\\,\\text\{pp\},\+31\.5pp\+31\.5\\,\\text\{pp\}\], two\-sided exact McNemarp=0\.0192p=0\.0192\(b=4b=4favor C1,c=15c=15favor C4g\)\. This is the substantive author\-blind comparison: paired, not subject to gold\-standard circularity \(the test is whether independent raters agree on system output\), and computed from the pre\-existing v85\.1 commit \(0c510d7, 2026\-04\-26\)\. Promotion of this statistic from a sensitivity to the primary endpoint occurred in v86 \(2026\-04\-27\) post\-hoc, in response to external\-reviewer feedback identifying single\-author circularity as the dominant threat to the originally\-declaredn=362n=362keyword primary; both endpoints are deterministic computations over already\-frozen three\-rater data \(commit7b388db, v74\), and pre\-registration was not performed on AsPredicted or OSF \(Section[3\.5](https://arxiv.org/html/2605.11143#S3.SS5)\)\. Inter\-non\-author Cohen’sκ=0\.36\\kappa=0\.36\(Hird×\\timesNadeem\) versus0\.430\.43–0\.490\.49for author\-involving pairs\.
#### Three\-rater majority \(sensitivity comparator\)\.
Under majority vote \(Table[4](https://arxiv.org/html/2605.11143#S4.T4)\), C4g is correct on 32/50 vs\. C1 on 20/50,\+24\.0pp\+24\.0\\,\\text\{pp\}\(p=0\.0075p=0\.0075\); change\-excluded\+27\.3pp\+27\.3\\,\\text\{pp\}\. No reviewer shows an inversion on any non\-*change*category\. Fleiss’κ=0\.413\\kappa=0\.413\(strict\) / 0\.615 \(gold correctness\)\[[28](https://arxiv.org/html/2605.11143#bib.bib28)\]; pairwise Cohen’sκ\\kappa0\.36–0\.49\. External reviewers found 61–64% reference defect rates \(vs\. 56% internal\)\. This majority result is subject to single\-author circularity \(Reviewer 1 is an author\) and is reported as a sensitivity comparator to the primary endpoint above\.
#### Structurally\-independent rater\.
The only structurally\-independent rater \(Reviewer 3, no prior author relationship\) showed\+2\.0pp\+2\.0\\,\\text\{pp\}\(calibration: 65/100 vs\. 39–46/100 for seniors\); with 23 discordant pairs, minimum detectable effect is∼22pp\{\\sim\}22\\,\\text\{pp\}, so the resident’s data are consistent with effects up to that magnitude \(Appendix[Z\.1](https://arxiv.org/html/2605.11143#A26.SS1)\)\.
## 5Discussion
#### Why long context underperforms\.
Long context forces the model to do two things at once: extract structure \(negation, temporality, experiencer\) and reason toward the answer\. KG\-RAG does the structural work upstream and hands the LLM a small set of typed facts\. C6 \(59\.2%\) trails C4g\_oracle \(68\.5%\) by−9\.3pp\-9\.3\\,\\text\{pp\}\(p=0\.001p=0\.001\), and the gap is concentrated in current\_state \(C6 18\.0% vs\. C4g 70\.0%\)\. GPT\-OSS 20B narrows its gap to Opus 4\.6 once both get structured context \(59\.4% vs\. 66\.0%\)\. The keyword evaluator is a reproducibility tool, not a clinical\-correctness claim: it scores keyword presence and skips polarity for uncertainty, family history, conditional, current\_state, and historical questions\. These shallow rules favor C4g’s structured answers over C1’s abstentions, so the keyword\+39\.5pp\+39\.5\\,\\text\{pp\}overestimates the true gain\. The physician adjudication \(\+24\+24to\+36pp\+36\\,\\text\{pp\}\) is the comparison that matters; the keyword number is its deterministic proxy\.
#### Cross\-model convergence\.
All six models gain \(Table[3](https://arxiv.org/html/2605.11143#S4.T3); oracle deltas\+20\.4\+20\.4to\+43\.1pp\+43\.1\\,\\text\{pp\}, allq<10−10q<10^\{\-10\}after BH\-FDR\)\. C4g\_oracle accuracies land in a narrow55\.855\.8–66\.0%66\.0\\%band even though C1 baselines span21\.821\.8–39\.5%39\.5\\%\. Regressing the C1→\\toC4g delta on the C1 baseline givesβ=−1\.123\\beta=\-1\.123,r=−0\.921r=\-0\.921,p=0\.009p=0\.009\. With C4g\_oracle close to the keyword evaluator’s measurable ceiling \(∼70%\\sim 70\\%\), Tu \(*BMJ*2005\) showed that paired pre\-post designs can produce slopes near−1\-1as a ceiling artifact, so we cannot tell whether structured retrieval is substituting for model size or whether we are seeing regression to the mean\. Withn=6n=6models, this is descriptive, not inferential\. One sign that representation alone is not free: C4 \(assertions without routing\) drops−3\.8pp\-3\.8\\,\\text\{pp\}below C3 \(p=0\.26p=0\.26, n\.s\.; Appendix[7](https://arxiv.org/html/2605.11143#A10.F7)\)\. Routing in C4g recovers and extends the gain\.
#### Possible implications beyond clinical NLP\.
Adding metadata without aligning retrieval to it can hurt performance, as the C3→\\toC4 step shows\. The C3→\\toC4→\\toC4g ablation may be useful as a test pattern wherever a system extracts structured annotations but does not integrate them into retrieval\.
#### Evaluator calibration and reference\-answer quality\.
The three evaluators agree directionally: keyword\+39\.5pp\+39\.5\\,\\text\{pp\}, physician\+24\+24to\+36pp\+36\\,\\text\{pp\}, LLM\-as\-judge\+28\.5pp\+28\.5\\,\\text\{pp\}\. The keyword and LLM\-judge magnitudes differ, but both compare the same C1 and C4g answers against the same gold standard, so any per\-answer evaluator bias cancels in the within\-model C1→\\toC4g paired delta; what survives is an architectural signal that the physician adjudication then independently corroborates\. Physician review found 56% of v2 reference answers defective\. The cause is a systematic NLP assertion\-classifier error: the classifier read “history of CHF” as “resolved” when clinical usage means “chronic\.” The model is correct 59% of the time when the reference is wrong, and 46% of the time when it is right, so noisy labels appear to underestimate true performance rather than inflate it\. We do not report a v3 gold rescore in this manuscript: the corrections were identified during physician adjudication, applying them and re\-running every model×\\timescondition would constitute a second freezing of the benchmark, and we prefer to keep v2 as the public test set with v3 reserved for an independent follow\-up\.
#### Bi\-temporal versus tri\-temporal modeling\.
We use the term “tri\-temporal” loosely\. The system stores bi\-temporal edges \(valid time and transaction time, in the Snodgrass tradition; cf\. Graphiti\[[11](https://arxiv.org/html/2605.11143#bib.bib11)\]\) and adds an NLP\-derived temporality labelτa∈\{Past, Current, Future\}\\tau\_\{a\}\\in\\\{\\text\{Past, Current, Future\}\\\}\. The label comes from clinical NLP, not database time, so the data model is bi\-temporal in the strict sense\. C4 enables the assertion and temporality axes together, so we cannot separate the temporal contribution to the C4g gain\. That separation is future work\.
#### Statistical caveats\.
BCa cluster bootstrap with 43 patients is at the low end of reliability \(Cameron & Miller\[[26](https://arxiv.org/html/2605.11143#bib.bib26)\]note≥50\\geq 50clusters is preferable\); CIs should be interpreted with this caveat\. The hard cross\-admissionn=122n=122subset was selected post\-hoc and merits a post\-selection\-inference caveat \(cf\. Berk et al\.\[[29](https://arxiv.org/html/2605.11143#bib.bib29)\]\)\. Cross\-model BH\-FDR is reported; the more conservative Benjamini\-Yekutieli \(2001\) procedure under PRDS\-violating dependence yields equivalent conclusions \(BY adjustment factor2\.45×2\.45\\times; allq<2\.5×10−10q<2\.5\\times 10^\{\-10\}\)\.
#### Multi\-site phenotype\-validation roadmap\.
The current cohort comprises 43 MIMIC\-IV patients \(BIDMC ICU\)\. Multi\-site validation in the eMERGE / OHDSI tradition would require≥5\\geq 5sites with per\-site PPV\. Minimum\-viable transportability checks \(MIMIC\-III↔\\leftrightarrowMIMIC\-IV intra\-cohort, eICU, Synthea narrative\) are flagged as future work\. NLP\-portability literature\[[30](https://arxiv.org/html/2605.11143#bib.bib30)\]projects 10–20ppPPV degradation on transport, which we have not measured\.
#### Deployment\-realism scoping\.
ClinicalBench Opus C1/C3/C4g require∼22\{\\sim\}22min per condition per patient via API; the system has not been packaged for sub\-second EHR\-sidebar latency, FURM\-style governance review\[[31](https://arxiv.org/html/2605.11143#bib.bib31)\], CHAI Assurance Reporting Checklist alignment, or post\-deployment algorithmovigilance\[[32](https://arxiv.org/html/2605.11143#bib.bib32)\]\. These are deployment prerequisites, not paper findings; a candidate SaMD/PCCP/CHAI/algorithmovigilance scaffold is sketched in Appendix[Z\.5](https://arxiv.org/html/2605.11143#A26.SS5)\.
#### Threats to validity and claim scope\.
ClinicalBench\-v2 is a provisional in\-distribution stress test \(43 patients, 400 questions, MIMIC\-IV\[[24](https://arxiv.org/html/2605.11143#bib.bib24)\]\) measuring retrieval fidelity, not external generalization or phenotype validation in the eMERGE/PheKB sense\. The change\-excluded keyword endpoint \(now reported as a sensitivity,n=362n=362\) excludes 8 experiencer\-attribution\-defective items \(Appendix[P\.2](https://arxiv.org/html/2605.11143#A16.SS2)\); they remain in the released v2 gold for transparency\. Evaluator uncertainty is substantial \(54% keyword–physician agreement\); conclusions focus on directional agreement across three evaluators\. The author’s combined role as designer, builder, and primary evaluator creates circularity risk structurally mitigated but not eliminated \(Section[4\.1](https://arxiv.org/html/2605.11143#S4.SS1.SSS0.Px4)\); inter\-non\-author Cohen’sκ=0\.36\\kappa=0\.36\(Hird×\\timesNadeem\) further attenuates the external\-adjudication signal; the leave\-author\-out 3\-rater majority is\+22\.0pp\+22\.0\\,\\text\{pp\}\(paired exact McNemarp=0\.0192p=0\.0192; 4 vs\. 15 discordant pairs\) versus\+24\.0pp\+24\.0\\,\\text\{pp\}with the author included, supporting both direction and significance\. Model dependence is real \(\+20\.4\+20\.4–43\.1pp43\.1\\,\\text\{pp\}oracle, after BH\-FDR\)\. This work performs in\-distribution evaluation and physician adjudication, not multi\-site phenotype validation; it does*not*establish clinical deployment readiness, and multi\-site replication and a prospective clinician\-with\-system comparison are encouraged\. Detailed threats and broader\-impact discussion in Appendix[Z](https://arxiv.org/html/2605.11143#A26)\.
## 6Conclusion
The author\-blind primary endpoint is leave\-author\-out paired exact McNemar \(Hird×\\timesNadeem unanimous strict,n=50n=50; promoted post\-hoc to primary in v86 from a pre\-existing v85\.1 sensitivity in response to external review — not pre\-registered\):\+22\.0pp\+22\.0\\,\\text\{pp\}\[\+5\.1pp, \+31\.5pp\],p=0\.0192p=0\.0192\. The architectural novelty is C2b \(Contriever dense\-RAG\)→\\toC4g\_kw \(intent\-aware KG\-RAG\) on the change\-excludedn=362n=362endpoint:\+8\.84pp\+8\.84\\,\\text\{pp\}\(McNemarp=1\.79×10−3p=1\.79\\times 10^\{\-3\}\)\. Sensitivities: three\-rater physician majority\+24\.0pp\+24\.0\\,\\text\{pp\}\(p=0\.0075p=0\.0075; subject to single\-author circularity since the author is R1\); deterministic keyword proxy\+39\.5pp\+39\.5\\,\\text\{pp\}\(reproducibility proxy only, not a clinical correctness claim\)\. Cross\-model accuracies converge with strong negative slope vs C1 baseline \(β=−1\.123\\beta=\-1\.123,r=−0\.921r=\-0\.921,p=0\.009p=0\.009\), consistent with regression to the mean rather than encoding substituting for parameter count\. The56%56\\%reference\-answer defect rate underscores a methodological lesson: automated NLP\-pipeline benchmarks require physician adjudication\. Reasoning benchmarks like HealthBench measure the last mile given clean vignettes; retrieval\-faithfulness on real charts is the first mile, and representation is the undermeasured prerequisite\.
## Acknowledgments
The author thanks non\-author contributors Cindy Hird, MD and Shaheera Nadeem, MD for independent physician adjudication of ClinicalBench items, and Yu Tian, PhD \(University of Central Florida\) for feedback on an early draft\. MIMIC\-IV data were accessed under PhysioNet Credentialed Health Data Use Agreement\.
#### Funding\.
This research received no specific grant from any funding agency in the public, commercial, or not\-for\-profit sectors\.
#### Conflict of Interest\.
The author serves as founder of Sulci\.ai, a clinical\-AI startup that is exploring deployment of EpiKG\-derived technology\. No commercial relationship funded this manuscript or the underlying experiments\. The University of Central Florida is the institution of academic record for this work\.
#### Author Contributions\.
Alex Stinard, MD: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data Curation, Writing – Original Draft, Writing – Review & Editing, Visualization, Project Administration\.
#### Data Availability\.
#### Use of AI Tools\.
Claude Code \(Anthropic, Claude Opus 4\.6\) was used as a programming assistant during system development, data analysis, and manuscript preparation\. All AI\-generated content was reviewed, verified, and edited by the author\. Claude Opus 4\.6 is also the primary answering model evaluated in ClinicalBench experiments; this dual role is disclosed throughout the paper\.
## References
- Singhal et al\. \[2023\]Karan Singhal, Shekoofeh Azizi, Tao Tu, S\. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole\-Lewis, Stephen Pfohl, et al\.Towards expert\-level medical question answering with large language models\.*Nature*, 620:399–404, 2023\.doi:10\.1038/s41586\-023\-06291\-2\.
- Saab et al\. \[2024\]Khaled Saab, Tao Tu, Xavier Amatriain, et al\.Capabilities of Gemini models in medicine\.*arXiv preprint arXiv:2404\.18416*, 2024\.
- Tu et al\. \[2024\]Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tober, et al\.Towards conversational diagnostic AI\.*Nature*, 2024\.
- Uzuner et al\. \[2011\]Özlem Uzuner, Brett R\. South, Shuying Shen, and Scott L\. DuVall\.2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text\.*Journal of the American Medical Informatics Association*, 18\(5\):552–556, 2011\.
- Kocaman et al\. \[2025\]Veysel Kocaman, Yigit Gul, M\. Aytug Kaya, Hasham Ul Haq, Mehmet Butgul, Cabir Celik, and David Talby\.Beyond negation detection: Comprehensive assertion detection models for clinical NLP\.In*Text2Story Workshop at European Conference on Information Retrieval \(ECIR\)*, 2025\.
- OHDSI Collaborative \[2024\]OHDSI Collaborative\.OMOP common data model v5\.4\.*Observational Health Data Sciences and Informatics*, 2024\.
- Edge et al\. \[2024\]Darren Edge et al\.From local to global: A graph RAG approach to query\-focused summarization\.*arXiv preprint arXiv:2404\.16130*, 2024\.
- Luo et al\. \[2025\]Tianjun Luo et al\.GFM\-RAG: Graph foundation model for retrieval augmented generation\.In*Advances in Neural Information Processing Systems*, 2025\.
- Jiang et al\. \[2025a\]Peng Jiang et al\.KARE: Knowledge graph augmented reasoning via llms for clinical decision support\.In*International Conference on Learning Representations*, 2025a\.
- Wu et al\. \[2025\]Junde Wu et al\.Medical\-Graph\-RAG: Towards safe medical large language model via graph retrieval\-augmented generation\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics*, 2025\.
- Rasmussen et al\. \[2025\]Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef\.Zep: A temporal knowledge graph architecture for agent memory\.*arXiv preprint arXiv:2501\.13956*, 2025\.
- Postiglione et al\. \[2024\]Marco Postiglione, Daniel Bean, Zeljko Kraljevic, Richard Dobson, and Vincenzo Moscato\.Predicting future disorders via temporal knowledge graphs and medical ontologies\.*IEEE Journal of Biomedical and Health Informatics*, 28\(7\):4238–4248, 2024\.doi:10\.1109/JBHI\.2024\.3390419\.
- OpenAI \[2026\]OpenAI\.HealthBench Professional: Evaluating clinical reasoning in large language models\.[https://openai\.com/research/healthbench](https://openai.com/research/healthbench), 2026\.Accessed 26 April 2026\.
- Jin et al\. \[2021\]Di Jin, Eileen Pan, Nassim Oufattole, Wei\-Hung Weng, Hanyi Fang, and Peter Szolovits\.What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.*Applied Sciences*, 11\(14\):6421, 2021\.doi:10\.3390/app11146421\.
- Xiong et al\. \[2024\]Guangzhi Xiong et al\.MIRAGE: Medical information retrieval\-augmented generation evaluation\.In*Findings of the Association for Computational Linguistics: ACL*, 2024\.
- Pampari et al\. \[2018\]Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng\.emrQA: A large corpus for question answering on electronic medical records\.In*Proceedings of the Conference on Empirical Methods in Natural Language Processing*, pages 2357–2368, 2018\.
- Chen et al\. \[2026\]Wei Chen et al\.Multi\-LLM KG\-RAG: End\-to\-end clinical knowledge graph construction\.*arXiv preprint arXiv:2601\.01844*, 2026\.
- Li et al\. \[2024\]Lang Li et al\.AutoRD: An automatic and end\-to\-end system for rare disease knowledge graph construction\.*JMIR Medical Informatics*, 12, 2024\.
- Mekhtieva et al\. \[2023\]Rakhilya Lee Mekhtieva, Brandon Forbes, Dalal Alrajeh, Brendan Delaney, and Alessandra Russo\.RECAP\-KG: Mining knowledge graphs from raw GP notes for remote COVID\-19 assessment in primary care\.*arXiv preprint arXiv:2306\.17175*, 2023\.
- Chapman et al\. \[2001\]Wendy W\. Chapman, Will Bridewell, Paul Hanbury, Gregory F\. Cooper, and Bruce G\. Buchanan\.A simple algorithm for identifying negated findings and diseases in discharge summaries\.*Journal of Biomedical Informatics*, 34\(5\):301–310, 2001\.
- Harkema et al\. \[2009\]Henk Harkema, John N\. Dowling, Tyler Thornblade, and Wendy W\. Chapman\.ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports\.*Journal of Biomedical Informatics*, 42\(5\):839–851, 2009\.
- Shannon \[1948\]Claude E\. Shannon\.A mathematical theory of communication\.*Bell System Technical Journal*, 27\(3\):379–423, 1948\.
- Jiang et al\. \[2025b\]Yixing Jiang, Kameron C\. Black, Gloria Geng, Danny Park, James Zou, Andrew Y\. Ng, and Jonathan H\. Chen\.MedAgentBench: A virtual EHR environment to benchmark medical LLM agents\.*NEJM AI*, 2\(9\), 2025b\.doi:10\.1056/AIdbp2500144\.
- Johnson et al\. \[2023\]Alistair E\.W\. Johnson, Lucas Bulgarelli, Lu Shen, et al\.MIMIC\-IV, a freely accessible electronic health record dataset\.*Scientific Data*, 10:1, 2023\.
- Efron and Tibshirani \[1993\]Bradley Efron and Robert J\. Tibshirani\.*An Introduction to the Bootstrap*\.Chapman and Hall/CRC, 1993\.
- Cameron and Miller \[2015\]A\. Colin Cameron and Douglas L\. Miller\.A practitioner’s guide to cluster\-robust inference\.*Journal of Human Resources*, 50\(2\):317–372, 2015\.
- McNemar \[1947\]Quinn McNemar\.Note on the sampling error of the difference between correlated proportions or percentages\.*Psychometrika*, 12\(2\):153–157, 1947\.
- Landis and Koch \[1977\]J\. Richard Landis and Gary G\. Koch\.The measurement of observer agreement for categorical data\.*Biometrics*, 33\(1\):159–174, 1977\.
- Berk et al\. \[2013\]Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao\.Valid post\-selection inference\.*Annals of Statistics*, 41\(2\):802–837, 2013\.
- Bittar et al\. \[2023\]Andre Bittar, Sumithra Velupillai, Johnny Downs, Rosemary Sedgwick, and Rina Dutta\.Portability of natural language processing methods to detect suicidality from unstructured clinical text in us and uk electronic health records\.*Journal of the American Medical Informatics Association Open*, 6\(3\):ooad078, 2023\.doi:10\.1093/jamiaopen/ooad078\.
- Shah et al\. \[2024\]Nigam H\. Shah, John D\. Halamka, Suchi Saria, Michael Pencina, Troy Tazbaz, Micky Tripathi, Alison Callahan, Hailey Hildahl, and Brian Anderson\.A nationwide network of health ai assurance laboratories\.*JAMA*, 331\(3\):245–249, 2024\.doi:10\.1001/jama\.2023\.26930\.
- Embi \[2021\]Peter J\. Embi\.Algorithmovigilance—advancing methods to analyze and monitor artificial intelligence–driven health care for effectiveness and equity\.*JAMA Network Open*, 4\(4\):e214622, 2021\.doi:10\.1001/jamanetworkopen\.2021\.4622\.
- Stinard \[2026\]Alex Stinard\.\[dataset\] ClinicalBench: Assertion\-sensitive clinical question answering benchmark\.[https://huggingface\.co/datasets/alexstinard/epikg\-clinicalbench](https://huggingface.co/datasets/alexstinard/epikg-clinicalbench), 2026\.Accessed 25 April 2026\.
- Ceusters and Smith \[2015\]Werner Ceusters and Barry Smith\.Aboutness: Towards foundations for the information artifact ontology\.*International Conference on Biomedical Ontology*, 2015\.
- Snodgrass \[2000\]Richard T\. Snodgrass\.*Developing Time\-Oriented Database Applications in SQL*\.Morgan Kaufmann, 2000\.
- Allen \[1983\]James F\. Allen\.Maintaining knowledge about temporal intervals\.*Communications of the ACM*, 26\(11\):832–843, 1983\.
- Li et al\. \[2020\]Fei Li, Jianfu Hong, Cui Tao, et al\.TEO: A time event ontology for clinical narratives\.*Journal of the American Medical Informatics Association*, 27\(10\):1560–1568, 2020\.
- Huang et al\. \[2023\]Yan Huang, Xiaojin Li, and Guo\-Qiang Zhang\.Temporal cohort logic\.*AMIA Annual Symposium Proceedings*, 2022:1237–1246, 2023\.
- Lu et al\. \[2025\]Yifan Lu, Tianyu Fu, et al\.Doctorrag: Medical rag emulating doctor\-like reasoning\.In*Advances in Neural Information Processing Systems*, 2025\.
- Wang et al\. \[2025\]Yusheng Wang et al\.MedRAG: Enhancing medical diagnosis through retrieval\-augmented generation with knowledge graph\-elicited reasoning\.*Proceedings of The Web Conference*, 2025\.
- Thio et al\. \[2025\]Samuel Thio, Matthew Lewis, Spiros Denaxas, and Richard J\. B\. Dobson\.Unlocking electronic health records: A hybrid graph RAG approach to safe clinical AI for patient QA\.*arXiv preprint arXiv:2602\.00009*, 2025\.
- Peng et al\. \[2025\]Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang\.Graph retrieval\-augmented generation: A survey\.*ACM Transactions on Information Systems*, 2025\.doi:10\.1145/3777378\.
- Cao et al\. \[2026\]Lang Cao, Qingyu Chen, and Yue Guo\.EHR\-RAG: Bridging long\-horizon structured electronic health records and large language models via enhanced retrieval\-augmented generation\.*arXiv preprint arXiv:2601\.21340*, 2026\.
- Gao et al\. \[2025\]Yanjun Gao, Ruizhe Li, Emma Croxford, John Caskey, Brian W\. Patterson, Matthew Churpek, Timothy Miller, Dmitriy Dligach, and Majid Afshar\.Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study\.*JMIR AI*, 4\(1\):e58670, 2025\.doi:10\.2196/58670\.
- Izacard et al\. \[2022\]Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave\.Unsupervised dense information retrieval with contrastive learning\.*Transactions on Machine Learning Research*, 2022\.
- Peng et al\. \[2018\]Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers, and Zhiyong Lu\.NegBio: A high\-performance tool for negation and uncertainty detection in radiology reports\.*AMIA Summits on Translational Science Proceedings*, 2018\.
## Appendix ANotation Summary
Table 5:Key notation used throughout the paper\.
## Appendix BFormal Epistemic Preservation
The epistemic invariant is formalized for the epistemic invariant maintained by the EpiKG pipeline, building on the principle that aboutness—the relationship between information artifacts and the entities they represent—must be preserved across transformations\[[34](https://arxiv.org/html/2605.11143#bib.bib34)\]\.
###### Definition 1\(Epistemic State\)\.
For a clinical mentionmm, the*epistemic state*is the tuplee\(m\)=\(c,α,ξ,τ\)e\(m\)=\(c,\\alpha,\\xi,\\tau\), whereccis the OMOP concept identifier,α∈𝒜\\alpha\\in\\mathcal\{A\}is the 7\-value assertion label \(Eq\.[1](https://arxiv.org/html/2605.11143#S3.E1)\),ξ∈\{Patient,Family\}\\xi\\in\\\{\\textnormal\{\{Patient\}\},\\textnormal\{\{Family\}\}\\\}is the experiencer, andτ∈\{Current,Past,Future\}\\tau\\in\\\{\\textnormal\{\{Current\}\},\\textnormal\{\{Past\}\},\\textnormal\{\{Future\}\}\\\}is the temporality\. A pipelinePP*epistemically preserves*mentionmmifP\(e\(m\)\)=e\(m\)P\(e\(m\)\)=e\(m\); that is, the epistemic state at the output of every pipeline stage is identical to the state at extraction\.
###### Proposition 2\(Assertion Entropy Loss\)\.
For conceptccin patientπ\\pi’s record, letAcπA\_\{c\}^\{\\pi\}denote the assertion distribution with empirical frequenciesp\(αi\)=ni/∑jnjp\(\\alpha\_\{i\}\)=n\_\{i\}/\\sum\_\{j\}n\_\{j\}over the\|𝒜\|\|\\mathcal\{A\}\|assertion classes\. The assertion entropy is
H\(Acπ\)=−∑ip\(αi\)logp\(αi\)\.H\(A\_\{c\}^\{\\pi\}\)=\-\\sum\_\{i\}p\(\\alpha\_\{i\}\)\\log p\(\\alpha\_\{i\}\)\\;\.\(2\)An assertion\-blind pipeline collapses all mentions toα=Present\\alpha=\\textsc\{Present\}, yielding a degenerate distribution withH=0H=0\. The information loss isΔH=H\(Acπ\)≥0\\Delta H=H\(A\_\{c\}^\{\\pi\}\)\\geq 0\[[22](https://arxiv.org/html/2605.11143#bib.bib22)\], withΔH\>0\\Delta H\>0strictly whenever any mention carries a non\-present assertion\.
###### Proof\.
Under the assertion\-blind mappingϕ:αi↦Present\\phi:\\alpha\_\{i\}\\mapsto\\textsc\{Present\}for allii, the output distribution assigns probability 1 toPresentand 0 to all other classes, soH\(ϕ\(Acπ\)\)=0H\(\\phi\(A\_\{c\}^\{\\pi\}\)\)=0\. By non\-negativity of Shannon entropy,ΔH=H\(Acπ\)−0=H\(Acπ\)≥0\\Delta H=H\(A\_\{c\}^\{\\pi\}\)\-0=H\(A\_\{c\}^\{\\pi\}\)\\geq 0, with equality iff all mentions already carryα=Present\\alpha=\\textsc\{Present\}\. ∎
###### Corollary 3\(Faithfulness Bound\)\.
Letfnp\(c\)f\_\{\\textit\{np\}\}\(c\)denote the fraction of mentions of conceptcccarrying a non\-present assertion \(α≠Present\\alpha\\neq\\textsc\{Present\}\)\. Without assertion labels, the maximum assertion\-faithful accuracy for any downstream task conditioned onccis bounded by1−fnp\(c\)1\-f\_\{\\textit\{np\}\}\(c\)\. In clinical records where negated and uncertain mentions are prevalent—e\.g\., “no pneumonia,” “possible CHF”—this bound can be substantially below 1\.
###### Proof\.
Without assertion labels, any predictor must assign a single assertion class to all mentions ofcc\. ChoosingPresent\(the majority class in clinical text\) yields accuracy1−fnp\(c\)1\-f\_\{\\textit\{np\}\}\(c\); thefnp\(c\)f\_\{\\textit\{np\}\}\(c\)non\-present mentions are necessarily misclassified\. ∎
## Appendix CExtended System Design
### C\.1Temporal Knowledge Graph Model \(Bi\-Temporal Storage \+ NLP\-Asserted Label\)
Clinical events unfold across multiple time dimensions that prior systems model incompletely\[[12](https://arxiv.org/html/2605.11143#bib.bib12),[11](https://arxiv.org/html/2605.11143#bib.bib11)\]\. Bi\-temporal databases\[[35](https://arxiv.org/html/2605.11143#bib.bib35)\]distinguish valid time from transaction time; EpiKG’s per\-edge representation adds an NLP\-asserted temporality labelτa\\tau\_\{a\}and stores all three on every KG edge \(data\-modeling clarification below\):
###### Definition 4\(Temporal Edge — bi\-temporal storage with NLP\-asserted label\)\.
An edgee=\(s,p,o,α,𝛕v,𝛕t,τa,r,c\)e=\(s,p,o,\\alpha,\\boldsymbol\{\\tau\}\_\{v\},\\boldsymbol\{\\tau\}\_\{t\},\\tau\_\{a\},r,c\)where:
- •s,os,oare source and target nodes;ppis the predicate \(one of 24 edge types\);
- •α∈𝒜\\alpha\\in\\mathcal\{A\}is the assertion label \(Eq\.[1](https://arxiv.org/html/2605.11143#S3.E1)\);
- •𝝉v=\(event\_date,valid\_from,valid\_to\)\\boldsymbol\{\\tau\}\_\{v\}=\(\\textit\{event\\\_date\},\\textit\{valid\\\_from\},\\textit\{valid\\\_to\}\)is thevalid timeinterval—when the relationship held in the real world;
- •𝝉t=\(recorded\_at,doc\_date,created\_at\)\\boldsymbol\{\\tau\}\_\{t\}=\(\\textit\{recorded\\\_at\},\\textit\{doc\\\_date\},\\textit\{created\\\_at\}\)is thetransaction time—when it was recorded;
- •τa∈\{Current,Past,Future\}\\tau\_\{a\}\\in\\\{\\textnormal\{\{Current\}\},\\textnormal\{\{Past\}\},\\textnormal\{\{Future\}\}\\\}is theNLP\-asserted temporality;
- •r∈ℛr\\in\\mathcal\{R\}is an Allen interval algebra relation\[[36](https://arxiv.org/html/2605.11143#bib.bib36)\];c∈\[0,1\]c\\in\[0,1\]is temporal confidence\.
ℛ\\mathcal\{R\}comprises seven of Allen’s 13 canonical relations\[[36](https://arxiv.org/html/2605.11143#bib.bib36)\]\(merging symmetric pairs like Before/Meets\) plusConcurrentandUnknown\(nine total; full mapping in Table[22](https://arxiv.org/html/2605.11143#A18.T22)\)\. Unlike TEO\[[37](https://arxiv.org/html/2605.11143#bib.bib37)\]and TCL\[[38](https://arxiv.org/html/2605.11143#bib.bib38)\], which use Allen’s relations as annotation\-ontology classes or modal operators, EpiKG storesrrandccdirectly on materialized edges; this is intended to enable temporal\-interval filtering during traversal, although the current intent\-aware retrieval algorithm queries the categoricalτa\\tau\_\{a\}label rather than Allen\-relation predicates \(see clarification below\)\.
#### Data\-modeling clarification \(bi\-temporal storage \+ derived label\)\.
What we describe as “tri\-temporal” is more precisely a*bi\-temporal storage layer*\(valid time𝝉v\\boldsymbol\{\\tau\}\_\{v\}\+ transaction time𝝉t\\boldsymbol\{\\tau\}\_\{t\}, in the Snodgrass tradition\[[35](https://arxiv.org/html/2605.11143#bib.bib35)\]; cf\. Graphiti\[[11](https://arxiv.org/html/2605.11143#bib.bib11)\]\) plus an*NLP\-asserted temporality label*τa∈\{Past,Current,Future\}\\tau\_\{a\}\\in\\\{\\textsc\{Past\},\\textsc\{Current\},\\textsc\{Future\}\\\}\. The label is derived from clinical NLP rather than from database time, so on data\-modeling grounds the model is bi\-temporal\-with\-derived\-attribute, not strictly tri\-temporal\. Allen\-style interval relationsr∈ℛr\\in\\mathcal\{R\}are stored on edges as metadata, but the current intent\-aware retrieval algorithm queries the categoricalτa\\tau\_\{a\}label rather than Allen\-relation predicates; using stored intervals in retrieval \(e\.g\., to enforceBefore/Overlapsconstraints\) is flagged as future work\.
### C\.2Graph\-Augmented Retrieval Details
Given a clinical questionqqand patientπ\\pi, the base retrieval pipeline: \(1\) extracts concepts fromqqvia NLP \+ OMOP enrichment; \(2\) traverses 2–3 hops via bidirectional breadth\-first search \(BFS\) over patient KG edges and OMOP vocabulary relationships \(20M\+ edges\) via PostgreSQL common table expressions \(CTEs\) \(Appendix[L](https://arxiv.org/html/2605.11143#A12.SS0.SSS0.Px3)\), pruning edges withc<0\.3c<0\.3; \(3\) groups edges into four temporal views \(event timeline, current state, historical, conflicts\); \(4\) retrieves matching guideline sections; \(5\) scores edges by:
score\(e\)=c\(e\)\+𝟙\[type\(e\)∈Qrel\]⋅0\.2\+𝟙\[τa\(e\)=Current\]⋅0\.1\\text\{score\}\(e\)=c\(e\)\+\\mathbb\{1\}\[\\text\{type\}\(e\)\\in Q\_\{\\text\{rel\}\}\]\\cdot 0\.2\+\\mathbb\{1\}\[\\tau\_\{a\}\(e\)=\\textsc\{Current\}\]\\cdot 0\.1\(3\)where the bonuses for question\-relevant edge types \(0\.20\.2\) and current temporality \(0\.10\.1\) were set by manual tuning on a development set of 20 questions\. The surviving subgraph is serialized into structured text preserving assertion labels \(e\.g\., “Absent: pneumonia”\); and \(6\) composes graph evidence, temporal context, guidelines, and source documents into a single prompt\.
## Appendix DGap Analysis
Table 6:Capability comparison across representative systems\.✓= supported,∘\\circ= partial,×\\times= not supported\.SystemPatient KG
OMOP mapping
Assertion \(7\-class\)
Bi\-temp\.\+label
Assert\.\-aware RAG
Experiencer
Allen’s algebra
Multi\-hop \(≥\\geq3\)
DoctorRAG\[[39](https://arxiv.org/html/2605.11143#bib.bib39)\]×\\times×\\times×\\times×\\times×\\times×\\times×\\times×\\timesGFM\-RAG\[[8](https://arxiv.org/html/2605.11143#bib.bib8)\]∘\\circ×\\times×\\times×\\times×\\times×\\times×\\times✓MedRAG\[[40](https://arxiv.org/html/2605.11143#bib.bib40)\]∘\\circ×\\times×\\times×\\times×\\times×\\times×\\times∘\\circKARE\[[9](https://arxiv.org/html/2605.11143#bib.bib9)\]∘\\circ×\\times×\\times×\\times∘\\circ×\\times×\\times✓Multi\-LLM KG\[[17](https://arxiv.org/html/2605.11143#bib.bib17)\]×\\times✓∘\\circ×\\times×\\times×\\times×\\times×\\timesMedTKG\[[12](https://arxiv.org/html/2605.11143#bib.bib12)\]✓×\\times×\\times∘\\circ×\\times×\\times×\\times✓Graphiti\[[11](https://arxiv.org/html/2605.11143#bib.bib11)\]×\\times×\\times×\\times∘\\circ×\\times×\\times×\\times✓MediGRAF\[[41](https://arxiv.org/html/2605.11143#bib.bib41)\]✓×\\times×\\times×\\times×\\times×\\times×\\times✓EpiKG \(this work\)✓✓✓✓✓✓✓∘\\circPartial marks indicate limited coverage: GFM\-RAG, MedRAG, and KARE build population\-level or hierarchical KGs \(not per\-patient cross\-admission graphs\); KARE uses KG\-augmented retrieval but without assertion awareness; Multi\-LLM KG models uncertainty via entropy but not the full assertion taxonomy; MedTKG and Graphiti carry bi\-temporal annotations but lack the assertion\-aware retrieval of EpiKG\. MediGRAF is the closest patient\-level competitor, constructing per\-patient medical graphs from clinical notes, but it does not preserve assertion status or temporal relations as edge attributes\. EpiKG’s partial mark on multi\-hop traversal \(≥3\\geq 3hops\) reflects a deliberate architectural trade\-off: PostgreSQL CTE\-based traversal provides ACID compliance but degrades beyond 2 hops \(Appendix[L](https://arxiv.org/html/2605.11143#A12.SS0.SSS0.Px3)\)\.
### D\.1Extended Related Work Discussion
This subsection expands the four\-axis related work overview in Section[1\.1](https://arxiv.org/html/2605.11143#S1.SS1)\.
#### Medical RAG systems\.
Graph\-augmented retrieval has emerged as a leading paradigm for medical QA\. GraphRAG\[[7](https://arxiv.org/html/2605.11143#bib.bib7)\]introduced community\-based summarization; GFM\-RAG\[[8](https://arxiv.org/html/2605.11143#bib.bib8)\]trained a graph foundation model across 60 KGs; KARE\[[9](https://arxiv.org/html/2605.11143#bib.bib9)\]adapted community retrieval to clinical decision support; Medical\-Graph\-RAG\[[10](https://arxiv.org/html/2605.11143#bib.bib10)\]links documents via triple graphs\. However, no prior clinical KG\-RAG system jointly propagates note\-derived assertion classes together with bi\-temporal storage plus an NLP\-asserted temporality label as first\-class graph properties through extraction, storage, and retrieval\. GFM\-RAG and KARE build population\-level graphs rather than patient\-level graphs from clinical text, and a recent survey\[[42](https://arxiv.org/html/2605.11143#bib.bib42)\]does not identify systems that carry epistemic metadata through the full retrieval stack\.
#### Clinical QA benchmarks and systems\.
MedPaLM 2\[[1](https://arxiv.org/html/2605.11143#bib.bib1)\], Med\-Gemini\[[2](https://arxiv.org/html/2605.11143#bib.bib2)\], and AMIE\[[3](https://arxiv.org/html/2605.11143#bib.bib3)\]target*medical knowledge recall*from parametric knowledge or literature\. ClinicalBench targets a narrower task:*assertion\-faithful cross\-admission reasoning*over real EHR records, where the challenge is knowing whether*this patient*is still on metformin, whether a condition was ruled out or confirmed, and how the clinical picture changed across admissions\. Prior EHR QA benchmarks—emrQA\[[16](https://arxiv.org/html/2605.11143#bib.bib16)\], EHR\-RAG\[[43](https://arxiv.org/html/2605.11143#bib.bib43)\], MIRAGE\[[15](https://arxiv.org/html/2605.11143#bib.bib15)\]—evaluate medically grounded retrieval but not assertion\-sensitive, cross\-admission QA with category×\\timescondition ablations and physician adjudication\.
#### Clinical KG construction\.
Multi\-LLM KG\-RAG\[[17](https://arxiv.org/html/2605.11143#bib.bib17)\]uses multi\-agent prompting with schema\-constrained extraction for oncology; AutoRD\[[18](https://arxiv.org/html/2605.11143#bib.bib18)\]and RECAP\-KG\[[19](https://arxiv.org/html/2605.11143#bib.bib19)\]apply LLMs to rare disease and GP notes respectively\. All share a common limitation: assertion status is not propagated into the final graph\.
#### Assertion detection\.
NegEx\[[20](https://arxiv.org/html/2605.11143#bib.bib20)\]introduced trigger\-based negation; ConText\[[21](https://arxiv.org/html/2605.11143#bib.bib21)\]extended it with temporality and experiencer; Gul et al\.\[[5](https://arxiv.org/html/2605.11143#bib.bib5)\]fine\-tuned LLMs to 0\.962 accuracy on the i2b2/VA taxonomy\[[4](https://arxiv.org/html/2605.11143#bib.bib4)\]\. All treat assertion detection as a terminal annotation task: labels are not carried into knowledge graphs or retrieval systems, and are not temporally situated across encounters\.
#### Temporal knowledge graphs\.
MedTKG\[[12](https://arxiv.org/html/2605.11143#bib.bib12)\]constructs temporal KGs with time\-stamped snapshots \(event time only\); Graphiti\[[11](https://arxiv.org/html/2605.11143#bib.bib11)\]implements bitemporal edges but lacks clinical ontology alignment\.
## Appendix ESafety Score
The safety scoreS=1−1N∑i=1Nwi⋅𝟙\[y^i≠yi\]S=1\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}w\_\{i\}\\cdot\\mathbb\{1\}\[\\hat\{y\}\_\{i\}\\neq y\_\{i\}\]applieswi=2\.0w\_\{i\}=2\.0for false\-positive assertion errors \(reporting a negated or absent condition as present\) andwi=1\.0w\_\{i\}=1\.0otherwise, capturing asymmetric clinical risk where acting on a falsely affirmed condition is more dangerous than missing a present one\. The valuew=2\.0w=2\.0is treated as a reasonable default; sensitivity to this choice is a direction for future work\.
## Appendix FCross\-Model Per\-Category Results
Table 7:ClinicalBench per\-category accuracy \(%\) by model under C1 and C4g \(intent\-aware KG\-RAG, evaluator v2\) for Opus, MedGemma 27B, GPT\-OSS, Gemma 4 31B, MedGemma 1\.5 4B, and Qwen3\.5\. These per\-category improvements are descriptive; several categories are small, and the change category has known label/evaluator defects discussed in Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\.
## Appendix GMedGemma Full Per\-Category Results
Table 8:MedGemma 27B ClinicalBench per\-category accuracy \(%, keyword evaluator v2\) for three conditions\. Per\-category changes are descriptive; several categories are small, and the change category has known label/evaluator defects discussed in Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\.### G\.1MedGemma 1\.5 4B Full Per\-Category Results
Table 9:MedGemma 1\.5 4B ClinicalBench per\-category accuracy \(%, keyword evaluator v2\) for three conditions: C1 \(LLM alone\), C4g keyword\-only classification, and C4g oracle classification\. Keyword→\\tooracle delta:\+3\.0pp\+3\.0\\,\\text\{pp\}overall, concentrated in current state \(\+18pp\+18\\,\\text\{pp\}\) and family history \(\+10pp\+10\\,\\text\{pp\}\); all other categories are identical between oracle and keyword\. Per\-category changes are descriptive; several categories are small, and the change category has known label/evaluator defects discussed in Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\.
## Appendix HExperiencer Attribute Propagation Ablation
Theexperiencerattribute distinguishes patient conditions from family member conditions; without it, the graph conflates the two, causing family history misattribution\.
Table 10:Impact of experiencer attribute propagation on Opus C4 \(400 questions, prior evaluator run\)\. Categories with no change omitted\.The fix improved family history by\+10\.0pp\+10\.0\\,\\text\{pp\}with zero regressions on guard categories \(negation, conditional, duration, sequence all unchanged\), confirming that the experiencer attribute is load\-bearing for assertion\-sensitive reasoning\.
## Appendix ISliceBench
SliceBench \(6 patients, 144 questions, 3 complexity tiers; LLM\-as\-judge evaluation\) is consistent with a complexity\-dependent KG effect: the incremental KG layer \(B2→\\rightarrowB3\) contributes\+2\.2pp\+2\.2\\,\\text\{pp\}overall \(CI: \[\-1\.5pp, \+5\.9pp\]\), not reaching aggregate significance, while Tier C \(15\+ encounters\) gains\+5\.0pp\+5\.0\\,\\text\{pp\}vs\. Tier A \(1–2 notes\)\+0\.6pp\+0\.6\\,\\text\{pp\}\(Table[14](https://arxiv.org/html/2605.11143#A13.T14)\)\. Because the overall B2→\\rightarrowB3 comparison is not statistically distinguishable from zero and tier\-level results are descriptive only \(n=2n=2patients per tier\), this pattern is treated as exploratory\.


Figure 6:SliceBench exploratory results \(n=2n=2patients per tier, 24 questions each\)\. Tier\-specific B2→\\rightarrowB3 deltas are point estimates only and should not be over\-interpreted; the overall B2→\\rightarrowB3 delta is\+2\.2pp\+2\.2\\,\\text\{pp\}\(95% CI \[\-1\.5pp, \+5\.9pp\]\)\.
## Appendix JPer\-Category Delta Chart and Transition Analysis
Figure 7:Per\-category deltas for three paired contrasts \(C3 vs\. C4, C4 vs\. C4g, and C1 vs\. C4g\)\. These contrasts are descriptive, not a formal additive decomposition, and categories with smallnnor known evaluator/reference\-answer defects are shown for transparency only\.Table[11](https://arxiv.org/html/2605.11143#A10.T11)reports question\-level transitions between C3 \(KG\-RAG without assertions\) and C4 \(with assertions, no routing\), with C4g recovery rates\.
Table 11:Per\-category C3→\\toC4 transition analysis \(Claude Opus 4\.6,n=400n\\\!=\\\!400\)\.*Regr\.*: C3 correct→\\toC4 incorrect;*Impr\.*: C3 incorrect→\\toC4 correct;*Recov\.*: regressions recovered by C4g\.CategorynnC3C4Regr\.Impr\.Recov\.*Temporal categories \(net−52\-52\):*Current state5054%30%20817Sequence4055%10%19119Historical5042%12%18315Duration3057%47%1078Change3020%7%625*Assertion\-sensitive categories \(net\+37\+37\):*Negation11066%89%6315Family history3043%50%8107Uncertainty4038%52%7136Conditional2030%50%151Total40050%46%958083Overall, 87\.4% of regressions \(83/95\) are recovered by C4g\. Temporal categories account for 73/95 regressions but only 21/80 improvements; assertion\-sensitive categories show the reverse pattern \(22/95 regressions, 59/80 improvements\)\. This confirms that assertions help on epistemic questions but suppress needed evidence for temporal synthesis without intent\-matched routing\.
#### Qualitative examples\.
C4 regressions exhibit a consistent pattern: the evidence context includes raw assertion summaries that distract the LLM from the question’s intent\. For family\-history questions, C4 states “The patient has Lymphoma” \(leaking family\-history findings to patient state\); for sequence questions, C4 lists assertion types instead of temporal ordering; for historical questions, C4 notes “history of depression” and concludes it is no longer active\. In each case, C4g’s intent\-aware retrieval eliminates the distraction by routing to category\-specific evidence \(cross\-admission comparison, timeline traversal, or current\-state filtering\)\.
## Appendix KSliceBench Conditions
SliceBench selects patients in three tiers:Tier A\(2 patients, 1–2 encounters\),Tier B\(2 patients, 5–10\), andTier C\(2 patients, 15\+\)\. Each patient receives 24 questions spanning hard cross\-admission categories \(cross\-encounter medication timelines, problem list reconciliation, causal chain tracing\), yielding 144 total\.
Table 12:SliceBench conditions\. B0–B4 form a monotone context progression\.
## Appendix LSystem Implementation Details
#### LLM baseline\.
On MedQA\-USMLE \(965 questions\), Claude Opus 4\.5 achieves 81\.6% accuracy \(5\.1 pp below GPT\-4 at 86\.7% and Med\-PaLM 2 at 86\.5%\[[23](https://arxiv.org/html/2605.11143#bib.bib23)\]\), establishing that the Claude model family—Sonnet 4\.5 \(SliceBench answerer\) and Opus 4\.6 \(ClinicalBench primary, SliceBench judge\)—provides a reasonable LLM baseline \(Table[13](https://arxiv.org/html/2605.11143#A12.T13)\)\.
Table 13:MedQA\-USMLE results\. The LLM\-alone baseline is competitive\.
#### KG scale and traversal latency\.
The deployed system processes 145 documents across 85 patients, materializing 3,100 KG nodes and 8,803 edges\. Graph traversal latency is sub\-millisecond at this scale: 0\.57 ms \(1\-hop\), 0\.75 ms \(2\-hop\)\. The full system integrates 201 clinical calculators, 1,202 guideline sections, 20M\+ OMOP vocabulary relationships, 122 assertion trigger patterns, and 24 edge types across 13 node types\.
#### Multi\-hop traversal\.
On the DR\.KNOWS diagnostic reasoning benchmark\[[44](https://arxiv.org/html/2605.11143#bib.bib44)\], which measures KG traversal accuracy using PostgreSQL\-backed clinical data, EpiKG achieves 0\.420 overall \(50% at 1\-hop, 25% at 2\-hop, 0% at 3\-hop\)\. Multi\-hop degradation reflects a deliberate architectural trade\-off: PostgreSQL CTE\-based traversal provides ACID compliance but scales poorly beyond 2 hops\.
## Appendix MSliceBench Tier\-Stratified Results
Table 14:SliceBench results stratified by patient complexity tier\. B2→\\rightarrowB3Δ\\Deltashows the incremental KG contribution\.
## Appendix NClinicalBench Full Per\-Category Results \(Opus\)
Table 15:Complete ClinicalBench per\-category accuracy \(%, Claude Opus 4\.6, keyword evaluator v2\) for all conditions, computed over the full 400\-item set\. The change\-excluded keyword endpoint \(sensitivity comparator; the primary endpoint is the leave\-author\-out McNemar in Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1)\) excludes the 30 change\-category items*and*8 experiencer\-attribution\-defective items \(Appendix[P\.2](https://arxiv.org/html/2605.11143#A16.SS2)\), yieldingn=362n=362; per\-category counts above are for the full set\. C2b \(Contriever dense retrieval\) scores comparably to C2 \(TF\-IDF\), confirming the retrieval method does not drive the C2→\\toC4g gap\. C4 adds assertion metadata to C3 without intent routing; it helps assertion\-sensitive categories \(negation:\+22\.7pp\+22\.7\\,\\text\{pp\}\) but hurts temporal categories \(historical:−30pp\-30\\,\\text\{pp\}, sequence:−45pp\-45\\,\\text\{pp\}\)\. Only with intent routing \(C4g\) does the full system reach 68\.5%\.†C7 is reported as “— \(evaluator artifact, semantic 0%\)” in Table[2](https://arxiv.org/html/2605.11143#S4.T2); raw 27\.2% reflects template refusals coincidentally matching negation keywords \(see Section[4\.1](https://arxiv.org/html/2605.11143#S4.SS1)\)\.### N\.1Extension Condition Per\-Category Results
Table 16:Per\-category accuracy \(%\) for extension conditions \(n=240n=240, frozen evaluator \+ canonical questions\_v2 gold\)\. C1b: discharge summary only\. C4g\+full: intent\-aware KG\-RAG with all clinical notes\. The extension subset spans 4 categories: negation \(n=110n=110\), current state \(n=50n=50\), historical \(n=50n=50\), duration \(n=30n=30\)\.All four models gain on the extension endpoint, with duration showing the largest gain across architectures \(Opus\+10pp\+10\\,\\text\{pp\}, Qwen3\.5\+36\.7pp\+36\.7\\,\\text\{pp\}, GPT\-OSS\+43\.4pp\+43\.4\\,\\text\{pp\}, MedGemma\+40pp\+40\\,\\text\{pp\}\), indicating cross\-admission temporal recall benefits substantially from structured graph retrieval\. Opus and Qwen3\.5 also gain on historical \(Opus\+22pp\+22\\,\\text\{pp\}, Qwen3\.5\+20pp\+20\\,\\text\{pp\}\); GPT\-OSS shows a slight decline on historical \(−8pp\-8\\,\\text\{pp\}\) and current state \(−12pp\-12\\,\\text\{pp\}\), consistent with smaller\-model difficulty stably extracting cross\-admission state\. The reference\-answer defects in historical \(Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\) attenuate measured gains on this category across all four models\.
## Appendix OPhysician Adjudication Protocol and Summary
#### Design\.
A single board\-certified emergency physician \(A\.S\.\) independently scored ClinicalBench answers under two conditions \(C1 and C4g\) using a five\-dimensional rubric \(reference\-answer correctness, model\-answer correctness, auto\-score fairness, clinical safety, clinical utility\)\. The reviewer was blinded to condition assignment \(conditions randomized as A/B\), so this should be interpreted as a blinded internal expert audit rather than independent external validation\. The paired adjudication covers 120 unique questions×\\times2 conditions\. Item\-level reference\-answer and safety summaries use the full adjudication record set, which includes one repeated blinded\-condition review and therefore sums to 241 records\. Free\-text physician notes were recorded for 165/241 items \(68\.5%\)\. Reviews were conducted using a custom scoring interface with access to full de\-identified MIMIC\-IV discharge summaries\. Condition assignment was randomized and approximately balanced across blinded labels; this verifies allocation balance, not successful deblinding prevention, because C4g outputs can be more structured than C1 outputs\.
#### Endpoints\.
\(1\) Human–keyword evaluator concordance rate \(fraction of automated scores confirmed by physician\)\. \(2\) Physician\-rated C4g vs\. C1 accuracy, safety, and utility\. \(3\) Reference\-answer error rate and defect taxonomy\.
#### Reference\-answer error analysis\.
Of the 241 audited records, only 44% of v2 reference answers were rated fully correct; 29\.5% were outright wrong and 19\.5% needed revision\. Errors are concentrated in change \(0% correct—NLP conflated inpatient orders with discharge medications\), historical \(56\.7% wrong—“history of X” misclassified as resolved\), and uncertainty \(37\.5% wrong—causal uncertainty conflated with existential uncertainty\)\. Because the detected defects are question\-level rather than condition\-specific, they are less likely to reverse the direction of the C1–C4g comparison\. They do, however, materially affect absolute accuracies and some category\-level magnitudes, especially for change and historical questions\.
Table 17:Reference\-answer quality by category \(physician adjudication of v2 reference set\)\. Defect rate = fraction of reference answers rated less than fully correct\. Categories with the highest defect rates drive most evaluator disagreements\.
#### Auto\-evaluator concordance\.
The keyword evaluator agreed with physician judgment in 54\.2% of cases \(n=240n=240\)\. When it disagreed, it was overwhelmingly too strict: 97 false negatives \(40\.4%\) vs\. 13 false positives \(5\.4%\), a 7\.5:1 strict\-to\-lenient ratio\. The majority of evaluator errors \(63% of false negatives\) trace to reference\-answer errors rather than evaluator logic\.
#### Disclosure and limitations\.
The reviewing physician \(A\.S\.\) is the author and system designer \(see Section[3\.8](https://arxiv.org/html/2605.11143#S3.SS8)for primary disclosure\)\. Single\-reviewer design limits inter\-rater reliability assessment; three\-rater external validation results \(including two independent external physicians\) are reported in Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7)\.
#### Status\.
This protocol was designed before full result synthesis\. Full adjudication results \(n=120n=120paired\) are reported in Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6); complete per\-category results in Appendix[X](https://arxiv.org/html/2605.11143#A24)\.
## Appendix PReproducibility Details
#### Models\.
ClinicalBench primary ablation: Claude Opus 4\.6 \(claude\-opus\-4\-6, keyword evaluator v2\)\. Cross\-model: MedGemma 27B \(alibayram/medgemma:27b, 4\-bit GGUF\), GPT\-OSS 20B \(4\-bit GGUF\), Qwen3\.5 35B, Gemma 4 31B \(gemma4:31b, 4\-bit GGUF,num\_predict=2048\), and MedGemma 1\.5 4B \(alibayram/medgemma15:4b, 4\-bit GGUF,num\_predict=2048, with<unused94\>/<unused95\>stop tokens\) via Ollama; same evaluator\. SliceBench: Claude Sonnet 4\.5 \(claude\-sonnet\-4\-5\-20250929\) answers, Opus 4\.6 judges\. Temperature 0 throughout; 4\-bit quantization introduces GPU non\-determinism \(±10pp\\pm 10\\,\\text\{pp\}run\-to\-run for MedGemma\), controlled via within\-run paired comparisons\.
#### Evaluation\.
The deterministic keyword evaluator performs exact word\-boundary matching \(regex`\\bKEYWORD\\b`\) against reference\-answer assertion and temporal keywords\. Per\-category keyword sets: negation \(“no,” “denies,” “absent,” “negative”\), uncertainty \(“possible,” “suspected,” “may,” “likely,” “consider”\), temporal \(“before,” “after,” “during,” “changed,” “new”\), plus category\-specific terms\. Evidence preambles \(echoed graph context\) are stripped before matching\.
#### Retrieval\.
C2’s vanilla RAG uses TF\-IDF over chunked clinical notes \(512\-token chunks, 64\-token overlap\), retrieving the top\-5 chunks by cosine similarity\. C2b replaces TF\-IDF with Contriever\[[45](https://arxiv.org/html/2605.11143#bib.bib45)\]dense embeddings \(256\-token chunks, 64\-token overlap, top\-kkup to 6,000 characters\); C2b scores 50\.8% \(Δ=−1\.2pp\\Delta=\-1\.2\\,\\text\{pp\}vs\. C2;p=0\.62p=0\.62\), confirming the retrieval method does not explain the C2→\\toC4g gap\. These are the closest analogs to existing medical RAG systems \(DoctorRAG, MedRAG\) on patient\-level cross\-admission data\. ClinicalBench conditions map to SliceBench: C1≈\\\!\\approx\\\!B0, C2≈\\\!\\approx\\\!B2, C4g≈\\\!\\approx\\\!B3, C4g\+full≈\\\!\\approx\\\!B4\.
#### Bootstrap\.
n=2,000n=2\{,\}000resamples, seed 42, BCa method\[[25](https://arxiv.org/html/2605.11143#bib.bib25)\]\. Patient\-level cluster bootstrap \(resampling the 43 patients with replacement, including all their questions per draw\) is the inferential method for keyword\-endpoint CIs\. Question\-level resampling is reported as a secondary sensitivity analysis\. \(Caveat: withn=43n=43clusters this is at the low end of cluster\-bootstrap reliability; cf\. Cameron & Miller\[[26](https://arxiv.org/html/2605.11143#bib.bib26)\]; the substantive primary endpoint is the leave\-author\-out paired exact McNemar in Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1), which does not depend on cluster bootstrap\.\) Patient\-level CIs are slightly wider but all reported endpoints remain significant \(Table[18](https://arxiv.org/html/2605.11143#A16.T18)\)\.
#### McNemar’s test\.
Bootstrap CIs are supplemented with CIs with McNemar’s test for paired nominal data, comparing discordant pairs \(C1 wrong/C4g right vs\. C1 right/C4g wrong\)\. All three pairwise comparisons are significant: C1 vs\. C4g \(χ2=155\.8\\chi^\{2\}=155\.8,p<10−6p<10^\{\-6\}; discordant: 18 vs\. 204\), C1 vs\. C3 \(χ2=73\.8\\chi^\{2\}=73\.8,p<10−6p<10^\{\-6\}; 30 vs\. 143\), and C3 vs\. C4g \(χ2=47\.2\\chi^\{2\}=47\.2,p<10−10p<10^\{\-10\}; 20 vs\. 93\)\. On the hard cross\-admission subset \(secondary endpoint,n=122n=122: change∪\\cupcurrent\_state∪\\cuphistorical, after excluding the 8 experiencer\-attribution\-defective items in Appendix[P\.2](https://arxiv.org/html/2605.11143#A16.SS2)\), C4goracle\{\}\_\{\\text\{oracle\}\}vs\. C1 yieldsΔ=\+57\.4pp\\Delta=\+57\.4\\,\\text\{pp\},p<10−15p<10^\{\-15\}\.
#### BH\-FDR adjustment for cross\-model contrasts\.
Across the six cross\-model C4goracle\{\}\_\{\\text\{oracle\}\}vs\. C1 McNemar tests reported in Table[18](https://arxiv.org/html/2605.11143#A16.T18)and Section[4\.4](https://arxiv.org/html/2605.11143#S4.SS4), Benjamini–Hochberg false\-discovery\-rate adjustment yields q\-values: Opusq=1\.84×10−32q=1\.84\\times 10^\{\-32\}, GPT\-OSSq=3\.14×10−27q=3\.14\\times 10^\{\-27\}, MedGemma 27Bq=1\.22×10−17q=1\.22\\times 10^\{\-17\}, Gemma 4 31Bq=3\.81×10−17q=3\.81\\times 10^\{\-17\}, Qwen3\.5 35Bq=2\.00×10−11q=2\.00\\times 10^\{\-11\}, MedGemma 1\.5 4Bq=9\.32×10−11q=9\.32\\times 10^\{\-11\}\. BH\-FDR applied across the six cross\-model contrasts; allq<10−10q<10^\{\-10\}\. BH\-FDR is reported here\. The more conservative Benjamini–Yekutieli \(2001\) procedure \(which holds under arbitrary dependence, including PRDS violations of paired McNemars on overlapping items\) yields equivalent conclusions: BY adjustment factor2\.45×2\.45\\timesover BH onm=6m=6cross\-model contrasts, with all BY q\-values<2\.5×10−10<2\.5\\times 10^\{\-10\}\.
Table 18:Statistical summary: patient\-level \(cluster\) and question\-level BCa bootstrap 95% CIs for the change\-excluded keyword sensitivity endpoint and other secondary contrasts\. The author\-blind primary endpoint is the leave\-author\-out paired exact McNemar in Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7.SSS0.Px1); the keyword endpoint is change\-excluded \(n=362n=362, after excluding the 8 experiencer\-attribution\-defective items in Appendix[P\.2](https://arxiv.org/html/2605.11143#A16.SS2)\) keyword C4g vs\. C1\.
#### Data\.
De\-identified MIMIC\-IV clinical notes accessed under a PhysioNet Credentialed Health Data Use Agreement \(v3\.1\)\.
#### PhysioNet DUA compliance\.
The HuggingFace release of the ClinicalBench artifact \(DOI[10\.57967/hf/8549](https://doi.org/10.57967/hf/8549)\) contains: \(a\) question text \(authored from MIMIC\-IV charts but rephrased; not raw note text\), \(b\) reference answers \(paraphrased clinical findings; not direct note quotations\), and \(c\) raw model predictions \(commitda5f5b1stripped raw note excerpts from adjudication items prior to public release\)\. MIMIC\-IV note text remains exclusively under PhysioNet credentialed access\. Any user replicating the system requires separate PhysioNet authorization\.
#### Released artifacts\.
ClinicalBench questions, reference answers \(v1 and v2\), raw model predictions for all evaluated conditions, the deterministic keyword evaluator, and physician adjudication data are publicly released \(Hugging Face DOI[10\.57967/hf/8549](https://doi.org/10.57967/hf/8549)\)\. The full EpiKG application stack \(graph construction, intent\-aware routing implementation, retrieval algorithm\) is NOT released in this work\. Application\-code release is planned for follow\-on work; reviewers and downstream users should treat the system contribution here as a probe rather than a reproducible architectural artifact\.
#### Compute\.
ClinicalBench Opus C1/C3/C4g:∼22\{\\sim\}22min each \(API\); Opus C6:∼3\.5\{\\sim\}3\.5h \(API\); C7:<1<1min \(no LLM\); MedGemma 27B C1–C4g:∼3\.6\{\\sim\}3\.6h \(single GPU\); MedGemma 27B C6:∼1\.5\{\\sim\}1\.5h; Gemma 4 31B C1/C4g:∼10\{\\sim\}10h \(Apple Silicon, Ollama\); MedGemma 1\.5 4B C1/C4g:∼1\.5\{\\sim\}1\.5h \(Apple Silicon, Ollama\); SliceBench:∼2\{\\sim\}2h \(API\)\.
#### Evaluator evolution\.
The keyword evaluator underwent three versions: v0 \(substring matching\), v1 \(word\-boundary matching \+ evidence preamble stripping\), and v2 \(\+ abstention detection gate \+ domain\-specific keyword requirements for sequence and change\)\. The v1 evaluator awarded false positives when models responded with “insufficient information in the notes” because negation and temporal keywords in the refusal text matched reference\-answer patterns\. The v2 evaluator adds an abstention detection layer: answers matching abstention patterns \(e\.g\., “cannot determine,” “not mentioned,” “insufficient evidence”\) are scored as incorrect unless they contain clinical claim patterns \(e\.g\., “patient does not,” “denies”\)\. Additionally, v2 requires sequence answers to contain ordering keywords \(“first,” “then,” “before”\) and change answers to contain change keywords \(“added,” “removed,” “discontinued”\), preventing term\-overlap\-only false positives\. The duration category’s minimum\-0\.5 score floor for matching duration keywords was removed\. The v1→\\tov2 transition reduced C1 accuracy \(from∼50%\{\\sim\}50\\%to 21\.8% for Opus\) by correctly classifying abstention responses as incorrect; C4g accuracy decreased modestly \(from∼76%\{\\sim\}76\\%to 68\.5%\) because the model abstains less frequently when retrieval context is provided\. All main\-text numbers use v2\.
#### Reproducibility package\.
A standalone reproducibility package is included in the supplementary materials \(epikg\-benchmark/\) and available at[https://huggingface\.co/datasets/alexstinard/epikg\-clinicalbench](https://huggingface.co/datasets/alexstinard/epikg-clinicalbench)\. It contains all 400 ClinicalBench questions with reference answers, raw model predictions for all LLM\-based conditions \(Opus C1/C2/C3/C4/C4g/C6, MedGemma 27B C1/C4g, GPT\-OSS C1/C4g, Qwen C1/C4g, Gemma 4 31B C1/C4g, MedGemma 1\.5 4B C1/C4g\), scored outputs for the deterministic baseline \(C7\), and the keyword evaluator v2 with abstention detection\. The evaluator itself requires only Python 3\.10\+ with no external dependencies;reproduce\.pyadditionally requires NumPy and SciPy for bootstrap CIs\. All ClinicalBench accuracy numbers in Tables[2](https://arxiv.org/html/2605.11143#S4.T2),[7](https://arxiv.org/html/2605.11143#A6.T7), and[3](https://arxiv.org/html/2605.11143#S4.T3)are reproducible from this package\. MIMIC\-IV patient identifiers are included so that PhysioNet\-credentialed reviewers can trace predictions back to source clinical notes\. ClinicalBench is publicly available at[https://huggingface\.co/datasets/alexstinard/epikg\-clinicalbench](https://huggingface.co/datasets/alexstinard/epikg-clinicalbench)with Croissant metadata for machine\-readable dataset discovery\.
#### Results provenance\.
Table[19](https://arxiv.org/html/2605.11143#A16.T19)maps each reported result to its source checkpoint file\. MedGemma 27B C4g has 2 empty predictions \(timeouts\); these are scored as incorrect \(n=400n=400\)\. All cross\-model C1 and C4g results \(Table[3](https://arxiv.org/html/2605.11143#S4.T3)\) are from the same system snapshot \(February 2026\), except Gemma 4 31B \(April 2026, added after the initial cross\-model batch\) and MedGemma 1\.5 4B \(April 2026, after fixing a checkpoint truncation bug and adding<unused94\>/<unused95\>Gemma 3 stop tokens in the Ollama Modelfile\); intermediate ablation conditions \(C2/C3/C4/C6\) for non\-Opus models were collected in a later batch \(March 2026\) after system updates and are included in the reproducibility package for completeness but are not used in any paper table\.
Table 19:Results provenance: checkpoint files for each condition×\\timesmodel combination\. All scored with keyword evaluator v2\.ConditionModelCheckpoint filennC1Opus 4\.6opus/C1\_llm\_alone\.jsonl400C2Opus 4\.6opus/C2\_vanilla\_rag\.jsonl400C2bOpus 4\.6opus/C2b\_dense\_rag\.jsonl400C3Opus 4\.6opus/C3\_kg\_rag\.jsonl400C4Opus 4\.6opus/C4\_epistemic\_kg\_rag\.jsonl400C4gOpus 4\.6opus/C4g\_intent\_aware\.jsonl400C6Opus 4\.6opus/C6\_long\_context\.jsonl400C7—opus/C7\_deterministic\.jsonl400C1MedGemma 27Bmedgemma/C1\_llm\_alone\.jsonl400C4gMedGemma 27Bmedgemma/C4g\_intent\_aware\.jsonl400\*C1GPT\-OSS 20Bgptoss/C1\_llm\_alone\.jsonl400C4gGPT\-OSS 20Bgptoss/C4g\_intent\_aware\.jsonl400C1Qwen3\.5 35Bqwen35/C1\_llm\_alone\.jsonl400C4gQwen3\.5 35Bqwen35/C4g\_intent\_aware\.jsonl400C1Gemma 4 31Bgemma4/C1\_llm\_alone\.jsonl400C4gGemma 4 31Bgemma4/C4g\_intent\_aware\.jsonl400C1MedGemma 1\.5 4Bmedgemma15/C1\_llm\_alone\.jsonl400C4gMedGemma 1\.5 4Bmedgemma15/C4g\_intent\_aware\.jsonl400\*MedGemma 27B C4g: 2 timeouts; empty answers scored as incorrect \(n=400n=400\)\.
### P\.1Infrastructure Bugs Discovered and Fixed During Evaluation
During final review, three infrastructure issues were discovered and corrected that affected data integrity in earlier runs\. They are documented here in the interest of methodological transparency; all three have since been fixed, and the residual impact on reported numbers is described for each\.
#### Bug 1: Checkpoint serialization truncation \(500\-character limit\)\.
A line inqa\_experiment\_executor\.pywrotepredicted\_answer\[:500\]to checkpoint files, silently truncating every saved answer to 500 characters\. Original in\-run scoring \(the legacy internal evaluator\) operated on full answers, but the frozen\-evaluator rescoring—which is the source of all numbers in this paper—operated on the truncated strings\. The truncation rate scaled with per\-model answer verbosity: GPT\-OSS≈0\.2%\\approx 0\.2\\%of answers affected, MedGemma 27B≈0\.5\\approx 0\.5–2\.8%2\.8\\%\(all negligible,<1pp<\\\!1\\,\\text\{pp\}impact on frozen rescore\), Qwen 3\.5≈0\.8\\approx 0\.8–4\.2%4\.2\\%\(<1pp<\\\!1\\,\\text\{pp\}\), Opus 4\.6≈3\.5\\approx 3\.5–16%16\\%\(estimated11–3pp3\\,\\text\{pp\}downward bias on C4g oracle\), and MedGemma 1\.5 4B≈7\.5\\approx 7\.5–35\.5%35\.5\\%\(estimated22–3pp3\\,\\text\{pp\}downward bias\)\. Because the bug was applied uniformly across models and conditions, relative rankings are preserved; absolute scores for the most verbose models \(Opus, MedGemma 1\.5\) would shift slightly upward if fully rerun\. The slice was removed from the\[:500\]slice and reran MedGemma 1\.5 end\-to\-end \(the most affected model\) to obtain clean data; other models retain their original checkpoint data with this documented downward bias\.
#### Bug 2: MedGemma 1\.5 special\-token duplication\.
MedGemma 1\.5 4B \(Gemma 3 architecture\) emitted the special tokens<unused94\>and<unused95\>mid\-generation, causing the model to regenerate the same answer twice within a single response\. The frozen evaluator then scored on the concatenated duplicate text, effectively giving the model two independent chances to match reference keywords\. Manual inspection of per\-question outputs showed that26%26\\%of MedGemma 1\.5 C4g oracle answers contained these tokens, artificially inflating frozen\-evaluator scores by approximately22–5pp5\\,\\text\{pp\}\. Only MedGemma 1\.5 was affected; the tokens are Gemma 3 specific\. The fix was trivial—adding<unused94\>and<unused95\>to the Ollama Modelfile stop\-token list—but the benchmarking impact was significant\. All MedGemma 1\.5 numbers in this paper were collected after the fix was applied and the model was rerun across all three reported conditions\.
#### Bug 3: Ollama silently discards Qwen 3\.5 repetition penalties in thinking mode\.
Qwen 3\.5 35B is designed as an extended\-reasoning model; its model card explicitly recommendspresence\_penalty=1\.5to prevent pathological repetition loops during chain\-of\-thought generation\. However, Ollama silently discardsrepeat\_penalty,presence\_penalty, andfrequency\_penaltyoptions when Qwen is used withthink: true\(Ollama issue \#14493\)\. Attemperature=0the model then enters infinite reasoning loops \(e\.g\., “Wait, I’ll write: ‘X’\. Okay, I’ll write: ‘Y’\. Wait, I’ll write: ‘X’…”\) and consumes the entire token budget without producing any visible content\. The issue was verified by testing eight sampling configurations through the Ollama API \(repeat\_penaltyfrom1\.31\.3to1\.81\.8,frequency\_penalty=0\.5,presence\_penalty=0\.5,temperature∈\{0\.3,0\.5\}\\in\\\{0\.3,0\.5\\\},mirostat=2, and stacked combinations\); all eight produced identical29,08329\{,\}083\-character thinking output with empty content, confirming that Ollama is not forwarding these options to Qwen’s thinking generation\. Related open Ollama issues include \#14493 \(Qwen 3\.5 tool calling non\-functional and repetition penalties silently ignored\), \#14421 \(qwen3\.5:35blooping\), \#10976 \(thinking \+ tools \+ qwen3⇒\\Rightarrowempty output\), \#14716 \(qwen3\.5vision output routed to thinking field\), and \#10927 \(LLM stuck in infinite loop of thinking\)\.
As a workaround thinking was disabled \(think: false\) for all Qwen runs reported in this paper\. This produces direct answers but may not reflect Qwen’s optimal reasoning performance\. Qwen’s reported numbers should therefore be interpreted as “Qwen 3\.5 35B without reasoning enabled” rather than “Qwen 3\.5 35B at full capability\.” This workaround has a visible downstream effect on Qwen’s oracle\-vs\-keyword comparison: a−1\.2pp\-1\.2\\,\\text\{pp\}oracle inversion at the aggregate level that does not match the pattern of the other five tested models\. Per\-category analysis shows Qwen benefits substantially from oracle routing on historical questions \(\+16pp\+16\\,\\text\{pp\}\) but suffers oncurrent\_state\(−10pp\-10\\,\\text\{pp\}\) and conditional \(−15pp\-15\\,\\text\{pp\}\) categories\. One hypothesis is that Qwen’s constrained non\-thinking mode interacts poorly with oracle’s focusedcurrent\_stateretrieval, which discards historical context that non\-thinking Qwen appears to rely on\. Whether this inversion would persist with thinking enabled—via vLLM or Qwen’s native DashScope API, which do honor repetition penalties—is an open question is left for future work\.
### P\.2Experiencer\-Attribution Defective Items
During post\-hoc verification, 8 items were identified whose sourcesection="Family History"but whose goldexpected\_answerasserts the disease as a current or historical condition of the patient\. The defect arises from the upstream NLP pipeline mis\-propagating an experiencer flag during gold\-answer generation: the section tag was correctly retained but the answer string nonetheless described the patient\. These items are excluded from the change\-excluded keyword endpoint \(n=362n=362, sensitivity comparator\)\. The headline impact of removing them is\+40\.0pp→\+39\.5pp\+40\.0\\,\\text\{pp\}\\to\+39\.5\\,\\text\{pp\}on that endpoint\. They remain in the released v2 gold standard for transparency\.
Table 20:Eight experiencer\-attribution\-defective items excluded from the change\-excluded keyword sensitivity endpoint\. All havesection="Family History"but gold answers describing the patient\.
### P\.3Evaluator Polarity: Extended Disclosure
The keyword evaluator uses category\-specific keyword lists and matches by word\-boundary regex\. For three categories—uncertainty, family\_history, and conditional—the rule is structurally:
> is\_correct = \_has\_match\(predicted\_lower, patterns\)
wherepatternsis the category\-defining keyword list \(e\.g\.,\[’if’,’conditional’,’pending’,’depending’,’only if’\]for conditional\)\. The goldexpected\_answeris not consulted: any prediction containing a category keyword is scored correct\.
For current\_state and historical, the rule matches keyword presence \("current","active","present"for current\_state;"was","former","resolved","history"for historical\) without polarity check\. Consequently, a prediction asserting*“NOT FOUND IN CURRENT RECORDS”*matches the keyword"current"and is scored correct against a gold of “currently active”—directionally opposed but lexically overlapping\.
The implication is that C4g’s structured\-answer style \(which routinely echoes the queried category, e\.g\., “Current state: …”\) is mechanically advantaged over C1’s abstention style \(“insufficient information”\), even when neither answer carries the right clinical content\. This contributes to the keyword evaluator’s measured 7\.5:1 strict\-vs\-lenient asymmetry \(Table[28](https://arxiv.org/html/2605.11143#A24.T28)\) and is the principal reason we treat the keyword evaluator as a deterministic reproducibility proxy rather than a substantive truth criterion\. Physician adjudication and LLM\-as\-judge are the substantively interpretable evaluators; their deltas \(Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\) are the comparisons we ask readers to weight\.
## Appendix QAssertion Category Definitions
Table 21:The 7\-value assertion taxonomy, extending i2b2 by separatinghistoricalfromfamily\_history\.
## Appendix RTemporal Relation Mapping
The nine temporal relationsℛ\\mathcal\{R\}used on KG edges are derived from Allen’s 13 canonical interval relations\[[36](https://arxiv.org/html/2605.11143#bib.bib36)\]by merging symmetric pairs\.
Table 22:Mapping from Allen’s 13 interval relations to the 9 temporal relation values stored on KG edges\.
## Appendix SIntent\-Aware Routing Algorithm
The C4g intent classifier operates in two modes\. In the*oracle*mode used for benchmark evaluation, the question’s category metadata determines the intent directly\. In the*keyword\-only*mode used for deployment, a rule\-based classifier infers intent from keyword patterns in the question text \(e\.g\., “changed”, “new since” triggerChange; “currently”, “active problem” triggerCurrent\_State\)\. The keyword classifier achieves 68% overall accuracy but only 20\.0% on questions requiring targeted routing; per\-category accuracy is reported in Table[23](https://arxiv.org/html/2605.11143#A19.T23)\. All benchmark results in the main text use oracle classification unless otherwise noted\. Algorithm[1](https://arxiv.org/html/2605.11143#algorithm1)details the full retrieval procedure\.
Input:Question
qq, patient
π\\pi
Output:Structured evidence
EE
𝒞←ExtractConcepts\(q\)\\mathcal\{C\}\\leftarrow\\textsc\{ExtractConcepts\}\(q\);
//NLP \+ OMOP enrichment
ι←ClassifyIntent\(q\)\\iota\\leftarrow\\textsc\{ClassifyIntent\}\(q\);
//
∈\{Change,CurrSt,Hist,Default\}\\in\\\{\\textsc\{Change\},\\textsc\{CurrSt\},\\textsc\{Hist\},\\textsc\{Default\}\\\}
1if*ι=*Change*\\iota=\\textsc\{Change\}*then
2Partition edges by admission:
ℰk←\{e∣hadm\_id\(e\)=k\}\\mathcal\{E\}\_\{k\}\\leftarrow\\\{e\\mid\\texttt\{hadm\\\_id\}\(e\)=k\\\};
3foreach*admission pair\(k,k′\)\(k,k^\{\\prime\}\)withk<k′k<k^\{\\prime\}*do
4
𝒜←𝒞k′∖𝒞k\\mathcal\{A\}\\leftarrow\\mathcal\{C\}\_\{k^\{\\prime\}\}\\setminus\\mathcal\{C\}\_\{k\};
ℛ←𝒞k∖𝒞k′\\mathcal\{R\}\\leftarrow\\mathcal\{C\}\_\{k\}\\setminus\\mathcal\{C\}\_\{k^\{\\prime\}\};
𝒮←𝒞k∩𝒞k′\\mathcal\{S\}\\leftarrow\\mathcal\{C\}\_\{k\}\\cap\\mathcal\{C\}\_\{k^\{\\prime\}\};
5
6
Eg←FormatChange\(𝒜,ℛ,𝒮\)E\_\{g\}\\leftarrow\\textsc\{FormatChange\}\(\\mathcal\{A\},\\mathcal\{R\},\\mathcal\{S\}\);
7
8else if*ι=*CurrSt*\\iota=\\textsc\{CurrSt\}*then
9
Eg←FilterEdges\(π,𝒞,τa=Current∨open validity\)E\_\{g\}\\leftarrow\\textsc\{FilterEdges\}\(\\pi,\\mathcal\{C\},\\;\\tau\_\{a\}\\\!=\\\!\\textsc\{Current\}\\lor\\text\{open validity\}\);
10Deduplicate by concept; emit “Not Found” for missing
c∈𝒞c\\in\\mathcal\{C\};
11
12else if*ι=*Hist*\\iota=\\textsc\{Hist\}*then
13
Eg←FilterEdges\(π,𝒞,τa=Past\)E\_\{g\}\\leftarrow\\textsc\{FilterEdges\}\(\\pi,\\mathcal\{C\},\\;\\tau\_\{a\}\\\!=\\\!\\textsc\{Past\}\);
14Augment: concepts in earlier admissions but absent from latest
→\\to“resolved”;
15
16else
17
Eg←BidirectionalBFS\(π,𝒞,hops=2–3,cmin=0\.3\)E\_\{g\}\\leftarrow\\textsc\{BidirectionalBFS\}\(\\pi,\\mathcal\{C\},\\;\\text\{hops\}=2\\text\{\-\-\}3,\\;c\_\{\\min\}=0\.3\);
18
19
Ed←RetrieveDocuments\(π,𝒞\)E\_\{d\}\\leftarrow\\textsc\{RetrieveDocuments\}\(\\pi,\\mathcal\{C\}\);
20return
Compose\(Eg,Ed,template\(ι\)\)\\textsc\{Compose\}\(E\_\{g\},E\_\{d\},\\;\\text\{template\}\(\\iota\)\);
Algorithm 1Intent\-Aware Retrieval \(C4g\)### S\.1Keyword\-Only Intent Classifier Accuracy
Table[23](https://arxiv.org/html/2605.11143#A19.T23)reports per\-category accuracy of the keyword\-only intent classifier alongside the oracle–keyword accuracy gap on Opus C4g\. The keyword classifier achieves 68% overall classification accuracy but only 20\.0% on questions requiring targeted routing \(historical: 0%, family history: 0%, current state: 34%, change: 50%\)\. Categories routed toDefault\(negation, uncertainty, conditional, duration, sequence\) are unaffected by classification errors because both oracle and keyword paths use the same default BFS traversal\.
Table 23:Per\-category keyword intent classifier accuracy and its impact on Opus C4g QA accuracy \(n=400n=400\)\. “Classifier acc\.” is the fraction of questions where the keyword classifier matches oracle intent\. Categories marked “Default” are routed identically under both classifiers\.Figure 8:Keyword intent classifier analysis \(Opus\)\.\(A\)Per\-category classifier accuracy: categories requiring no targeted routing \(top, green\) achieve 100%; categories needing routing show 0–50% keyword accuracy \(bottom, red/orange\)\.\(B\)Downstream QA impact: categories near the parity line \(dashed\) have small oracle–keyword gaps; change \(−70pp\-70\\,\\text\{pp\}\) is the major outlier, while duration \(\+10pp\+10\\,\\text\{pp\}\) benefits from keyword routing\.
## Appendix TBookend and Diagnostic Conditions
#### C5 \(Full System\)\.
C5 extends C4 with guideline retrieval \(1,202 sections\) and clinical calculators \(201, e\.g\., CHA2DS2\-VASc, MELD\)\. In a prior evaluator run, C5 scored below C4g, likely because non\-relevant components \(guidelines and calculators\) dilute the context window without contributing to ClinicalBench question types\. C5 is excluded from the current ablation ladder because it conflates retrieval architecture with additional knowledge sources\.
#### C6 \(Long Context\)\.
All patient documents are concatenated chronologically and presented to the LLM\. For Opus \(200K token window\), all notes fit; for MedGemma \(8K window\), later documents are truncated\. C6 achieves 59\.2% \(Opus\) overall—well above C1 \(21\.8%\) and modestly below C4goracle\{\}\_\{\\text\{oracle\}\}\(68\.5%,−9\.3pp\-9\.3\\,\\text\{pp\},p=0\.001p=0\.001\)\. The gap concentrates in current state \(C6 18\.0% vs\. C4g 70\.0%,−52pp\-52\\,\\text\{pp\}\); on temporal categories C6 is comparable \(historical 62\.0%, sequence 82\.5%\) or weaker only modestly \(duration 43\.3% vs\. C4g 63\.3%\)\. This indicates brute\-force long context handles factual recall and temporal retrieval reasonably well but underperforms structured retrieval on intent\-sensitive queries, where the epistemic KG’s explicit assertion typing distinguishes resolved from active conditions in ways implicit raw\-text reading does not\.
#### C7 \(Deterministic KG\)\.
KG edges matching query concepts are returned directly without LLM reasoning\. C7 nominally scores 27\.2% overall, but this is an evaluator artifact: C7 returns template refusals \(“No relevant knowledge graph edges found”\) for\>\>98% of questions, and the word “No” in the template coincidentally matches negation\-category keywords \(negation: 99\.1%, all other categories: 0%\)\. Semantic accuracy is 0%, confirming that structured data without an LLM reasoner cannot answer clinical questions\.
## Appendix UIllustrative Retrieval Example
To illustrate why intent\-aware routing matters, consider a change question:*“What medications changed between this patient’s first and second admissions?”*
#### C1 \(LLM alone\)\.
The model sees only the current note, which mentions metoprolol, atorvastatin, and “discontinued lisinopril\.” Without prior admission records, it cannot determine what was*added*vs\.*continued*, and may hallucinate prior medication lists\.
#### C4g \(intent\-aware KG\-RAG\)\.
The intent classifier routes toChangeretrieval, which partitions KG edges byhadm\_idand computes set differences:
- •Added\(admission 2 only\): atorvastatin 40mg \[Present\]
- •Removed\(admission 1 only\): lisinopril 10mg \[Present→\\toAbsent\]
- •Continued: metoprolol 25mg \[Present, both admissions\]
The assertion labels \(Present,Absent\) disambiguate: “discontinued lisinopril” is not a current medication but a historical one whose status changed—exactly the distinction the epistemic schema preserves\.
## Appendix VLLM\-as\-Judge Concordance
To complement the deterministic keyword evaluator, all were rescored for 800 predictions \(400 questions×\\times2 conditions: C1 and C4g\) using Claude Opus 4\.6 as an LLM judge\. The judge prompt presents the question, reference answer, and system answer, and requests a score of 1 \(correct\), 0\.5 \(partially correct\), or 0 \(incorrect\) with a one\-sentence justification\. Table[24](https://arxiv.org/html/2605.11143#A22.T24)summarizes the concordance\.
Table 24:Evaluator concordance comparison\. LLM judge scores≥0\.5\\geq 0\.5treated as correct for binary comparison\. Physician concordance computed on then=30n=30pilot adjudication subset\.#### Key findings\.
The LLM judge achieves higher physician concordance \(56\.7% vs\. 46\.7%\) and is less prone to false strictness \(36\.7% vs\. 43\.3%\), confirming that the keyword evaluator’s conservative bias inflates the measured delta\. Under the LLM judge, C1 rises from 21\.8% to 28\.5% \(the judge credits clinically reasonable hedging that keyword matching penalizes\), while C4g drops from 68\.5% to 57\.0% \(the judge penalizes partial answers that pass keyword matching via term overlap\)\. The resulting C4g−\-C1 delta under LLM judge \(\+28\.5pp\+28\.5\\,\\text\{pp\}\) is lower than the keyword delta \(\+46\.8pp\+46\.8\\,\\text\{pp\}\) but remains substantial and directionally consistent\. The per\-condition asymmetry—C1 gains, C4g loses—is consistent with the physician finding that the keyword evaluator disproportionately penalizes C1 abstentions\.
#### Per\-category analysis\.
The LLM judge is particularly stricter on change \(C4g: 62% mean score vs\. 100% keyword\) because it penalizes partial medication lists that contain correct change keywords but miss specific drugs\. Conversely, the judge is more generous on historical \(C4g: 62% vs\. 60% keyword\) and negation \(C4g: 79% vs\. 81% keyword\), where its semantic understanding better captures correct answers that use varied phrasing\.
#### Limitations\.
Using the same model family \(Opus\) for both answering and judging introduces potential self\-preference bias\. The judge also shows moderate agreement with the keyword evaluator \(κ=0\.572\\kappa=0\.572\), indicating they capture partially overlapping but distinct aspects of answer quality\.
## Appendix WAssertion Classifier Evaluation
The rule\-based assertion classifier uses 122 calibrated trigger patterns extending the NegEx\[[20](https://arxiv.org/html/2605.11143#bib.bib20)\]and ConText\[[21](https://arxiv.org/html/2605.11143#bib.bib21)\]frameworks to a 7\-class taxonomy \(Table[21](https://arxiv.org/html/2605.11143#A17.T21)\)\. Table[25](https://arxiv.org/html/2605.11143#A23.T25)summarizes the pattern inventory and confidence ranges by category\.
Table 25:Assertion trigger pattern inventory\. Confidence ranges reflect per\-trigger calibration\.#### Corpus statistics\.
Across the 43 ClinicalBench patients, the knowledge graph contains 3,943 edges linked to clinical facts\. Of these, 618 \(15\.7%\) carry non\-present assertions: 442 absent \(11\.2%\), 94 possible \(2\.4%\), 71 historical \(1\.8%\), 8 conditional \(0\.2%\), and 3 hypothetical \(0\.1%\)\. At the mention level, 1,428 of 12,379 mentions \(11\.5%\) are non\-present\. This non\-present fractionfnp=0\.157f\_\{\\textit\{np\}\}=0\.157provides the empirical bound from Corollary[3](https://arxiv.org/html/2605.11143#Thmdefinition3): an assertion\-blind pipeline cannot exceed1−fnp=84\.3%1\-f\_\{\\textit\{np\}\}=84\.3\\%assertion\-faithful accuracy on concepts where negated or uncertain mentions are present\.
#### Intrinsic evaluation\.
A stratified sample of 189 mentions from the 43 ClinicalBench patients, stratified by predicted assertion type \(50 present, 51 absent, 15 possible, 28 historical, 19 conditional, 4 hypothetical, 10 family history\), and a physician \(A\.S\.\) annotated reference\-standard labels in a blinded, randomized review\. Table[26](https://arxiv.org/html/2605.11143#A23.T26)reports per\-class precision, recall, and F1\.
Table 26:Intrinsic assertion classifier evaluation on 189 physician\-annotated MIMIC\-IV mentions \(stratified sample from 43 ClinicalBench patients\)\.Overall accuracy is 89\.4% \(169/189; 95% Wilson CI: \[84\.2%, 93\.0%\]\) with Cohen’sκ=0\.867\\kappa=0\.867\(strong agreement\)\. Negation detection achieves F1 = 0\.970, consistent with published benchmarks: NegEx P = 94\.5%/R = 77\.8%\[[20](https://arxiv.org/html/2605.11143#bib.bib20)\], NegBio P = 96\.3%/R = 85\.7%\[[46](https://arxiv.org/html/2605.11143#bib.bib46)\]\. The dominant error pattern \(11/20 errors\) is over\-triggering uncertainty: the classifier assignspossibleto mentions near hedging language \(“likely,” “concerning for”\) that physicians judge aspresent\. This explains the low precision for thepossibleclass \(P = 0\.50\) despite perfect recall\.
#### Functional evaluation\.
Rather than relying solely on intrinsic metrics, the C4 ablation provides a functional evaluation of assertion quality\. C4 adds assertion metadata to C3’s graph without intent routing: the classifier’s output is directly consumed by the retrieval pipeline\. On assertion\-sensitive categories \(negation, conditional, uncertainty, family history\), C4 outperforms C3 by an average of\+16\.1pp\+16\.1\\,\\text\{pp\}, confirming that the classifier produces usable assertions for these categories\. On temporal categories \(historical, sequence, change, current state, duration\), C4 underperforms C3 by an average of−24\.5pp\-24\.5\\,\\text\{pp\}—not because assertions are incorrect, but because uniform BFS traversal cannot exploit them effectively\. The C4→\\toC4g routing correction recovers these losses and amplifies the gains, achieving\+22\.0pp\+22\.0\\,\\text\{pp\}overall \(p<10−10p<10^\{\-10\}\)\.
## Appendix XPhysician Adjudication: Full Results and Reference Answer Evolution
This section reports complete results from the blinded physician adjudication \(n=120n=120paired questions\) and documents the reference\-answer evolution\.
### X\.1Per\-Category Physician Accuracy
Table 27:Physician\-judged accuracy by category and condition \(nnper condition\)\. Strict = correct only; lenient = correct \+ partially correct\. Categories sorted by C4g strict accuracy\.The keyword evaluator dramatically underscores C1 on conditional \(0% vs\. 80% physician lenient\) and family history \(0% vs\. 90%\), where the model gives clinically reasonable hedged answers that lack specific keywords\. The change category exhibits the opposite pattern: the keyword evaluator reports C4g at 100%, but the physician rates it at 0% strict / 46\.7% lenient—keyword matching catches medication names without verifying comparison logic\. Family history shows no physician\-judged delta \(both conditions score∼90%\{\\sim\}90\\%lenient\), suggesting that the LLM already handles this category well without KG\-RAG when evaluated by a physician—a finding masked by the keyword evaluator, which scores both conditions at 0% on C1\.
### X\.2Evaluator Agreement
Table 28:Overall keyword evaluator agreement with physician judgment \(n=240n=240items from full adjudication\)\. The evaluator is overwhelmingly too strict \(7\.5:1 strict\-to\-lenient ratio\), and the majority of errors trace to reference\-answer defects rather than evaluator logic\.Evaluator Outcomenn%Agrees with physician13054\.2%Too strict \(false negative\)9740\.4%Too lenient \(false positive\)135\.4%Strict:lenient ratio7\.5:1The majority of evaluator errors \(63% of false negatives, 85% of false positives\) trace to reference\-answer errors rather than evaluator logic: when the reference answer is correct, the evaluator achieves 64\.2% physician agreement with only a 1\.9% false\-positive rate\. The lowκ\\kappa\(0\.18\) despite 54\.2% raw agreement confirms the evaluator’s errors are systematic \(overwhelmingly too strict\) rather than random\. Table[29](https://arxiv.org/html/2605.11143#A24.T29)breaks this down by category\.
Table 29:Keyword evaluator agreement with physician judgment by category\. Categories where the evaluator fails worst are highlighted\.
### X\.3Safety and Clinical Utility by Condition
Table 30:Physician\-judged clinical safety and utility by condition \(n=120n=120paired questions\)\. C4g is safer and more useful than C1; the “misleading” rate is similar between conditions\.DimensionRatingC1C4gΔ\\DeltaSafetySafe60\.8%76\.7%\+15\.8pp\+15\.8\\,\\text\{pp\}Minor concern33\.3%20\.8%−12\.5pp\-12\.5\\,\\text\{pp\}Potentially harmful5\.8%2\.5%−3\.3pp\-3\.3\\,\\text\{pp\}UtilityHelpful30\.8%67\.5%\+36\.7pp\+36\.7\\,\\text\{pp\}Neutral20\.0%15\.0%−5\.0pp\-5\.0\\,\\text\{pp\}Not useful34\.2%2\.5%−31\.7pp\-31\.7\\,\\text\{pp\}Misleading15\.0%15\.0%0\.0pp\\phantom\{\+\}0\.0\\,\\text\{pp\}Table[31](https://arxiv.org/html/2605.11143#A24.T31)breaks safety down by category\.
Table 31:Clinical safety ratings by category in the audited record set\. Categories sorted by safety concern rate\.Change is the most safety\-concerning category: 83\.9% of change items have safety concerns \(minor or harmful\), mirroring the reference\-answer quality and model accuracy findings\.
### X\.4Qualitative Themes from Physician Notes
Free\-text notes \(165/241 items, 68\.5%\) reveal five recurring themes:
1. 1\.Reference answers systematically wrong for medication change questions\(103 notes mentioning reference\-answer issues\): Every reference answer in the change category conflates inpatient medication orders \(heparin, IV antibiotics, CIWA protocol\) with discharge medications\.
2. 2\.C1 hallucinates from limited context\(15 notes, exclusively C1\): Without retrieval, C1 fabricates admission IDs, medication names, and clinical scenarios\. Zero C4g items received this complaint\.
3. 3\.NLP assertion classifier propagates errors\(8 notes\): Boilerplate discharge instructions \(“call if fever\>\>101\.5”\) tagged as clinical findings; “h/o recently diagnosed metastatic cancer” tagged as historical\.
4. 4\.Safety\-critical errors\(10 items flagged as potentially harmful\): Code status errors \(model “hallucinates DNR confirmation” when chart says full code\), active cancer missed from medication list, anticoagulation misclassified\.
5. 5\.Model praised when reference answers were wrong\(63 notes\): Reviewer noted the model gave clinically correct answers that the automated benchmark reference penalized\.
### X\.5Reference Answer Version History
The ClinicalBench reference answers have undergone iterative refinement:
- •v1 \(auto\-generated reference set\): Reference answers created by LLM from MIMIC\-IV notes via the NLP extraction pipeline\. No physician review\. 400 questions\.
- •v2 \(partially corrected reference set\): From the initialn=30n=30pilot, 54 corrections were applied to the most egregious errors\.222The shippedcorrections\.jsonfile documents 53 explicit corrections; the diff between v1 and v2expected\_answerfields contains 54 differing items, because one item was adjusted post\-corrections\.json during a final reconciliation pass\.This is the version used for all reported numbers\. Both v1 and v2 are released\.
- •v3 \(planned physician\-validated release\): Full corrections from then=120n=120adjudication plus external validation\. Triage summary: 45 questions KEEP \(37%\), 55 FIX\_GOLD \(46%\), 20 REPLACE\_QUESTION \(17%\)\. Of the 55 FIX\_GOLD items, 46 have drafted proposed corrections\. This future release is not used for any numbers reported in this paper\.
### X\.6Systematic Error Taxonomy
Five systematic failure modes account for all 78 problematic questions \(of 120 adjudicated\):
1. 1\.NLP assertion classifier error\(28 questions, 36%\): The dominant failure\. Manifests as: “history of heart failure”→\\to“heart failure is resolved” \(clinical idiom means active chronic condition\); “edema, likely due to noncompliance”→\\to“edema is uncertain” \(causal vs\. existential uncertainty conflation\); experiencer tag reversal \(patient’s atrial fibrillation labeled as family history\)\.
2. 2\.Wrong answer / inverted truth\(16 questions, 21%\): The reference answer states the opposite of the chart\. Example: the reference says pitting edema is absent when PE documents “2\+ pitting edema bilaterally\.”
3. 3\.Non\-clinical entity extraction\(11 questions, 14%\): NLP extracted boilerplate \(“call if fever\>\>101”\), devices \(Foley catheter as diagnosis\), lab values \(blood sugar as diagnosis\), or section headers \(“Allergies” as medical condition\)\.
4. 4\.Medication list conflation\(10 questions, 13%\): Change questions compared wrong lists—inpatient orders \(heparin, IV antibiotics\) vs\. discharge medications, or admission med\-rec vs\. discharge list\. PRN\-only medications \(CIWA Valium\) counted as prescribed\.
5. 5\.Fabricated temporal relationship\(8 questions, 10%\): Sequence questions claimed ordering not supported by the chart—both conditions in the same admission with no temporal anchoring, or based on NLP\-extracted entities from negated text\.
### X\.7Impact on Reported Numbers
Because the detected defects are question\-level rather than condition\-specific, they are less likely to reverse the direction of the C1–C4g comparison\. They do, however, materially affect absolute accuracies and some category\-level magnitudes, especially for change and historical questions\. The keyword evaluator achieves 64\.2% physician agreement and only 1\.9% false\-positive rate when restricted to questions with correct reference answers, confirming that evaluator errors are dominated by reference\-answer noise rather than matching logic\.
## Appendix YCohort Demographics and Subgroup Analysis
The 43 ClinicalBench patients are drawn from MIMIC\-IV\[[24](https://arxiv.org/html/2605.11143#bib.bib24)\], a single academic medical center \(Beth Israel Deaconess Medical Center, Boston\)\. The cohort skews female \(60\.5%\), White \(76\.7%\), and Medicare\-insured \(44\.2%\), reflecting the source institution’s patient mix\.
#### Subgroup accuracy\.
Table[32](https://arxiv.org/html/2605.11143#A25.T32)reports C1 and C4g accuracy by demographic group\. Because questions are distributed unevenly across patients and group sizes are small \(n≤26n\\leq 26patients per stratum\), these comparisons are severely underpowered; they are reported for transparency, not for drawing subgroup conclusions\. No statistically significant interaction between demographic group and condition was observed, but the analysis cannot rule out meaningful effect modification\.
Table 32:ClinicalBench accuracy by demographic subgroup \(underpowered; reported for transparency\)\.nqn\_\{q\}= number of questions in each stratum\.
#### Generalizability limitations\.
The single\-site, predominantly White cohort limits external validity\. MIMIC\-IV emergency department notes are heavily templated; performance on narrative\-heavy specialties \(psychiatry, palliative care\) or community hospital documentation styles is unknown\. Multi\-site validation with demographically diverse cohorts is needed before deployment claims can be made\.
## Appendix ZEthics, Broader Impacts, and Detailed Threat Analysis
This work uses de\-identified clinical data from MIMIC\-IV\[[24](https://arxiv.org/html/2605.11143#bib.bib24)\]under a PhysioNet Credentialed Health Data Use Agreement\. No patient re\-identification was attempted\. The system is designed for clinical decision*support*, not autonomous clinical decision\-making\.
#### Broader impacts\.
Improved assertion\-faithful retrieval could reduce clinical errors caused by negation or family\-history misattribution, particularly in high\-volume settings where physicians cannot review every note\. However, deployment risks include over\-reliance on automated epistemic labeling \(false confidence in assertion status\), brittleness to out\-of\-distribution clinical language, and the potential for structured outputs to appear more authoritative than their accuracy warrants\. Responsible deployment requires prospective evaluation, integration with clinician workflows, and clear communication of system limitations\. Because ClinicalBench is single\-site, predominantly White, and built from heavily templated MIMIC\-IV documentation, a system or benchmark tuned on it could overfit note style and underperform on other hospitals, specialties, or populations\. This creates a coverage and fairness risk: strong results on this stress test could be mistaken for portable performance when they may primarily reflect source\-institution conventions\. ClinicalBench should be used for failure analysis and ablation, not as evidence of deployment readiness or demographic robustness\.
#### Evaluator bias characterization\.
The keyword evaluator has known limitations\. Negation scoring checks keyword presence without verifying direction \(e\.g\., “pneumonia is absent” and “patient has pneumonia” could both match\)\. Longer model outputs naturally contain more keyword matches, creating a verbosity bias \(Opus averages 337 characters vs\. GPT\-OSS 139 characters\)\. Echo\-stripping may differentially affect structured vs\. prose outputs\. These biases primarily affect*cross\-model*comparisons; within\-model ablations \(C1 vs\. C4g\) use the same model’s output format across conditions, so evaluator biases cancel\.
#### Reference\-answer correction methodology\.
Benchmark label quality is substantially imperfect: full physician adjudication \(n=120n=120questions, 241 audited records\) found a 56% defect rate in provisional reference answers\. Defects are not random but trace to five systematic pipeline failures: NLP assertion classifier errors \(36% of defects\), inverted\-truth reference answers \(21%\), non\-clinical entity extraction \(14%\), medication list conflation \(13%\), and fabricated temporal relationships \(10%\)\. The NLP assertion classifier is the dominant source, systematically misinterpreting “history of X” as “X is resolved” and conflating causal uncertainty with existential uncertainty \(Appendix[X](https://arxiv.org/html/2605.11143#A24)\)\. Reference\-answer corrections were made by the lead physician \(A\.S\.\), who also designed the system\. To quantify impact, all were rescored for conditions against both the original \(v1\) and partially corrected \(v2\) reference sets: v1→\\tov2 corrections improved all models similarly \(\+0\.5\+0\.5–1\.2pp1\.2\\,\\text\{pp\}\), preserving within\-model deltas in that historical comparison\. All results reported in this paper use the v2 reference set \(54 corrections from an initialn=30n=30pilot, reconciled against the v1→\\tov2expected\_answerdiff\); both v1 and v2 are released for reproducibility\. A future v3 release will incorporate consensus corrections from the multi\-reviewer adjudication once that study is complete; no v3 results are used in this manuscript\. The completed three\-rater adjudication \(two independent external physicians×\\times100 items\) addresses single\-rater bias; results appear in Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7)\.
#### Scope and framing\.
This study is an ablation analysis rather than a head\-to\-head leaderboard against existing medical RAG systems\[[8](https://arxiv.org/html/2605.11143#bib.bib8),[9](https://arxiv.org/html/2605.11143#bib.bib9),[39](https://arxiv.org/html/2605.11143#bib.bib39)\]; C2 is used as a proxy baseline for document\-level retrieval on this cross\-admission task\. Runtime variance from quantized local inference \(e\.g\., MedGemma±10pp\\pm 10\\,\\text\{pp\}across runs\) and single\-site provenance further limit external validity\.
### Z\.1Extended External Adjudication Details
#### Rater\-calibration detail\.
Reviewer 3 rates 65/100 model answers as “correct” \(vs\. 46/100 Reviewer 1 and 39/100 Reviewer 2\) and uses “partially correct” only 10/100 times \(vs\. 22/100 and 29/100 respectively\): borderline answers are collapsed into “correct” on both conditions, compressing the between\-condition delta\. Excluding*change*, her delta rises to\+6\.8pp\+6\.8\\,\\text\{pp\}\(still underpowered,p=0\.65p=0\.65\)\. Ordinal linear\-weighted pairwise Cohen’sκ\\kappaon the 3\-class model\-answer scale averages 0\.463 \(quadratic\-weighted 0\.527\) across the three reviewer pairs\. The structurally\-independent inter\-non\-author Cohen’sκ\\kappa\(Hird×\\timesNadeem, binary strict\) is0\.360\.36, versus0\.430\.43–0\.490\.49for author\-involving pairs and Fleiss’κ=0\.413\\kappa=0\.413on the full three\-rater binary\-strict scale; the lower non\-authorκ\\kappais consistent with author influence on the other two reviewer pairs and is one reason the three\-rater majority\-vote\+24\.0pp\+24\.0\\,\\text\{pp\}result should be weighted with this dependency in mind\. All 100 items contribute toκ\\kappa; the planned 10\-item calibration phase was replaced by written instructions only, so no items were excluded\.
#### Reviewer 3 gold\-standard pattern\.
Reviewer 3’s unusual combination of the strictest gold\-standard ratings \(31/100 fully correct\) alongside the most lenient model\-answer ratings \(65/100 correct\) is internally coherent: she frequently judges the reference answer as wrong while still finding the model’s answer clinically reasonable—the same phenomenon the internal audit reports, now independently replicated\.
#### Scoring assumptions\.
Under lenient scoring \(“correct or partially correct”\), the 3\-rater majority\-vote delta is\+12\.0pp\+12\.0\\,\\text\{pp\}\(p=0\.18p=0\.18, n\.s\.\) on the full set and\+15\.9pp\+15\.9\\,\\text\{pp\}\(p=0\.09p=0\.09\) with*change*excluded—directional but not statistically significant\. The three\-rater validation characterizes the physician\-perceived magnitude under three scoring assumptions: strict 2/3 majority \(\+24\.0pp\+24\.0\\,\\text\{pp\}, significant\), lenient 2/3 majority \(\+12\.0pp\+12\.0\\,\\text\{pp\}, n\.s\.\), and per\-reviewer deltas \(\+2\+2to\+36pp\+36\\,\\text\{pp\}\)\.
### Z\.2Leave\-Author\-Out Sensitivity for Three\-Rater Majority
To assess sensitivity of the\+24\.0pp\+24\.0\\,\\text\{pp\}three\-rater majority result to author inclusion \(Reviewer 1 = A\.S\., the lead author\), we recomputed the C1 vs\. C4g delta using only the two structurally\-independent external raters \(Hird and Nadeem\)\. Table[33](https://arxiv.org/html/2605.11143#A26.T33)summarizes the alternative aggregations\.
Table 33:Leave\-author\-out sensitivity for the three\-rater external validation \(n=50n=50items per condition, strict scoring\)\. The author\-involving 3\-rater majority \(top row, paper headline\) is contrasted with author\-excluded aggregations\.Inter\-non\-author Cohen’sκ=0\.36\\kappa=0\.36\(Hird×\\timesNadeem\) versus0\.430\.43–0\.490\.49for author\-involving pairs\. The substantive direction survives author exclusion and the magnitude is preserved \(−2pp\-2\\,\\text\{pp\}, from\+24\.0pp\+24\.0\\,\\text\{pp\}to\+22\.0pp\+22\.0\\,\\text\{pp\}under unanimous external agreement\); the paired exact McNemar test on the 50 matched questions gives 4 discordant pairs favoring C1 vs\. 15 favoring C4g, two\-sidedp=0\.0192p=0\.0192, retaining significance atα=0\.05\\alpha=0\.05\.
### Z\.3Inter\-Rater Agreement on the Gold Standard
Inter\-rater agreement was measured on*system output ratings*\(Fleissκ=0\.413\\kappa=0\.413on C1/C4g/equivalent labels; pairwise Cohenκ∈\{0\.36,0\.43,0\.49\}\\kappa\\in\\\{0\.36,0\.43,0\.49\\\}, see Appendix[Z\.1](https://arxiv.org/html/2605.11143#A26.SS1)and Section[4\.7](https://arxiv.org/html/2605.11143#S4.SS7)\)\. Agreement on the*correctness of the gold standard answers themselves*was not formally measured in this study—a known gap in the benchmark methodology that limits direct quantification of gold validity\.
Indirectly, the external raters separately found 61–64% reference defect rates on the v2 gold standard \(Section[4\.6](https://arxiv.org/html/2605.11143#S4.SS6)\), implying low gold\-validity agreement between the v2 reference and external clinical judgment\. A v3 gold standard with multi\-rater authoring \(each item independently authored by≥2\\geq 2raters with consensus reconciliation\) and explicit IAA on the reference answers themselves is planned for post\-publication release; no v3 numbers are reported in this paper\.
### Z\.4Component\-Level F1 of the Assertion Classifier
The 122\-pattern rule\-based assertion classifier \(Appendix[W](https://arxiv.org/html/2605.11143#A23)\) was*not*externally benchmarked on i2b2\-2010 or n2c2\-2010 held\-out sets in this work; component\-level F1 is reported only from internal validation on a physician\-annotated stratified sample of 189 ClinicalBench mentions \(accuracy 89\.4%, weighted F1 0\.902, Cohenκ=0\.867\\kappa=0\.867; Table[26](https://arxiv.org/html/2605.11143#A23.T26)\)\. External component\-level evaluation in the i2b2/n2c2 tradition is flagged as future work and would strengthen the assertion\-preservation claim by separating intrinsic classifier quality from downstream KG\-RAG retrieval gains\.
### Z\.5Regulatory and Governance Framing
#### Scope\.
This work does NOT establish clinical deployment readiness\. The framings below are governance scaffolding for any future deployment effort, not claims about the current paper\. EpiKG is evaluated here as a research probe; no intended\-use statement, no IRB\-approved deployment protocol, and no prospective clinical study is in scope\.
#### SaMD risk class \(candidate\)\.
If deployed for patient\-level clinical answers from EHR retrieval, EpiKG would plausibly fall under FDA SaMD Class IIb–IIIa under IMDRF criteria \(high\-impact decision support over high\-severity conditions, where individual answers can influence diagnosis or therapy\)\. Class assignment depends on intended use, clinical context, and clinician oversight model: a tool used for care\-team chart\-review triage with mandatory clinician adjudication would sit lower on the risk spectrum than one driving patient\-facing answers\. The system as evaluated is research\-only and does not have an intended\-use statement\.
#### PCCP elements \(per FDA December 2024 final guidance\)\.
An AI\-enabled SaMD deploying EpiKG would specify a Predetermined Change Control Plan covering modifiable components: \(a\) the intent\-aware retrieval policy \(routing rules that select between assertion\-, temporal\-, and entity\-typed retrieval slices\), \(b\) the 122\-pattern assertion classifier \(rule additions, deletions, or modifications\), \(c\) the OMOP vocabulary version \(e\.g\., concept additions in successive Athena releases\), and \(d\) the base LLM versions across the cross\-model stack \(Claude, MedGemma, Qwen, Gemma 4, GPT\-OSS\)\. Each component requires a Description of Modifications, a Modification Protocol, and an Impact Assessment under the December 2024 final guidance\. None of these governance artifacts are present in this work; they are flagged as deployment prerequisites\.
#### CHAI Assurance Reporting Checklist self\-mapping\.
The CHAI 2024 Assurance Reporting Checklist and the Joint Commission–CHAI September 2025 governance guidance enumerate categories that any deployment effort must address\. A high\-level self\-assessment against those categories follows:
- •*Governance:*no external clinical\-AI governance committee, model risk\-management board, or institutional review structure is engaged\.
- •*Privacy:*de\-identified MIMIC\-IV under PhysioNet credentialed access; the HuggingFace release contains rephrased questions, paraphrased reference answers, and predictions without raw note text \(Appendix[P](https://arxiv.org/html/2605.11143#A16)\)\.
- •*Transparency:*model versions, evaluator code, raw predictions, and physician adjudication data are publicly released; the full application stack is not \(Appendix[P](https://arxiv.org/html/2605.11143#A16)\)\.
- •*Data security:*HuggingFace public release; commitda5f5b1stripped raw note excerpts\.
- •*Safety event reporting:*no post\-deployment safety surveillance protocol; this is research\-only\.
- •*Risk and bias assessment:*demographic table and equity gap reported \(Appendix[Y](https://arxiv.org/html/2605.11143#A25)\); no structured external bias review\.
The paper provides honest disclosure on most categories; structured external review and a formal assurance\-checklist filing are absent\. Any deployment effort would need to close those gaps before clinical release\.
#### Algorithmovigilance plan \(sketch only\)\.
Per Embí \(JAMIA 2021\) and Davis, Embí, and Matheny \(JAMIA 2024\), a clinical\-AI algorithmovigilance program for EpiKG would minimally include: \(a\) drift monitoring—e\.g\., monthly per\-category accuracy on a held\-out chart sample, with drift alerts triggered when any category accuracy moves more than5pp5\\,\\text\{pp\}relative to a rolling baseline; \(b\) periodic re\-validation—annual re\-validation against a refreshed gold standard, with attention to upstream NLP\-pipeline regressions; and \(c\) equity surveillance—per\-demographic\-stratum accuracy with alerting on\>5pp\>5\\,\\text\{pp\}gaps between strata\. None of these are implemented in this work; they are flagged as deployment prerequisites and are not in scope for the present manuscript\.
#### Harm\-pathway analysis\.
The 10 “potentially harmful” items identified by physician adjudication \(Appendix[X](https://arxiv.org/html/2605.11143#A24); Table[31](https://arxiv.org/html/2605.11143#A24.T31)\) span code status, oncology, and anticoagulation\. For governance purposes these can be cast in an FDA\-style probability×\\timesseverity matrix \(Table[34](https://arxiv.org/html/2605.11143#A26.T34)\)\. The matrix is illustrative; deployment would require structured FMEA with multidisciplinary review\.
Table 34:Illustrative probability×\\timesseverity matrix for the 10 potentially harmful items observed in the internal 120\-paired\-item physician adjudication \(n=240n=240single\-condition ratings; Table[31](https://arxiv.org/html/2605.11143#A24.T31)\)\. Probability is the observed rate within the audited subset; deployment\-context base rates would differ\.
#### Equity gap reframing\.
The9\.9pp9\.9\\,\\text\{pp\}Non\-White vs\. White accuracy gap on the C4g endpoint is reported descriptively in the demographic table \(Appendix[Y](https://arxiv.org/html/2605.11143#A25)\) with the disclaimer “severely underpowered \(n=10n=10Non\-White\)\.” We additionally frame this as an equity signal warranting investigation in any future deployment effort, per CHAI/FDA expectations on intended\-use\-population fairness\. The current sample size cannot exclude either a true equity gap or sampling noise; deployment would require demographically diverse multi\-site validation with pre\-specified per\-stratum accuracy targets and stratum\-conditional recalibration if gaps persist\.
#### Joint Commission–CHAI September 2025 alignment\.
The Joint Commission–CHAI September 2025 “Responsible Use of AI in Health Care” guidance is the operative governance standard for clinical\-AI deployment going forward\. EpiKG is not deployed in any healthcare setting in scope of this paper, and the present manuscript should not be read as a Joint Commission compliance document\. Any institutional deployment of EpiKG \(or a derivative\) would be expected to map its governance, monitoring, and reporting practices to that guidance prior to clinical release\.Similar Articles
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
Introduces OGCaReBench, a free-form retrieval benchmark for evaluating LLMs on clinical questions that require reasoning beyond standard guidelines. Experiments show that even the best model achieves only 56% accuracy, but retrieval augmentation boosts performance to 82%.
When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering
This paper evaluates six open-weight LLMs on biomedical QA under conflicting evidence conditions, revealing accuracy drops and prediction flips, and proposes a conflict-aware abstention score that improves selective accuracy.
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
RealICU is a hindsight-annotated benchmark for evaluating LLMs in ICU settings, covering four physician-motivated tasks. Experiments reveal that existing LLMs struggle with recall-safety tradeoffs and anchoring bias, while a new structured-memory agent improves reasoning but not fully eliminate safety failures.
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
MedCUA-Bench is a new benchmark for evaluating computer-use agents on clinical software tasks, covering 18 scenarios across 10 medical domains with safety dimensions. Results show that current agents perform poorly, especially on real OpenEMR, highlighting a significant gap in reliability.