Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

arXiv cs.CL Papers

Summary

This paper introduces EHR-ReasonCon, a reasoning-intensive benchmark for consistency verification between clinical notes and structured tables in electronic health records, and EHR-Inspector, an LLM-based framework that achieves state-of-the-art performance in detecting discrepancies.

arXiv:2605.26463v1 Announce Type: new Abstract: Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:06 AM

# Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records
Source: [https://arxiv.org/html/2605.26463](https://arxiv.org/html/2605.26463)
Yeonsu Kwon1, Jiho Kim111footnotemark:1, Junseong Choi1, Paloma Rabaey2, Minseo Kim1,Sujeong Im1,Jeewon Yang1,Jun\-Min Lee1,Sangji Lee3, Jiwon Kim4,Hangyul Yoon1,Hyunwook Kwon5,Edward Choi1 1KAIST2Ghent University3Samsung Medical Center 4Samsung Changwon Hospital5Asan Medical Center \{yeonsu\.k, jiho\.kim, edwardchoi\}@kaist\.ac\.kr

###### Abstract

Data consistency between unstructured clinical notes and structured tables inElectronicHealth Records \(EHRs\) is essential for patient safety and clinical decision\-making\. However, existing work on note–table consistency verification mainly relies on surface\-level matching of numeric values or simple events\. Such approaches fail to capture the reasoning underlying real\-world EHR documentation, including clinical interpretation, event relations, and temporal changes\. To address this gap, we introduceEHR\-ReasonCon, a reasoning\-intensive benchmark for note–table consistency verification\. Built on MIMIC\-III with expert\-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high\-quality ground\-truth labels\. The annotation protocol is supported by specialized table\-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment\. We also proposeEHR\-Inspector, an LLM\-based framework that segments notes, extracts anchor entities and temporal references, and uses table\-exploration tools to verify consistency against structured tables\. Evaluated using expert\-validated LLM\-as\-a\-judge metrics under harsh and lenient criteria,EHR\-Inspectorachieves state\-of\-the\-art performance across multiple model backbones\. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.26463v1/x1.png)Figure 1:Overview of reasoning\-intensive note–table consistency verification\. The examples highlight the need for reasoning\-intensive verification that goes beyond surface\-level alignment between clinical notes and structured tables\.Within Electronic Health Record \(EHR\) systems, patient information is documented through two primary modalities: structured tables \(e\.g\.,vital signs, prescriptions\) and unstructured clinical notes \(e\.g\.,physician notes\)Seinenet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib37)\)\. Clinicians combine objective findings from structured data with contextual information from notes to guide diagnostic and therapeutic decisions, which are then recorded back into the EHR\. Due to this interdependence, reliability and consistency between these data types are critical\. However, in practice, discrepancies frequently arise from administratively driven system architectures and documentation practicesPayneet al\.\([2018](https://arxiv.org/html/2605.26463#bib.bib29)\); Villa and Cabezas \([2014](https://arxiv.org/html/2605.26463#bib.bib27)\), potentially compromising patient safety and creating legal risksDemsashet al\.\([2023](https://arxiv.org/html/2605.26463#bib.bib32)\); Tsouet al\.\([2017](https://arxiv.org/html/2605.26463#bib.bib31)\)\.

Detecting these discrepancies is therefore essential, yet manual verification is impractical due to its time and cost demands\. This has motivated the development of automated approaches\. Prior work has investigated inconsistencies between clinical notes and structured tables, primarily focusing on specific domains such as allergies or medicationsLiet al\.\([2015](https://arxiv.org/html/2605.26463#bib.bib18)\); Loet al\.\([2022](https://arxiv.org/html/2605.26463#bib.bib20)\); Rinottet al\.\([2012](https://arxiv.org/html/2605.26463#bib.bib19)\)\. More recently, EHRConKwonet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib33)\)extended this line of work to relational databases\. However, these approaches rely onsurface\-levelverification, such as checking whether numerical values \(e\.g\.,WBC 10\.0\) or discrete events \(e\.g\.,administration of vancomycin\) mentioned in clinical notes are also recorded in structured tables\. While such approaches provide a useful starting point, they often fail to capture the contextual and nuanced nature of real\-world clinical documentation\.

Real\-world EHR documentation fundamentally requires advanced reasoning for accurate note–table consistency verification beyond surface\-level alignment\. For example, clinical notes often describe interpreted patient states, whereas structured tables record the underlying measurements \(see Figure[1](https://arxiv.org/html/2605.26463#S1.F1)\-\(1\)\)Gaoet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib24)\); Raghavanet al\.\([2014](https://arxiv.org/html/2605.26463#bib.bib21)\)\. Verifying these statements therefore requires assessing whether the measurements satisfy the clinical criteria supporting the interpretation\. Moreover, clinical notes often describe relationships among multiple clinical events \(see Figure[1](https://arxiv.org/html/2605.26463#S1.F1)\-\(2\)\)\. Verifying such statements requires checking whether these event relationships are consistently supported by structured records, rather than validating individual table entries in isolationKhetanet al\.\([2022](https://arxiv.org/html/2605.26463#bib.bib23)\); Wanget al\.\([2018](https://arxiv.org/html/2605.26463#bib.bib26)\)\. Furthermore, clinical notes often describe changes in patient status over time and subsequent interventions \(see Figure[1](https://arxiv.org/html/2605.26463#S1.F1)\-\(3\)\)\. Verifying such statements requires assessing trends, time spans, and corresponding treatments rather than relying on a single time point in structured tablesPanet al\.\([2020](https://arxiv.org/html/2605.26463#bib.bib25)\); Yuet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib22)\)\. However, existing approaches fall short in capturing these reasoning\-intensive aspects of note–table consistency\.

To bridge this critical gap, we introduceEHR\-ReasonCon, a reasoning\-intensive benchmark for note–table consistency verification\. Built on MIMIC\-IIIJohnsonet al\.\([2016](https://arxiv.org/html/2605.26463#bib.bib17)\), the dataset comprises 8,048 entities derived from clinical notes\. It is designed to reflect real\-world clinical documentation practices, guided by a rigorous annotation protocol developed in close collaboration with four clinical experts111Board\-Certified Radiation Oncologist, Board\-Certified General Surgeon, Resident in Anesthesiology, EHR technician\. This protocol was iteratively refined through pilot studies to ensure both clarity and robustness\. To support annotation under this protocol, we developed eight specialized table\-exploration tools for systematic retrieval of structured evidence\. Using these tools, eight annotators familiar with EHR systems carried out the annotation, consulting authoritative medical references as needed\. To ensure high reliability, we implement a multi\-stage quality control process, including dual annotation, disagreement resolution, and final adjudication by physicians\. As a result,EHR\-ReasonConachieves a high level of inter\-annotator agreement \(i\.e\.,0\.897 for NER, 0\.888 for consistency labeling\), establishing a reliable ground truth for evaluating reasoning\-based consistency verification\.

To address this benchmark, we also proposeEHR\-Inspector, an LLM\-based framework that mirrors the annotation workflow\.EHR\-Inspectorsegments clinical notes into event\-centric spans, extracts anchor entities along with their temporal context, and verifies consistency with structured EHR data using table\-exploration tools\. For evaluation, we adopt LLM\-as\-a\-judge evaluators, validated against clinical expert judgments, and assess the framework under two strictness levels: Harsh and Lenient\. Experimental results show thatEHR\-Inspectorconsistently achieves state\-of\-the\-art performance across multiple model backbones, substantially improving recall and precision\. Extensive ablation analyses were conducted, and further reasoning\-trace evaluations highlighted the differences between LLM and human verification\.

## 2Related Works

Consistency Check Between Clinical Notes and Structured Tables in EHRsDiscrepancies between clinical notes and tables have long been recognized as a critical issue that can lead to medical errorsKwonet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib33)\); Liet al\.\([2015](https://arxiv.org/html/2605.26463#bib.bib18)\); Loet al\.\([2022](https://arxiv.org/html/2605.26463#bib.bib20)\); Rinottet al\.\([2012](https://arxiv.org/html/2605.26463#bib.bib19)\)\. Early studies on consistency checking primarily focused on reconciliation within specific domains to align information across disparate data sources\. For example,Rinottet al\.\([2012](https://arxiv.org/html/2605.26463#bib.bib19)\)detected inconsistencies in sarcoma discharge summaries using an ensemble of classifiers,Liet al\.\([2015](https://arxiv.org/html/2605.26463#bib.bib18)\)proposed a hybrid ML and rule\-based approach for pediatric medication discrepancies, andLoet al\.\([2022](https://arxiv.org/html/2605.26463#bib.bib20)\)applied NLP methods to reconcile allergy information between clinical notes and structured lists\. However, these approaches typically relied on extracting coded entities from notes and comparing them with structured tables, without defining consistency verification as a general task or releasing datasets\. To address this limitation, EHRConKwonet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib33)\)introduced a benchmark for verifying consistency between clinical notes and relational databases, constructed on MIMIC\-IIIJohnsonet al\.\([2016](https://arxiv.org/html/2605.26463#bib.bib17)\)\. The dataset includes manual annotations linking entities in clinical notes to entries in multiple tables via SQL query execution\. However, EHRCon performs verification in asurface\-levelmanner, assessing whether specific values or simple events in notes match structured records\. In contrast,EHR\-ReasonConintroduces areasoning\-intensivebenchmark for assessing note–table consistency\.

Tool\-Augmented Table Reasoning AgentsRecent work on table QA augments LLM reasoning with external tools for filtering, aggregation, and numerical computation over tabular dataJianget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib15)\); Luet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib16)\); Wanget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib12)\); Xionget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib14)\); Zhouet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib13)\)\. Some approaches integrate program generation and execution into the reasoning processLuet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib16)\), while others develop autonomous agents that perform iterative planning, action, and reflectionJianget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib15)\), or organize these capabilities into modular or multi\-agent workflows for complex table reasoningXionget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib14)\); Zhouet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib13)\)\. Spreadsheet agents further extend tool\-based reasoning to large multi\-table environments and support both question answering and spreadsheet manipulationWanget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib12)\)\. Inspired by these advances, we proposeEHR\-Inspector, a tool\-augmented framework for reasoning\-intensive consistency verification between clinical notes and large\-scale structured tables\.

## 3EHR\-ReasonCon

EHR\-ReasonConis a high\-quality benchmark for reasoning\-intensive consistency verification, comprising 8,048 annotated entities linked to 14 tables in MIMIC\-III\.222Tables include Chartevents, Labevents, Prescriptions, Inputevents\_cv, Inputevents\_mv, Outputevents, Procedureevents\_mv, Microbiologyevents, Diagnoses\_icd, Procedures\_icd, D\_items, D\_icd\_diagnoses, D\_icd\_procedures, and D\_labitems\. Although MIMIC\-IVJohnsonet al\.\([2020](https://arxiv.org/html/2605.26463#bib.bib40)\)is newer, we use MIMIC\-III because MIMIC\-IV lacks diverse note types \(e\.g\.,physician and nursing notes\) and removes note dates instead of shifting them\.These entities are derived from 105 clinical notes across three note types: discharge summaries, physician notes, and nursing notes\. Table[1](https://arxiv.org/html/2605.26463#S3.T1)reports statistics forEHR\-ReasonCon, with additional details in Appendix[A](https://arxiv.org/html/2605.26463#A1)\. Figure[2](https://arxiv.org/html/2605.26463#S3.F2)shows the annotation process, and the steps involved are described below\.

Table 1:Dataset statistics ofEHR\-ReasonCon\. Con\. and Incon\. represent the number of entities with consistent and inconsistent labels, respectively\.
![Refer to caption](https://arxiv.org/html/2605.26463v1/x2.png)Figure 2:Overview of the reasoning\-intensive annotation process for note–table consistency verification\. The pipeline includes protocol and tool development \(Stage 0\), anchor entity identification \(Stage 1\), tool\-assisted evidence retrieval \(Stage 2\), and entity\-level consistency verification \(Stage 3\)\. Finally, a dataset\-level quality control step \(Stage 4\) is performed to resolve disagreements and refine annotations after all entities are annotated\.##### Stage 0: Pre\-Annotation Setup

The goal of this stage is to establish the annotation protocol and tools needed to construct a high\-quality benchmark reflecting clinical contexts\. The protocol, developed with medical practitioners, specifies how narrative expressions in clinical notes are mapped to structured table fields and provides criteria for interpreting temporal trends, handling ambiguous clinical judgments, and ensuring annotation consistency \(see Appendix[B](https://arxiv.org/html/2605.26463#A2)for detailed protocol\)\. However, real\-world EHR data contain edge cases not fully anticipated by predefined guidelines\. To address this, we conducted a pilot annotation study to refine the protocol\. During the pilot phase, we analyzed how annotators searched for relevant evidence in structured tables and formalized recurring search patterns into modular functionalities\. This process produced eight table\-exploration tools \(see Appendix[C](https://arxiv.org/html/2605.26463#A3)for further details\) that support efficient exploration of complex structured tables and are organized into three functional categories:

- •Entity\-to\-Table Item AlignmentThe tools in this category support the alignment of entities mentioned in clinical notes with corresponding items in structured tables\. The same clinical information may appear under different names or levels of abstraction \(e\.g\.,“White Blood Cells” and “WBC”\), so these tools retrieve potentially relevant table items based on both lexical similarity and conceptual relatedness\.
- •Database Exploration and Value ProfilingThe tools in this category support exploration of the EHR database schema and content\. Since clinical concepts may be distributed across multiple tables, the tools support exploration of relevant table groups and summarize typical values for each item \(e\.g\.,Stool Amount: \[Small, Medium, Large\]indicates thatStool Amountis categorical rather than numerical\), enabling annotators to quickly interpret the role of different fields\.
- •Temporal and Conditional Record RetrievalThe tools in this category support verification of clinical statements that involve temporal changes or specific conditions\. The tools allow annotators to retrieve records from structured tables based on time windows and value constraints, enabling inspection of whether the structured data support trends or events described in clinical notes\.

Stage 1: Anchor Entity IdentificationThe goal of this stage is to identify anchor entities in clinical notes that correspond to items in structured tables\. These anchors serve as entry points for exploring records in tables such as medications and diagnoses\.

Stage 2: Tool\-Assisted Table ExplorationThe goal of this stage is to retrieve evidence from structured tables corresponding to the anchor entities\. Annotators review the anchor entities, their attributes \(e\.g\.,dosage, route\), and their temporal context to understand the clinical trends described in the notes\. They then use the predefined tools \(inStage 0\) to query structured tables and retrieve records that support the clinical narratives, which are then used as evidence for consistency assessment\.

Stage 3: Consistency VerificationThe goal of this stage is to verify whether the narrative content associated with each anchor entity is supported by the structured evidence\. Anchor entities are labeled asConsistentif the corresponding information in the clinical notes is supported by structured records; otherwise, they are labeled asInconsistent\. To ensure reliable labeling, annotators perform reasoning\-intensive analysis involving temporal reasoning, commonsense reasoning, and medical interpretation\. When necessary, they consult established medical references333UpToDate, MedlinePlus, Cleveland Clinic, Mayo Clinicor physicians\.

Stage 4: Annotation ReliabilityThe goal of this stage is to ensure the integrity and reliability of the dataset through a multi\-step verification process\. The annotation was conducted by eight trained annotators\. For each clinical note, two annotators were randomly assigned to independently perform the annotation and then resolve disagreements through mutual reconciliation\. For complex clinical judgments, annotators consulted medical professionals\. After the initial annotations were completed, an independent reviewer conducted a review of all 105 clinical notes to ensure dataset\-wide consistency\. The inter\-annotator agreement for NER and consistency labeling was 0\.897 and 0\.888, respectively \(see Appendix[D](https://arxiv.org/html/2605.26463#A4)for details\)\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x3.png)Figure 3:Overview ofEHR\-Inspector\. The framework \(1\) segments clinical notes and resolves temporal references, \(2\) extracts anchor entities using patient records and ontology guidance, \(3\) filters entities based on temporal context, and \(4\) verifies note–table consistency using tool\-augmented table exploration\.

## 4EHR\-Inspector

We proposeEHR\-Inspector, an LLM\-based framework for verifying consistency between clinical notes and structured tables that mirrors the annotation workflow\. Figure[3](https://arxiv.org/html/2605.26463#S3.F3)provides an overview ofEHR\-Inspector, operating through a sequence of modules described in the following subsections\. The implementation details, including the prompts, are provided in Appendix[E](https://arxiv.org/html/2605.26463#A5)\.

### 4\.1Note Segmentation and Temporal Reference Extraction

Clinical notes often contain multiple clinical events within a single document, which can hinder LLM reasoning when the entire note is processed as a single context\. To address this, we first perform LLM\-based note segmentation\. Given a clinical notexx, the LLM partitions it into topic\-coherent segmentsX=\{x1,x2,…,xT\}X=\\\{x\_\{1\},x\_\{2\},\.\.\.,x\_\{T\}\\\}, enabling localized reasoning in each segment\. However, segmentation may separate relative temporal expressions \(e\.g\.,Postoperative Day 3\) from their reference events \(e\.g\.,Surgery Date\), which may appear in different segments\. To maintain consistent temporal interpretation, the LLM extracts global temporal references from the entire note and resolves relative temporal expressions within segments into absolute time\.

### 4\.2Multi\-Stage LLM\-Based Entity Extraction

This module aims to extract only note entities that can be aligned with structured table records, rather than all clinical mentions\. A simple approach would prompt the LLM with the entire structured item setLLand ask it to identify corresponding expressions in the note\. However, sinceLLcontains over 14,000 items, this is computationally inefficient and may degrade performance\. To address this issue, we adopt a two\-step extraction strategy to comprehensively identify note entities that can be aligned with structured table items\. First, we perform patient\-specific extraction usingLP⊆LL\_\{P\}\\subseteq L, the subset of structured items recorded for patientPP\. Given a note segment andLPL\_\{P\}, the LLM extracts entities aligned with the patient’s structured records, denoted asEpatientE\_\{\\text\{patient\}\}\. Second, to capture entities that may be absent from the structured records ofPPbut are valid within the broader EHR ontology, we perform ontology\-guided coarse\-to\-fine extraction\. Instead of presenting the large complementL∖LPL\\setminus L\_\{P\}, the LLM first selects relevant high\-level groups and subgroups, and then extracts entities within the selected subgroups, yieldingEontologyE\_\{\\text\{ontology\}\}\. Finally, we obtain the extractable entity set by merging the outputs of the two stages:Eextract=Epatient∪EontologyE\_\{\\text\{extract\}\}=E\_\{\\text\{patient\}\}\\cup E\_\{\\text\{ontology\}\}\. Details are provided below\.

Patient\-specific Entity ExtractionIn this stage, the LLM identifies note entities corresponding to structured items recorded for patientPP\. For each note segmentxt∈Xx\_\{t\}\\in X, the LLM is provided with: \(1\) the note segmentxtx\_\{t\}, \(2\) the patient\-specific item setLPL\_\{P\}, and \(3\) a value distribution profileT​o​p​V​\(l\)TopV\(l\)for each iteml∈LPl\\in L\_\{P\}\. The profileT​o​p​V​\(l\)TopV\(l\)is defined as the top\-KKmost frequent values ofllin the overall structured tables, and serves as an auxiliary signal that helps the LLM better understand the semantic meaning and value type of each item\. For example, givenT​o​p​V​\(S​B​P\)=\[110, 120, 130\]TopV\(SBP\)=\[\\textit\{110, 120, 130\}\], the LLM can infer from the distribution and format of these values that the itemSBPis numerical, rather than a categorical or textual attribute\. Based on these inputs, the LLM extracts entity names in each segmentxtx\_\{t\}that correspond to items inLPL\_\{P\}, and aggregates them across segments to formEpatient=⋃xt∈XEpat​\(xt\)E\_\{\\mathrm\{patient\}\}=\\bigcup\_\{x\_\{t\}\\in X\}E^\{\\mathrm\{pat\}\}\(x\_\{t\}\)\.

Ontology\-guided Hierarchical Entity ExtractionThis stage targets entities outsideLPL\_\{P\}by narrowing the item set \(L∖LPL\\setminus L\_\{P\}\) through a pre\-constructedhierarchicalontologyOO\. The ontology consists of 12 high\-level groups,O=\{G1,…,G12\}O=\\\{G\_\{1\},\\dots,G\_\{12\}\\\}\. Each groupGiG\_\{i\}\(e\.g\.,Hemodynamics\) is divided into subgroupsGi=\{Si,j\}j=1JiG\_\{i\}=\\\{S\_\{i,j\}\\\}\_\{j=1\}^\{J\_\{i\}\}, and each subgroupSi,jS\_\{i,j\}\(e\.g\.,Vitals\) is associated with a set of itemsLi,jL\_\{i,j\}\(e\.g\.,\{Heart rate, SpO2, …\}\)\. Using the ontology as a coarse\-to\-fine search space, for each note segmentxtx\_\{t\}, the LLM first identifies relevant groupsG​\(xt\)⊆OG\(x\_\{t\}\)\\subseteq Oand subsequently selects subgroupsS​\(xt\)⊆⋃Gi∈G​\(xt\)GiS\(x\_\{t\}\)\\subseteq\\bigcup\_\{G\_\{i\}\\in G\(x\_\{t\}\)\}G\_\{i\}\. The LLM then extracts entitiesEont​\(xt\)E^\{\\mathrm\{ont\}\}\(x\_\{t\}\)that correspond to the items inL​\(xt\)=\(⋃Si,j∈S​\(xt\)Li,j\)∖LPL\(x\_\{t\}\)=\\bigl\(\\bigcup\_\{S\_\{i,j\}\\in S\(x\_\{t\}\)\}L\_\{i,j\}\\bigr\)\\setminus L\_\{P\}\. The final ontology\-based entity set is obtained by aggregating entities across all segmentsxt∈Xx\_\{t\}\\in X:Eontology=⋃xt∈XEont​\(xt\)E\_\{\\mathrm\{ontology\}\}=\\bigcup\_\{x\_\{t\}\\in X\}E^\{\\mathrm\{ont\}\}\(x\_\{t\}\)\. Details on the ontology construction and the full list of groups and subgroups are provided in Appendix[E\.1\.2](https://arxiv.org/html/2605.26463#A5.SS1.SSS2)\.

### 4\.3Temporal Reasoning and Scope Filtering

Structured tables in EHRs typically contain events from the current hospitalization, whereas narrative clinical notes may additionally contain past history or future plans\. To avoid incorrectly flagging past or planned entities as inconsistencies, we retain only entities associated with the current hospitalization fromEe​x​t​r​a​c​tE\_\{extract\}\. For each entitye∈Ee​x​t​r​a​c​te\\in E\_\{extract\}, an LLM extracts temporal cues \(e\.g\.,occurrence time, duration, end time\) and assigns a temporal statusC​\(e\)∈\{Past,Active,Plan\}C\(e\)\\in\\\{\{\\text\{Past\},\\text\{Active\},\\text\{Plan\}\}\\\}using these cues and the global temporal references identified in §[4\.1](https://arxiv.org/html/2605.26463#S4.SS1)\. The scope\-filtered set isEs​c​o​p​e=\{e∈Ee​x​t​r​a​c​t∣C​\(e\)=Active\}E\_\{scope\}=\\\{e\\in E\_\{extract\}\\mid C\(e\)=\\text\{Active\}\\\}, used for consistency verification\.

### 4\.4Tool\-Augmented Consistency Verification

Structured tables in EHRs often contain high\-frequency time\-series records, making it infeasible to include all relevant structured records in the limited LLM context\. To address this, we enable the LLM to interact with the tables through table\-exploration tools \(defined in §[3](https://arxiv.org/html/2605.26463#S3)\) that retrieve relevant subsets of the data\. These tools also expose schema and value representations, allowing the LLM to interpret how clinical concepts are recorded in structured form and perform accurate verification for each entitye∈Es​c​o​p​ee\\in E\_\{scope\}\. However, verifying each entity independently can be computationally inefficient, as multiple entities extracted from the same segmentxtx\_\{t\}often originate from a single clinical observation\. For example, the entities ‘Abd’, ‘soft’, and ‘distended’ all refer to the same abdominal exam in the sentence “Abd: soft, distended\.” In such cases, independent verification would lead to redundant tool calls and repeated access to the same table entries\. To tackle this issue, we introduce avalidation cachethat stores themmmost recently validated entities and their evidence in a sliding window, enabling reuse for related entities\. Finally, each entityeeis assigned aconsistentorinconsistentlabel based on its alignment with the structured records\.

## 5Experiments

### 5\.1Experimental Setting

We partitioned the dataset into test and validation sets with a 4:1 ratio\. The test set was used for the main experiments, while the validation set was used to developEHR\-Inspector\.

BaselineWe compareEHR\-InspectorwithCheckEHRKwonet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib33)\), the state\-of\-the\-art framework for verifying consistency between clinical notes and structured tables\. CheckEHR is an LLM\-based eight\-stage pipeline that extracts entities from clinical notes and generates SQL queries to validate them against tables\. To the best of our knowledge, CheckEHR is the only prior framework that addresses this comparable note–table verification setting, and thus serves as our primary baseline\.444General\-purpose table reasoning agentsJianget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib15)\); Luet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib16)\)are not directly applicable to our task because they assume an explicit question or claim and a predefined table context\. In contrast, our task requires inferring verification units, temporal scope, and supporting or missing evidence from long clinical notes, and then grounding them across relational tables\.

Base LLMsTo ensure a fair comparison, we use four LLMs as base models for bothCheckEHRandEHR\-Inspector: Gemini 2\.5 FlashComaniciet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib11)\), a proprietary model known for strong reasoning performance with high cost efficiency; Qwen3\-32BYanget al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib10)\)and GPT\-OSS 20BAgarwalet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib9)\), open\-source reasoning models; and MedGemma 27BSellergrenet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib8)\), an open\-source model specialized for medical\-domain tasks\.

Table 2:Performance comparison between CheckEHR andEHR\-Inspector\. Results are reported as recall \(Rec\) and precision \(Prec\) under both Lenient and Harsh evaluation\.666Due to API and computational costs, our primary evaluation is based on a single inference pass\. To ensure statistical validity, we conducted an additional robustness check on the validation set, performing three independent inferences, each evaluated three times by the LLM\-as\-a\-judge\. Detailed results and standard deviations are reported in Appendix[G](https://arxiv.org/html/2605.26463#A7)\.Values below 40 are shaded orange and above 40 blue, with intensity indicating distance from 40\.
### 5\.2Evaluation

MetricsWe evaluate the frameworks at the entity level\. For each note, we compute Recall, Precision, and F1 and report the average scores\. Recall is defined as the number of correctly classified entities divided by the number of ground\-truth entities in the note\. Precision is defined as the number of correctly classified entities divided by the number of entities recognized by the framework\.

LLM\-as\-a\-judgeDirect rule\-based comparison between framework outputs and gold annotations is unreliable due to differences in span boundaries and granularity\. For example, in the phrase“lung sounds: clear, no crackles,”a human annotator may treat“clear”and“no crackles”as separate verifications for the entity“lung sounds,”whereas a framework may extract“lung”as the entity and associate both values jointly or separately\. Such discrepancies in span boundaries and granularity make rule\-based evaluation challenging\. To address this issue, we adopt an LLM\-as\-a\-judge evaluator based on Gemini 2\.5 ProComaniciet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib11)\), which compares framework outputs and gold annotations semantically\. In addition to assessing agreement with expert\-guided gold annotations, we further consider the fact that equivalent clinical findings can be expressed in different ways and mapped to different structured EHR fieldsCohenet al\.\([2019](https://arxiv.org/html/2605.26463#bib.bib50)\); Newman\-Griffiset al\.\([2021](https://arxiv.org/html/2605.26463#bib.bib51)\); Rosenbloomet al\.\([2011](https://arxiv.org/html/2605.26463#bib.bib49)\)\. For example, when a clinical note states“No rash,”both the gold annotation and the framework output may extract“rash”as the entity, yet map it to different table item–value pairs\. The gold annotation may use the table item“Rash”with the value“None,”directly representing the absence of rash, whereas the framework may use the table item“Skin integrity”with the value“intact,”interpreting the same expression as a broader skin\-status finding\. Although these outputs differ structurally, both capture the same clinical information—that no rash was observed\. To evaluate this, we define two complementary evaluation settings:HarshandLenient\. TheHarshevaluator enforces exact agreement with gold annotations, while theLenientevaluator accepts clinically plausible variations even when they diverge from gold annotations\. To validate the reliability of the LLM\-based judgments, a subset of evaluation samples was independently reviewed by human annotators\. TheHarshsetting achieved 99\.46% agreement with author annotations, whereas theLenientsetting, involving subjective clinical judgment, reached 95\.35% agreement among four practitioners\. We also addressed potential evaluator bias via cross\-model validation with GPT\-5Singhet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib1)\)\. Additional details are provided in Appendix[F](https://arxiv.org/html/2605.26463#A6)\.

### 5\.3Main Results

Table[2](https://arxiv.org/html/2605.26463#S5.T2)shows thatEHR\-Inspectorgenerally outperforms CheckEHR across a range of backbone LLMs under both evaluation settings, demonstrating its effectiveness across diverse model families\. The performance gap becomes more pronounced when using models with strong reasoning capabilities, such as Qwen3 32B or Gemini 2\.5 Flash, indicating thatEHR\-Inspectoreffectively leverages advanced reasoning capabilities\. Stronger LLMs also exhibit a larger gap betweenLenientandHarshevaluation, suggesting that they produce more independent verification rationales that differ from ground\-truth annotations, often capturing clinically valid interpretations beyond the annotated labels\. Interestingly, although MedGemma 27B is specialized for medical tasks, it does not outperform general\-purpose reasoning models and even underperforms in the case of CheckEHR\. This suggests that the task requires not only domain\-specific medical knowledge but also strong structured reasoning over tabular evidence, where improvements in reasoning capability can play a critical role\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/figures/error_type_comparison.png)Figure 4:Error distribution by model inEHR\-Inspector\.Error Analysis by ModelTo analyze error cases inEHR\-Inspectoracross models, we sampled 100 errors per model and categorized them into five types\. As shown in Figure[4](https://arxiv.org/html/2605.26463#S5.F4), premature conclusions were the most frequent across all models, indicating a systematic under\-search issue where models terminate verification before collecting sufficient evidence\. This highlights the need for deeper table exploration mechanisms\. Additionally, stronger models rarely exhibit validation cache errors, implying more reliable tracking of intermediate verification states\. A more detailed error analysis is provided in Appendix[H](https://arxiv.org/html/2605.26463#A8)\.

Table 3:Ablation study on note segmentation and scope filtering\.Boldindicates the highest performance\.
Table 4:Performance by entity extraction method\.Boldandunderlinedenote the best and second\-best results, respectively\.

### 5\.4Ablation Study

To analyze the contribution of each component, we conduct an ablation study on the key modules ofEHR\-Inspectorusing Gemini 2\.5 Flash as the base LLM\. All experiments and further analyses are performed on the validation set\.

Note SegmentationTo examine the effect of note segmentation, we remove the segmentation stage and perform entity extraction and verification on the entire clinical note\. As shown in Table[4](https://arxiv.org/html/2605.26463#S5.T4), the F1 score decreases in both settings\. This suggests that segmentation improves reliability by structuring long clinical notes into manageable units\. Without segmentation, the model must process longer contexts at once, which reduces extraction stability and leads to more verification errors\.

Entity ExtractionTo analyze the impact of the entity extraction stage, we compare our method with three NER baselines while keeping the rest of the framework unchanged: Trained NER, BERT Ensemble, and CheckEHR NER\. Trained NER fine\-tunes MedGemma 27B on note–entity pairs fromEHR\-ReasonCon, BERT Ensemble aggregates predictions from three biomedical BERT\-based models trained on different datasetsBeltagyet al\.\([2019](https://arxiv.org/html/2605.26463#bib.bib4)\); Mattupalli \([2023](https://arxiv.org/html/2605.26463#bib.bib6)\); Uzuneret al\.\([2011](https://arxiv.org/html/2605.26463#bib.bib5)\), and CheckEHR NER follows the entity extraction method used in the original CheckEHR pipeline\. Additional implementation details are provided in Appendix[I](https://arxiv.org/html/2605.26463#A9)\. As shown in Table[4](https://arxiv.org/html/2605.26463#S5.T4), our method achieves the highest F1 scores in both evaluation settings\. The baselines exhibit a Precision–Recall trade\-off: Trained NER achieves high Precision but relatively low Recall due to limited generalization to diverse entity expressions, whereas BERT Ensemble improves Recall by aggregating multiple models but often predicts additional entities not present in the ground\-truth, reducing Precision\.

Scope FilteringTo examine the contribution of scope filtering, we conduct an ablation experiment by removing the filtering stage from the framework\. As shown in Table[4](https://arxiv.org/html/2605.26463#S5.T4), removing temporal filtering decreases F1 scores in both evaluation settings\. Without filtering, the framework extracts more entities, increasing Recall but introducing many irrelevant entities that reduce Precision\. As a result, the overall F1 score declines\. These results highlight the role of scope filtering in controlling excessive entity extraction and maintaining a balance between Recall and Precision\.

Tool TypesSimilar to prior studiesQinet al\.\([2023](https://arxiv.org/html/2605.26463#bib.bib2)\); Xuet al\.\([2025](https://arxiv.org/html/2605.26463#bib.bib3)\), the toolset ofEHR\-Inspectoris designed by humans\. We evaluate its effectiveness by ablating each tool category \(§[3](https://arxiv.org/html/2605.26463#S3)\)\. Interestingly, removing certain tools resulted in performance improvements, as shown in Table[5](https://arxiv.org/html/2605.26463#S5.T5)\. This can be attributed to the different reasoning processes between human annotators and LLMs\. Human annotators, with limited working memory, tend to use tools that search for table items or explore the database structure in order to narrow the search space before examining large tables\. In contrast, LLMs, with much larger working memory \(process 1M tokens\), do not require such space\-reduction strategies\. For LLMs, tools designed to search for items or explore the database structure can introduce unnecessary complexity and lead to premature conclusions\. These findings suggest the agents should dynamically manipulate toolsets to optimize performance\. Detailed experimental results are provided in Appendix[J](https://arxiv.org/html/2605.26463#A10)\.

### 5\.5Further Analysis

Tool Usage: Annotators vs\.EHR\-InspectorWe compare the tool usage paths of human annotators and the Gemini 2\.5 Flash\-basedEHR\-Inspector\. The results, plotted for frequently occurring tool usage traces during consistency verification, are shown in Figure[5](https://arxiv.org/html/2605.26463#S5.F5)\. Human annotators exhibit relatively limited and less varied tool usage traces, suggesting that they identify optimized tool usage traces and apply them conservatively to minimize risksMachina and Siniscalchi \([2014](https://arxiv.org/html/2605.26463#bib.bib41)\)\. In contrast,EHR\-Inspectorexhibits a broader but less controlled exploration of tool usage paths, resulting in more complex and dispersed patterns\. Therefore, to enhance the framework, it should transition from less controlled exploration to a data\-driven approach that optimizes tool usage traces based on past outcomes\. Detailed trace analysis by model is described in Appendix[K](https://arxiv.org/html/2605.26463#A11)\.

Generalization Across Datasets and Database SchemasWe evaluateEHR\-Inspector’s generalization ability across both different datasets and database schemas\. On EHRConKwonet al\.\([2024](https://arxiv.org/html/2605.26463#bib.bib33)\), asurface\-levelconsistency verification dataset,EHR\-Inspectorachieves a Lenient F1 score of 86\.41, outperformingCheckEHRby 24\.99, indicating strong cross\-dataset generalization\. We further assess robustness to both different database schemas and note structures using MIMIC\-IV discharge summariesJohnsonet al\.\([2020](https://arxiv.org/html/2605.26463#bib.bib40)\)\. Despite increased complexity from additional tables \(i\.e\.,emarandemar\_detail\), Lenient F1 drops only modestly from 71\.03777For MIMIC\-IV, we annotated discharge summaries and compared model performance with the same setting in MIMIC\-III\.to 68\.67\. Additionally, to examine reliance on prior knowledge of the MIMIC database \(see Appendix[L](https://arxiv.org/html/2605.26463#A12)\), we construct a perturbed schema with unseen table and column names\. In this setting, Lenient F1 decreases from 74\.44 to 69\.73, suggesting some benefit from prior familiarity; nevertheless,EHR\-Inspectormaintains reasonable performance, demonstrating robustness to both dataset shifts and schema variations\. Detailed results are provided in Appendix[M](https://arxiv.org/html/2605.26463#A13)\.

Table 5:Tool category ablation results under Lenient evaluation metrics\.![Refer to caption](https://arxiv.org/html/2605.26463v1/x4.png)
Figure 5:Tool usage trace comparison between human annotation andEHR\-Inspector\.

## 6Conclusion and Limitations

In this study, we introduceEHR\-ReasonCon, a reasoning\-intensive benchmark to verify consistency between clinical notes and structured tables by addressing real\-world EHR documentation challenges\. To tackle this benchmark, we proposeEHR\-Inspector, an LLM\-based framework that mirrors the human annotation workflow through tool\-augmented table reasoning\. Experimental results show thatEHR\-Inspectorachieves state\-of\-the\-art performance across diverse model backbones\.

While our dataset curation and evaluation design were rigorous, certain limitations exist\. The limited number of clinical notes used to buildEHR\-ReasonConmay constrain its representativeness across varied clinical scenarios\. Furthermore, its reliance on preprocessed MIMIC data could fail to capture the true complexity of real\-world clinical errors\. Finally, the LLM\-as\-a\-judge paradigm, though highly scalable and consistent with expert consensus, inherently risks introducing model\-driven biases\. To address these limitations, future work should focus on expanding data scale and diversity while incorporating scalable expert\-in\-the\-loop evaluation frameworks to improve overall robustness\.

## References

- \[1\]S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[Appendix K](https://arxiv.org/html/2605.26463#A11.p1.1),[2nd item](https://arxiv.org/html/2605.26463#A15.I3.i2.p1.1.1),[§5\.1](https://arxiv.org/html/2605.26463#S5.SS1.p3.1)\.
- \[2\]I\. Beltagy, K\. Lo, and A\. Cohan\(2019\)SciBERT: a pretrained language model for scientific text\.InProceedings of EMNLP\-IJCNLP,pp\. 3615–3620\.Cited by:[Appendix I](https://arxiv.org/html/2605.26463#A9.p3.1),[§5\.4](https://arxiv.org/html/2605.26463#S5.SS4.p3.1)\.
- \[3\]G\. R\. Cohen, C\. P\. Friedman, A\. M\. Ryan, C\. R\. Richardson, and J\. Adler\-Milstein\(2019\)Variation in physicians’ electronic health record documentation and potential patient harm from that variation\.34\(11\),pp\. 2355–2367\.Cited by:[§5\.2](https://arxiv.org/html/2605.26463#S5.SS2.p2.1)\.
- \[4\]G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[1st item](https://arxiv.org/html/2605.26463#A15.I2.i1.p1.1.1),[2nd item](https://arxiv.org/html/2605.26463#A15.I2.i2.p1.1.1),[§5\.1](https://arxiv.org/html/2605.26463#S5.SS1.p3.1),[§5\.2](https://arxiv.org/html/2605.26463#S5.SS2.p2.1)\.
- \[5\]A\. W\. Demsash, S\. Y\. Kassie,et al\.\(2023\)Health professionals’ routine practice documentation and its associated factors in a resource\-limited setting: a cross\-sectional study\.BMJ Health & Care Informatics30\(1\),pp\. e100699\.External Links:[Document](https://dx.doi.org/10.1136/bmjhci-2022-100699)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p1.1)\.
- \[6\]Y\. Gao, D\. Mahajan, Ö\. Uzuner, and M\. Yetisgen\(2024\-02\)Clinical natural language processing for secondary uses\.Journal of Biomedical Informatics150,pp\. 104596\.External Links:[Document](https://dx.doi.org/10.1016/j.jbi.2024.104596)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p3.1)\.
- \[7\]C\. Hsu, S\. Karnwal, S\. Mullainathan, Z\. Obermeyer, and C\. Tan\(2020\)Characterizing the value of information in medical notes\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 2062–2072\.Cited by:[§A\.2](https://arxiv.org/html/2605.26463#A1.SS2.p1.1)\.
- \[8\]C\. Jiang, M\. Cheng, X\. Tao, Q\. Mao, J\. Ouyang, and Q\. Liu\(2025\)Tablemind: an autonomous programmatic agent for tool\-augmented table reasoning\.arXiv preprint arXiv:2509\.06278\.Cited by:[§2](https://arxiv.org/html/2605.26463#S2.p2.1),[footnote 4](https://arxiv.org/html/2605.26463#footnote4)\.
- \[9\]A\. Johnson, L\. Bulgarelli, T\. Pollard, S\. Horng, L\. A\. Celi, and R\. Mark\(2020\)Mimic\-iv\.pp\. 49–55\.Cited by:[2nd item](https://arxiv.org/html/2605.26463#A15.I1.i2.p1.1.1),[§E\.1\.1](https://arxiv.org/html/2605.26463#A5.SS1.SSS1.p1.11),[§5\.5](https://arxiv.org/html/2605.26463#S5.SS5.p2.1),[footnote 2](https://arxiv.org/html/2605.26463#footnote2)\.
- \[10\]A\. Johnson, T\. Pollard, and R\. Mark\(2016\-09\)MIMIC\-III Clinical Database\.PhysioNet\.Note:Version 1\.4External Links:[Document](https://dx.doi.org/10.13026/C2XW26),[Link](https://doi.org/10.13026/C2XW26)Cited by:[1st item](https://arxiv.org/html/2605.26463#A15.I1.i1.p1.1.1),[§1](https://arxiv.org/html/2605.26463#S1.p4.1),[§2](https://arxiv.org/html/2605.26463#S2.p1.1)\.
- \[11\]V\. Khetan, M\. I\. Rizvi, J\. Huber, P\. Bartusiak, B\. Sacaleanu, and A\. Fano\(2022\)MIMICause: representation and automatic extraction of causal relation types from clinical notes\.InFindings of the association for computational linguistics: ACL 2022,pp\. 764–773\.Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p3.1)\.
- \[12\]Y\. Kwon, J\. Kim, G\. Lee, S\. Bae, D\. Kyung, W\. Cha, T\. Pollard, A\. Johnson, and E\. Choi\(2024\)EHRCon: dataset for checking consistency between unstructured notes and structured tables in electronic health records\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,Cited by:[§A\.2](https://arxiv.org/html/2605.26463#A1.SS2.p1.1),[3rd item](https://arxiv.org/html/2605.26463#A15.I1.i3.p1.1.1),[§1](https://arxiv.org/html/2605.26463#S1.p2.1),[§2](https://arxiv.org/html/2605.26463#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.26463#S5.SS1.p2.1),[§5\.5](https://arxiv.org/html/2605.26463#S5.SS5.p2.1)\.
- \[13\]Q\. Li, S\. A\. Spooner, M\. Kaiser, N\. Lingren, J\. Robbins, T\. Lingren, H\. Tang, I\. Solti, and Y\. Ni\(2015\-05\-06\)An end\-to\-end hybrid algorithm for automated medication discrepancy detection\.BMC Medical Informatics and Decision Making15\(1\),pp\. 37\.External Links:[Document](https://dx.doi.org/10.1186/s12911-015-0160-8),[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC4427951/)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p2.1),[§2](https://arxiv.org/html/2605.26463#S2.p1.1)\.
- \[14\]Y\. Lo, S\. Varghese, S\. Blackley, D\. L\. Seger, K\. G\. Blumenthal, F\. R\. Goss, and L\. Zhou\(2022\)Reconciling allergy information in the electronic health record after a drug challenge using natural language processing\.Frontiers in Allergy3,pp\. 904923\.External Links:[Document](https://dx.doi.org/10.3389/falgy.2022.904923)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p2.1),[§2](https://arxiv.org/html/2605.26463#S2.p1.1)\.
- \[15\]X\. Lu, L\. Pan, Y\. Ma, P\. Nakov, and M\. Kan\(2025\-04\)TART: an open\-source tool\-augmented framework for explainable table\-based reasoning\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4323–4339\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.244/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.244),ISBN 979\-8\-89176\-195\-7Cited by:[§2](https://arxiv.org/html/2605.26463#S2.p2.1),[footnote 4](https://arxiv.org/html/2605.26463#footnote4)\.
- \[16\]M\. Machina and M\. Siniscalchi\(2014\-12\)Ambiguity and ambiguity aversion\.Vol\.1,pp\.\.External Links:ISBN 9780444536853,[Document](https://dx.doi.org/10.1016/B978-0-444-53685-3.00013-1)Cited by:[§5\.5](https://arxiv.org/html/2605.26463#S5.SS5.p1.1)\.
- \[17\]L\. Mathiesen, T\. B\. M\. Nguyen, I\. Dæhlen, M\. Mowé, and M\. Lea\(2024\)Effect of integrated medicines management on quality of discharge medication information—a secondary endpoint in a randomized controlled trial\.36\(4\),pp\. mzae100\.Cited by:[§A\.3\.2](https://arxiv.org/html/2605.26463#A1.SS3.SSS2.Px4.p1.1)\.
- \[18\]S\. Mattupalli\(2023\)Clinical ner model\.Note:[https://huggingface\.co/blaze999/clinical\-ner](https://huggingface.co/blaze999/clinical-ner)HuggingFace model for clinical named entity recognition, fine\-tuned from DeBERTa\-v3\-baseCited by:[Appendix I](https://arxiv.org/html/2605.26463#A9.p3.1),[§5\.4](https://arxiv.org/html/2605.26463#S5.SS4.p3.1)\.
- \[19\]D\. Newman\-Griffis, G\. Divita, B\. Desmet, A\. Zirikly, C\. P\. Rosé, and E\. Fosler\-Lussier\(2021\)Ambiguity in medical concept normalization: an analysis of types and coverage in electronic health record datasets\.28\(3\),pp\. 516–532\.Cited by:[§5\.2](https://arxiv.org/html/2605.26463#S5.SS2.p2.1)\.
- \[20\]X\. Pan, B\. Chen, H\. Weng, Y\. Gong, and Y\. Qu\(2020\-07\-27\)Temporal expression classification and normalization from chinese narrative clinical texts: pattern learning approach\.JMIR Med Inform8\(7\),pp\. e17652\.External Links:ISSN 2291\-9694,[Document](https://dx.doi.org/10.2196/17652),[Link](https://medinform.jmir.org/2020/7/e17652),[Link](https://doi.org/10.2196/17652),[Link](http://www.ncbi.nlm.nih.gov/pubmed/32716307)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p3.1)\.
- \[21\]S\. J\. Patel and C\. P\. Landrigan\(2019\)Communication at transitions of care\.66\(4\),pp\. 751–773\.Cited by:[§A\.3\.2](https://arxiv.org/html/2605.26463#A1.SS3.SSS2.Px4.p1.1)\.
- \[22\]T\. H\. Payne, W\. D\. Alonso, J\. A\. Markiel, K\. Lybarger, R\. Lordon, M\. Yetisgen, J\. M\. Zech, and A\. A\. White\(2018\)Using voice to create inpatient progress notes: effects on note timeliness, quality, and physician satisfaction\.JAMIA open1\(2\),pp\. 218–226\.Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p1.1)\.
- \[23\]Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2023\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.arXiv preprint arXiv:2307\.16789\.Cited by:[§5\.4](https://arxiv.org/html/2605.26463#S5.SS4.p5.1)\.
- \[24\]P\. Raghavan, J\. L\. Chen, E\. Fosler\-Lussier, and A\. M\. Lai\(2014\-04\)How essential are unstructured clinical narratives and information fusion to clinical trial recruitment?\.InAMIA Joint Summits on Translational Science Proceedings,Vol\.2014,pp\. 218–223\.Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p3.1)\.
- \[25\]A\. Rajkomar, E\. Loreaux, Y\. Liu, J\. Kemp, B\. Li, M\. Chen, Y\. Zhang, A\. Mohiuddin, and J\. Gottweis\(2022\)Deciphering clinical abbreviations with a privacy protecting machine learning system\.13\(1\),pp\. 7456\.Cited by:[§C\.1](https://arxiv.org/html/2605.26463#A3.SS1.SSS0.Px1.p1.1)\.
- \[26\]R\. Rinott, M\. Torresani, R\. Bertulli, A\. Goldsteen, P\. Casali, B\. Carmeli, and N\. Slonim\(2012\)Automatic detection of inconsistencies between free text and coded data in sarcoma discharge letters\.InStudies in Health Technology and Informatics,Vol\.180,pp\. 661–666\.External Links:[Document](https://dx.doi.org/10.3233/978-1-61499-101-4-661)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p2.1),[§2](https://arxiv.org/html/2605.26463#S2.p1.1)\.
- \[27\]S\. T\. Rosenbloom, J\. C\. Denny, H\. Xu, N\. Lorenzi, W\. W\. Stead, and K\. B\. Johnson\(2011\)Data from clinical notes: a perspective on the tension between structure and flexible documentation\.18\(2\),pp\. 181–186\.Cited by:[§5\.2](https://arxiv.org/html/2605.26463#S5.SS2.p2.1)\.
- \[28\]T\. M\. Seinen, J\. A\. Kors, E\. M\. van Mulligen, and P\. R\. Rijnbeek\(2024\)Using structured codes and free\-text notes to measure information complementarity in electronic health records: feasibility and validation study\.Journal of Medical Internet Research27\.External Links:[Link](https://api.semanticscholar.org/CorpusID:274208819)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p1.1)\.
- \[29\]A\. Sellergren, S\. Kazemzadeh, T\. Jaroensri, A\. Kiraly, M\. Traverse, T\. Kohlberger, S\. Xu, F\. Jamil, C\. Hughes, C\. Lau,et al\.\(2025\)Medgemma technical report\.arXiv preprint arXiv:2507\.05201\.Cited by:[Appendix K](https://arxiv.org/html/2605.26463#A11.p1.1),[3rd item](https://arxiv.org/html/2605.26463#A15.I3.i3.p1.1.1),[4th item](https://arxiv.org/html/2605.26463#A15.I4.i4.p1.1.1),[§5\.1](https://arxiv.org/html/2605.26463#S5.SS1.p3.1)\.
- \[30\]A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[3rd item](https://arxiv.org/html/2605.26463#A15.I2.i3.p1.1.1),[§5\.2](https://arxiv.org/html/2605.26463#S5.SS2.p2.1)\.
- \[31\]W\. F\. Styler IV, S\. Bethard, S\. Finan, M\. Palmer, S\. Pradhan, P\. C\. De Groen, B\. Erickson, T\. Miller, C\. Lin, G\. Savova,et al\.\(2014\)Temporal annotation in the clinical domain\.2,pp\. 143–154\.Cited by:[§A\.3\.2](https://arxiv.org/html/2605.26463#A1.SS3.SSS2.Px1.p1.1)\.
- \[32\]A\. Y\. Tsou, C\. U\. Lehmann, J\. Michel, R\. Solomon, L\. Possanza, and T\. Gandhi\(2017\)Safe practices for copy and paste in the ehr: systematic review, recommendations, and novel model for health it collaboration\.Applied Clinical Informatics8\(1\),pp\. 12–34\.External Links:[Document](https://dx.doi.org/10.4338/ACI-2016-09-R-0150)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p1.1)\.
- \[33\]O\. Uzuner, B\. R\. South, S\. Shen, and S\. L\. DuVall\(2011\)2010 i2b2/va challenge on concepts, assertions, and relations in clinical text\.Journal of the American Medical Informatics Association18\(5\),pp\. 552–556\.Cited by:[Appendix I](https://arxiv.org/html/2605.26463#A9.p3.1),[§5\.4](https://arxiv.org/html/2605.26463#S5.SS4.p3.1)\.
- \[34\]L\. B\. Villa and I\. Cabezas\(2014\)A review on usability features for designing electronic health records\.In2014 IEEE 16th International Conference on e\-Health Networking, Applications and Services \(Healthcom\),Vol\.,pp\. 49–54\.External Links:[Document](https://dx.doi.org/10.1109/HealthCom.2014.7001812)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p1.1)\.
- \[35\]Y\. Wang, L\. Wang, M\. Rastegar\-Mojarad, S\. Moon, F\. Shen, N\. Afzal, S\. Liu, Y\. Zeng, S\. Mehrabi, S\. Sohn, and H\. Liu\(2018\)Clinical information extraction applications: a literature review\.Journal of Biomedical Informatics77,pp\. 34–49\.External Links:ISSN 1532\-0464,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jbi.2017.11.011),[Link](https://www.sciencedirect.com/science/article/pii/S1532046417302563)Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p3.1)\.
- \[36\]Z\. Wang, J\. Su, M\. Zhou, H\. Zeng, M\. Jia, X\. Lv, H\. Dong, X\. Ma, S\. Han, and D\. Zhang\(2025\)SheetBrain: a neuro\-symbolic agent for accurate reasoning over complex and large spreadsheets\.arXiv preprint arXiv:2510\.19247\.Cited by:[§2](https://arxiv.org/html/2605.26463#S2.p2.1)\.
- \[37\]T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. Le Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. Rush\(2020\-10\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Q\. Liu and D\. Schlangen \(Eds\.\),Online,pp\. 38–45\.External Links:[Link](https://aclanthology.org/2020.emnlp-demos.6/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.6)Cited by:[§O\.2](https://arxiv.org/html/2605.26463#A15.SS2.p3.1)\.
- \[38\]S\. Xiong, Z\. He, Z\. He, Y\. Zhao, C\. Pan, J\. Zhang, S\. Song, and Y\. Li\(2025\)TableZoomer: a collaborative agent framework for large\-scale table question answering\.Vicinagearth2\(1\),pp\. 1–23\.Cited by:[§2](https://arxiv.org/html/2605.26463#S2.p2.1)\.
- \[39\]W\. Xu, C\. Huang, S\. Gao, and S\. Shang\(2025\)LLM\-based agents for tool learning: a survey\.Data Science and Engineering10,pp\. 533–563\.External Links:[Document](https://dx.doi.org/10.1007/s41019-025-00296-9)Cited by:[§5\.4](https://arxiv.org/html/2605.26463#S5.SS4.p5.1)\.
- \[40\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix K](https://arxiv.org/html/2605.26463#A11.p1.1),[1st item](https://arxiv.org/html/2605.26463#A15.I3.i1.p1.1.1),[§5\.1](https://arxiv.org/html/2605.26463#S5.SS1.p3.1)\.
- \[41\]D\. Yu, R\. W\. Stidham, and V\. G\. V\. Vydiswaran\(2024\-01\)A systematic temporal extraction pipeline for medical concepts in clinical notes\.InAMIA Annual Symposium Proceedings,Vol\.2023,pp\. 1314–1323\.Cited by:[§1](https://arxiv.org/html/2605.26463#S1.p3.1)\.
- \[42\]R\. Zhang, F\. Chu, D\. Chen, and X\. Shang\(2018\)A text structuring method for chinese medical text based on temporal information\.15\(3\),pp\. 402\.Cited by:[§A\.3\.2](https://arxiv.org/html/2605.26463#A1.SS3.SSS2.Px1.p1.1)\.
- \[43\]Y\. Zhou, M\. Zhang, K\. Li, M\. Wang, Q\. Liu, Q\. Wang, J\. Liu, F\. Liu, S\. Li, W\. Li,et al\.\(2025\)Mixture\-of\-minds: multi\-agent reinforcement learning for table understanding\.arXiv preprint arXiv:2510\.20176\.Cited by:[§2](https://arxiv.org/html/2605.26463#S2.p2.1)\.

## Supplementary Contents

## Appendix ADataset Details

EHR\-ReasonConis a reasoning\-intensive consistency verification dataset built based on the MIMIC\-III database\. In this section, we will provide detailed information about the tables and columns we handle with this dataset\.

### A\.1Tables and Columns inEHR\-ReasonCon

EHR\-ReasonConperforms consistency checks between three types of clinical notes—Discharge summaries, Physician notes, and Nursing notes—and 14 tables from MIMIC\-III\. The tables used for this process are Chartevents, Labevents, Prescriptions, Inputevents\_cv, Inputevents\_mv, Outputevents, Procedureevents\_mv, Microbiologyevents, Diagnoses\_icd, Procedures\_icd, D\_items, D\_icd\_diagnoses, D\_icd\_procedures, and D\_labitems\. To ensure a comprehensive verification process,EHR\-ReasonCondoes not impose any restrictions on the scale or types of columns when verifying consistency between the notes and the tables\. The final labeling results in a dataset that includes specific column entries from each of the tables, as detailed in Table[6](https://arxiv.org/html/2605.26463#A1.T6)\.

### A\.2Three Types of Notes inEHR\-ReasonCon

We constructEHR\-ReasonConusing 105 clinical notes, following the setting of EHRCon\[[12](https://arxiv.org/html/2605.26463#bib.bib33)\]\. To closely reflect real\-world clinical practice, we selected three widely used types of inpatient clinical documentation: discharge summaries, physician notes, and nursing notes\[[7](https://arxiv.org/html/2605.26463#bib.bib42)\], capturing the characteristics of diverse clinical settings\. In addition, we performed data filtering to improve analytical accuracy\. The original notes may include information outside the scope of this study, such as family history or prior visit records\. Therefore, these elements were excluded, and the dataset was refined to focus only on core information directly related to the current hospitalization process\.

### A\.3Comparison with Surface\-Level Consistency Verification Benchmark

Going beyond a surface\-level approach,EHR\-ReasonConis designed to faithfully reflect the complex context inherent in actual EHR documentation by integrating three key characteristics: \(1\) Compositional, \(2\) Time\-series, and \(3\) Interpretive Information\. To emphasize the strengths of this approach, we compare it with EHRCon, the only study that clearly defines both the task and a benchmark for verifying consistency between clinical notes and tables\. In the following sections, we will explain howEHR\-ReasonConcaptures these characteristics in a more reasoning\-intensive manner\.

##### Compositional Clinical Information

EHR\-ReasonCongoes beyond simple numerical values or discrete event comparisons by incorporating a broader range of complex attributes\. Existing benchmarks like EHRCon have primarily focused on fragmented data such as time \(CHARTTIME\) or numerical values \(VALUENUM\)\. In contrast,EHR\-ReasonConexpands the validation scope to 14 tables and 58 columns \(see Table[6](https://arxiv.org/html/2605.26463#A1.T6)\)\. This includes complex attributes that were previously overlooked, such as event endtime \(ENDTIME\), procedure location \(LOCATION\), route of administration \(ROUTE\), and qualitative interpretations \(INTERPRETATION\) stored in the microbiologyevents table, which capture susceptibility results for antibiotic names \(AB\_NAME\)\. As a result,EHR\-ReasonConcaptures not just simple matches but the deeper consistency across the entire clinical documentation\.

##### Temporally Evolving Information

EHR data is inherently time\-series data, and time information is a key factor in understanding a patient’s clinical trajectory\. EHRCon recognized this importance and utilized time information to validate the consistency between notes and table data\. However, it simplified the time aspect into three categories: standard format, narrative expression \(e\.g\.,admission date\), and missing time information\. While this approach may be useful for verifying data consistency, it falls short of capturing the complex temporal context embedded in clinical records\. Medical records often contain ambiguous time references such as “post\-surgery” or “early morning,” as well as more difficult\-to\-define time spans or intervals, making standardization challenging\. In response,EHR\-ReasonConhas adopted a reasoning\-intensive approach, moving away from rigid classification systems\. By analyzing the overall context of the document, it captures the subtle timing and sequential relationships between individual clinical events, accurately and comprehensively reconstructing fragmented information into a patient’s complete time\-series record\.

Table 6:Tables and columns used inEHR\-ReasonCon\.
##### Clinical Interpreted Information

Clinical notes and tables capture the same patient data in two distinct formats: unstructured narrative and standardized tabular forms\. While previous benchmarks like EHRCon addressed these differences through lexical mappings \(e\.g\.,abbreviations and brand names\), actual clinical reasoning demands a deeper interpretive bridge\. For example, confirming a diagnosis of ‘anemia’ in the notes requires the knowledge to interpret the ‘hemoglobin \(Hgb\)’ levels recorded in the tables\. To bridge this gap, this study integrates authoritative clinical references with expert insights\. By incorporating this interpretive layer, it enables a more precise and consistent review that aligns with real\-world clinical practices\. The clinical references used in this process are summarized in TableLABEL:tab:clinical\_ref\.

#### A\.3\.1Statistics

Table[1](https://arxiv.org/html/2605.26463#S3.T1)presents the key statistics ofEHR\-ReasonCon\. The dataset consists of 105 clinical notes, split into 83 notes for the test set and 22 notes for the validation set, including 38 discharge summaries, 33 physician notes, and 34 nursing notes, with a total of 8,048 annotated entities from MIMIC\-III\. On average, each note contains 76\.65 entities and 1,953 tokens\. The annotations include 5,848 consistent and 2,200 inconsistent labels, providing a reliable ground truth for consistency verification\. Among the three note types, discharge summaries are the most complex, with the longest average length of 2,789 tokens and the largest number of inconsistent labels \(1,457\)\. This reflects the difficulty of maintaining factual consistency in longitudinal clinical narratives\.

#### A\.3\.2Fine\-grained Analysis of Note–Table Discrepancies

To better understand the inconsistencies identified inEHR\-ReasonCon, we analyze their patterns across clinical notes and structured tables\.

##### Column\-level Inconsistency Patterns

As shown in Figure[6](https://arxiv.org/html/2605.26463#A1.F6), the most prevalent type of column\-level inconsistency is time error, accounting for 50\.3% of all inconsistent cases\. This indicates that discrepancies between clinical notes and structured tables frequently arise from mismatches in the timing or duration of clinical events\. Temporal information plays a critical role in interpreting the sequence of examinations, medications, procedures, and patient status changes\[[31](https://arxiv.org/html/2605.26463#bib.bib43)\]\. Therefore, inconsistencies in the time column can hinder the reconstruction of a patient’s clinical trajectory and the determination of whether specific treatments occurred before or after particular events\[[42](https://arxiv.org/html/2605.26463#bib.bib44)\]\. Value error is the second most common type, accounting for 30\.5% of inconsistencies\. This category includes clinically important information such as laboratory values, vital signs, medication dosages, and output measurements\. Even when the same clinical entity appears in both the note and the table, discrepancies in its associated time or value may lead to different clinical interpretations\. These findings highlight the importance of detecting both temporal and value\-level inconsistencies to support reliable interpretation of clinical data\.

##### Mismatch Patterns by Number of Columns

Figure[7](https://arxiv.org/html/2605.26463#A1.F7)presents the number of inconsistent columns observed for each entity\. Most inconsistencies are confined to a single column, representing 81\.7% of inconsistent entities\. Cases involving two and three inconsistent columns account for 14\.2% and 3\.2%, respectively, while inconsistencies spanning four or more columns are relatively rare\. This pattern indicates that note–table discrepancies are more likely to arise from partial attribute\-level mismatches than from complete disagreement at the entity level\.

##### Patterns of Inconsistency and Missing Evidence

We further compare column inconsistency with missing evidence, as shown in Figure[8](https://arxiv.org/html/2605.26463#A1.F8)\. The two discrepancy types occur at similar rates, accounting for 51\.3% and 48\.7% of cases, respectively\. This result suggests that note–table discrepancies are not limited to cases where structured table values conflict with narrative documentation\. A comparable proportion of discrepancies also arises when information described in clinical notes lacks corresponding structured evidence\. Notably, the distribution of these two discrepancy types varies across note categories\. In discharge summaries, column inconsistency and missing evidence appear in nearly equal proportions \(49\.9% and 50\.1%, respectively\), suggesting that discharge summaries may contain both information that conflicts with structured tables and information that cannot be verified within them\. In contrast, nursing notes show a higher proportion of missing evidence \(57\.5%\), indicating that bedside observations and nursing interventions are often not fully captured in structured tables\. Conversely, physician notes exhibit a higher proportion of column inconsistency \(58\.8%\), implying that discrepancies in physician documentation are more likely to arise from attribute\-level conflicts between note narratives and existing structured records, rather than from the absence of supporting evidence\. If such omissions or inconsistencies persist, downstream systems that rely solely on structured data may fail to capture important aspects of the patient’s condition or treatment trajectory, or may instead depend on incorrect evidence\. Therefore, it is essential to verify both whether the information described in clinical notes is adequately supported by structured tables and whether existing structured evidence is fully consistent with the narrative documentation\.

##### Inconsistency Patterns by Note Type

Figure[9](https://arxiv.org/html/2605.26463#A1.F9)shows that the distribution of column\-level inconsistency differs across note types\. In discharge summaries, time errors account for the largest proportion, at 54\.1%\. Because discharge summaries provide a comprehensive overview of diagnoses, tests, treatments, and clinical course at the time of discharge, temporal inconsistencies can affect the interpretation of the patient’s overall hospitalization and treatment history\[[21](https://arxiv.org/html/2605.26463#bib.bib46)\]\. Moreover, discharge summaries are used to communicate the hospitalization course during post\-discharge care or patient transfer\[[17](https://arxiv.org/html/2605.26463#bib.bib45)\], making temporal accuracy particularly important for downstream clinical interpretation\. In nursing notes, value errors are the most prevalent inconsistency type, accounting for 52\.3%\. Nursing notes typically include bedside\-level information such as patient observations, vital signs, intake and output, pain assessments, and responses to interventions\. Because these data are directly related to patient monitoring and nursing care, value\-level inconsistencies can lead to incorrect assessments of patient status when structured tables are used as evidence\. In physician notes, time errors and value errors constitute the majority of inconsistencies, accounting for 50\.1% and 37\.9%, respectively\. Physician notes document daily patient progress, interpretations of test results, treatment responses, and medication changes\. Therefore, both temporal and value information are central to verifying whether clinical decisions and treatment trajectories are accurately aligned with evidence from structured tables\.

![[Uncaptioned image]](https://arxiv.org/html/2605.26463v1/figures/append_fig/plot1_error_categories.png)

Figure 6:Distribution of column\-level inconsistency types, with time errors being the most prevalent\.

![[Uncaptioned image]](https://arxiv.org/html/2605.26463v1/figures/append_fig/plot2_col_count_dist.png)

Figure 7:Distribution of the number of inconsistent columns per entity, showing most inconsistencies occur in a single column\.

![[Uncaptioned image]](https://arxiv.org/html/2605.26463v1/figures/append_fig/plot3_incon_vs_missing.png)

Figure 8:Comparison of column inconsistency and missing evidence across note types, illustrating different patterns of discrepancy depending on the note category\.

![[Uncaptioned image]](https://arxiv.org/html/2605.26463v1/figures/append_fig/plot4_error_heatmap.png)

Figure 9:Distribution of inconsistency types by note type, highlighting time errors in discharge and physician notes and value errors in nursing notes\.

## Appendix BAnnotation Protocol

### B\.1Task

Identify entities within clinical notes and compare their associated values with actual tables to detect discrepancies between the notes and the tables\. For efficient labeling, we used Streamlit888[https://streamlit\.io/](https://streamlit.io/), and a screenshot of the interface is shown in Figure[10](https://arxiv.org/html/2605.26463#A2.F10)\.

### B\.2Annotation Protocols

#### B\.2\.1Definition of Clinical Note

To ensure accurate labeling, it is important to understand the types of clinical notes used in our study\. These include the Discharge Summary, Physician Note, and Nursing Note\.

##### Discharge Summary

The Discharge Summary is written at the time of the patient’s discharge, summarizing the events during their hospitalization\. It may include information from before admission, such as past medical history, as well as details from after discharge\. For example, information about "admission medications" may be recorded to continue treatment for medications the patient was taking prior to hospitalization\.

##### Physician Note

The Physician Note is written by the physician during daily rounds\. It describes the patient’s condition and outlines the next steps for diagnosis and treatment\.

##### Nursing Note

The Nursing Note is written by a nurse and documents the patient’s condition\. These notes are often recorded multiple times a day to provide ongoing updates\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x5.png)Figure 10:A screenshot of Streamlit provided to labelers for annotation\.

### B\.3Definition of Entity

The goal of entity extraction in this study is to extract all entities that can be matched to the item names from 14 clinical tables, including D\_ITEMS, DIAGNOSES\_ICD, D\_ICD\_DIAGNOSES, D\_ICD\_PROCEDURES, D\_LABITEMS, INPUTEVENTS\_MV, CHARTEVENTS, MICROBIOLOGYEVENTS, INPUTEVENTS\_CV, OUTPUTEVENTS, LABEVENTS, PROCEDUREEVENTS\_ICD, PRESCRIPTIONS, and PROCEDUREEVENTS\_MV\.

##### Entity Group Considerations

In particular, when general drug categories such as “Antibiotics” or “Beta\-blockers” or test panels like “ABG” or “Chem\-7” appear in the clinical notes, they are expanded into detailed items for analysis\. For example, if “Chem\-7” is mentioned, the entity values from the LABEVENTS table, such as Sodium, Potassium, Chloride, Bicarbonate, Blood Urea Nitrogen, Creatinine, and Glucose, should be compared with each respective value\. To find these detailed items, searches are limited to four established sources: MedlinePlus999[https://MedlinePlus\.gov/](https://medlineplus.gov/), Cleveland Clinic101010[https://my\.clevelandclinic\.org/](https://my.clevelandclinic.org/), Mayo Clinic111111[https://www\.mayoclinic\.org/](https://www.mayoclinic.org/), and UpToDate121212[https://www\.uptodate\.com/contents/search](https://www.uptodate.com/contents/search)\. If the search results are insufficient or additional medical knowledge is required, please consult with a physician\.

##### Entity Mapping in Clinical Notes

In clinical notes, information is often written as free text, which means the entities might not always be clearly listed or categorized in the tables\. For example, a note that mentions “headaches” might correspond to different entries in the database, such as “pain location” or “type of pain\.” Even if these entries aren’t explicitly labeled, it’s essential to match the data to the correct item in the tables whenever possible\. To map entities accurately, it’s important to understand how each entity is stored in the database\. Since information in clinical notes can sometimes be incomplete or unclear, you need to interpret it carefully and know how to connect it correctly\. If something is unclear, it’s important to consult with a physician to ensure the mapping is done correctly\.

##### Entity Scope

To prevent incorrect discrepancies, we do not extract the following types of information as entities \(see Figure[11](https://arxiv.org/html/2605.26463#A2.F11)\):

- •Past Information: The past medical history, family history, previous hospital admissions, emergency room visits, and medications administered during hospitalization recorded in clinical notes are all considered past information\. Since this information is not stored in the tables, it is not extracted as entities\.
- •Future Plans: Information such as ‘discharge plans’ or ‘next steps to be taken’ mentioned in discharge summaries or physician notes are categorized as future plans\. These plans may not be executed due to changes in the patient’s condition, and therefore, are not considered discrepancies if not carried out\. Future plans are not extracted as entities, including those recorded as ‘plan’ in nursing notes\.
- •Information Unrelated to Database: If the information recorded in clinical notes is not linked to a specific record or item in the tables, it is not extracted as an entity\. Additionally, information such as transfer details, which are not related to the tables we handle, or items that only mention total amounts without specifying what those amounts refer to, are also not extracted\.

##### Insurance\-related Entities

Items recorded under “Discharge Diagnoses” or “Major Surgical/Invasive Procedures” in the discharge summary are typically written for insurance claim purposes by hospitals\. Entities listed under these items should be limited to discrepancy checks related to insurance claims, specifically for ROCEDUREEVENTS\_ICD, DIAGNOSES\_ICD, D\_ICD\_DIAGNOSES, and D\_ICD\_PROCEDURES\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x6.png)Figure 11:Scope ofEHR\-ReasonCon\.

### B\.4Definition of Entity Attribution

##### Time

Structured tables in EHR are essentially time\-series data, making the timing of events crucial for understanding a patient’s clinical journey\. When reviewing clinical notes, it is important to determine the time an event occurred\. For example, if a note says “3 days after surgery,” the surgery date should be extracted, converted into a standard time format, and then used to calculate the exact date that is 3 days after the surgery\. The key is to carefully consider the context of the note and extract the most accurate time possible\. While the guidelines below are helpful, the main goal is to interpret the context of the note and determine the most appropriate timestamp\.

- •Handling Ambiguous Time Information: If explicit time information is not provided, the time is inferred based on the note’s nature\. For example, a discharge summary typically summarizes the patient’s admission and discharge records\. If the exact time is unclear, the entities in the discharge summary should be checked against the tables within the patient’s hospitalization period, and if they match, it can be considered consistent\. Physician or nursing notes are usually written daily, so in this case, the charting date is treated as the event date\. That is, if the entity information is recorded in the tables on the charting date, it can be considered consistent\.
- •Mapping Time Based on Admission and Discharge Dates: When specified as “admission date” and “discharge date,” the time expression may vary depending on the physician\. Therefore, a reasonable medical range should be applied, for instance, one day before admission and one day before discharge\. Consistency checks should be conducted within this time frame\.
- •Medication Timing: For medication\-related tables, such as inputevents or prescription tables, both the start and end times are recorded\. If both the start and end times are explicitly provided, these should be checked against the tables’ startdate or starttime and enddate or endtime fields to ensure they match accurately\. If a duration is recorded in the clinical note, verify that the medication was administered correctly during the specified period\.
- •Verifying Time\-Related Terms in EHR: When terms like “morning” or “midnight” are used in a clinical note, verify whether the records exist accurately within the corresponding time frame\.

- •Note: While these guidelines should be followed, the main goal is to understand the context of the note and extract the most accurate timestamp possible\.

### B\.5Discrepancy Detection Process

The core of this task is to compare the entities extracted from clinical notes with the table records to check for discrepancies\. To do this, the following three main approaches can be used:

##### Matching Based on Common Sense

Even if an entity in the clinical note does not exactly match the one in the tables, it can still be considered a match based on common sense\. For example, the clinical note may mention “hand,” while the tables record “finger\.” Since both refer to the same part of the body, they can be considered a match\.

##### Matching Based on Medical Knowledge

Clinical notes describe a patient’s condition in free text, while the tables use a standardized format\. For instance, if the note says ‘edema 2\+,’ it might be recorded in the tables as ‘palpable edema\.’ In such cases, it is essential to understand the medical meaning of these terms \(e\.g\.,2\+ and palpable\) and refer to authoritative sources such as UpToDate, MedlinePlus, Mayo Clinic, and Cleveland Clinic to make an accurate match\. If these sources do not provide sufficient information, it is important to consult with a physician for clinical interpretation\.

##### Exact Matching

Some clinical note entries may be copied from the tables based on medical practices\. In this case, the entities in the clinical note may match the table records exactly\. For example, if the note mentions a WBC count of 10\.0, the same value may appear in the tables\. However, EHR system errors could lead to discrepancies, so it’s crucial to carefully check for consistency\.

### B\.6Single or Multiple Row Matching

##### Single Row Matching

Some entities can be compared with a single row in the tables to check for discrepancies\. This applies to events that occur at a specific point in time\. For example, if the note says “WBC stable,” and the tables show a stable value, it can be considered a match\. In this case, as the event happens at a single point in time, finding this record in the tables once would be enough to confirm the match\. However, time should also be carefully considered in these cases\.

##### Multiple Row Matching

Clinical notes often describe a patient’s condition over time\. When this happens, the trend across multiple records in the tables needs to be checked\. For instance, if the note says, “No fever from hospital day 3 to 5, but fever started on day 6,” the table records must show no fever from days 3\-5, and the fever should be recorded starting from day 6\.

- •Note: Values like blood pressure \(BP\) can be represented as follows: 100/60 \(80\) \- 140/100 \(120\)\. In this notation, the value in parentheses represents the mean value, while the part before the hyphen indicates the minimum blood pressure, and the part after the hyphen represents the maximum blood pressure range\. This notation is used when recording a patient’s blood pressure measurement to provide both the average value along with the minimum and maximum values\. When comparing with table data, the mean value should be searched as the mean BP in the table, while the minimum and maximum values should be considered as the BP measurement range\.

### B\.7Example

- •WBC 20\.0 \* - –At this point, you need to confirm that the numeric value of WBC is 20\.0, and that the FLAG column correctly shows the value as “ABNORMAL” in the table\.

- •Anemia on admission - –Here, you need to check that the hemoglobin value \(valuenum\) listed in the table falls within the range indicating anemia, from the day before admission to the day after admission\.

- •Sputum culture identifying Streptococcus sensitive to Cefazolinex\. - –In this case, the table should show that the specimen is “sputum” and the organism is “Streptococcus”\. Additionally, you need to confirm that the sensitivity test result \(interpretation\) for “Cefazolinex” \(ab\_name\) shows “sensitive\.”

### B\.8Consistency Check

Labeling should be strictly limited to consistent and inconsistent\. A label of consistent indicates that all information described in the clinical note is accurately aligned with the table\. If even a single column contains conflicting information or if there is no supporting record in the table, the case should be labeled as inconsistent\. In addition, please specify the rows in the table that serve as evidence for your consistent or inconsistent decision\. Notes for Consistency Checking:

- •Use of medical knowledge: If medical knowledge is used to make a judgment, please provide the corresponding reference\. Examples of medical knowledge used inEHR\-ReasonConare provided in TableLABEL:tab:clinical\_ref\.

- •Use of commonsense knowledge: If commonsense knowledge is used, please explicitly state that commonsense knowledge was applied\. For example, if the note states that the patient injured their right arm, while the table records an injury to the right finger, this may be considered consistent based on commonsense knowledge\.

Table 7:Clinical Reference TableSubjectReferenceContentNormal Heart Rate for Adults \(BPM\)Mayo ClinicA normal resting heart rate for adults is 60–100 beats per minute; well\-trained athletes may have lower ratesRegular TemperatureMayo ClinicAround 37°C is normal, and 38°C or higher is considered a fever\.Stool ColorMayo ClinicStool color is mainly influenced by diet and bile, and colors such as brown, green, and yellow are generally considered normal, whereas bright red or black stool may indicate possible bleeding\.WBC Normal RangeMedlinePlusThe normal range is approximately 4,500 to 11,000 per microliter; lower levels may indicate weakened immunity or underlying disease, while higher levels may suggest infection, inflammation, or cancer\.LevofloxacinMedlinePlusLevofloxacin is an antibiotic used to treat bacterial infections, not viruses like colds or flu, and carries risks of serious side effects that require caution\.Respiratory RangeCleveland ClinicRR is normally 12–18 breaths per minute, and values below 12 or above 25 at rest are considered abnormal\.MelenaCleveland ClinicMelena \(black stool\) is a symptom of internal bleeding, usually in your upper gastrointestinal \(GI\) tract\. The blood turns black as it travels through your digestive system before coming out in your poop\.TachypnicCleveland ClinicTachypnea is quick, shallow breathing\. This makes you feel like you’re not getting enough air\. This symptom can affect anyone at any age and is common among newborns and people with respiratory conditions\. Treating the underlying cause prevents this symptom\.Normal Sinus RhythmUptodateNSR is a normal rhythm originating from the sinoatrial node, while sinus tachycardia has the same origin but with a heart rate above 100 bpm\.Hypertension CriteriaMedlinePlusHypertension is when blood pressure is consistently 130/80 mmHg or higher\.FurosemideMedlinePlusFurosemide is a diuretic \(“water pill”\) used to treat high blood pressure and reduce excess fluid \(edema\) in the body\.Blood Oxygen LevelCleveland ClinicNormal blood oxygen saturation is 95–100%; below 92% requires caution, and 88% or lower is an emergency level\.Clobetasol OintmentCleveland ClinicClobetasol propionate is an ointment you can rub on your skin to treat eczema and psoriasis\. It reduces swelling, redness, itching and rashes caused by these skin conditions\. It’s a type of topical steroid medication\. Brand names of this medication are Cormax, Embeline and Temovate\.T3 normal rangeCleveland ClinicThe normal T3 range in adults is 79 to 165 ng/dL for total T3 and 2\.3 to 4\.1 pg/mL for free T3\.Atrial fibrillation \(AFib\)Mayo ClinicAtrial fibrillation, also called AFib\. This is the most common type of tachycardia\. Chaotic, irregular electrical signals start in the upper chambers of the heart, called the atria\.Beta BlokersCleveland ClinicBeta\-blockers are a class of medicines most commonly used for problems involving your heart and circulatory system\. They can also help treat conditions related to your brain and nervous system\. Beta\-blockers work by slowing down certain types of cell activity\. This can help manage your blood pressure, heart rate and more\.

## Appendix CTable\-Exploration Tools

This process produced eight table\-exploration tools that support efficient exploration of complex EHR databases and are organized into three functional categories:

### C\.1Entity\-to\-Table Item Alignment

The tools in this category support the alignment of entities mentioned in clinical notes with corresponding items in structured tables\. The same clinical information may appear under different names or levels of abstraction \(e\.g\.,“White Blood Cells” and “WBC”\), so these tools retrieve potentially relevant table items based on both lexical similarity and conceptual relatedness\.

##### Lexical\_Search

This tool employs an N\-gram–based approach to retrieve items from structured tables that are lexically similar to a given query entity \(see Figure[12](https://arxiv.org/html/2605.26463#A3.F12)\-\(1\)\)\. Rather than relying solely on simple string matching, it incorporates the C4\-WSRS medical abbreviation dataset\[[25](https://arxiv.org/html/2605.26463#bib.bib47)\], allowing abbreviations such as “WBC” or “BP” to retrieve their full forms, “White Blood Cell” and “Blood Pressure\.” This enables more accurate retrieval by accounting not only for surface\-level text similarity but also for abbreviation expansion\.

##### Semantic\_Search

This tool supports semantic search to capture similarities that are difficult to detect with lexical methods alone \(see Figure[12](https://arxiv.org/html/2605.26463#A3.F12)\-\(2\)\)\. It leverages theGemini 2\.5 Flashmodel to identify items that are semantically similar to a given query entity by analyzing contextual meaning rather than surface\-level text\. For example, the term “alert,” which describes a patient’s condition, is conceptually related to “level of consciousness,” but this relationship may not be identified through N\-gram matching\. This tool addresses this limitation by considering context and meaning, linking expressions that differ lexically but represent the same or similar clinical concepts\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x7.png)Figure 12:Conceptual illustration comparing N\-gram–based lexical search and semantic search, highlighting how string similarity versus conceptual relevance leads to different retrieval results\.![Refer to caption](https://arxiv.org/html/2605.26463v1/x8.png)Figure 13:Conceptual illustration comparing Get Item Value Distribution and Analyze Category Trend\.![Refer to caption](https://arxiv.org/html/2605.26463v1/x9.png)Figure 14:Conceptual illustration comparing Get Item Status History and Get Item Value History\.![Refer to caption](https://arxiv.org/html/2605.26463v1/x10.png)Figure 15:Conceptual illustration comparing Analyze Value Trend and View General Timeline\.

### C\.2Database Exploration and Value Profiling

The tools in this category support exploration of the EHR database schema and content\. Since clinical concepts may be distributed across multiple tables, the tools support exploration of relevant table groups and summarize typical values for each item, enabling annotators to quickly interpret the role of different fields\.

##### Get\_Item\_Value\_Distribution

This tool provides insight into the distribution of values associated with specific items within tables \(see Figure[12](https://arxiv.org/html/2605.26463#A3.F12)\-\(3\)\)\. It allows users to view the top K most frequent values for each item, helping to characterize the nature of the data\. In this study, K was set to 10\. For instance, if “SBP” values frequently appear as 110, 120, or 130, this indicates that the item represents continuous numerical data\. This tool enables users to determine whether an item is numerical, categorical, or follows a specific pattern, and to assess whether the retrieved item appropriately reflects the information recorded in clinical notes\.

##### Analyze\_Category\_Trend

This tool helps users understand the structure of items across multiple tables in a database \(see Figure[13](https://arxiv.org/html/2605.26463#A3.F13)\-\(4\)\)\. Since items are distributed across tables based on their characteristics, it is not always intuitive to determine where a specific item resides\. This tool analyzes how each table is organized into categories and what items belong to each category, providing insight into the location and context of entities\. For example, the chartevents table may include categories such as Labs and General\. The Labs category contains items like “anion gap” or “CK\-MB,” while the General category includes items related to consciousness, such as “Level of Consciousness” or “Oriented\.” This helps users more efficiently identify the appropriate table for a given entity\.

### C\.3Temporal and Conditional Record Retrieval

The tools in this category support verification of clinical statements that involve temporal changes or specific conditions\. The tools allow annotators to retrieve records from structured tables based on time windows and value constraints, enabling inspection of whether the structured data support trends or events described in clinical notes\.

##### Get\_Item\_Status\_History

This tool enables users to examine entity information over time \(see Figure[14](https://arxiv.org/html/2605.26463#A3.F14)\-\(5\)\)\. Time can be specified in three different ways to refine the search\. First, users can query based on an exact standard timestamp, such as “06\-24,” to check whether records exist at a specific point in time\. This approach is suitable when explicit time information is directly recorded in the data\. Second, this tool supports searches based on time expressions, such as “admission,” which are described narratively in clinical notes\. Because these expressions do not correspond to precise timestamps and must be interpreted from context, the search is performed by defining a time window around the inferred point, typically extending from one day before to one day after the reference time, in order to retrieve relevant surrounding records\. Third, this tool allows searches based on duration\. Users can define a start time \(time1\) and an end time \(time2\), and for more robust retrieval, the search range is typically expanded to include the period from one day before the start time to one day after the end time, ensuring that all relevant data within the interval are captured\. For example, when a user searches \(Admission, spo2\) using Get\_Item\_Status\_History, the tool retrieves spo2 results recorded during that admission\.

##### Get\_Item\_Value\_History

This tool supports more fine\-grained queries than Get\_Item\_Status\_History \(see Figure[14](https://arxiv.org/html/2605.26463#A3.F14)\-\(6\)\)\. It allows users to specify not only the item but also its associated values as search conditions\. By using operators such as “more,” “less,” and “between,” users can define value ranges more precisely and retrieve data that meet specific criteria\. For example, when a user searches \(Admission, spo2, 90\[more\]\) using Get\_Item\_Value\_History, the tool retrieves only spo2 results greater than 90 recorded during that admission\.

##### Analyze\_Value\_Trend

This tool enables the analysis of value trends over time, independent of specific items \(see Figure[15](https://arxiv.org/html/2605.26463#A3.F15)\-\(7\)\)\. Rather than focusing on values at a single time point, it examines how values evolve across time, capturing patterns such as increases, decreases, or stability\. Users can specify a time range to analyze trends within a particular interval, allowing for a more contextual understanding of value changes\. In addition to absolute values, this tool considers dynamic characteristics such as the rate of change and variability\. It is designed to move beyond item\-specific status queries and instead support a broader understanding of temporal value patterns across the data\. For example, when a user searches \(Admission, 90\[more\]\) using Analyze\_Value\_Trend, the tool retrieves all results greater than 90 recorded during that admission\.

##### View\_General\_Timeline

This tool allows users to retrieve all records associated with a specific time point within a selected table \(see Figure[15](https://arxiv.org/html/2605.26463#A3.F15)\-\(8\)\)\. For example, if “2026\-05\-03” is set as the reference time for the chartevents table, all patient records documented in the chartevents table on that date can be retrieved\.

## Appendix DInter Annotator Agreement

After completing the annotation process, we measured inter\-annotator agreement by comparing the annotations produced by two annotators for each note\. For entity recognition, we defined recall and precision based on the proportion of overlapping entities identified by the two annotators, and computed the F1 score as their harmonic mean\. The resulting F1 score was 0\.897, indicating a high level of agreement\. In addition, for entities extracted by both annotators, we compared their consistency labels and found an agreement rate of 0\.888\. These high agreement scores support the overall quality and reliability of the annotation\.

## Appendix EDetailed Implementation ofEHR\-Inspector

### E\.1Entity Extraction

EHR\-Inspectoridentifies entities in clinical notes that can be verified using EHR data\. It follows a two\-step approach: \(1\) entities corresponding to items recorded in the patient’s structured tables \(Ep​a​t​i​e​n​tE\_\{patient\}\), and \(2\) entities defined in the EHR schema but absent from the patient’s structured records \(Eo​n​t​o​l​o​g​yE\_\{ontology\}\)\. The final anchor entity set is defined asEe​x​t​r​a​c​t=Ep​a​t​i​e​n​t∪Eo​n​t​o​l​o​g​yE\_\{extract\}=E\_\{patient\}\\cup E\_\{ontology\}\.

#### E\.1\.1Patient\-specific Extraction

As illustrated in Figure[16](https://arxiv.org/html/2605.26463#A5.F16), we first predefine the global item setLLand the corresponding value distribution profileT​o​p​V​\(l\)TopV\(l\)for each item using the full MIMIC\-III\[[9](https://arxiv.org/html/2605.26463#bib.bib40)\]database\. For a given patientPP, we then construct a patient\-specific subsetLP⊆LL\_\{P\}\\subseteq Lby selecting only the items recorded in the patient’s structured tables\. Notably, the value distributionT​o​p​V​\(l\)TopV\(l\)is not recomputed per patient but directly reused from the globally defined statistics\. For each clinical note segmentxtx\_\{t\}, the LLM is provided withxtx\_\{t\},LPL\_\{P\}, and\{T​o​p​V​\(l\)\}l∈LP\\\{TopV\(l\)\\\}\_\{l\\in L\_\{P\}\}, and extracts entity names that correspond to items inLPL\_\{P\}\. The final patient\-specific entity set is obtained by aggregating the extracted entities across all segments, i\.e\.,Epatient=⋃xt∈XEpat​\(xt\)E\_\{\\mathrm\{patient\}\}=\\bigcup\_\{x\_\{t\}\\in X\}E^\{\\mathrm\{pat\}\}\(x\_\{t\}\)\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x11.png)Figure 16:Overview of patient\-specific entity extraction\.
#### E\.1\.2Ontology\-Guided Extraction

Figure[17](https://arxiv.org/html/2605.26463#A5.F17)illustrates the construction of the predefined hierarchical ontology and its use in ontology\-guided entity extraction\. This section first describes how the predefined hierarchical ontology is constructed and then explains how the constructed ontology is used to extract ontology\-based entities from clinical notes\.

##### Predefined Hierarchical Ontology Construction

In this study, we automatically construct a hierarchical ontology using an LLM, based on the full set of items appearing in the structured tables\. Since real\-world EHR tables contain more than 10,000 items, it is impractical to provide the entire item set to the LLM at once\. Therefore, we divide the items into smaller batches using a sliding window of sizewwand process them sequentially\. At each step, the LLM receives thewwitems in the current window and assigns each item to an appropriate group\. The prompt also includes the list of groups generated from previous windows\. This allows the LLM to map newly observed items to existing groups, or to create a new group when none of the existing groups is suitable\. By repeating this prompt\-update process over the entire item set, we progressively construct the group\-level ontology\. For example, suppose a window contains items such asSystolic BP,Heart Rate,Temperature, andGuedel\. The LLM may assignSystolic BPandHeart Rateto theHemodynamicsgroup,Temperatureto theVitalsgroup, andGuedelto theProceduregroup\. When new items are processed in subsequent windows, the LLM refers to the previously generated groups and either assigns the new items to existing groups or creates additional groups when necessary\.

After constructing the group\-level ontology, we build the subgroup\-level ontology using the same sliding\-window strategy\. In this stage, the LLM receives item–group pairs rather than item names alone\. That is, each input includes both an item and its assigned group, such asSystolic BP–Hemodynamics\. The LLM then maps each item–group pair to an appropriate subgroup, or creates a new subgroup if no existing subgroup is suitable\. The generated subgroup list is also included in subsequent prompts, allowing the subgroup\-level ontology to be progressively expanded over the full item set\. For example,Systolic BP–HemodynamicsandHeart Rate–Hemodynamicsmay be assigned to theVitalssubgroup, whereasGuedel–Proceduremay be assigned to theAirwaysubgroup\. The final hierarchical ontology is reviewed by the authors to identify and correct inappropriate mappings or redundant categories\. The prompts used for group\- and subgroup\-level ontology construction are provided in Appendix[21](https://arxiv.org/html/2605.26463#A5.F21)\. The full list of groups and subgroups is provided in TableLABEL:apptab:group\_subgroup\.

##### Ontology\-Guided Entity Extraction

Given the predefined hierarchical ontologyOOand segmented clinical notesxtx\_\{t\}, LLM extract the ontology\-based entity setEontologyE\_\{\\mathrm\{ontology\}\}\. The goal of this step is to recover entities that are mentioned in clinical notes but are absent from the corresponding patient’s structured records\. Specifically, for each clinical note segment, the LLM first selects the groups that are relevant to the segment\. Within the selected groups, it then selects the most relevant subgroups\. Finally, item\-level entity extraction is performed using only the candidate items associated with the selected subgroups\. In other words, instead of searching over the entire EHR item set, the ontology guides the LLM to progressively narrow the candidate space from groups to subgroups and then to item\-level entities\. For example, suppose a clinical note segment contains the phrase “Temp: 100”\. The LLM first recognizes that “Temp” refers to a vital\-sign\-related mention and may select relevant groups such asVitalsorHemodynamics\. It then selects the relevant subgroup within the selected groups, such as theVitalssubgroup\. Finally, extraction is performed using only the item candidates associated with that subgroup\. Since this subgroup contains the itemTemperature, the mention “Temp” is mapped toTemperaturein the predefined ontology\. Therefore,Tempis added to the ontology\-based entity setEontologyE\_\{\\mathrm\{ontology\}\}\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x12.png)Figure 17:Overview of ontology\-guided entity extraction\.Table 8:Groups and subgroups used inEHR\-ReasonCon\.GROUPSUB GROUPPROCEDUREACCESS & LINESAIRWAY & VENTILATIONRENAL & DIALYSISCARDIACDIAGNOSTICS \(IMAGING/LABS/CULTURES/SWABS/ECG·EEG\)CARE FLOW, COMMS, SAFETY ADMINGI&IRNEURO & ICPOB/GYNMSKGENERAL/PLASTICS/ENT/EYE/DENTALMEDICATIONS, FLUIDS & NUTRITIONFLUIDS & NUTRITIONCARDIO & HEMATOLOGYINFECTIOUS DISEASES & IMMUNOLOGYCNS, PAIN & SEDATIONENDOCRINE, RENAL & GIRESPIRATORY, ENT/OPHTH & DERMFORMULATIONS, DEVICES, STUDY CODES & AMBIGUOUSOUTPUT AND REMOVALDIALYSIS\_ULTRAFILTRATIONURINARY\_OUTPUTPLEURAL\_CHEST\_TUBESABDOMINAL\_PARACENTESISCSF\_DRAINAGEGI\_TUBES\_OUTPUTWOUND\_LOCAL\_DRAINSBLOODLOSS\_APHERESIS\_OPSDRAIN\_DEVICESBOWEL\_OSTOMY\_OUTPUTEMESIS\_ORAL\_GASTRICHEPATOBILIARY\_PANCREATIC\_DRAINSLOCATION\-BASED OUTPUTCARDIAC OUTPUTADJUSTMENT/AGGREGATED VALUESAMBIGUOUS VALUESMICROBIOLOGYSPECIMENORGANISMANTIBACTERIUMRESPIRATORY / VENTILATION & GAS MONITORINGRESPIRATORY / VENTILATION & GAS MONITORINGHEMODYNAMICS & VITALSHEMODYNAMICS & VITALSSKIN / WOUNDS & DEVICES / SAFETYLABS / ABG\-CHEMISTRYNEURO/PAIN/FUNCTIONALSKIN / WOUNDS & DEVICES / SAFETYAMBIGUOUS, ADMINISTRATIVE / ADMISSION & DEMOGRAPHICSNEURO/PAIN/FUNCTIONALALLERGYALLERGYDIAGNOSISTUBERCULOSISENTERIC & FOODBORNE BACTERIAL INFECTIONSPARASITIC & VECTOR\-BORNE PROTOZOAL/TICK\-BORNE DISEASESSEXUALLY TRANSMITTED INFECTIONS \(STIS\) & HIVOTHER BACTERIAL & SEPTIC INFECTIONSRESPIRATORY BACTERIAL DISEASESZOONOTIC & MYCOBACTERIAL DISEASESVIRAL DISEASESPRION & SLOW VIRUS DISEASESNON\-INFECTIOUS / NEOPLASTIC DISEASESLABORATORYHEMATOLOGY \(CEREBROSPINAL FLUID \(CSF\)\)HEMATOLOGY \(JOINT FLUID\)HEMATOLOGY \(OTHER BODY FLUID\)HEMATOLOGY \(STOOL\)BLOOD GAS \(BLOOD\)CHEMISTRY \(ASCITES\)CHEMISTRY \(CEREBROSPINAL FLUID \(CSF\)\)CHEMISTRY \(OTHER BODY FLUID\)CHEMISTRY \(STOOL\)HEMATOLOGY \(BLOOD\)CHEMISTRY \(OTHER BODY FLUID\)HEMATOLOGY \(URINE\)HEMATOLOGY \(JOINT FLUID\)BLOOD GAS \(OTHER BODY FLUID\)CHEMISTRY \(CEREBROSPINAL FLUID \(CSF\)\)CHEMISTRY \(STOOL\)HEMATOLOGY \(PLEURAL\)HEMATOLOGY \(URINE\)BLOOD GAS \(BLOOD\)BLOOD GAS \(OTHER BODY FLUID\)CHEMISTRY \(BLOOD\)CHEMISTRY \(JOINT FLUID\)HEMATOLOGY \(OTHER BODY FLUID\)CHEMISTRY \(PLEURAL\)CHEMISTRY \(URINE\)HEMATOLOGY \(ASCITES\)CHEMISTRY \(ASCITES\)HEMATOLOGY \(BONE MARROW\)CHEMISTRY \(BLOOD\)CHEMISTRY \(URINE\)HEMATOLOGY \(BLOOD\)

### E\.2Validation Cache

In this section, we describe the operational process of thevalidation cacheusing the example shown in Figure[28](https://arxiv.org/html/2605.26463#A5.F28)\. From the clinical note sentence “Abd: soft, distended\.”, the entity extraction step identifies three entities:Abd,soft, anddistended\. These entities are not independent; rather, they originate from a single abdominal examination\. As a result, verifying them independently would lead to repeated retrieval of the same structured data\. First, for the entityAbd, the LLM invokes a table exploration tool to retrieve the relevant structured record \(e\.g\.,Abdomen Assessment\)\. The retrieved record indicates that, at the time of admission, the abdomen was documented as “soft” and “distended\.” This result is stored in thevalidation cache, which maintains the entity \(Abd\), temporal context \(Admission\), and associated attributes \(Soft, Distended\) as reusable evidence for subsequent verification steps\. When verifying the entitysoft, the model first consults thevalidation cache\. Since the cached result already contains the abdominal examination with attributes “soft” and “distended,” the model can immediately confirm thatsofthas been previously validated\. Therefore, the verification is completed without issuing an additional tool call\. The same process applies todistended, which is also verified through cache lookup alone, without further access to the structured data\. In this way, thevalidation cacheenables multiple entities derived from the same clinical observation to share verification results, reducing redundant tool calls and improving computational efficiency\. The cache maintains the most recentmmvalidation results in a sliding window, maximizing reuse within temporally localized contexts\. We set the validation cache size to five in all experiments\.

### E\.3Prompt

The prompts used inEHR\-Inspectorare presented in Figures[18](https://arxiv.org/html/2605.26463#A5.F18)through[27](https://arxiv.org/html/2605.26463#A5.F27)\. In addition, to protect patient privacy, all example data values included in the paper have been replaced with fictional ones rather than real data\.

Prompt Template for Note SegemntationTask: Segment the content into sections based on relevant clinical topics\.Instruction:•Given a clinical note, please categorize and segment the content into subtexts based on relevant clinical topics \(e\.g\., Discharge Medication,Major Surgical or Invasive Procedure,Discharge Diagnosis, Brief Hospital Course, Pertinent Results, etc\.\)\.•Specifically, in the Hospital Course or sections describing the patient’s clinical progression, please use temporal cues \(e\.g\., “Hospital day,” specific dates, or relative time expressions like “two days after surgery” or “the following day”\) as the basis for segmentation\.•If events occurring on the same day or over a continuous period are described across multiple paragraphs, they should still be grouped into a single range based on the timeline\.•Remember, Even if content related to the same clinical category \(e\.g\., Major Surgical or Invasive Procedure\) is spread across multiple paragraphs, dates, or formatting sections, it should be grouped into a single subtext range if it is semantically related\.•The segmentation should not be based on dates or paragraph breaks, but rather on semantic relevance or clinical topics \(e\.g\., Major Surgical or Invasive Procedure, Discharge Diagnosis, Brief Hospital Course, Pertinent Results, etc\.\)\.For example:•95 109 describes events on hospital day 7\. Even though the information is spread across different paragraphs, it all occurred on the same day and should be grouped together\.•In contrast, 111 113 refers to hospital day 8, which is a new day and should therefore be placed in a separate range\.For each subtext, output only the line number range in the following format:Output format\)\- 0\-3\- 4\-7\- 8\-10Example 1:45: Hospital Summary:46: 58yo male with history of COPD admitted for worsening47: dyspnea and productive cough\. Started on antibiotics48: and oxygen therapy with gradual improvement\.49:50: 1\. Respiratory distress: Initially required ICU care51: due to hypoxia but stabilized after treatment52: and transferred to general wardOutput format\)\- 45\-48\- 50\-52Clinical Note: “<<<<CLINICAL\_NOTE\>\>\>\>”Output:IMPORTANT:•If you fail to include these initial lines starting from line 0, your output will be considered entirely incorrect\.•This is a critical error and will result in a full penalty\.Figure 18:Prompt Template for Note Segmentation\.Prompt Template for Extract Time ReferenceTask:You are given a clinical summary from a patient’s hospital stay\.Your task is to extract only those clinical events that serve as temporal anchors, which can be used to infer the dates of other events\.Instruction:•If there are events described with relative time expressions \(e\.g\., “a few days later,” “on the nth day,” “the next day”\), extract only the anchor event\(s\) that allow these events to be assigned an exact date\.
•Criteria for events to be extracted:
\- The event must serve as a practical anchor for determining the timing of another event\.
\- Include Admission and Discharge if they are used to infer other event dates\.
•Output format
\- yyyy\-mm\-dd \- Event description
\- yyyy\-mm\-dd \- Event description
•The following should be excluded from the output:
\- Events described only with relative expressions such as “3 days later” or “the next day”
\- Events that have a clear date but are not used to infer the date of any other eventsExample 1:The patient was admitted on 2172\-03\-12 with shortness of breath and chills\. Initial labs were drawn on the day of admission\.On 2172\-03\-15, bacterial pneumonia was confirmed and ceftriaxone was initiated\.Respiratory symptoms improved three days after starting treatment\.The patient was discharged on 2172\-03\-20\. Ibuprofen was recommended at discharge\.Output:\- 2172\-03\-12 \- Admission\- 2172\-03\-15 \- ceftriaxone was initiated\- 2172\-03\-20 \- DischargeExplanation:\- 2172\-03\-12 \- AdmissionInitial labs are described as occurring “on the day of admission,” so the admission date serves as a reference point → included\.\-2172\-03\-15 \- ceftriaxone was initiated:Symptom improvement is defined relative to treatment start \(“three days after”\), making this an anchor → included\.\-2172\-03\-20 \- Discharge:Medication recommendation is tied to discharge timing → included\.\-Initial labs:Although linked to admission day, they are not used to determine timing of other events → excluded\.\-Pneumonia confirmation:Has a specific date but does not anchor other events → excluded\.Ibuprofen recommendation:Occurs “at discharge” and depends on that date; not independently anchoring → excluded\.Clinical Note: “<<<<CLINICAL\_NOTE\>\>\>\>”Output:Figure 19:Prompt Template for Extract Time Reference\.Prompt Template for Patient Specific Entity ExtractionInstruction:•In the given Clinical Note, use a pre\-defined list of “entities and their representative associated values” as the “answer key”\.•If there is statistical or semantic similarity, extract the corresponding entity from the clinical note\.•Attach to each extracted entity the representative values \(e\.g\., dose, measurement and unit, procedure site, etc\.\) that most frequently co\-occur across many patient records\.Explanation:•Explanation based on examples of pre\-defined “entities and their representative associated values”: Potassium Chloride: \[‘2’, ‘7’, ‘10’, ‘5’, ‘8’, ‘50’, ‘4’\]
→ \[Entity identification\] In patient A’s EHR record, “Potassium Chloride” was mentioned\.
→ \[Associated value extraction\] Analysis of the entire patient database shows that the values/doses most frequently recorded together with Potassium Chloride are ‘2’, ‘7’, ‘10’, ‘5’, ‘8’, ‘50’, ‘4’\.
→ \[Interpretation\] For this drug, numeric values or doses are recorded in the EHR\.
Output format\)\- clinical note line number\. entity in the note \(same level as an entity in the EHR table\): \[associated value1\] \- \[Linking entity1 \(ehr table\), linking entity2 \(ehr table\),…\]\- clinical note line number\. entity in the note same level as an entity in the EHR table\): \[associated value2\] \- \[Linking entity1 \(ehr table\), linking entity2 \(ehr table\),…\]\- clinical note line number\. another entity in the note \(same level as an entity in the EHR table\): \[associated value1\] \- \[Linking entity1 \(ehr table\), linking entity2 \(ehr table\),…\]Example 1:Pre\-defined list of entities and their representative associated values \(examples\):Abdominal X\-Ray: \[\]Abdominal Assessment: \[‘Firm Distended’, ‘Firm’, ‘Obese’, ‘Soft’, ‘Non\-Distended’\]Skin Temperature: \[‘Hot’, ‘Warm’, ‘Cold’, ‘Cool’\]Temperature Site: \[‘Blood’, ‘Rectal’, ‘Oral’, ‘Tympanic’\]Clinical note\)0\. Intravenous fluids were initiated1\. Vitals: T: 97 BP: 94/49 on 0\.06 levophedOutput:Line number 0\. Intravenous fluids: \[\] \- \[IV site 1\]→ Rationale: The initiation of IV can be associated with IV Site 1\.Line number 1\. T: \[97\] \- \[Temperature Fahrenheit\]→ Rationale: In the pre\-defined list, the Temperature Fahrenheit entity holds temperature values\. The “T” recorded in the note is an abbreviation for temperature and is semantically linked to this entity\. Therefore, the numeric value “97” recorded with “T” is extracted as the value for Temperature Fahrenheit\.Extract entities from the clinical note based on a predefined list of entities and their representative values\.Pre\-defined list of entities and their representative associated values\)“<<<<PREDEFINED\_VALUES\>\>\>\>”Clincal Note:“<<<<CLINICAL\_NOTE\>\>\>\>”Figure 20:Prompt Template for Patient Specific Entity Extraction\.Prompt Template for Mapping Groups and SubgroupsYou are an expert in clinical coding of medical entities\.Your task is to categorize each procedure into EXACTLY ONE of the predefined “<<<<Group\_or\_Subgroup\>\>\>\>” based on its value and title\.PREDEFINED “<<<<Group\_or\_Subgroup\>\>\>\>”\(Choose strictly from this list\):json\.dumps\(PROCEDURE\_SUBGROUPS, indent=2\)ITEMS TO CATEGORIZE:json\.dumps\(items\_to\_classify, indent=2\)INSTRUCTIONS:•Analyze each procedure title and assign it to the most relevant subgroup\.•If it covers multiple areas, choose the primary anatomical or functional focus\.•OUTPUT FORMAT: Provide ONLY a valid JSON dictionary where keys are the exact item IDs \(the string numbers provided\) and values are the chosen “<<<<Group\_or\_Subgroup\>\>\>\>” names\. Do not include markdown formatting\.Figure 21:Prompt Template for Mapping Groups and Subgroups\.Prompt Template for Selecting Groups and SubgroupsMULTI CHOICE QUESTIONRead the given clinical note and select all categories below that explicitly appear in the note\.No inference\. Your output format, punctuation \(commas, spaces, brackets\) must match the example exactly\.List labels in alphabetical order\.Answer Format ExampleA\. Neuro/Pain/Functional, B\. Medications & InfusionsClinical Note“<<<<CLINICAL\_NOTE\>\>\>\>”Selectable categories“<<<<SELECTED\_CATEGORY\>\>\>\>”•In your answer, include only the selected labels—nothing else\.•Just give me answer with answer format without any messages\.•DO NOT GENERATE REASON\. ONLY GIVE ME THE ANSWER\!•Answer:Figure 22:Prompt Template for Selecting Groups and Subgroups\.Prompt Template for ExtractingEontologyE\_\{\\mathrm\{ontology\}\}Task: Please find and extract entities in the clinical note that are related to the items in the given list\.Output format:Line number n\. entity \- \(item1, item2,…\) \- \(associated value\)Line number n\. entity2 \- \(item4\) \- \(associated value\)Previously extracted entity list:format: \(entity, line number, associated value, the form in which the entity is recorded in the EHR\)\(\(‘Lisinopril’, ‘10’, ‘’, ‘Lisinopril’\)\),‘Temperature’, ‘11’, ‘101’, \(‘Temperature’\)\)\)Example Task:Item list to check \(example\):\(Temperature, Non Invasive Blood Pressure systolic, Non Invasive Blood Pressure diastolic, Abdomen, Lisinopril\)Example:10: The patient developed hypertension, was prescribed Lisinopril,11: and their BP improved to 130/85 mmHg12: Temperature was 101Output:Line number 10\. hypertension\- \(Non Invasive Blood Pressure systolic, Non Invasive Blood Pressure diastolic\) \- \(\)Line number 11\. BP\- \(Non Invasive Blood Pressure systolic\) \- \(130\)Line number 12\. BP\- \(Non Invasive Blood Pressure diastolic\) \- \(85\)\*Do not extract Lisinopril and Temperature, as they have already been extracted\.Let’s do the task\!LIST to check:“<<<<CHECKED\_LIST\>\>\>\>”Clinical Note: “<<<<Group\_or\_Subgroup\>\>\>\>” names\.Do not include markdown formatting\.REMEMBER:•Do not extract any entities that were previously extracted\.•Previously extracted entity list: format: \(entity, line number, associated value, the form in which the entity is recorded in the EHR\) “<<<<LAST\_EXTRACTED\>\>\>\>”Figure 23:Prompt Template for ExtractingEontologyE\_\{\\mathrm\{ontology\}\}\.Prompt Template for Scope FilteringInstructions:•Base your answers only on the information explicitly stated in the note\.•Do not rely on prior medical knowledge, assumptions, or general practices\.•Your reasoning must be directly supported by evidence from the text\.Important:•If you use your own prior medical knowledge, assumptions, or general practices that are not clearly supported by the text itself, your answer will be invalid\.•Always respond strictly based on the given entity\.Output format:number\.Reason:Answer: \(Yes, No\)•Did the event of <<<<ENTITY\>\>\>\> occur in the Emergency Room \(ED\)?•Is <<<<ENTITY\>\>\>\> part of the Past Medical History?•Is <<<<ENTITY\>\>\>\> an event from a previous hospitalization?•Is <<<<ENTITY\>\>\>\> a future plan, suggestion, or recommendation?
a\. future plan, suggestion, or recommendation \(YES\)
\- Not yet performed → an action that may or may not occur\.
\- Often indicated by verbs such as suggest, recommend, consider, monitor\.
\- Conditional or optional; leaves room for change\.
\- Describes anticipated changes or actions, not the current state\.
\- At the time of documentation, execution is not yet confirmed\.
b\. Not consider as future plan, suggestion, or recommendation \(NO\)
\- Actions already decided as part of a fixed future schedule or care pathway\.
\- Actions to be carried out without any option for change\.
\- Statements in the future tense that serve only as a notification
•Does <<<<ENTITY\>\>\>\> reflect the opinion or preference of the clinician, patient, medical team, or patient’s family?•Is <<<<ENTITY\>\>\>\> a treatment or medication mentioned as discontinued, but the timing of discontinuation is unclear?•Is <<<<ENTITY\>\>\>\> a Diagnosis? A diagnosis is a formally identified disease or medical condition, not just a symptom or sign\.•Is <<<<ENTITY\>\>\>\> related to total input, total output?
YES:
\- Mentions of the complete quantity \(e\.g\., total blood volume, total input, total output, or other totals\) where the full amount is explicitly stated or implied\.
NO:
\- Mentions that do not indicate a total amount \(e\.g\., She remained NPO\)\.
\- Cases where total is part of a parameter or test name but does not represent an actual total quantity \(e\.g\., Total bilirubin was 0\.4\)\.
•Does <<<<ENTITY\>\>\>\> indicate the patient’s refusal of a procedure or treatment?•Does <<<<ENTITY\>\>\>\> include a medical interpretation summarizing abnormal \(or normal\) findings in imaging studies \(e\.g\., CT, Xray, EGD\)? Reason:Figure 24:Prompt Template for Scope Filtering\.System Prompt for Tool CallingTask OverviewThis task involves verifying the consistency between a patient’s clinical notes and electronic health record \(EHR\) tables\. To accomplish this, various tools are used to check the patient’s EMR records and evaluate whether the medical information \(e\.g\., values, medications, observation results, etc\.\) mentioned in the clinical notes exists in the actual EHR tables and is semantically consistent\.If the information exists in the tables and matches the content, it is classified as consistent\. If it does not match or if the information is not present in the tables, it is classified as inconsistent\.Introduction to ToolsGlobal\-based ToolsTools for exploring entire tables without specific patient information\.1\. Item\_Search•N\-gram based cosine similarity for label matching\.2\. Top\_Values\_for\_Entities•Aggregates the top\-N values that frequently appear for a specific item in a particular table across the entire dataset\.Patient\-Centric ToolsTools that set time conditions based on an individual patient’s admission time to verify whether specific events actually occurred\.3\. Semantic\_Search•Searches for similar items within a patient’s table based on input keywords such as test names or medication names using meaning\-based matching\.4\. Table\_Value\_Time•Searches for data that simultaneously satisfies both value and time conditions based on individual patient criteria\.5\. Table\_Category\_Time•Verifies the existence of events within a patient’s time conditions based on an item’s category\.6\. Table\_Time•Filters based solely on time conditions without value or category conditions\.7\. Table\_Selected\_Item\_Time•You must use the item names exactly as they appear in the name column \(i\.e\.,LABELorDRUG\) of the table obtained through the previous search steps, because items need to be searched using those exact expressions\.8\. Table\_Selected\_Item\_Value\_Time•You must use the item names exactly as they appear in the name column \(i\.e\.,LABELorDRUG\) of the table obtained through the previous search steps, because items need to be searched using those exact expressions\.Use the available tools to check whether the information about specific entities such as symptoms, test results, medications, and other clinical facts recorded in the clinical notes matches the actual EHR table data\. In this process, entities may be recorded in the tables using identical terms, synonyms, or abbreviations\. Evaluate the semantic consistency between the clinical notes and the EHR tables, and determine whether they are consistent or inconsistent\.Figure 25:System Prompt for Tool Calling\.Prompt Template for Tool CallingInput FormatLine number: <<<<line\_number\>\>\>\>,Candidate entity: <<<<candidate\_entity\>\>\>\>,Related information: <<<<candidate\_value\>\>\>\>The date \(\+time\) this event likely occurred: <<<<TIME\>\>\>\>Clinical Note<<<<clinical\_note\>\>\>\>Query\) What tool do you want to use first?Just choose one \(only tool name\)\.Output FormatReason: \[The reason for selecting this tool\]Selected tool: \[The name of the tool you selected\]Figure 26:Prompt Template for Tool Calling\.Prompt Template for Final VerificationTask OverviewYou are given a clinical note and a structured EHR table\. Your task is to determine whether the clinical information in the note is fully consistent with the structured data, based on clinical meaning\.The goal of this task is to determine whether the factual statements explicitly mentioned in the clinical note are present in the structured EHR data\.Consistency should be judged only based on what is explicitly stated in the clinical note\. Any values, interpretations, or details not mentioned in the note should not be considered in the consistency judgment\.Task Rules1\.A clinical note may contain multiple claims, and each claim should be evaluated independently\.2\.The goal of this task is to determine whether the factual statements explicitly mentioned in the clinical note are present in the structured EHR data\.3\.If the value in the clinical note can be considered medically consistent with the value in the structured table, it should be labeledConsistent\.4\.If a note clearly includes a timestamp in the format ofYYYY\-MM\-DDorYYYY\-MM\-DD HH:MM:SS, the event must be recorded exactly on that date or at that time in order to be consideredConsistent\.5\.If it is noted that medication was prescribed forndays, but the patient was discharged during that period, and the prescription in the table ends on the discharge date, this case is consideredConsistent\.Output RequirementsFor each claim, provide the following items\.1\.Single Fact•For clinical notes with multiple facts, clearly state the individual factual claim being evaluated in this block\.2\.Consistency status•Choose one of the following:Consistent/Inconsistent3\.Reason•Provide a clinical explanation of why the claim is labeledConsistentorInconsistent\.4\.Evidence index•ForConsistentorContradictory Evidence Inconsistentclaims, specify the row index or indices from the EHR table that support or contradict the claim\.•ForInformation Missing Inconsistent \(NEI\)claims, if the structured table contains no relevant data at all for the claim, leave the evidence index blank\.Types of Inconsistent Judgments1\. Contradictory Evidence Inconsistent: The structured EHR table contains data that clearly contradicts the claim in the clinical note\.2\. Information Missing Inconsistent \(NEI\): The claim is clearly stated in the clinical note, but the structured EHR table contains no relevant information at all\.<<<<Examples\>\>\>\>Figure 27:Prompt Template for Final Verification\.![Refer to caption](https://arxiv.org/html/2605.26463v1/x13.png)Figure 28:A conceptual illustration of Validation Cache\.

## Appendix FLLM\-as\-a\-judge Evaluators

### F\.1Harsh Evaluator

TheHarshevaluator is designed to assess strict agreement with the gold consistency labels\. Given a set of predicted entity–value pairs and corresponding gold annotations, the evaluator determines whether the classification is consistent with the gold annotation and assigns a binary label \(CorrectorIncorrect\)\. In this setting, a classification result is consideredcorrectonly if the assigned consistency label exactly matches the gold label\. While the evaluator may internally account for minor variations in entity spans or value expressions when interpreting the classification, the final decision is based strictly on agreement with the gold consistency label\.

We implement this evaluator usingGemini 2\.5 Pro, prompted to act as a deterministic classifier\. The model is instructed to output a binary decision reflecting whether the classification is consistent with the gold annotation under a strict interpretation\. The full prompt used for this evaluator is provided in Figure[29](https://arxiv.org/html/2605.26463#A6.F29)\. To validate the reliability of theHarshevaluator, we randomly sample 300 evaluation instances and have them independently reviewed by the authors\. The evaluation results are compared against human judgments, achieving 99\.46% agreement\. This result indicates that the evaluator reliably reproduces human decisions under well\-defined consistency criteria\.

### F\.2Lenient Evaluator

TheLenientevaluator is designed to assess the validity of the reasoning underlying the predicted classification, rather than strict agreement with the gold annotation\. Unlike theHarshsetting, the evaluator focuses on whether the classification is supported by a defensible and clinically appropriate rationale, rather than enforcing surface\-level agreement with a single reference annotation\.

We employGemini 2\.5 Proas the evaluator, with prompts emphasizing justification and reasoning validity\. The model is instructed to assess whether the predicted classification is supported by a sound reasoning process, rather than strictly matching the gold annotation\. The full prompt used for this evaluator is provided in Figure[30](https://arxiv.org/html/2605.26463#A6.F30)\. To validate this evaluator, we collect a separate set of 200 samples and obtain independent judgments from four clinical practitioners\. Agreement is computed individually between the LLM evaluator and each practitioner, and the final score is obtained by averaging across practitioners, resulting in 95\.35% agreement\. This demonstrates that the evaluator aligns well with expert judgment in assessing the validity of clinical reasoning, even in cases where multiple interpretations are possible\.

### F\.3Evaluator Bias

To assess potential bias introduced by reliance on a single LLM evaluator, we perform cross\-model validation usingGPT\-5as an alternative judge\. The same evaluation prompts and decision criteria are applied across both models to ensure consistency\. When usingGPT\-5as the evaluator, the measured performance of the framework is comparable to that obtained withGemini 2\.5 Pro, as shown in Table[9](https://arxiv.org/html/2605.26463#A6.T9)\. This indicates that no significant model\-specific bias is observed in the evaluation\.

### F\.4Evaluation Criteria for Clinical Practitioners

#### F\.4\.1Objective

The purpose of this evaluation is to enable clinical experts to conduct a final review of AI\-generated assessments verifying consistency between patients’ unstructured clinical notes and structured EHR tables\. Reviewers are asked to examine the AI model’s prediction \(PREDICT\), the reference answer \(GOLD\), and the AI evaluator’s reasoning comparing the two \(RESULT\)\. Based on this review, please determine the final validity of the assessment by marking the Eval field as either Correct or Incorrect\.

#### F\.4\.2Data Structure

The data provided to reviewers consists of the following fields:

- •PREDICT:The consistency judgment made by the AI verification model after reading the clinical note and querying the EHR table\. This includes the model’s determination of Consistent or Inconsistent, along with its detailed reasoning process\.
- •GOLD:The reference answer, or ground truth, previously established by the research team\.
- •RESULT:The detailed reasoning generated by an AI evaluator, or LLM\-as\-a\-judge, comparing PREDICT with GOLD and providing an initial assessment of whether the prediction is valid\.
- •EVAL:The field to be completed or confirmed by the clinical reviewer\. Please determine whether the reasoning in RESULT is clinically appropriate, and mark it as Correct or Incorrect\.

#### F\.4\.3Detailed Evaluation Criteria

##### Strictness and Clinical Acceptability

- •Evaluation criterion:Even if the text span or specific item name extracted in PREDICT does not exactly match that in GOLD, the prediction should be accepted as correct if it conveys the same meaning in the clinical context and the model’s reasoning is reasonable\.
- •Example:If GOLD evaluates “heart rhythm,” but PREDICT logically verifies tachycardia based on the “heart rate” value, this should be marked as Eval: Correct\.

##### Acceptance of Omitted Redundant Checks \(Validation Cache\)

- •Evaluation criterion:For composite indicators such as blood pressure \(BP: 94/49\), where multiple values are assessed together, the model includes a function that allows it to remember values already checked in previous steps and omit redundant follow\-up checks for efficiency, for example: “Consistency check was already completed\.”
- •Judgment:If it is confirmed that the model omitted a redundant check based on a previous verification step, this should be considered normal reasoning rather than an error or omission\. In such cases, mark Eval: Correct\.

##### Rejection of Clear Medical Errors and Misjudgments

- •Evaluation criterion:The following cases should be marked as incorrect even if the AI evaluator \(RESULT\) judged them to be correct: - –The model \(PREDICT\) claims consistency by fabricating EHR records that do not exist, i\.e\., hallucination\. - –The model relies on an incorrect medical standard, such as judging an abnormal value as normal\. - –The comparison logic in RESULT itself contradicts medical facts\. - –The AI evaluator claims consistency based on an incorrect EHR record\.

- •Judgment:If any such critical error is identified, mark Eval: Incorrect\.

Prompt Template for Harsh EvaluatorYou are an assistant that compares ground truth data to prediction results\.\(Ground Truth Data\)represents an individual data item from the ground\-truth set, and <Prediction Results\> are the outputs of a process in which a specific model identifies entities within notes and performs a consistency check against a reference table\.The ground\-truth content may appear in different surface forms but still refer to the same underlying information\. Even if the line numbers are different, as long as they refer to the same event, they should be considered as comparable\.Please determine whether the given ground\-truth item and any of the prediction results refer to the same content, and if so, verify that the consistency check was performed correctly\.Even if the reason for the prediction is different from the ground truth record, it is correct as long as the consistency is the same\.If any given ground\-truth event is not associated with any of the prediction results, it should be considered incorrect\.Note that in some cases the consistency check was already performed earlier and was therefore skipped\.These cases are listed under <Consistency check was already completed\>, and if those entries refer to the same ground\-truth content and the consistency\-check result matches, they should be marked as “Correct”\.The required output format is:Reason: \(REASON\)Result: \(Correct or Incorrect\)Figure 29:Prompt Template for Harsh Evaluator\.Table 9:Comparison of evaluation results betweenGemini 2\.5 ProandGPT\-5under identical settings, showing consistent performance across Lenient and Harsh evaluators\.Prompt Template for Lenient EvaluatorYou are an assistant that compares ground truth data to prediction results\.\(Ground Truth Data\)represents an individual data item from the ground\-truth set, and \(Prediction Results\) are the outputs of a process in which a specific model identifies entities within notes and performs a consistency check against a reference table\.The ground\-truth content may appear in different surface forms but still refer to the same underlying information\. Even if the line numbers are different, as long as they refer to the same event, they should be considered as comparable\.Please determine whether the given ground\-truth item and any of the prediction results refer to the same content, and if so, verify that the consistency check was performed correctly\.Even if the reason for the prediction is different from the ground truth record, it is correct as long as the consistency is the same\.If any given ground\-truth event is not associated with any of the prediction results, it should be considered incorrect\.Important: However, if the reason provided in the prediction result is sufficiently reasonable and logically acceptable, it should still be considered correct\.For example, if the ground truth checks for “Bradycardia” by examining whether “heart rhythm” shows bradycardic patterns and marks it as inconsistent, but the prediction checks “heart rate” and finds a bradycardic\-level rate, marking it as consistent, this is still correct because the reasoning is reasonable\.Similarly, if the ground truth marks “right wrist” as consistent because “right arm” exists in the EHR table, but the prediction marks it as inconsistent since “wrist” and “arm” are different, this is also considered correct because the reasoning is sufficiently logical\.Note that in some cases the consistency check was already performed earlier and was therefore skipped\.These cases are listed under \(Consistency check was already completed\), and if those entries refer to the same ground\-truth content and the consistencycheck result matches, they should be marked as “Correct”\.The required output format is:Reason: \(Reason\)Result: \(Correct or Incorrect\)Figure 30:Prompt Template for Lenient Evaluator\.

## Appendix GStatistical Evaluation

To statistically evaluate the framework, we conducted three runs on the validation set ofEHR\-ReasonConusingEHR\-Inspector\. In addition, to assess the statistical reliability of the LLM\-based evaluator, each of the three runs was independently evaluated three times using the LLM\-as\-a\-judge approach\. As shown in Table[10](https://arxiv.org/html/2605.26463#A7.T10), the results are reported in terms of mean and standard deviation\.

Table 10:Detailed evaluation results across runs with per\-run averages and standard deviations under Lenient and Harsh settings\.
## Appendix HError Analysis by Model

To better understand model\-specific failure patterns inEHR\-Inspector, we randomly sampled 100 error cases per model and conducted a manual analysis\. Each error was categorized into five types: Premature Conclusion Error, Tool Usage Error, Temporal Reasoning Error, Medical Knowledge Error, and Validation Cache Error\.

##### Premature Conclusion Error

As shown in Figure[4](https://arxiv.org/html/2605.26463#S5.F4), this is the most frequent error across all models\. It occurs when a model terminates the verification process before collecting sufficient supporting evidence\. In practice, models often stop after limited exploration \(e\.g\.,semantic search or lexical search\) without further table exploration\. This behavior indicates a systematic under\-search issue and highlights the need for mechanisms that encourage deeper exploration or enforce sufficient evidence gathering\.

##### Tool Usage Error

This error arises when models fail to use available tools consistently\. For example, a model may identify “Heart Rate” in a table but subsequently issue a query using an inconsistent term such as “HR,” leading to failed retrieval\. These errors reflect limitations in query formulation and tool interaction, and are particularly prominent in Gemini\-2\.5\-Flash and Qwen3\.

##### Temporal Reasoning Error

This error refers to failures in correctly interpreting time\-related information in clinical notes\. Such notes often contain events that are not chronologically ordered and are densely interwoven, making it difficult to determine when a specific condition or event occurred\. Our analysis shows that models still struggle with such temporally complex and unstructured data\.

Notably, GPT\-OSS exhibits relatively fewer Temporal Reasoning Errors, as well as fewer Tool Usage Errors\. However, this is primarily because it shows a significantly higher rate of Premature Conclusion Errors, often terminating before reaching stages where such errors would occur\. This suggests that lower downstream error rates do not necessarily indicate stronger reasoning ability, but rather earlier failure in the reasoning pipeline\.

##### Medical Knowledge Error

This error occurs when models rely on incorrect or insufficient internal medical knowledge\. While MedGemma, a medically specialized model, shows fewer such errors compared to general\-purpose models, these errors are not entirely eliminated\. This indicates that reliance on parametric knowledge alone is insufficient and that incorporating external knowledge retrieval could further improve performance\.

##### Validation Cache Error

This error involves failures in tracking or reusing intermediate verification states\. It is relatively rare overall and almost negligible in stronger models, suggesting that they maintain more reliable intermediate representations during multi\-step reasoning\.

## Appendix INER Baselines Implementation

To analyze the impact of the entity extraction stage, we compare our method with three NER baselines while keeping the rest of the framework unchanged: Trained NER, BERT Ensemble, and CheckEHR NER\.

Trained NERfine\-tunes MedGemma 27B on note–entity pairs fromEHR\-ReasonCon\. After note segmentation inEHR\-Inspector, each note is divided into segments of varying lengths, and the model is trained to extract entities from these segments\. To construct the training data, each note is split into chunks containingnnlines, where each chunk is paired with the entities corresponding to those lines\. We varynnfrom 2 to the full length of the note, allowing the model to learn from inputs of different granularities\. This setup encourages the model to generalize across varying input lengths and to reliably extract the corresponding entity sets\. Fine\-tuning is performed for one epoch with a learning rate of2×10−52\\times 10^\{\-5\}, an effective batch size of 8 through gradient accumulation, cosine learning\-rate scheduling with a warmup ratio of 0\.03, and a maximum sequence length of 2048 tokens\.

BERT Ensembleaggregates predictions from three biomedical BERT\-based models trained on different datasets\[[2](https://arxiv.org/html/2605.26463#bib.bib4),[18](https://arxiv.org/html/2605.26463#bib.bib6),[33](https://arxiv.org/html/2605.26463#bib.bib5)\]\. In our implementation, we use BERT\-based NER models provided by Hugging Face131313[https://huggingface\.co](https://huggingface.co/), spaCy141414[https://spacy\.io](https://spacy.io/), and Stanza151515[https://stanfordnlp\.github\.io/stanza/](https://stanfordnlp.github.io/stanza/)\.

CheckEHR NERfollows the entity extraction method used in the original CheckEHR pipeline\.

## Appendix JTool Ablation Experiment

Detailed ablation results for the tool category are provided in Table[11](https://arxiv.org/html/2605.26463#A10.T11)\.

Table 11:Detailed ablation results for tool categories\.
## Appendix KTool Trace Analysis by Model

Figure[31](https://arxiv.org/html/2605.26463#A11.F31)compares the tool\-usage traces generated by three backbone models, GPT\-OSS 20B\[[1](https://arxiv.org/html/2605.26463#bib.bib9)\], Qwen3\-32B\[[40](https://arxiv.org/html/2605.26463#bib.bib10)\], and MedGemma 27B\[[29](https://arxiv.org/html/2605.26463#bib.bib8)\], during consistency verification\. Overall, all three models exhibit denser and more dispersed transition patterns than the human annotators shown in Figure[5](https://arxiv.org/html/2605.26463#S5.F5)\. This suggests that LLM\-based agents tend to explore a broader range of tool combinations across search, history retrieval, and trend analysis, rather than converging on a small set of stable tool\-usage paths\. However, the degree of dispersion differs across models\. GPT\-OSS shows the most concentrated structure among the three models, with prominent paths centered around Semantic\_Search, Lexical\_Search, Get\_Item\_Status\_History, and Get\_Item\_Value\_History\. This indicates that GPT\-OSS tends to follow retrieval\-to\-history verification patterns more consistently than the other models\. In contrast, Qwen exhibits the densest and most widely dispersed transition structure, with frequent transitions across nearly all major tools\. This suggests that Qwen is more likely to explore a wide range of possible tool combinations rather than converge on a specific verification path\. MedGemma also shows a high\-density transition structure despite being specialized for the medical domain\. In particular, repeated verification patterns are observed around Lexical\_Search, Analyze\_Category\_Trend, and Get\_Item\_Status\_History, where retrieval is followed by trend analysis or history inspection\. This indicates that domain\-specific knowledge does not necessarily translate into efficient tool\-use planning in relational\-table\-based verification settings\. Taken together, although the models differ in their tool\-use behaviors, all of them generate broader and less compact tool\-usage traces than human annotators\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x14.png)Figure 31:Comparison of path frequency graphs across models\.
## Appendix LPrior Knowledge of the MIMIC Schema

To analyze the extent of Gemini’s internal knowledge of the MIMIC schema, we present the prompt and its corresponding response\.

### L\.1Gemini Prompt

The following content is the response generated by the Gemini model when prompted with:“Tell me what you know about the EHR database MIMIC without using web search\. Describe its table structure and the columns it contains\.”

### L\.2Gemini Response

TheMedical Information Mart for Intensive Care \(MIMIC\)is one of the most significant open\-access datasets in the world of healthcare informatics\. It is maintained by the MIT Laboratory for Computational Physiology and contains de\-identified health\-related data from patients admitted to critical care units at the Beth Israel Deaconess Medical Center\.

While there are several versions \(MIMIC\-II, MIMIC\-III, and the current MIMIC\-IV\), the structure generally follows a relational database model designed to capture the complex timeline of an ICU stay\.

Database Architecture The database is organized into several modules\. The primary structure is built around a few “core” tables that define patients and their hospital stays, which then link to various “event” tables containing clinical data\.

1\. Core Tables \(The Foundation\) These tables define the “who” and “when” of the data\.

- •PATIENTS:Contains one row per patient\. - –Columns:subject\_id,gender,dob,dod\(date of death\)\.
- •ADMISSIONS:Tracks each unique hospital visit\. - –Columns:subject\_id,hadm\_id,admittime,dischtime,admission\_type,insurance,religion,marital\_status,ethnicity\.
- •ICUSTAYS:Defines stays within an ICU \(since one hospital admission can involve multiple ICU transfers\)\. - –Columns:subject\_id,hadm\_id,icustay\_id,first\_careunit,last\_careunit,intime,outime,los\(length of stay\)\.

2\. Event Tables \(The Clinical Data\) These are the largest tables, containing longitudinal data points timestamped to the patient’s stay\.

- •CHARTEVENTS:The largest table\. Contains all charted data for a patient \(vitals, sensor data, etc\.\)\. - –Columns:subject\_id,hadm\_id,icustay\_id,itemid,charttime,valuenum,valueuom\(unit of measurement\)\.
- •LABEVENTS:Contains results from laboratory tests \(blood work, chemistry\)\. - –Columns:subject\_id,hadm\_id,itemid,charttime,value,valuenum,flag\(marks “abnormal” results\)\.
- •PRESCRIPTIONS:Medications ordered for the patient\. - –Columns:subject\_id,hadm\_id,startdate,enddate,drug,dose\_val\_rx,dose\_unit\_rx,route\.
- •INPUTEVENTS / OUTPUTEVENTS:Tracks fluids coming in \(IVs, nutrition\) and fluids going out \(urine, drains\)\.

3\. Dictionary Tables To keep the database normalized, specific codes are mapped to human\-readable labels in “D” tables\.

- •D\_ICD\_DIAGNOSES:Maps ICD codes to diagnosis descriptions\.
- •D\_ITEMS:Maps theitemidfound inCHARTEVENTSto what it actually represents \(e\.g\., Heart Rate, GCS, Temperature\)\.
- •D\_LABITEMS:Maps theitemidinLABEVENTSto the specific lab test name\.

Key Identifiers To join these tables effectively, you rely on a hierarchy of IDs:

1. 1\.subject\_id: Unique to a person\.
2. 2\.hadm\_id: Unique to a single hospital admission\.
3. 3\.icustay\_id: Unique to a single unit stay within that admission\.

Important Note on Privacy Though the data is incredibly detailed, it isde\-identifiedin compliance with HIPAA\. Dates are shifted into the future by a random offset for each patient \(though the intervals between events remain chronologically accurate\), and patients over the age of 89 have their ages obscured to prevent identification of the “oldest old\.”

## Appendix MGeneralization ofEHR\-Inspector

To evaluate whetherEHR\-Inspectorgeneralizes to different datasets and database schemas, we conduct additional experiments under three settings\.

### M\.1Evaluation on EHRCon

To assess whetherEHR\-Inspectorgeneralizes to other datasets, we evaluate it on EHRCon, asurface\-levelconsistency verification dataset\. The results show thatEHR\-Inspectorachieves a Lenient F1 score of 86\.41, outperforming CheckEHR by 24\.99 points\. This indicates thatEHR\-Inspectorgeneralizes well even when applied to datasets with different task formulations\. Detailed results are reported in Table[13](https://arxiv.org/html/2605.26463#A13.T13)\.

### M\.2Evaluation on MIMIC\-IV

To evaluate robustness under different note formats and database schemas, we conduct experiments on MIMIC\-IV\. To this end, we construct a MIMIC\-IV validation set under a matched annotation scale, manually annotating seven discharge summaries following theEHR\-ReasonConannotation protocol\. Compared to MIMIC\-III, this setting introduces additional complexity, including new tables \(e\.g\.,emarandemar\_detail\)\. Under this setting, performance shows a modest decrease\. The Lenient F1 score drops from 71\.03 to 68\.67, driven by changes in Recall \(75\.59→\\rightarrow68\.18\) and Precision \(66\.99→\\rightarrow69\.18\)\. The Harsh F1 score also decreases, driven by changes in Recall \(65\.56→\\rightarrow58\.27\) and Precision \(56\.18→\\rightarrow56\.72\)\. The MIMIC\-IV tables and columns we used are listed in Table[12](https://arxiv.org/html/2605.26463#A13.T12)\.

Table 12:Tables and columns of MIMIC\-IV\.
### M\.3Evaluation on Perturbed MIMIC\-III

To examine whetherEHR\-Inspectorrelies excessively on the base LLM’s prior knowledge of the MIMIC schema, we construct a perturbed database by replacing table and column names with new identifiers\. This enables evaluation under an unseen schema\. As a result, the Lenient F1 score decreases from 74\.44 to 69\.73, suggesting that familiarity with the original schema provides some benefit\. Nevertheless,EHR\-Inspectorstill achieves a reasonably high F1 score, indicating robustness to schema variations\. Detailed results are reported in Table[14](https://arxiv.org/html/2605.26463#A13.T14), and the perturbed table and column names are listed in TableLABEL:apptab:table\_columns\_purt\.

Table 13:Comparison ofEHR\-Inspectorand CheckEHR on the EHRCon benchmark for note–table consistency verification, evaluated across note types using Lenient and Harsh metrics\.Table 14:Evaluation results ofEHR\-Inspectoracross two database settings \(original vs\. perturbed\) under identical experimental conditions, reported for both Lenient and Harsh evaluators\.Table 15:Tables and columns used with perturbation mappings\.TABLECOLUMNPURT\_TABLEPURT\_COLUMNCHARTEVENTSROW\_IDVITALS\_TIMESERIESIDITEMIDITEM\_IDCHARTTIMEOBSERVED\_ATVALUEVALUE\_TEXTVALUENUMVALUE\_NUMVALUEUOMVALUE\_UNITMICROBIOLOGYEVENTSROW\_IDMICRO\_RESULTSIDCHARTTIMECOLLECTED\_ATSPEC\_ITEMIDSPECIMEN\_ITEM\_IDSPEC\_TYPE\_DESCSPECIMEN\_TYPEORG\_ITEMIDORGANISM\_ITEM\_IDORG\_NAMEORGANISM\_NAMEAB\_ITEMIDANTIBIOTIC\_ITEM\_IDAB\_NAMEANTIBIOTIC\_NAMEINTERPRETATIONSUSCEPTIBILITYD\_ICD\_DIAGNOSISROW\_IDICD\_DIAGNOSIS\_REFIDICD9\_CODEICD9CODESHORT\_TITLEDX\_SHORT\_NAMELONG\_TITLEDX\_LONG\_NAMED\_ICD\_PROCEDURESROW\_IDICD\_PROCEDURE\_REFIDICD9\_CODEICD9CODESHORT\_TITLEPROC\_SHORT\_NAMELONG\_TITLEPROC\_LONG\_NAMEDIAGNOSIS\_ICDROW\_IDPATIENT\_DIAGNOSESIDICD9\_CODEICD\_CODEPROCEDURE\_ICDROW\_IDPATIENT\_PROCEDURESIDICD9\_CODEICD\_CODELABEVENTSROW\_IDLAB\_RESULTSIDITEMIDLAB\_ITEM\_IDCHARTTIMEMEASURED\_ATVALUEVALUE\_TEXTVALUENUMVALUE\_NUMVALUEUOMVALUE\_UNITFLAGABNORMAL\_FLAGD\_LABITEMSROW\_IDLAB\_TEST\_REFIDITEMIDLAB\_ITEM\_IDLABELLAB\_TEST\_NAMEFLUIDSPECIMEN\_FLUIDD\_ITEMSROW\_IDCLINICAL\_ITEM\_REFIDITEMIDITEM\_IDPROCEDUREEVENTS\_MVROW\_IDPROCEDURES\_MV\_LOGIDSTARTTIMESTART\_ATENDTIMEEND\_ATITEMIDITEM\_IDLOCATIONPROCEDURE\_LOCATIONPRESCRIPTIONSROW\_IDMEDICATION\_ORDERSIDSTARTDATESTART\_ATENDDATEEND\_ATDRUGDRUG\_NAMEPROD\_STRENGTHPRODUCT\_STRENGTHDOSE\_VAL\_RXDOSE\_VALUEDOSE\_UNIT\_RXDOSE\_UNITFORM\_VAL\_DISPDISPENSE\_VALUEFORM\_UNIT\_DISPDISPENSE\_UNITROUTEADMIN\_ROUTEINPUTEVENTS\_MVROW\_IDFLUID\_INPUTS\_MVIDITEMIDITEM\_IDSTARTTIMESTART\_ATENDTIMEEND\_ATAMOUNTAMOUNT\_VALUEAMOUNTUOMAMOUNT\_UNITRATERATE\_VALUERATEUOMRATE\_UNITORIGINALAMOUNTORIG\_AMOUNT\_VALUEORIGINALRATEORIG\_RATE\_VALUEINPUTEVENTS\_CVROW\_IDFLUID\_INPUTS\_CVIDITEMIDITEM\_IDCHARTTIMECHART\_ATAMOUNTAMOUNT\_VALUEAMOUNTUOMAMOUNT\_UNITRATERATE\_VALUERATEUOMRATE\_UNITORIGINALAMOUNTORIG\_AMOUNT\_VALUEORIGINALAMOUNTUOMORIG\_AMOUNT\_UNITORIGINALROUTEORIG\_ROUTEORIGINALRATEORIG\_RATE\_VALUEORIGINALRATEUOMORIG\_RATE\_UNITOUTPUTEVENTSROW\_IDOUTPUT\_FLUIDSIDITEMIDITEM\_IDCHARTTIMERECORDED\_ATVALUEVALUEVALUEUOMVALUE\_UNITDIAGNOSES\_ICDROW\_IDDIAGNOSIS\_ICD\_CODESIDICD9\_CODEICD9CODEPROCEDURES\_ICDROW\_IDPROCEDURE\_ICD\_CODESIDICD9\_CODEICD9CODED\_LABITEMSROW\_IDLAB\_TEST\_INFOIDITEMIDITEM\_IDLABELTEST\_NAMEFLUIDSAMPLE\_FLUIDCATEGORYTEST\_CATEGORYD\_ITEMSROW\_IDITEM\_INFORMATIONIDITEMIDITEM\_IDLABELITEM\_NAMEABBREVIATIONITEM\_ABBREVIATIONDBSOURCEDATASOURCELINKSTOLINKED\_TABLECATEGORYITEM\_CATEGORYUNITNAMEUNIT\_NAMEPARAM\_TYPEPARAMETER\_TYPECONCEPTIDCONCEPT\_ID

## Appendix NSample data ofEHR\-ReasonCon

The sample data ofEHR\-ReasonConis described in Figure[32](https://arxiv.org/html/2605.26463#A14.F32)\.

![Refer to caption](https://arxiv.org/html/2605.26463v1/x15.png)Figure 32:Sample cases fromEHR\-ReasonCon\. TheHypotensiveexample illustrates a consistent case, whereas thesoft mechanical dietexample demonstrates an inconsistent case caused by omission\.Commonsense\_medical\_noneindicates whether consistency verification relied on commonsense reasoning \(c\), medical knowledge \(m\), or no deep reasoning\. When medical knowledge is used, the corresponding reference is recorded in themedical\_knowledge\_sourcefield\. Theconsistency\_check\_pathdenotes the annotator\-labeled reasoning trace, andevidence\_row\_idsrecords the row IDs in the structured tables that support the consistency or inconsistency judgment\. To prevent patient identification, some content in the notes and tables was added, removed, or modified\.
## Appendix OExperiment Setup Details

### O\.1Dataset Resources

- •MIMIC III\[[10](https://arxiv.org/html/2605.26463#bib.bib17)\] - –URL: https://physionet\.org/content/mimiciii/1\.4/ - –License: PhysioNet Credentialed Health Data License 1\.5\.0
- •MIMIC IV\[[9](https://arxiv.org/html/2605.26463#bib.bib40)\] - –URL: https://physionet\.org/content/mimiciv/3\.1/ - –License: PhysioNet Credentialed Health Data License 1\.5\.0
- •EHRCon\[[12](https://arxiv.org/html/2605.26463#bib.bib33)\] - –URL: https://physionet\.org/content/ehrcon\-consistency\-of\-notes/1\.0\.0/ - –License: PhysioNet Credentialed Health Data License 1\.5\.0

### O\.2Model Resources

This study categorizes the evaluated models into two primary types: proprietary and open\-source\. The proprietary lineup consists of Gemini\-2\.5\-Pro, Gemini\-2\.5\-Flash, and GPT\-5\. Each was accessed via its respective provider’s API and used strictly within their given terms of service\. For all experiments involving GPT\-5 and Gemini 2\.5, GPT\-5 was accessed via HIPAA\-compliant Azure OpenAI services, and Gemini 2\.5 models were accessed via Google Vertex AI\. Table[16](https://arxiv.org/html/2605.26463#A15.T16)summarizes the estimated average inference cost per note, in U\.S\. dollars, for each proprietary model\. The specific model versions are as follows:

- •Gemini\-2\.5\-Flash\[[4](https://arxiv.org/html/2605.26463#bib.bib11)\]
- •Gemini\-2\.5\-Pro\[[4](https://arxiv.org/html/2605.26463#bib.bib11)\]
- •GPT\-5\[[30](https://arxiv.org/html/2605.26463#bib.bib1)\]

Table 16:Estimated inference costs for proprietary models\.The open\-source models were obtained from Hugging Face\[[37](https://arxiv.org/html/2605.26463#bib.bib48)\]\. The corresponding Hugging Face repositories and licenses for each model are listed below:

- •Qwen3\-32B\[[40](https://arxiv.org/html/2605.26463#bib.bib10)\] - –Hugging Face Path: Qwen/Qwen3\-32B - –License: Apache license 2\.0
- •GPT\-OSS 20B\[[1](https://arxiv.org/html/2605.26463#bib.bib9)\] - –Hugging Face Path: openai/gpt\-oss\-20b - –License: Apache license 2\.0
- •MedGemma 27B\[[29](https://arxiv.org/html/2605.26463#bib.bib8)\] - –Hugging Face Path: google/medgemma\-27b\-it - –License: health\-ai\-developer\-foundations

### O\.3Experiment Details

##### Dataset

We split the 105 clinical notes into 83 for the test set and 22 for the validation set\. The main experiments were conducted on the test set, and the validation set was used to developEHR\-Inspector\. For MIMIC\-IV, we additionally annotated seven discharge summary notes and used them in our experiments\.

##### Compute Requirements and Hyperparameters

On average, open\-source models required approximately 9 hours per note to complete our task, while Gemini\-2\.5\-Flash required approximately 3 hours per note\. This is substantially shorter than human annotation, which required approximately 12 hours per note on average, even when performed by annotators familiar with EHRs\. For model\-based evaluation, Gemini\-2\.5\-Pro required approximately 1 hour per note, and GPT\-5 also required approximately 1 hour per note on average\. For open\-source models, inference was performed on a single NVIDIA A100 GPU with a maximum generation length of 2048 tokens, temperature set to 0\.5, and top\-p set to 0\.95\. Gemini models were run using the default inference settings\.

##### Additional NER Models and Fine\-tuning Details\.

Additional NER models were used for the entity extraction ablation study\. Specifically, the BERT Ensemble baseline aggregates predictions from biomedical BERT\-based NER models, while the Trained NER baseline fine\-tunes MedGemma 27B on note–entity pairs fromEHR\-ReasonCon\. The additional model resources, fine\-tuning hyperparameters are as follows:

- •Clinical Bert Model - –Hugging Face Path: blaze999/clinical\-ner - –License: mit
- •Spacy - –https://spacy\.io/
- •Stanza - –https://stanfordnlp\.github\.io/stanza/
- •MedGemma 27B\[[29](https://arxiv.org/html/2605.26463#bib.bib8)\] - –Hugging Face Path: google/medgemma\-27b\-it - –Hyperparameter: - \*Epoch: 1 - \*Learning Rate:2×10−52\\times 10^\{\-5\} - \*Batch Size: 8 - \*Cosine learning\-rate scheduling with a warmup ratio of 0\.03 - \*Maximum sequence length of 2048 tokens

## Appendix PAuthor statement

Any legal or ethical issues, including potential rights violations related toEHR\-ReasonCon, are the sole responsibility of the authors\.

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs

arXiv cs.CL

Introduces LoVer, an unsupervised verifier that uses logical rules (negation consistency, intra-group and inter-group consistency) to improve LLM reasoning without labeled data, achieving performance close to supervised verifiers on reasoning benchmarks.