EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

arXiv cs.AI Papers

Summary

EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.

arXiv:2605.30637v1 Announce Type: new Abstract: Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:24 AM

# EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
Source: [https://arxiv.org/html/2605.30637](https://arxiv.org/html/2605.30637)
\(2026\)

###### Abstract\.

Clinical decision\-making \(CDM\) is central to real\-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence\. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real\-world clinical decision tasks remains insufficiently understood\. To evaluate CDM models, especially LLM\-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality\. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference\. To fill the gaps, we introduce EHRBench, an automated and reliable EHR\-grounded benchmark for evaluating LLM\-based clinical decision\-making at scale\. To ensure scalability and reliability, EHRBench is constructed through an EHR–LLM–knowledge\-base \(KB\) interaction pipeline\. For efficiency, we use a specialized LLM to automatically convert encounter\-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items\. In parallel, we apply systematic KB\-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability\. Using this pipeline, we construct nearly 1M \(960,067\) QA items spanning three core inference\-required clinical decision tasks: diagnosis, treatment, and prognosis\. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness\. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems111The guideline for source code and data of EHRBench is available at the GitHub link[https://github\.com/constantjxyz/EHRBench](https://github.com/constantjxyz/EHRBench)\.

Large language models; Electronic health records; Clinical decision making; Medical question answering; Benchmark; Knowledge base verification

††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 9–13, 2026; Jeju Island, Republic of Korea\.††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD 2026\), August 9–13, 2026, Jeju Island, Republic of Korea††isbn:979\-8\-4007\-2259\-2/2026/08††doi:10\.1145/3770855\.3817571††ccs:Applied computing Health care information systems††ccs:Information systems Data mining††ccs:Computing methodologies Natural language processing††ccs:Computing methodologies Artificial intelligence## 1\.Introduction

Clinical decision\-making \(CDM\) is a fundamental part of real\-world clinical workflows, where clinicians must infer diagnoses, determine treatments, or forecast future clinical states from incomplete evidence\(Subbiah,[2023](https://arxiv.org/html/2605.30637#bib.bib1); Harishet al\.,[2021](https://arxiv.org/html/2605.30637#bib.bib2); Masic,[2022](https://arxiv.org/html/2605.30637#bib.bib3); Pelacciaet al\.,[2017](https://arxiv.org/html/2605.30637#bib.bib4)\)\. For instance, given the observed diagnoses of an encounter, an*in\-encounter diagnosis completion*decision requires inferring concurrent conditions, such as identifying chronic kidney disease when type 2 diabetes and diabetic nephropathy are present\. Similarly, an*in\-encounter treatment selection*decision involves selecting appropriate treatments, such as identifying the necessary anticoagulation for a patient with atrial fibrillation\. Furthermore, a*next\-encounter prognosis prediction*decision requires anticipating potential downstream outcomes or diagnoses in subsequent encounters, such as forecasting ischemic stroke risk in patients with hypertension and hyperlipidemia\. These decisions directly affect patient care and outcomes, carrying substantial clinical significance for patient safety and well\-being\(Panagiotiet al\.,[2019](https://arxiv.org/html/2605.30637#bib.bib5); Vaseyet al\.,[2021](https://arxiv.org/html/2605.30637#bib.bib6)\)\.

Large language models \(LLMs\) are increasingly deployed to support these clinical decisions\(Zhou and others,[2025](https://arxiv.org/html/2605.30637#bib.bib9); Molinetet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib10); Jiaet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib11); Onianiet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib12); Xieet al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib13); Huniket al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib14)\), owing to their robust language understanding capabilities, broad biomedical knowledge acquired during pre\-training, and superior efficiency relative to traditional manual workflows\(Singhalet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib7); Kumaret al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib8); Bhasuranet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib120); Luet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib125); Wanget al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib132)\)\. This rapid progress raises a central question: how reliably do LLMs perform on core clinical decision tasks when the evidence reflects patient\-specific, real\-world clinical data? Benchmarks are essential for addressing this question, as they enable controlled, reproducible comparisons and provide guidance for the development of safer CDM systems\.

Building these benchmarks requires an automated and reliable construction pipeline\. Historically, many medical QA resources have achieved high quality through substantial domain expertise and meticulous manual curation\(Malaviyaet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib15); Singhalet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib7); Zhouet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib18); Wornowet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib19)\)\. However, the high cost associated with manual effort typically limits these benchmarks to a small number of patient records, which restricts the scale and diversity of evaluation\(Yanet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib16); Bosmaet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib17)\)\. Since CDM is inherently complex and multi\-faceted, large\-scale benchmarks are essential for comprehensive evaluation, which in turn necessitates the transition toward automated construction pipelines\.

Recent studies have explored the use of LLMs themselves to scale up benchmark creation by generating questions under specific constraints\(Longet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib20); Artsiet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib21); Sileoet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib22)\)\. While this produces a large volume of data, it raises quality issues since LLMs can hallucinate\. Consequently, ensuring that LLM\-generated benchmarks are clinically realistic and unambiguous requires more than formatting constraints alone; it necessitates systematic validation \(e\.g\., via external knowledge bases\) to mitigate hallucinated clinical relations and ambiguous answers\(Huanget al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib23); Niuet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib24)\)\.

Beyond constructing an automated and reliable pipeline, the data source of the benchmark is also important\. Grounding a CDM benchmark in patients’ real electronic health records \(EHRs\) facilitates more authentic evaluations of practical CDM tasks\. Currently, most existing medical benchmarks are derived from well\-formed narrative sources such as exams, textbooks, clinical guidelines, and clinical notes\(Jinet al\.,[2021](https://arxiv.org/html/2605.30637#bib.bib26); Palet al\.,[2022](https://arxiv.org/html/2605.30637#bib.bib27); Liuet al\.,[2024a](https://arxiv.org/html/2605.30637#bib.bib32); Kweonet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib33); Mehandruet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib35); Liuet al\.,[2024b](https://arxiv.org/html/2605.30637#bib.bib36); Kimet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib37); Dadaet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib28); Zhanget al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib34); Zuoet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib30); Wanget al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib31)\)\. These sources often make clinical reasoning explicit—for example, by directly stating the rationale for diagnoses or treatments—thereby reducing the need for inference\. In contrast, clinicians routinely reason over longitudinal EHRs, where the underlying clinical logic is not pre\-digested but must be inferred from patterns of structured events\. Unlike general medical sources that emphasize idealized and broadly applicable knowledge, EHRs capture personalized, longitudinal, real\-world clinical events and care patterns at scale\(Knevel and Liao,[2023](https://arxiv.org/html/2605.30637#bib.bib38); Xieet al\.,[2026](https://arxiv.org/html/2605.30637#bib.bib128); Zhanget al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib133)\)\. Furthermore, compared with free\-text clinical notes, which are costly to curate and typically focus on a limited set of salient details, the structured tabular components of EHRs have far higher volume, cover a broader range of clinical concepts, and reflect substantially greater variability in real\-world practice\(Kimet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib39)\)\.

Despite this potential, directly leveraging raw structured EHRs for benchmark construction remains challenging\. Clinical relationships in EHRs are largely implicit and must be inferred from temporally ordered events, while fragmentation across coding systems complicates faithful transformation into natural\-language prompts without introducing artifacts or label leakage\(Wuet al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib130); Xieet al\.,[2024a](https://arxiv.org/html/2605.30637#bib.bib118)\)\. In addition, EHR trajectories are often extremely long, making it difficult to convert raw records into LLM\-feasible inputs while preserving data fidelity\(Zhanget al\.,[2024b](https://arxiv.org/html/2605.30637#bib.bib119); Shaoet al\.,[2026](https://arxiv.org/html/2605.30637#bib.bib131)\)\. As a result, existing EHR\-based benchmarks often emphasize reading\-comprehension or information\-retrieval tasks \(e\.g\., “what treatment did the patient receive during this visit”\(Leeet al\.,[2022](https://arxiv.org/html/2605.30637#bib.bib42); Xuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib41)\)\), rather than core CDM tasks that require substantive biomedical knowledge and clinical inference, such as deciding what should be prescribed given a diagnosis\.

![Refer to caption](https://arxiv.org/html/2605.30637v1/x1.png)Figure 1\.Overview of EHRBench\. EHRBench automatically and reliably transforms raw structured EHR trajectories into QA benchmarks via an EHR\-LLM\-KB interaction pipeline and evaluates representative LLMs on three core clinical decision tasks: diagnosis, treatment, and prognosis\.To bridge these gaps, we introduce EHRBench, an automated and reliable benchmark grounded in real\-world electronic health records \(EHRs\) for evaluating the clinical decision\-making capabilities of LLMs\. As illustrated in Figure[1](https://arxiv.org/html/2605.30637#S1.F1), our framework systematically transforms raw structured EHR trajectories into a benchmark that is both large\-scale and high\-quality through a multi\-stage pipeline that integrates EHR data, LLMs, and external biomedical knowledge bases \(KBs\)\. Specifically, we use LLMs to generate question templates \(including clinical relations, questions, and answers\) from EHR trajectories, which are concurrently validated \(for clinical relations\) and enriched \(with entity definitions and retrieved evidence\) using external KBs to ensure clinical reliability\. These generated templates are deterministically instantiated into multiple types of QA items to ensure diversity\. Using EHRBench, we evaluate representative LLMs on three core CDM tasks that require substantive biomedical knowledge and clinical inference, covering in\-encounter diagnosis completion \(diagnosis\), in\-encounter treatment selection \(treatment\), and next\-encounter outcome prediction \(prognosis\)\. We further analyze model performance in terms of accuracy, efficiency, and robustness\.

Our contributions are summarized as follows:

- •We construct EHRBench, a large\-scale, EHR\-grounded QA benchmark for evaluating LLMs’ clinical decision\-making capabilities, comprising nearly 1 million QA items \(960,067\)\. To the best of our knowledge, EHRBench is the first benchmark built directly from raw structured EHR trajectories that leverages LLMs for question template generation while enforcing systematic verification for clinical reliability\.
- •We propose an automated and reliable benchmark construction framework based on EHR–LLM–KB interactions, where LLMs enable scalable template generation, KBs provide principled validation and enrichment, and EHR trajectories supply realistic longitudinal clinical evidence\.
- •We formulate clinical decision making as conditional inference over partially observed EHR data and design three representative tasks: diagnosis completion, treatment selection, and next\-encounter prognosis, which require substantive biomedical knowledge and clinical inference over implicit clinical relations and longitudinal patient trajectories\.
- •We systematically benchmark more than 30 representative LLMs on EHRBench and conduct comprehensive analyses of their accuracy, efficiency, and robustness, providing actionable insights for developing and evaluating clinically reliable LLM systems\.

## 2\.Related Work

Medical QA Benchmarks\.Medical QA benchmarks are essential for measuring the biomedical knowledge and reasoning capabilities of clinical decision\-supporting models, including LLMs\(Xiaoet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib100)\)\. A large body of work constructs high\-quality QA resources through expert curation or carefully designed evaluation protocols\. These approaches typically improve correctness and reduce ambiguity, but they often limit dataset scale due to annotation cost and the need for domain expertise, including MedAlign\(Fleminget al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib53)\), SD\-Bench\(Noriet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib67)\), ExpertQA\(Malaviyaet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib15)\), and MedThink\-Bench\(Zhouet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib18)\), which typically contain several hundred expert\-annotated QA pairs\. Most existing large\-scale medical benchmarks are derived from general narrative sources such as exams, textbooks, and clinical guidelines, including MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.30637#bib.bib26)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2605.30637#bib.bib27)\), ClinicBench\(Liuet al\.,[2024a](https://arxiv.org/html/2605.30637#bib.bib32)\), MedXpertQA\(Zuoet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib30)\), MedChain\(Liuet al\.,[2024b](https://arxiv.org/html/2605.30637#bib.bib36)\), MedExQA\(Kimet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib37)\), LLM\-Eval\-Med\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib34)\), TrialPanorama\(Wanget al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib31)\), CHBench\(Guoet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib45)\), CMB\(Wanget al\.,[2024a](https://arxiv.org/html/2605.30637#bib.bib47)\), MedOdyssey\(Fanet al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib57)\), MedS\-Bench\(Wuet al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib60)\), MultiFacetEval\(Zhouet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib62)\), ReasonMed\(Sunet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib66)\), XMedBench\(Wanget al\.,[2024b](https://arxiv.org/html/2605.30637#bib.bib69)\), and related evaluations of medical reasoning and generalization\. In addition, several benchmarks are grounded in healthcare practice\-generated clinical notes, case reports, or dialogue\-style clinical interactions, such as MediSumQA\(Dadaet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib28)\), EHRNoteQA\(Kweonet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib33)\), ER\-REASON\(Mehandruet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib35)\), CPUCase\(Peretset al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib48)\), LongHealth\(Adamset al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib52)\), MedR\-Bench\(Qiuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib58)\), MMMU\(Yueet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib61)\), HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib51)\), DiagnosisArena\(Zhuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib110)\), and CRAFT\-MD\(Johriet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib111)\)\. Safety\-centered medical benchmarks further evaluate risk, harmfulness, and reliability in clinical contexts, such as MedSafetyBench\(Hanet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib59)\)and MedRisk or related risk\-oriented agents\(Liuet al\.,[2025a](https://arxiv.org/html/2605.30637#bib.bib109)\)\. Beyond general QA, some benchmarks focus on specialized competencies such as medical calculation\(Khandekaret al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib54)\), concept\-centric QA\(Shoham and Rappoport,[2024](https://arxiv.org/html/2605.30637#bib.bib55)\), or epidemiological question answering\(Weiet al\.,[2026](https://arxiv.org/html/2605.30637#bib.bib129)\)\. A separate but related direction builds agentic or interactive environments for sequential diagnosis and decision support, including MEDIQ\(Liet al\.,[2024b](https://arxiv.org/html/2605.30637#bib.bib56)\), AI Hospital\(Fanet al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib63)\), AgentClinic\(Schmidgallet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib70)\), MAQUE\(Gonget al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib112)\), VivaBench\(Chiuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib113)\), AgentHospital\(Liet al\.,[2024a](https://arxiv.org/html/2605.30637#bib.bib114)\), MMD\-Eval\(Liuet al\.,[2025c](https://arxiv.org/html/2605.30637#bib.bib115)\), and AMIE\(Tuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib116)\)\. Moreover, multimodal information, including ECG, genomics, imaging, and other medical data, is becoming increasingly important for CDM by providing complementary evidence about patient physiology and disease status\(Wanget al\.,[2026b](https://arxiv.org/html/2605.30637#bib.bib127); Xieet al\.,[2024b](https://arxiv.org/html/2605.30637#bib.bib121),[2022](https://arxiv.org/html/2605.30637#bib.bib122); Wanget al\.,[2026a](https://arxiv.org/html/2605.30637#bib.bib126); Hanet al\.,[2026](https://arxiv.org/html/2605.30637#bib.bib134)\)\. This trend has motivated multimodal medical benchmarks such as Asclepius\(Liuet al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib43)\), CLIMB\(Daiet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib46)\), EHRXQA\(Baeet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib49)\), GMAI\-MMBench\(Yeet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib50)\), OmniMedVQA\(Huet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib64)\), and PMC\-VQA\(Zhanget al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib65)\)\. Despite this breadth, few benchmarks are built directly from raw structured EHR trajectories in a way that preserves real\-world patterns required for CDM\.

EHR QA Benchmarks\.A growing body of work leverages raw EHR data to construct QA datasets and benchmarks\(Bardhanet al\.,[2022](https://arxiv.org/html/2605.30637#bib.bib101)\)\. However, most existing resources primarily assess the ability of a model to retrieve explicit facts from large, redundant tabular records, rather than to infer clinical decisions from longitudinal context\(Xieet al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib117)\)\. A representative line of work frames EHR QA as text\-to\-SQL parsing or database querying, including benchmarks such as EHRSQL\(Leeet al\.,[2022](https://arxiv.org/html/2605.30637#bib.bib42)\)and emrQA\(Pampariet al\.,[2018](https://arxiv.org/html/2605.30637#bib.bib102)\), as well as systems that map questions to executable queries \(e\.g\., emrKBQA\(Raghavanet al\.,[2021](https://arxiv.org/html/2605.30637#bib.bib103)\), MIMICSQL\(Wanget al\.,[2020](https://arxiv.org/html/2605.30637#bib.bib104)\)\) or agentic coding workflows\(Xuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib41)\)\. Complementary knowledge\-graph\-based approaches, such as ClinicalKBQA\(Wanget al\.,[2022](https://arxiv.org/html/2605.30637#bib.bib105)\)and MIMIC\-SPARQL\(Parket al\.,[2021](https://arxiv.org/html/2605.30637#bib.bib106)\), and temporal\-reasoning benchmarks like TIMER\(Cuiet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib68)\), further enable relational and time\-aware querying\. Finally, concurrent efforts increasingly evaluate LLMs on clinical decision tasks \(e\.g\., EHR\-R1\(Liaoet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib108)\)\), underscoring the importance and urgency of our work\. In contrast to these efforts, prior work does not emphasize an automated and reliable EHR\-LLM\-KB pipeline that explicitly extracts clinical relations from raw structured EHRs and then systematically verifies and filters them using large\-scale knowledge bases\.

Positioning of Our Work\.Complementary to prior benchmarks that rely on curated narratives or emphasize retrieval\-oriented EHR QA, our work targets realistic CDM evaluation by \(i\) grounding the benchmark in raw structured EHR trajectories, \(ii\) formulating three core CDM tasks that require substantive biomedical knowledge and clinical inference beyond information access, \(iii\) using LLMs to extract implicit clinical logic from raw EHR data for efficiency, and \(iv\) enforcing systematic verification and enrichment via biomedical KBs to maintain reliability\.

## 3\.EHRBench Construction Methodology

![Refer to caption](https://arxiv.org/html/2605.30637v1/x2.png)Figure 2\.Construction pipeline of EHRBench\. Starting from raw structured EHRs, we preprocess and normalize encounter\-level clinical events into a standardized representation\. We then generate structured templates integrating EHR signals, LLM\-based extraction, and KB verification and enrichment\. Finally, the QA generation module deterministically instantiates each template into multiple QA items for downstream evaluation\. Overall, the pipeline is LLM\-driven for scalability, KB\-verified for reliability, and EHR\-grounded for clinical relevance\.### 3\.1\.Problem Definition & Framework

Our goal is to transform structured EHRs into a clinically grounded QA benchmark for evaluating LLMs through an automated and reliable pipeline\. Figure[2](https://arxiv.org/html/2605.30637#S3.F2)presents an overview of the construction framework\. The pipeline first preprocesses raw structured EHRs and normalizes clinical events into a standardized encounter\-level representation\. It then constructs templates through an automated EHR\-LLM\-KB interaction pipeline that extracts clinically meaningful signals and validates and enriches them through KB evidence\. Finally, the pipeline instantiates each template into multiple QA variants and question formats, yielding task\-specific QA items for evaluation\. In summary, the pipeline is LLM\-powered for scale, KB\-checked for reliability, and grounded in real EHR data for clinical relevance\. We describe each step in detail below\.

EHR data collection & representation\.Let

\(1\)ℰ=\{ℰ\(1\),ℰ\(2\),…,ℰ\(N\)\}\\mathcal\{E\}=\\left\\\{\\mathcal\{E\}^\{\(1\)\},\\mathcal\{E\}^\{\(2\)\},\\dots,\\mathcal\{E\}^\{\(N\)\}\\right\\\}denote a cohort of EHRs fromNNencounters, whereℰ\(n\)\\mathcal\{E\}^\{\(n\)\}is the structured record associated with encounternn\. We assume each encounterℰ\(n\)\\mathcal\{E\}^\{\(n\)\}is associated with a patient identifierπ​\(n\)\\pi\(n\), and encounters are ordered chronologically within each patient\.

Each encounter\-level EHRℰ\(n\)\\mathcal\{E\}^\{\(n\)\}is represented as a set of recorded clinical events:

\(2\)ℰ\(n\)=\{e1\(n\),e2\(n\),…,eMn\(n\)\},\\mathcal\{E\}^\{\(n\)\}=\\left\\\{e^\{\(n\)\}\_\{1\},e^\{\(n\)\}\_\{2\},\\dots,e^\{\(n\)\}\_\{M\_\{n\}\}\\right\\\},whereMnM\_\{n\}is the number of events observed in encounternn\.

Each eventem\(n\)e^\{\(n\)\}\_\{m\}is represented as

\(3\)em\(n\)=\(dm\(n\),tm\(n\),am\(n\)\),e^\{\(n\)\}\_\{m\}=\\left\(d^\{\(n\)\}\_\{m\},t^\{\(n\)\}\_\{m\},a^\{\(n\)\}\_\{m\}\\right\),wheredm\(n\)d^\{\(n\)\}\_\{m\}is a textual description of the clinical event \(e\.g\., diagnosis or prescription\),tm\(n\)t^\{\(n\)\}\_\{m\}is a timestamp, andam\(n\)a^\{\(n\)\}\_\{m\}denotes additional attributes such as medical codes or numerical values\.

In practice, we follow common conventions by treating encounters as the basic temporal unit, aggregating clinical events at the encounter level rather than collapsing them to the patient level or relying on fine\-grained timestamps\. Purely patient\-level aggregation is overly coarse because it ignores temporal structure and distinctions across encounters, while fine\-grained timestamps are fragmented and reflect administrative logging rather than clinical onset \(e\.g\., many diagnosis events are summarized as billing codes at discharge\)\. Using encounters as the primary temporal unit aligns representations with documentation and CDM, enabling consistent aggregation over meaningful windows while allowing downstream tasks to focus on within\-encounter evidence or longitudinal history\.

Additional details regarding data sources and cohort construction are provided in Section[3\.2](https://arxiv.org/html/2605.30637#S3.SS2)\.

Template generation\.A template generation function is

\(4\)g:ℰ→𝒫,g:\\mathcal\{E\}\\rightarrow\\mathcal\{P\},which maps the encounter cohortℰ\\mathcal\{E\}toKKstructured QA templates:

\(5\)𝒫=\{Pk\}k=1K\.\\mathcal\{P\}=\\left\\\{P\_\{k\}\\right\\\}\_\{k=1\}^\{K\}\.
The construction of𝒫\\mathcal\{P\}is a multi\-stage interaction among EHR data, LLMs, and biomedical knowledge bases \(KBs\)\. Each templatePkP\_\{k\}defines a clinically grounded blueprint that can be deterministically instantiated into one or more QA items\. EachPkP\_\{k\}comprises:

- •A template contextCk⊂ℰ\(n\)C\_\{k\}\\subset\\mathcal\{E\}^\{\(n\)\}, constructed by an LLM by selecting relevant events from an encounter recordℰ\(n\)\\mathcal\{E\}^\{\(n\)\}\.
- •A clinical relationRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\), wherexkx\_\{k\}andyky\_\{k\}are the subject and object entities andrkr\_\{k\}is the relation predicate that links them, such as\(Hypertension,Cause,Stroke\)\(\\textit\{Hypertension\},\\textit\{Cause\},\\textit\{Stroke\}\)\. Each relation is extracted from EHR events using an LLM and verified against KBs to ensure clinical validity\.
- •A set of latent attributesAkA\_\{k\}, generated by the LLM or retrieved from KBs, including entity definitions, evidence or rationale, candidate distractors, and the clinical topic associated with the relation\.

All attributes inPkP\_\{k\}are exposed to the LLM during the subsequent QA generation stage to provide guidance\. Additional details of the template generation procedure are provided in Section[3\.3](https://arxiv.org/html/2605.30637#S3.SS3)\.

QA generation\.A transformation function is defined as

\(6\)f:𝒫→ℐ,where​ℐ=\{\(Sj,Qj,Bj\)\}j=1J,f:\\mathcal\{P\}\\rightarrow\\mathcal\{I\},\\quad\\text\{where \}\\mathcal\{I\}=\\left\\\{\\left\(S\_\{j\},Q\_\{j\},B\_\{j\}\\right\)\\right\\\}\_\{j=1\}^\{J\},which maps the constructed templates to a collection ofJJconstructed QA items\. Each QA item consists of:

- •a textual scenarioSjS\_\{j\}, which is a natural\-language paragraph that verbalizes some background clinical events from a patient encounter;
- •a natural\-language questionQjQ\_\{j\}constructed from the template,
- •a metadata bundleBjB\_\{j\}, including the choices, correct answer, clinical rationale, associated medical topic, and underlying clinical relations\.

More details of the QA generation are provided in Section[C\.4](https://arxiv.org/html/2605.30637#A3.SS4)\.

Clinical decision tasks\.Using the formulation above, a collection of QA itemsℐ\\mathcal\{I\}is constructed to target three core clinical decision tasks that require medical knowledge and inference\. Each task corresponds to a conditional inference objective grounded in encounter\-level EHR dataℰ\\mathcal\{E\}, where the model receives a scenarioS\(n\)S^\{\(n\)\}composed of a subset of observed events\.

\(I\) Diagnosis decision \(in\-encounter diagnosis completion\)\.This task evaluates intra\-encounter diagnostic inference by predicting a missing diagnosis from other diagnoses recorded in the same encounter \(referred to as “diagnosis decision” in this study for brevity\)\. Given an encounternnwith diagnosis set𝒟\(n\)\\mathcal\{D\}^\{\(n\)\}, we withhold a target diagnosisdtgt\(n\)∈𝒟\(n\)d^\{\(n\)\}\_\{\\mathrm\{tgt\}\}\\in\\mathcal\{D\}^\{\(n\)\}and create a scenario diagnosis subset

\(7\)𝒮\(n\)⊆𝒟\(n\)∖\{dtgt\(n\)\}\.\\mathcal\{S\}^\{\(n\)\}\\subseteq\\mathcal\{D\}^\{\(n\)\}\\setminus\\left\\\{d^\{\(n\)\}\_\{\\mathrm\{tgt\}\}\\right\\\}\.The model is asked to infer the missing diagnosis:

\(8\)dtgt\(n\)∼p\(d\|𝒮\(n\)\)\.d^\{\(n\)\}\_\{\\mathrm\{tgt\}\}\\;\\sim\\;p\\\!\\left\(d\\;\\middle\|\\;\\mathcal\{S\}^\{\(n\)\}\\right\)\.Accordingly, the scenario descriptionSjS\_\{j\}verbalizes𝒮\(n\)\\mathcal\{S\}^\{\(n\)\}, and the question asks for the most likely co\-occurring diagnosis\. This task measures whether the model captures clinically plausible co\-morbidity patterns and diagnostic co\-occurrence structure within a single encounter\.

\(II\) Treatment decision \(in\-encounter treatment selection\)\.This task models encounter\-level treatment selection \(referred to as “treatment decision” in this study for brevity\)\. Given encounternn, a scenario diagnosis set is constructed as

\(9\)𝒮\(n\)⊆𝒟\(n\)\\mathcal\{S\}^\{\(n\)\}\\subseteq\\mathcal\{D\}^\{\(n\)\}and the model is required to infer a target treatmentttgt\(n\)t^\{\(n\)\}\_\{\\mathrm\{tgt\}\}prescribed or performed during the same encounter:

\(10\)ttgt\(n\)∼p\(t\|𝒮\(n\)\)\.t^\{\(n\)\}\_\{\\mathrm\{tgt\}\}\\;\\sim\\;p\\\!\\left\(t\\;\\middle\|\\;\\mathcal\{S\}^\{\(n\)\}\\right\)\.Here, the scenario descriptionSjS\_\{j\}verbalizes𝒮\(n\)\\mathcal\{S\}^\{\(n\)\}, and the question asks the model to select an appropriate treatment from𝒯\(n\)\\mathcal\{T\}^\{\(n\)\}\(i\.e\., a prescription or a procedure\)\.

\(III\) Prognosis decision \(next\-encounter outcome prediction\)\.This task evaluates longitudinal reasoning over consecutive encounters to anticipate future diagnoses \(referred to as “prognosis decision” in this study for brevity\)\. Given two consecutive encountersnnandn\+1n\{\+\}1for the same patient, a scenario event set is constructed as

\(11\)𝒮\(n\)⊆𝒟\(n\)∪𝒯\(n\)\\mathcal\{S\}^\{\(n\)\}\\subseteq\\mathcal\{D\}^\{\(n\)\}\\cup\\mathcal\{T\}^\{\(n\)\}from encounternn, and the model is required to predict a target diagnosisdtgt\(n\+1\)d^\{\(n\+1\)\}\_\{\\mathrm\{tgt\}\}in the subsequent encounter:

\(12\)dtgt\(n\+1\)∼p\(d\|𝒮\(n\)\)\.d^\{\(n\+1\)\}\_\{\\mathrm\{tgt\}\}\\;\\sim\\;p\\\!\\left\(d\\;\\middle\|\\;\\mathcal\{S\}^\{\(n\)\}\\right\)\.Here𝒟\(n\)\\mathcal\{D\}^\{\(n\)\}and𝒯\(n\)\\mathcal\{T\}^\{\(n\)\}denote the diagnoses and treatments \(including procedures and prescriptions\) observed at encounternn\. For each QA itemjj, the scenario descriptionSjS\_\{j\}is a natural\-language rendering of𝒮\(n\)\\mathcal\{S\}^\{\(n\)\}, and the question asks for a diagnosis that appears at encountern\+1n\{\+\}1\. This task evaluates whether disease progression and treatment\-related effects over time under partial observation are captured\.

Overall, the resulting benchmark is designed to systematically evaluate LLMs’ ability to perform clinically grounded reasoning and decision\-making over structured, longitudinal EHR data under partial observation\. Details of the evaluation protocol are summarized in Appendix[C\.7](https://arxiv.org/html/2605.30637#A3.SS7)\.

### 3\.2\.Data Collection & Preprocessing

Our benchmark utilizes structured EHR trajectories from three real\-world sources: MIMIC\-III, MIMIC\-IV, and PROMOTE\. MIMIC\-III \(Version 1\.4\) is a widely\-used, publicly available critical\-care dataset from intensive care units at Beth Israel Deaconess Medical Center between 2001 and 2012\(Johnsonet al\.,[2016](https://arxiv.org/html/2605.30637#bib.bib25)\)\. MIMIC\-IV \(Version 3\.1\) is a newer release that extends MIMIC\-III with updated hospital data from the same institution \(2008–2022\)\(Johnsonet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib71)\)\. To further reduce potential contamination from public corpora and evaluate language models in a setting less prone to data leakage, we additionally include PROMOTE, a private dataset from Emory Healthcare spanning 2012–2021\(Xieet al\.,[2025c](https://arxiv.org/html/2605.30637#bib.bib72); Wuet al\.,[2025c](https://arxiv.org/html/2605.30637#bib.bib73)\)\.

Across all sources, we treat an inpatient stay as the basic encounter unit\. For each encounter, we extract billing\-code\-derived clinical events, including diagnoses and treatments, where treatments cover both medical procedures and medication prescriptions\. During preprocessing, we normalize heterogeneous source schemas into a unified event representation with \(i\) a standardized event type in “\{diagnosis,procedure,prescription\}”, \(ii\) a human\-readable event description mapped from raw clinical codes \(e\.g\., ICD\), and \(iii\) a consistent encounter timeline ordering\. We mainly usePyHealth\(Yanget al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib74)\)to extract information from raw EHRs and perform preprocessing\.

From the preprocessed EHRs, we define an*EHR instance*as the minimal input unit presented to the LLM\. For the diagnosis and treatment tasks, each instanceu\(i\)u^\{\(i\)\}corresponds to a single encounterℰ\(n\)\\mathcal\{E\}^\{\(n\)\}\. For the prognosis task, each instanceu\(i\)u^\{\(i\)\}corresponds to a pair of consecutive encounters\(ℰ\(n\),ℰ\(n\+1\)\)\(\\mathcal\{E\}^\{\(n\)\},\\mathcal\{E\}^\{\(n\+1\)\}\)from the same patient\. Detailed cohort statistics are summarized in Appendix[B](https://arxiv.org/html/2605.30637#A2)\.

### 3\.3\.Template Generation

After preprocessing the EHR cohort, structured templates𝒫\\mathcal\{P\}are constructed fromℰ\\mathcal\{E\}\. Each templatePkP\_\{k\}specifies a clinically grounded blueprint that can be deterministically instantiated into QA items, following the formulation in Section[3\.1](https://arxiv.org/html/2605.30637#S3.SS1)\. Specifically,PkP\_\{k\}includes a template contextCkC\_\{k\}, a target clinical relationRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\), and latent attributesAkA\_\{k\}that support downstream generation\. Template construction is implemented through a multi\-stage interaction among EHR data, a medical LLM, and a biomedical knowledge base\.

Stage 1 \(Relation extraction: EHR→\\rightarrowLLM→\\rightarrowKB\)\.For each EHR input instanceu\(i\)u^\{\(i\)\}, an instruction\-fine\-tuned medical LLM \(specifically, HuatuoGPT\-o1\-8B\(Chenet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib80)\)\) is prompted to extract clinically salient relations from the patient record under strict JSON output constraints\. The objective is to capture implicit clinical logic encoded in structured EHR data\. Extracted relations are deduplicated and assigned unique identifiers, producing candidate relation tripletsRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\)that later QA items target, such as\(Hyperglycemia, Treat\-with, Insulin\), together with an associated rationale\. The LLM is also prompted to extract a small set of auxiliary context events, such asTracheostomyandHemothorax, which are aggregated intoCkC\_\{k\}\. These auxiliary events are constrained to have no lexical or semantic overlap with any entity inRkR\_\{k\}, meaning they cannot directly repeat or paraphrase entities appearing in the target relation triplet\. Outputs of this stage are passed to KB for verification\. More details about the extraction of clinical relations and context from raw EHR data are presented in Appendix[C\.1](https://arxiv.org/html/2605.30637#A3.SS1)\.

Stage 2 \(Relation verification and enrichment: KB→\\rightarrowLLM\)\.In this stage, relations extracted in Stage 1 are validated and enriched through external biomedical evidence by querying a composite KB that integrates UMLS\(Bodenreider,[2004](https://arxiv.org/html/2605.30637#bib.bib76)\), SemMedDB\(Kilicogluet al\.,[2012](https://arxiv.org/html/2605.30637#bib.bib75)\), DrugBank\(Wishartet al\.,[2008](https://arxiv.org/html/2605.30637#bib.bib77)\), PubMed\(Canese and Weis,[2013](https://arxiv.org/html/2605.30637#bib.bib78)\), and ICD\(O’malleyet al\.,[2005](https://arxiv.org/html/2605.30637#bib.bib79)\)\. UMLS provides standardized concept identifiers \(CUIs\) and textual definitions across biomedical vocabularies\(Bodenreider,[2004](https://arxiv.org/html/2605.30637#bib.bib76)\)\. SemMedDB provides semantic relations \(e\.g\.,CauseandTreat\-with\) extracted from PubMed abstracts, thereby offering literature\-supported evidence for relations between biomedical concepts\(Kilicogluet al\.,[2012](https://arxiv.org/html/2605.30637#bib.bib75)\)\. Event strings extracted from EHRs are first resolved to standardized concepts through the UMLS API, mapping each entity to a UMLS CUI with source vocabularies such as ICD and DrugBank\. After concept linking, evidence for relations is retrieved through SemMedDB\.

Given an LLM\-extracted clinical relation from a patient EHR recordRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\), we verify its validity by checking for supporting evidence in SemMedDB, which contains KB relation triplets automatically extracted from PubMed abstracts\. A candidate relationRkR\_\{k\}is retained only if it satisfies all three criteria:

- •Positive support: SemMedDB contains evidence supporting the relation, such as\(xk,Cause,yk\)\(x\_\{k\},\\textit\{Cause\},y\_\{k\}\)or\(xk,Treat\-with,yk\)\(x\_\{k\},\\textit\{Treat\-with\},y\_\{k\}\)\.
- •No negative evidence: SemMedDB does not contain contradictory relations, such as\(xk,Neg\-cause,yk\)\(x\_\{k\},\\textit\{Neg\-cause\},y\_\{k\}\)\.
- •No conflicting background evidence: No contradictory relation is found with respect to a predefined set of background conceptsCkC\_\{k\}, such as\(Ck,Neg\-cause,yk\)\(C\_\{k\},\\textit\{Neg\-cause\},y\_\{k\}\)\.

These checks ensure that each retained relation reflects a clinically valid association supported by biomedical knowledge, thereby reducing the risk of hallucinated relations introduced by LLMs\.

To support downstream QA generation, entity definitions are retrieved from UMLS, and evidence sentences are retrieved from PubMed through SemMedDB\. The verified relations, together with their associated definitions and evidence, are stored in structured templatesPkP\_\{k\}and used in subsequent QA construction\. More details about how to use the KB are provided in Appendix[C\.2](https://arxiv.org/html/2605.30637#A3.SS2)\.

Stage 3 \(Template completion: LLM→\\rightarrowKB\)\.In this stage, each verified templatePkP\_\{k\}is completed by prompting the LLM to generate additional structured attributes under a strict JSON schema\. For each verified relationRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\), the LLM produces: \(i\) a set of distractor candidates that compete with the object entityyky\_\{k\}in the relation, e\.g\.,Magnesium sulfateandFurosemideas distractors forInsulin; these candidates are generated by the LLM or sampled from EHR data of other patients to capture both model knowledge and real\-world patterns; \(ii\) a high\-level clinical condition topic associated with the relation, e\.g\.,Hyperglycemia; and \(iii\) a concise rationale that summarizes the relation together with KB\-retrieved evidence, e\.g\.,“The prescription of Insulin is likely due to the need to treat Hyperglycemia, as Insulin effectively regulates blood sugar levels\.”The resulting attributes are stored in the updated templatePkP\_\{k\}and used for downstream QA generation\. Additional details of the generated templates are provided in Appendix[C\.3](https://arxiv.org/html/2605.30637#A3.SS3)\.

Stage 4 \(Template filtering: KB→\\rightarrowTemplate Output\)\.In the final stage, unqualified or misleading distractors within templates are removed via KB verification to preserve an unambiguous answer set\. Following the procedure from Stage 2, each distractor term is resolved to a UMLS CUI, and SemMedDB is queried for supporting predicate evidence\. A distractor is filtered out if it forms any clinically supported relation that would render it a plausible correct answer given the question context\. Specifically, a distractor is removed if:

- •SemMedDB provides positive evidence linking the distractor to the subject entityxkx\_\{k\}under a compatible predicate, such as\(xk,Cause,distractor\)\(x\_\{k\},\\text\{Cause\},\\text\{distractor\}\)\.
- •SemMedDB provides positive evidence linking the distractor to any auxiliary context event inCkC\_\{k\}under a compatible predicate, such as\(Ck,Cause,distractor\)\(C\_\{k\},\\text\{Cause\},\\text\{distractor\}\)\.

After filtering, between three and five distractors are retained per template; templates failing to meet the minimum distractor count are discarded to ensure QA quality\.

### 3\.4\.QA Generation

Each templatePkP\_\{k\}provides \(i\) a contextCkC\_\{k\}, \(ii\) a verified clinical relationRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\), and \(iii\) latent attributesAkA\_\{k\}, including entity definitions, supporting evidence or rationale, candidate distractors, and an associated clinical topic\. The templates are provided to an LLM to instantiate multiple types of QA items\.

For each QA itemIjI\_\{j\}, an event\-complete scenarioSjS\_\{j\}is constructed by augmenting the template context with the relation subject entity, i\.e\.,Sj←Ck∪\{xk\}S\_\{j\}\\leftarrow C\_\{k\}\\cup\\\{x\_\{k\}\\\}\. This design groundsSjS\_\{j\}in observed encounter\-level clinical events while explicitly tying the scenario to the verified relation, thereby providing faithful background information for question construction\.

Within each template, multiple\-choice questions \(MCQs\) are instantiated with an option countc∈\{4,5,6\}c\\in\\\{4,5,6\\\}\. For each task, a task\-specific question skeleton is used to constructQjQ\_\{j\}; for example, in the prognosis task, the question is phrased as*“Given the prior clinical history summarized above, what diagnosis may occur at the next encounter?”*During evaluation, the tested LLM receives\(Sj,Qj\)\(S\_\{j\},Q\_\{j\}\)as input\. For a givencc, one correct answer is designated and the remainingc−1c\-1options are filled with distractors retrieved from the template\. The explanation is taken from the template rationale, generated by integrating real\-world EHR patterns, internal LLM knowledge, and KB\-retrieved evidence\. To increase diversity, each question is paraphrased and each choice set is permuted to create multiple MCQ versions\. Additional details of the question skeleton and question paraphrasing are provided in Appendix[C\.4](https://arxiv.org/html/2605.30637#A3.SS4)\.

Open\-ended questions \(OEQs\) are also constructed to elicit a free\-text response and a corresponding explanation aligned with the target clinical relationRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\)\. These questions follow the same three clinical decision tasks and task\-specific skeletons to formQjQ\_\{j\}; for example, an OEQ is phrased as*“Given the prior clinical history summarized above, what event may lead to acute kidney failure at the next encounter? Why?”*The gold\-standard answer is defined as the rationale generated during template completion by integrating real\-world EHR patterns, internal model knowledge, and KB\-retrieved evidence\.

For each templatePkP\_\{k\}and option countcc, MCQs are instantiated for 4\-choice MCQ \(4 paraphrased versions\), 5\-choice MCQ \(5 versions\), 6\-choice MCQ \(6 versions\), and 1 OEQ\. As a result, each template yields at most 16 QA items across four question types, enabling evaluation under different response constraints\. Finally, the generated benchmark contains 960,067 QA items\. QA statistics are provided in Appendix[B](https://arxiv.org/html/2605.30637#A2)\.

## 4\.Experiments

### 4\.1\.Benchmarking LLMs on EHRBench Across Clinical Decision Tasks

In our main experiments, we evaluate a comprehensive set of 31 representative LLMs on the constructed benchmark dataset\. Details of all utilized LLMs in this study are provided in Appendix[J](https://arxiv.org/html/2605.30637#A10)\. Specifically, the LLM used for EHRBench generation \(HuatuoGPT\-o1\-8B\) is not evaluated here to avoid bias\. These evaluated models are categorized into three primary groups:

1. \(a\)Open source general LLMsthat serve as critical performance baselines and widely accessible tools for comparative analysis: \(a\.1\) glm4\-9b and glm4\-32b\(Glmet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib90)\)\(a\.2\) llama3\-8b, llama3\-70b, llama3\.1\-8b, llama3\.2\-3b, llama3\.3\-70b\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib91)\); \(a\.3\) mistral\-7b, mistral\-small3\-24b\(Jianget al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib88)\), and ministral\-8b\(Liuet al\.,[2026](https://arxiv.org/html/2605.30637#bib.bib89)\); \(a\.4\) qwen2\.5\-3b, qwen2\.5\-7b and qwen2\.5\-32b\(Yanget al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib85)\); \(a\.5\) qwen3\-4b, qwen3\-8b, and qwen3\-32b\(Yanget al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib86)\); \(a\.6\) smollm3\-3b\(Bakouchet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib98)\); \(a\.7\) yi\-1\.5\-9b and yi\-1\.5\-34b\(Younget al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib99)\)\.
2. \(b\)Medical LLMsthat are pretrained on healthcare corpora and specialized for tackling medical tasks: \(b\.1\) doctor\-r1\-8b\(Laiet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib82)\); \(b\.2\) med42\-8b\(Christopheet al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib95)\); \(b\.3\) ultramedical\-8b\(Zhanget al\.,[2024a](https://arxiv.org/html/2605.30637#bib.bib96)\); \(b\.4\) m1\-7b\-23k and m1\-32b\-1k\(Huanget al\.,[2025b](https://arxiv.org/html/2605.30637#bib.bib81)\)\.
3. \(c\)HIPAA compliant API\-based LLMsthat ensure the secure processing of protected health information: \(c\.1\) gpt\-4\.1\-nano, gpt\-4\.1\-mini, and gpt\-4\.1\(Achiamet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib93)\); \(c\.2\) gpt\-5\-nano, gpt\-5\-mini, gpt\-5, gpt\-5\.2\(Singhet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib94)\)\.

Table 1\.Benchmarking LLMs on EHRBench across tasks, data sources, and question types\. We use abbreviations Dx/Tx/Px for diagnosis/treatment/prognosis decision task, MIII/MIV/PRO for MIMIC\-III/MIMIC\-IV/PROMOTE, and 4C/5C/6C for 4/5/6\-choice MCQs\. Within each column, we mark the top\-8 results \(ranked first 25%\) using underlines and rank superscripts:\#1\{\}^\{\\textbf\{\\\#1\}\}\#2\{\}^\{\\textbf\{\\\#2\}\}\#3\{\}^\{\\textbf\{\\\#3\}\}\#4\{\}^\{\\textbf\{\\\#4\}\}\#5\{\}^\{\\textbf\{\\\#5\}\}\#6\{\}^\{\\textbf\{\\\#6\}\}\#7\{\}^\{\\textbf\{\\\#7\}\}\#8\{\}^\{\\textbf\{\\\#8\}\}\.ModelOverallRankTask AccSource AccType AccAcc \(%\)↑\\uparrowAvg↓\\downarrowSD\.Dx \(%\)Tx \(%\)Px \(%\)MIII \(%\)MIV \(%\)PRO \(%\)4C \(%\)5C \(%\)6C \(%\)Open source general LLMsglm4\-9b59\.6216\.702\.5258\.3672\.6147\.9061\.9261\.1156\.4164\.7659\.4754\.64glm4\-32b66\.12\#7\{\}^\{\\textbf\{\\\#7\}\}7\.05\#7\{\}^\{\\textbf\{\\\#7\}\}2\.4467\.09\#6\{\}^\{\\textbf\{\\\#6\}\}77\.90\#6\{\}^\{\\textbf\{\\\#6\}\}53\.3666\.2766\.45\#8\{\}^\{\\textbf\{\\\#8\}\}65\.85\#4\{\}^\{\\textbf\{\\\#4\}\}70\.81\#8\{\}^\{\\textbf\{\\\#8\}\}65\.89\#8\{\}^\{\\textbf\{\\\#8\}\}61\.66\#7\{\}^\{\\textbf\{\\\#7\}\}llama3\-8b48\.9024\.231\.1344\.6163\.4438\.6351\.1948\.6047\.7454\.8748\.0043\.82llama3\-70b63\.3510\.723\.3862\.2177\.63\#7\{\}^\{\\textbf\{\\\#7\}\}50\.2065\.3963\.7261\.2068\.4163\.1958\.45llama3\.1\-8b56\.7619\.413\.7453\.8273\.4543\.0257\.5056\.7755\.7962\.6756\.6650\.97llama3\.2\-3b49\.8523\.671\.2243\.1865\.4140\.9650\.1651\.2048\.4655\.7948\.7844\.99llama3\.3\-70b67\.28\#4\{\}^\{\\textbf\{\\\#4\}\}5\.23\#4\{\}^\{\\textbf\{\\\#4\}\}2\.0868\.35\#3\{\}^\{\\textbf\{\\\#3\}\}79\.05\#4\{\}^\{\\textbf\{\\\#4\}\}54\.44\#7\{\}^\{\\textbf\{\\\#7\}\}68\.74\#5\{\}^\{\\textbf\{\\\#5\}\}67\.94\#4\{\}^\{\\textbf\{\\\#4\}\}65\.35\#6\{\}^\{\\textbf\{\\\#6\}\}71\.98\#4\{\}^\{\\textbf\{\\\#4\}\}67\.07\#4\{\}^\{\\textbf\{\\\#4\}\}62\.79\#4\{\}^\{\\textbf\{\\\#4\}\}mistral\-7b38\.2328\.042\.1336\.5941\.9036\.2138\.5438\.2537\.8540\.1638\.5635\.98ministral\-8b56\.4820\.321\.6453\.1171\.9744\.3758\.5457\.3954\.3161\.8455\.7951\.82mistral\-small3\-24b65\.019\.524\.4966\.2075\.0253\.81\#8\{\}^\{\\textbf\{\\\#8\}\}67\.00\#8\{\}^\{\\textbf\{\\\#8\}\}65\.3663\.3869\.4164\.1961\.42qwen2\.5\-3b37\.8728\.981\.0734\.9447\.7530\.9339\.7239\.2135\.2347\.9336\.0929\.59qwen2\.5\-7b57\.7418\.992\.3056\.2772\.0444\.9159\.7659\.2554\.9862\.2257\.4753\.52qwen2\.5\-32b64\.978\.812\.9566\.48\#7\{\}^\{\\textbf\{\\\#7\}\}76\.8751\.5465\.2267\.00\#7\{\}^\{\\textbf\{\\\#7\}\}63\.0569\.8064\.5060\.59qwen3\-4b60\.6314\.992\.7759\.6773\.4648\.7662\.0161\.7358\.3966\.3760\.2455\.27qwen3\-8b60\.8714\.272\.8458\.1974\.4949\.9362\.2662\.1158\.7467\.0960\.2155\.31qwen3\-32b66\.78\#6\{\}^\{\\textbf\{\\\#6\}\}6\.54\#6\{\}^\{\\textbf\{\\\#6\}\}2\.1467\.97\#5\{\}^\{\\textbf\{\\\#5\}\}77\.34\#8\{\}^\{\\textbf\{\\\#8\}\}55\.04\#6\{\}^\{\\textbf\{\\\#6\}\}68\.48\#7\{\}^\{\\textbf\{\\\#7\}\}67\.18\#6\{\}^\{\\textbf\{\\\#6\}\}65\.55\#5\{\}^\{\\textbf\{\\\#5\}\}71\.33\#6\{\}^\{\\textbf\{\\\#6\}\}66\.47\#6\{\}^\{\\textbf\{\\\#6\}\}62\.55\#5\{\}^\{\\textbf\{\\\#5\}\}smollm3\-3b45\.8225\.841\.3440\.7958\.2938\.3946\.3146\.0845\.0451\.0145\.0941\.37yi\-1\.5\-9b45\.5125\.881\.3741\.9757\.5237\.0546\.8745\.5044\.8453\.1545\.6437\.74yi\-1\.5\-34b58\.9417\.652\.6556\.7072\.2547\.8660\.7260\.0756\.5664\.4858\.6153\.71Medical LLMsdoctor\-r1\-8b61\.0714\.062\.3258\.7474\.4949\.9861\.9462\.2659\.0167\.0760\.5655\.57med42\-8b36\.4829\.311\.0433\.4845\.5430\.4136\.8835\.0037\.4538\.6938\.4132\.34ultramedical\-8b29\.0230\.600\.6919\.1443\.0924\.8331\.1729\.5627\.9938\.7628\.2620\.05m1\-7b\-23k46\.0826\.011\.8238\.4263\.0736\.7446\.5046\.9945\.1450\.0345\.7542\.45m1\-32b\-1k63\.2111\.473\.6463\.0774\.8451\.7362\.8765\.4961\.4668\.4962\.8058\.35HIPAA compliant API\-based LLMsgpt\-4\.1\-nano60\.4815\.092\.2858\.0274\.0349\.3961\.5961\.8958\.2865\.4260\.3955\.63gpt\-4\.1\-mini66\.79\#5\{\}^\{\\textbf\{\\\#5\}\}6\.28\#5\{\}^\{\\textbf\{\\\#5\}\}1\.9166\.41\#8\{\}^\{\\textbf\{\\\#8\}\}77\.90\#5\{\}^\{\\textbf\{\\\#5\}\}56\.05\#5\{\}^\{\\textbf\{\\\#5\}\}68\.58\#6\{\}^\{\\textbf\{\\\#6\}\}67\.36\#5\{\}^\{\\textbf\{\\\#5\}\}64\.45\#7\{\}^\{\\textbf\{\\\#7\}\}71\.34\#5\{\}^\{\\textbf\{\\\#5\}\}66\.76\#5\{\}^\{\\textbf\{\\\#5\}\}62\.26\#6\{\}^\{\\textbf\{\\\#6\}\}gpt\-4\.169\.43\#2\{\}^\{\\textbf\{\\\#2\}\}2\.51\#2\{\}^\{\\textbf\{\\\#2\}\}1\.3269\.87\#2\{\}^\{\\textbf\{\\\#2\}\}80\.10\#3\{\}^\{\\textbf\{\\\#3\}\}58\.33\#3\{\}^\{\\textbf\{\\\#3\}\}70\.59\#2\{\}^\{\\textbf\{\\\#2\}\}69\.77\#2\{\}^\{\\textbf\{\\\#2\}\}67\.87\#3\{\}^\{\\textbf\{\\\#3\}\}73\.97\#2\{\}^\{\\textbf\{\\\#2\}\}69\.21\#2\{\}^\{\\textbf\{\\\#2\}\}65\.11\#2\{\}^\{\\textbf\{\\\#2\}\}gpt\-5\-nano57\.8019\.311\.9356\.3970\.8646\.1658\.6458\.7255\.6863\.2757\.8652\.29gpt\-5\-mini66\.12\#8\{\}^\{\\textbf\{\\\#8\}\}7\.84\#8\{\}^\{\\textbf\{\\\#8\}\}3\.4865\.4076\.1756\.79\#4\{\}^\{\\textbf\{\\\#4\}\}69\.08\#4\{\}^\{\\textbf\{\\\#4\}\}66\.0563\.88\#8\{\}^\{\\textbf\{\\\#8\}\}70\.87\#7\{\}^\{\\textbf\{\\\#7\}\}65\.98\#7\{\}^\{\\textbf\{\\\#7\}\}61\.51\#8\{\}^\{\\textbf\{\\\#8\}\}gpt\-569\.06\#3\{\}^\{\\textbf\{\\\#3\}\}3\.21\#3\{\}^\{\\textbf\{\\\#3\}\}1\.8668\.26\#4\{\}^\{\\textbf\{\\\#4\}\}80\.45\#1\{\}^\{\\textbf\{\\\#1\}\}58\.46\#2\{\}^\{\\textbf\{\\\#2\}\}70\.16\#3\{\}^\{\\textbf\{\\\#3\}\}69\.18\#3\{\}^\{\\textbf\{\\\#3\}\}68\.26\#2\{\}^\{\\textbf\{\\\#2\}\}73\.64\#3\{\}^\{\\textbf\{\\\#3\}\}68\.92\#3\{\}^\{\\textbf\{\\\#3\}\}64\.61\#3\{\}^\{\\textbf\{\\\#3\}\}gpt\-5\.270\.91\#1\{\}^\{\\textbf\{\\\#1\}\}1\.69\#1\{\}^\{\\textbf\{\\\#1\}\}1\.1072\.02\#1\{\}^\{\\textbf\{\\\#1\}\}80\.13\#2\{\}^\{\\textbf\{\\\#2\}\}60\.59\#1\{\}^\{\\textbf\{\\\#1\}\}71\.50\#1\{\}^\{\\textbf\{\\\#1\}\}71\.06\#1\{\}^\{\\textbf\{\\\#1\}\}70\.70\#1\{\}^\{\\textbf\{\\\#1\}\}75\.40\#1\{\}^\{\\textbf\{\\\#1\}\}70\.53\#1\{\}^\{\\textbf\{\\\#1\}\}66\.81\#1\{\}^\{\\textbf\{\\\#1\}\}

We benchmark 31 representative LLMs on EHRBench to assess their clinical decision\-making capability under a unified inference protocol\. Specifically, we evaluate three core tasks \(Diagnosis/Treatment/Prognosis\), three data sources \(MIMIC\-III/MIMIC\-IV/PROMOTE\), and three multiple\-choice settings \(4\-choice/5\-choice/ 6\-choice\)\. We report accuracy at multiple granularities \(task\-level, source\-level, and type\-level\), and additionally compute each model’s overall accuracy and its rank statistics \(mean and standard deviation\) across settings, where per\-setting ranks are obtained by sorting models by accuracy within each setting and then aggregating ranks across all settings\. Further details regarding the experiment protocol, including batching, deterministic decoding, hardware, and the fixed evaluation subset, are provided in Appendix[E\.1](https://arxiv.org/html/2605.30637#A5.SS1)\. The aggregated accuracy results are summarized in Table[1](https://arxiv.org/html/2605.30637#S4.T1)\. Additional results for cost analysis and error analysis are provided in Appendix[E\.2](https://arxiv.org/html/2605.30637#A5.SS2)and[E\.3](https://arxiv.org/html/2605.30637#A5.SS3)\.

Overall, the benchmark results are broadly consistent with established model capability trends, with the highest\-ranked systems corresponding to the most capable and recently released models in our evaluation set, validating the construction pipeline of EHRBench\. Specifically, gpt\-5\.2, gpt\-4\.1, gpt\-5, llama3\.3\-70b, gpt\-4\.1\-mini, qwen3\-32b, glm4\-32b, and gpt\-5\-mini emerge as the strongest performers\. Among them, gpt\-5\.2 achieves the highest overall accuracy of 70\.91% with the best average rank of 1\.69 and a low rank standard deviation of 1\.10, indicating stable performance across tasks, sources, and question types\. Meanwhile, the leading open\-source models remain highly competitive: llama3\.3\-70b attains 67\.28% and qwen3\-32b attains 66\.78%, narrowing the gap to API\-based models to only 3\-4 absolute points\. Beyond the leaderboard, the relative ordering within model families follows expected scaling and generation trends\. For example, within the Qwen series, qwen3\-32b substantially outperforms smaller counterparts such as qwen3\-8b \(60\.87%\) and qwen3\-4b \(60\.63%\), and also improves over the previous\-generation qwen2\.5\-32b \(64\.97%\)\. Collectively, these patterns suggest that EHRBench reliably captures meaningful capability differences consistent with model capacity\.

Performance varies substantially across clinical decision tasks: treatment selection consistently yields the highest accuracy, whereas prognosis prediction is the most challenging\. Overall, the average accuracy across all models and all questions follows the ordering as Tx\>\>Dx\>\>Px \(69\.33%\>\>55\.02%\>\>46\.67% \)\. This pattern is clinically intuitive\. Treatment selection often depends on relatively direct, well\-documented associations between medications and their indications, which are explicitly described in drug labels and consolidated in clinical practice guidelines\. In contrast, diagnosis and prognosis tasks emphasize disease\-to\-disease causal and progression relations, which are typically less explicit, more confounded by comorbidities, and harder to infer from limited encounter evidence\. Prognosis further requires anticipating conditions beyond the current visit by integrating longitudinal trajectories and subtle risk factors, making it inherently more difficult than in\-encounter decision\-making\. Despite this difficulty, diagnosis completion and especially prognosis are critical for real\-world care, underscoring the need to strengthen LLMs for longitudinal reasoning and forward\-looking clinical prediction to support clinicians\.

Dataset\-source\-level accuracies exhibit only moderate variation compared with the stronger task\- and type\-level effects\. When aggregating across all tasks and all evaluated models, the accuracies on MIMIC\-III/MIMIC\-IV/PROMOTE are 58\.26%/57\.69%/55\.45%, suggesting that performance on the two public datasets and the private dataset is broadly comparable\. The consistent trends across both public and private data support the robustness of our pipeline, indicating that EHRBench can be instantiated on heterogeneous EHR sources while yielding benchmarks, capturing CDM behaviors with similar difficulty and discriminative power\.

Increasing the number of multiple\-choice options leads to a clear and consistent decline in accuracy\. The average accuracy across all models and all questions decreases from 62\.29% \(4C\) to 56\.69% \(5C\) to 52\.04% \(6C\)\. Such monotonic trends are expected under a well\-constructed multiple\-choice benchmark, as additional options increase confusability and reduce the probability of correct selection under uncertainty\. The consistent degradation across models therefore provides evidence that the EHRBench pipeline produces valid question instances whose difficulty is appropriately controlled by the number of answer choices\.

When comparing medical LLMs to the general\-purpose base models they are adapted from, we do not observe consistent gains from medical fine\-tuning on EHRBench\. For example, m1\-32b\-1k achieves 63\.21% overall accuracy, close to its base model qwen2\.5\-32b at 64\.97%\. Appendix[E\.4](https://arxiv.org/html/2605.30637#A5.SS4)provides a more detailed breakdown\. Similar observations have also been reported in prior work\(Xuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib41); Dorfneret al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib124)\)\. These results suggest that current medical\-domain specialization still leaves important gaps for EHR\-grounded clinical decision\-making\. In EHRBench, strong performance requires reasoning over patients’ real longitudinal EHR context and answering questions that demand both biomedical knowledge and nontrivial inference, including disentangling confounded relations such as disease–disease progression and disease–treatment associations\. Improving these capabilities likely requires training signals beyond domain text exposure, such as large\-scale clinical case supervision and decision\-focused objectives, for which EHR\-grounded resources like EHRBench may provide a useful foundation\.

We also conduct additional analyses to examine whether the main experiment results are sensitive to benchmark construction choices or can be explained by shallow matching heuristics\. Specifically, Appendix[E\.6](https://arxiv.org/html/2605.30637#A5.SS6)evaluates whether the benchmark results are affected by the LLM used for QA generation, and Appendix[E\.7](https://arxiv.org/html/2605.30637#A5.SS7)studies whether changing the number of local EHR context events alters the observed performance trends\. In addition, Appendix[E\.5](https://arxiv.org/html/2605.30637#A5.SS5)compares LLMs with embedding\-based non\-LLM retrieval baselines under the same zero\-shot QA setting\. These analyses show that the main findings are robust to key construction choices and that strong performance on EHRBench cannot be reduced to simple question\-option semantic similarity matching\.

### 4\.2\.Additional Analyses and Validation

Beyond the main benchmark results in Section[4\.1](https://arxiv.org/html/2605.30637#S4.SS1), which focus on comparing representative LLMs under a unified zero\-shot multiple\-choice setting, we conduct several experiments with different settings in the Appendix to validate the reliability of the evaluation protocol and the stability of the main conclusions\.

- •In Appendix[F](https://arxiv.org/html/2605.30637#A6), we separately benchmark reasoning\-oriented LLM configurations because explicit intermediate reasoning substantially increases token usage and would otherwise confound the direct model comparison in Section[4\.1](https://arxiv.org/html/2605.30637#S4.SS1)\. This controlled study characterizes the accuracy\-efficiency trade\-off under different reasoning\-effort settings, showing that additional reasoning generally improves performance while incurring higher token cost\. This trend is consistent with expected scaling behavior and further supports the validity of the EHRBench pipeline\.
- •In Appendix[G](https://arxiv.org/html/2605.30637#A7), we evaluate multiple paraphrased question versions to test whether model performance is sensitive to surface\-level wording\. The consistently low variability and high prediction consistency across versions indicate that single\-version evaluation in the main experiment provides a stable proxy for underlying CDM capability\.
- •In Appendix[H](https://arxiv.org/html/2605.30637#A8), we run an additional evaluation on the extended question set covering all verified QA templates to examine whether the fixed\-subset protocol introduces sampling artifacts\. The near\-identical model ranking and trend patterns suggest that the selected subset in the main experiment provides sufficient coverage for fair comparison at scale, further supporting the reliability of the generation pipeline\.
- •In Appendix[I](https://arxiv.org/html/2605.30637#A9), we evaluate paraphrased open\-ended questions to probe free\-form clinical reasoning beyond multiple\-choice selection\. The performance trends remain consistent with the main benchmark, supporting the reliability of the open\-ended question pipeline\.

Collectively, these analyses show that the conclusions drawn from the main benchmark are robust to key evaluation design choices\. They also provide additional evidence that EHRBench offers a stable and reliable framework for evaluating EHR\-grounded clinical decision\-making capabilities\.

## 5\.Conclusion

In this work, we develop EHRBench via an automated and reliable pipeline based on*EHR–LLM–KB*interaction\. The pipeline \(i\) converts encounter\-level EHR trajectories into structured templates, \(ii\) deterministically instantiates these templates into large\-scale QA items with controlled variants for robust evaluation, and \(iii\) applies KB\-based verification and enrichment to improve reliability\.

Under a unified inference protocol, the benchmarking of more than 30 representative LLMs on EHRBench yields consistent trends that validate the benchmark\. Recently released high\-capability models achieve the strongest performance, treatment selection is consistently easier than the other two tasks, dataset effects remain modest, and current medical fine\-tuning does not deliver consistent gains over the corresponding general\-purpose base models\. Additional analyses further confirm the reliability of the EHRBench construction pipeline and the evaluation protocol used in the main experiments\. Collectively, these results provide actionable insights that can inform the design and evaluation of clinically reliable LLM systems in EHR\-grounded medical decision making\.

Overall, EHRBench provides an automated and reliable benchmark with 960,067 QA items for evaluating LLM\-based clinical decision making grounded in real\-world structured EHR trajectories\. We hope that EHRBench will serve as a practical testbed to accelerate the development of clinically reliable LLM systems and to facilitate transparent and reproducible progress in EHR\-grounded medical decision making\.

## Ethical Statement

This study was conducted in full compliance with established ethical and data governance standards\. MIMIC\-III and MIMIC\-IV are publicly available credentialed datasets accessed under the PhysioNet Credentialed Data Use Agreement and all relevant data usage policies\. PROMOTE is a private dataset that was fully de\-identified prior to use, and its use was approved by the Emory Institutional Review Board \(IRB Protocol 2025P010425\)\.

All raw EHR data were processed locally on HIPAA\-compliant systems into structured templates and QA pairs and were not directly exposed to the evaluated LLMs\. The released benchmark contains only de\-identified QA items rather than raw patient records\. We additionally designed the benchmark construction pipeline to reduce information leakage through controlled context generation, KB verification, and filtering procedures\. No patient re\-identification was attempted at any stage of the study\.

## Acknowledgement

This research was partially supported by internal funds and GPU resources provided by Emory University, the U\.S\. National Science Foundation \(Awards 2442172, 2312502, and 2319449\), and the U\.S\. National Institutes of Health \(Awards K25DK135913, RF1NS139325, R01DK143456, U18DP006922, and R01HL166233\)\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[item \(c\)](https://arxiv.org/html/2605.30637#S4.I1.i3.p1.1)\.
- L\. Adams, F\. Busch, T\. Han, J\. Excoffier, M\. Ortala, A\. Löser, H\. J\. Aerts, J\. N\. Kather, D\. Truhn, and K\. Bressem \(2025\)Longhealth: a question answering benchmark with long clinical documents\.Journal of Healthcare Informatics Research,pp\. 1–17\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[Appendix F](https://arxiv.org/html/2605.30637#A6.p2.1)\.
- R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas,et al\.\(2025\)Healthbench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Artsi, V\. Sorin, E\. Konen, B\. S\. Glicksberg, G\. Nadkarni, and E\. Klang \(2024\)Large language models for generating medical examinations: systematic review\.BMC medical education24\(1\),pp\. 354\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p4.1)\.
- S\. Bae, D\. Kyung, J\. Ryu, E\. Cho, G\. Lee, S\. Kweon,et al\.\(2023\)Ehrxqa: a multi\-modal question answering dataset for electronic health records with chest x\-ray images\.Advances in Neural Information Processing Systems36,pp\. 3867–3880\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- E\. Bakouch, L\. B\. Allal, A\. Lozhkov, N\. Tazi, L\. Tunstall, C\. M\. Patino,et al\.\(2025\)SmolLM3: smol, multilingual, long\-context reasoner\.Hugging Face Blog\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- J\. Bardhan, A\. Colas, K\. Roberts, and D\. Z\. Wang \(2022\)Drugehrqa: a question answering dataset on structured and unstructured electronic health records for medicine related queries\.arXiv preprint arXiv:2205\.01290\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- B\. Bhasuran, Q\. Jin, Y\. Xie, C\. Yang, K\. Hanna, J\. Costa, C\. Shavor, W\. Han, Z\. Lu, and Z\. He \(2025\)Preliminary analysis of the impact of lab results on large language model generated differential diagnoses\.npj Digital Medicine8\(1\),pp\. 166\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- O\. Bodenreider \(2004\)The unified medical language system \(umls\): integrating biomedical terminology\.Nucleic acids research32\(suppl\_1\),pp\. D267–D270\.Cited by:[§3\.3](https://arxiv.org/html/2605.30637#S3.SS3.p3.1)\.
- J\. S\. Bosma, K\. Dercksen, L\. Builtjes, R\. André, C\. Roest, S\. J\. Fransen, C\. R\. Noordman, M\. Navarro\-Padilla, J\. Lefkes, N\. Alves,et al\.\(2025\)The dragon benchmark for clinical nlp\.npj Digital Medicine8\(1\),pp\. 289\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p3.1)\.
- K\. Canese and S\. Weis \(2013\)PubMed: the bibliographic database\.The NCBI handbook2\(1\),pp\. 2013\.Cited by:[§3\.3](https://arxiv.org/html/2605.30637#S3.SS3.p3.1)\.
- J\. Chen, Z\. Cai, K\. Ji, X\. Wang, W\. Liu, R\. Wang, J\. Hou, and B\. Wang \(2024\)Huatuogpt\-o1, towards medical complex reasoning with llms\.arXiv preprint arXiv:2412\.18925\.Cited by:[§3\.3](https://arxiv.org/html/2605.30637#S3.SS3.p2.6)\.
- C\. Chiu, S\. Pitis, and M\. van der Schaar \(2025\)Simulating viva voce examinations to evaluate clinical reasoning in large language models\.arXiv preprint arXiv:2510\.10278\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- C\. Christophe, P\. K\. Kanithi, T\. Raha, S\. Khan, and M\. A\. Pimentel \(2024\)Med42\-v2: a suite of clinical llms\.External Links:[Link](https://arxiv.org/abs/2408.06142)Cited by:[item \(b\)](https://arxiv.org/html/2605.30637#S4.I1.i2.p1.1)\.
- H\. Cui, A\. Unell, B\. Chen, J\. A\. Fries, E\. Alsentzer, S\. Koyejo, and N\. H\. Shah \(2025\)Timer: temporal instruction modeling and evaluation for longitudinal clinical records\.npj Digital Medicine8\(1\),pp\. 577\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- A\. Dada, O\. Koraş, M\. Bauer, A\. Butler, K\. Smith, J\. Kleesiek, and J\. Friedrich \(2025\)Medisumqa: patient\-oriented question\-answer generation from discharge letters\.InProceedings of the Second Workshop on Patient\-Oriented Language Processing \(CL4Health\),pp\. 124–136\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- W\. Dai, P\. Chen, M\. Lu, D\. Li, H\. Wei, H\. Cui, and P\. P\. Liang \(2025\)Climb: data foundations for large scale multimodal clinical foundation models\.arXiv preprint arXiv:2503\.07667\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- F\. J\. Dorfner, A\. Dada, F\. Busch, M\. R\. Makowski, T\. Han, D\. Truhn,et al\.\(2024\)Biomedical large languages models seem not to be superior to generalist models on unseen medical data\.arXiv preprint arXiv:2408\.13833\.Cited by:[§E\.4](https://arxiv.org/html/2605.30637#A5.SS4.p4.1)\.
- F\. J\. Dorfner, A\. Dada, F\. Busch, M\. R\. Makowski, T\. Han, D\. Truhn,et al\.\(2025\)Evaluating the effectiveness of biomedical fine\-tuning for large language models on clinical tasks\.Journal of the American Medical Informatics Association32\(6\),pp\. 1015–1024\.Cited by:[§E\.4](https://arxiv.org/html/2605.30637#A5.SS4.p4.1),[§4\.1](https://arxiv.org/html/2605.30637#S4.SS1.p7.1)\.
- Y\. Fan, H\. Sun, K\. Xue, X\. Zhang, S\. Zhang, and T\. Ruan \(2025a\)Medodyssey: a medical domain benchmark for long context evaluation up to 200k tokens\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 32–56\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Z\. Fan, L\. Wei, J\. Tang, W\. Chen, W\. Siyuan, Z\. Wei, and F\. Huang \(2025b\)Ai hospital: benchmarking large language models in a multi\-agent medical interaction simulator\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 10183–10213\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- S\. L\. Fleming, A\. Lozano, W\. J\. Haberkorn, J\. A\. Jindal, E\. Reis, R\. Thapa, L\. Blankemeier, J\. Z\. Genkins, E\. Steinberg, A\. Nayak,et al\.\(2024\)Medalign: a clinician\-generated dataset for instruction following with electronic medical records\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 22021–22030\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- T\. Glm, A\. Zeng, B\. Xu, B\. Wang, C\. Zhang, D\. Yin,et al\.\(2024\)Chatglm: a family of large language models from glm\-130b to glm\-4 all tools\.arXiv preprint arXiv:2406\.12793\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- L\. Gong, A\. Wang, Y\. Lai, W\. Ma, and Y\. Liu \(2025\)The dialogue that heals: a comprehensive evaluation of doctor agents’ inquiry capability\.arXiv preprint arXiv:2509\.24958\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- C\. Guo, N\. Xu, Y\. Chang, and Y\. Wu \(2024\)Chbench: a chinese dataset for evaluating health in large language models\.arXiv preprint arXiv:2409\.15766\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- K\. Han, S\. Zhao, Y\. Su, X\. Li, Y\. Yuan, L\. He, and C\. Yang \(2026\)Towards a virtual neuroscientist: autonomous neuroimaging analysis via multi\-agent collaboration\.arXiv preprint arXiv:2605\.09366\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- T\. Han, A\. Kumar, C\. Agarwal, and H\. Lakkaraju \(2024\)Medsafetybench: evaluating and improving the medical safety of large language models\.Advances in Neural Information Processing Systems37,pp\. 33423–33454\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- V\. Harish, F\. Morgado, A\. D\. Stern, and S\. Das \(2021\)Artificial intelligence and clinical decision making: the new nature of medical uncertainty\.Academic Medicine96\(1\),pp\. 31–36\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p1.1)\.
- Y\. Hu, T\. Li, Q\. Lu, W\. Shao, J\. He, Y\. Qiao, and P\. Luo \(2024\)Omnimedvqa: a new large\-scale comprehensive evaluation benchmark for medical lvlm\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 22170–22183\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang,et al\.\(2025a\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p4.1)\.
- X\. Huang, J\. Wu, H\. Liu, X\. Tang, and Y\. Zhou \(2025b\)M1: unleash the potential of test\-time scaling for medical reasoning with large language models\.arXiv preprint arXiv:2504\.00869\.Cited by:[item \(b\)](https://arxiv.org/html/2605.30637#S4.I1.i2.p1.1)\.
- L\. Hunik, A\. Chaabouni, T\. van Laarhoven, T\. C\. O\. Hartman, R\. T\. Leijenaar, J\. W\. Cals, A\. A\. Uijen, and H\. J\. Schers \(2025\)Diagnostic prediction models for primary care, based on ai and electronic health records: systematic review\.JMIR Medical Informatics13\(1\),pp\. e62862\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- M\. Jia, J\. Duan, Y\. Song, and J\. Wang \(2025\)Medikal: integrating knowledge graphs as assistants of llms for enhanced clinical diagnosis on emrs\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 9278–9298\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- A\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. Chaplot, D\. de Las Casas,et al\.\(2023\)Mistral 7b\. arxiv 2023\.arXiv preprint arXiv:2310\.06825\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- A\. E\. Johnson, L\. Bulgarelli, L\. Shen, A\. Gayles, A\. Shammout, S\. Horng, T\. J\. Pollard, S\. Hao, B\. Moody, B\. Gow,et al\.\(2023\)MIMIC\-iv, a freely accessible electronic health record dataset\.Scientific data10\(1\),pp\. 1\.Cited by:[Appendix B](https://arxiv.org/html/2605.30637#A2.p1.1),[§3\.2](https://arxiv.org/html/2605.30637#S3.SS2.p1.1)\.
- A\. E\. Johnson, T\. J\. Pollard, L\. Shen, L\. H\. Lehman, M\. Feng, M\. Ghassemi, B\. Moody, P\. Szolovits, L\. Anthony Celi, and R\. G\. Mark \(2016\)MIMIC\-iii, a freely accessible critical care database\.Scientific data3\(1\),pp\. 1–9\.Cited by:[Appendix B](https://arxiv.org/html/2605.30637#A2.p1.1),[§3\.2](https://arxiv.org/html/2605.30637#S3.SS2.p1.1)\.
- S\. Johri, J\. Jeong, B\. A\. Tran, D\. I\. Schlessinger, S\. Wongvibulsin, Z\. R\. Cai, R\. Daneshjou, and P\. Rajpurkar \(2024\)CRAFT\-md: a conversational evaluation framework for comprehensive assessment of clinical llms\.InAAAI 2024 Spring Symposium on Clinical Foundation Models,Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- N\. Khandekar, Q\. Jin, G\. Xiong, S\. Dunn, S\. Applebaum, Z\. Anwar, M\. Sarfo\-Gyamfi, C\. Safranek, A\. Anwar, A\. Zhang,et al\.\(2024\)Medcalc\-bench: evaluating large language models for medical calculations\.Advances in Neural Information Processing Systems37,pp\. 84730–84745\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- H\. Kilicoglu, D\. Shin, M\. Fiszman, G\. Rosemblat, and T\. C\. Rindflesch \(2012\)SemMedDB: a pubmed\-scale repository of biomedical semantic predications\.Bioinformatics28\(23\),pp\. 3158–3160\.Cited by:[§3\.3](https://arxiv.org/html/2605.30637#S3.SS3.p3.1)\.
- M\. K\. Kim, C\. Rouphael, J\. McMichael, N\. Welch, and S\. Dasarathy \(2023\)Challenges in and opportunities for electronic health record\-based data analysis and interpretation\.Gut and liver18\(2\),pp\. 201\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1)\.
- Y\. Kim, J\. Wu, Y\. Abdulle, and H\. Wu \(2024\)MedExQA: medical question answering benchmark with multiple explanations\.arXiv preprint arXiv:2406\.06331\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- R\. Knevel and K\. P\. Liao \(2023\)From real\-world electronic health record data to real\-world results using artificial intelligence\.Annals of the Rheumatic Diseases82\(3\),pp\. 306–311\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1)\.
- Y\. Kumar, A\. Koul, R\. Singla, and M\. F\. Ijaz \(2023\)Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda\.Journal of ambient intelligence and humanized computing14\(7\),pp\. 8459–8486\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- S\. Kweon, J\. Kim, H\. Kwak, D\. Cha, H\. Yoon, K\. Kim, J\. Yang, S\. Won, and E\. Choi \(2024\)Ehrnoteqa: an llm benchmark for real\-world clinical practice using discharge summaries\.Advances in Neural Information Processing Systems37,pp\. 124575–124611\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Lai, K\. Liu, Z\. Wang, W\. Ma, and Y\. Liu \(2025\)Doctor\-r1: mastering clinical inquiry with experiential agentic reinforcement learning\.arXiv preprint arXiv:2510\.04284\.Cited by:[item \(b\)](https://arxiv.org/html/2605.30637#S4.I1.i2.p1.1)\.
- G\. Lee, H\. Hwang, S\. Bae, Y\. Kwon, W\. Shin, S\. Yang, M\. Seo, J\. Kim, and E\. Choi \(2022\)Ehrsql: a practical text\-to\-sql benchmark for electronic health records\.Advances in Neural Information Processing Systems35,pp\. 15589–15601\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p6.1),[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- J\. Li, Y\. Lai, W\. Li, J\. Ren, M\. Zhang, X\. Kang, S\. Wang, P\. Li, Y\. Zhang, W\. Ma,et al\.\(2024a\)Agent hospital: a simulacrum of hospital with evolvable medical agents\.arXiv preprint arXiv:2405\.02957\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- S\. Li, V\. Balachandran, S\. Feng, J\. Ilgen, E\. Pierson, P\. W\. W\. Koh, and Y\. Tsvetkov \(2024b\)Mediq: question\-asking llms and a benchmark for reliable interactive clinical reasoning\.Advances in Neural Information Processing Systems37,pp\. 28858–28888\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Liao, C\. Wu, J\. Liu, S\. Jiang, P\. Qiu, H\. Wang, Y\. Yue, S\. Zhen, J\. Wang, Q\. Fan,et al\.\(2025\)EHR\-r1: a reasoning\-enhanced foundational language model for electronic health record analysis\.arXiv preprint arXiv:2510\.25628\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- A\. H\. Liu, K\. Khandelwal, S\. Subramanian, V\. Jouault, A\. Rastogi, A\. Sadé,et al\.\(2026\)Ministral 3\.arXiv preprint arXiv:2601\.08584\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- F\. Liu, Z\. Li, H\. Zhou, Q\. Yin, J\. Yang, X\. Tang, C\. Luo, M\. Zeng, H\. Jiang, Y\. Gao,et al\.\(2024a\)Large language models in the clinic: a comprehensive benchmark\.arXiv preprint arXiv:2405\.00716\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- F\. Liu, J\. Wu, H\. Zhou, X\. Gu, S\. Molaei, A\. Thakur, L\. Clifton, H\. Wu, and D\. A\. Clifton \(2025a\)RiskAgent: autonomous medical ai copilot for generalist risk prediction\.arXiv preprint arXiv:2503\.03802\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- J\. Liu, W\. Wang, Z\. Ma, G\. Huang, Y\. SU, K\. Chang, W\. Chen, H\. Li, L\. Shen, and M\. Lyu \(2024b\)Medchain: bridging the gap between llm agents and clinical practice through interactive sequential benchmarking\.arXiv preprint arXiv:2412\.01605\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- J\. Liu, W\. Wang, S\. Yihang, J\. Huang, Y\. Zhang, C\. Li, W\. Chen, X\. Xing, K\. Chang, L\. Shen,et al\.\(2025b\)Asclepius: a spectrum evaluation benchmark for medical multi\-modal large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 24181–24201\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- R\. Liu, K\. Xue, X\. Zhang, and S\. Zhang \(2025c\)Interactive evaluation for medical llms via task\-oriented dialogue system\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 4871–4896\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- L\. Long, R\. Wang, R\. Xiao, J\. Zhao, X\. Ding, G\. Chen, and H\. Wang \(2024\)On llms\-driven synthetic data generation, curation, and evaluation: a survey\.arXiv preprint arXiv:2406\.15126\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p4.1)\.
- M\. Lu, Y\. Xie, Z\. Bi, S\. Cao, and X\. Wang \(2025\)CROSSAGENTIE: cross\-type and cross\-task multi\-agent llm collaboration for zero\-shot information extraction\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 13953–13977\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- C\. Malaviya, S\. Lee, S\. Chen, E\. Sieber, M\. Yatskar, and D\. Roth \(2024\)Expertqa: expert\-curated questions and attributed answers\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3025–3045\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p3.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- I\. Masic \(2022\)Medical decision making\-an overview\.Acta Informatica Medica30\(3\),pp\. 230\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p1.1)\.
- N\. Mehandru, N\. Golchini, D\. Bamman, T\. Zack, M\. F\. Molina, and A\. Alaa \(2025\)ER\-reason: a benchmark dataset for llm\-based clinical reasoning in the emergency room\.arXiv preprint arXiv:2505\.22919\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- B\. Molinet, S\. Marro, E\. Cabrio, and S\. Villata \(2024\)Explanatory argumentation in natural language for correct and incorrect medical diagnoses\.Journal of Biomedical Semantics15\(1\),pp\. 8\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)Ragtruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 10862–10878\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p4.1)\.
- H\. Nori, M\. Daswani, C\. Kelly, S\. Lundberg, M\. T\. Ribeiro, M\. Wilson, X\. Liu, V\. Sounderajah, J\. Carlson, M\. P\. Lungren,et al\.\(2025\)Sequential diagnosis with language models\.arXiv preprint arXiv:2506\.22405\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- K\. J\. O’malley, K\. F\. Cook, M\. D\. Price, K\. R\. Wildes, J\. F\. Hurdle, and C\. M\. Ashton \(2005\)Measuring diagnoses: icd code accuracy\.Health services research40\(5p2\),pp\. 1620–1639\.Cited by:[§3\.3](https://arxiv.org/html/2605.30637#S3.SS3.p3.1)\.
- D\. Oniani, X\. Wu, S\. Visweswaran, S\. Kapoor, S\. Kooragayalu, K\. Polanska, and Y\. Wang \(2024\)Enhancing large language models for clinical decision support by incorporating clinical practice guidelines\.In2024 IEEE 12th International Conference on Healthcare Informatics \(ICHI\),pp\. 694–702\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)Medmcqa: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InConference on health, inference, and learning,pp\. 248–260\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- A\. Pampari, P\. Raghavan, J\. Liang, and J\. Peng \(2018\)Emrqa: a large corpus for question answering on electronic medical records\.arXiv preprint arXiv:1809\.00732\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- M\. Panagioti, K\. Khan, R\. N\. Keers, A\. Abuzour, D\. Phipps, E\. Kontopantelis, P\. Bower, S\. Campbell, R\. Haneef, A\. J\. Avery,et al\.\(2019\)Prevalence, severity, and nature of preventable patient harm across medical care settings: systematic review and meta\-analysis\.bmj366\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p1.1)\.
- J\. Park, Y\. Cho, H\. Lee, J\. Choo, and E\. Choi \(2021\)Knowledge graph\-based question answering with electronic health records\.InMachine Learning for Healthcare Conference,pp\. 36–53\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- T\. Pelaccia, J\. Tardif, E\. Triby, and B\. Charlin \(2017\)A novel approach to study medical decision making in the clinical setting: the “own\-point\-of\-view” perspective\.Academic Emergency Medicine24\(7\),pp\. 785–795\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p1.1)\.
- O\. Perets, O\. B\. Shoham, N\. Grinberg, and N\. Rappoport \(2025\)CUPCase: clinically uncommon patient cases and diagnoses dataset\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 28293–28301\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- P\. Qiu, C\. Wu, S\. Liu, W\. Zhao, Z\. Chen, H\. Gu, C\. Peng, Y\. Zhang, Y\. Wang, and W\. Xie \(2025\)Quantifying the reasoning abilities of llms on real\-world clinical cases\.arXiv preprint arXiv:2503\.04691\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- P\. Raghavan, J\. J\. Liang, D\. Mahajan, R\. Chandra, and P\. Szolovits \(2021\)Emrkbqa: a clinical knowledge\-base question answering dataset\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- S\. Schmidgall, R\. Ziaei, C\. Harris, E\. Reis, J\. Jopling, and M\. Moor \(2024\)AgentClinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments\.arXiv preprint arXiv:2405\.07960\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- M\. Shao, Y\. Xie, C\. Yang, and J\. Lu \(2026\)LLM\-mine: large language model based alzheimer’s disease and related dementias phenotypes mining from clinical notes\.arXiv preprint arXiv:2603\.13673\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p6.1)\.
- O\. B\. Shoham and N\. Rappoport \(2024\)MedConceptsQA: open source medical concepts qa benchmark\.Computers in Biology and Medicine182,pp\. 109089\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- D\. Sileo, K\. Uma, and M\. F\. Moens \(2024\)Generating multiple\-choice questions for medical question answering with distractors and cue\-masking\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 7647–7653\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p4.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[Appendix F](https://arxiv.org/html/2605.30637#A6.p2.1),[item \(c\)](https://arxiv.org/html/2605.30637#S4.I1.i3.p1.1)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1),[§1](https://arxiv.org/html/2605.30637#S1.p3.1)\.
- V\. Subbiah \(2023\)The next generation of evidence\-based medicine\.Nature medicine29\(1\),pp\. 49–58\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p1.1)\.
- Y\. Sun, X\. Qian, W\. Xu, H\. Zhang, C\. Xiao, L\. Li, D\. Zhao, W\. Huang, T\. Xu, Q\. Bai,et al\.\(2025\)Reasonmed: a 370k multi\-agent generated dataset for advancing medical reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 26457–26478\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- T\. Tu, M\. Schaekermann, A\. Palepu, K\. Saab, J\. Freyberg, R\. Tanno, A\. Wang, B\. Li, M\. Amin, Y\. Cheng,et al\.\(2025\)Towards conversational diagnostic artificial intelligence\.Nature,pp\. 1–9\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- B\. Vasey, S\. Ursprung, B\. Beddoe, E\. H\. Taylor, N\. Marlow, N\. Bilbro, P\. Watkinson, and P\. McCulloch \(2021\)Association of clinician diagnostic performance with machine learning–based decision support systems: a systematic review\.JAMA network open4\(3\),pp\. e211276–e211276\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p1.1)\.
- P\. Wang, T\. Shi, K\. Agarwal, S\. Choudhury, and C\. K\. Reddy \(2022\)Attention\-based aspect reasoning for knowledge base question answering on clinical notes\.InProceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics,pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- P\. Wang, T\. Shi, and C\. K\. Reddy \(2020\)Text\-to\-sql generation for question answering on electronic medical records\.InProceedings of The Web Conference 2020,pp\. 350–361\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- R\. Wang, Y\. Xie, X\. Hu, C\. Yang, and J\. Lu \(2025a\)BioMedJImpact: a comprehensive dataset and llm pipeline for ai engagement and scientific impact analysis of biomedical journals\.arXiv preprint arXiv:2511\.12821\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- X\. Wang, C\. Chang, D\. Cao, K\. Han, F\. Sun, Y\. Huang, M\. Wang, C\. Xu, X\. Luo, R\. Yan,et al\.\(2026a\)Position: beyond prediction: toward verifiable physiological waveform reasoning with foundation models and agentic llms\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- X\. Wang, K\. Han, Y\. Xu, X\. Luo, Y\. Sun, W\. Wang, and C\. Yang \(2026b\)SE\-diff: simulator and experience enhanced diffusion model for comprehensive ecg generation\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- X\. Wang, G\. Chen, S\. Dingjie, Z\. Zhiyi, Z\. Chen, Q\. Xiao, J\. Chen, F\. Jiang, J\. Li, X\. Wan,et al\.\(2024a\)Cmb: a comprehensive medical benchmark in chinese\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6184–6205\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- X\. Wang, N\. Chen, J\. Chen, Y\. Wang, G\. Zhen, C\. Zhang, X\. Wu, Y\. Hu, A\. Gao, X\. Wan,et al\.\(2024b\)Apollo: a lightweight multilingual medical llm towards democratizing medical ai to 6b people\.arXiv preprint arXiv:2403\.03640\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Z\. Wang, Q\. Jin, J\. Lin, J\. Gao, J\. Pradeepkumar, P\. Jiang, B\. Danek, Z\. Lu, and J\. Sun \(2025b\)TrialPanorama: database and benchmark for systematic review and design of clinical trials\.arXiv preprint arXiv:2505\.16097\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- M\. Wei, D\. Min, Z\. Liu, Y\. Xie, G\. Wu, Z\. Zhang, C\. Yang, M\. S\. Lau, Q\. He, L\. Cheng,et al\.\(2026\)EpiQAL: benchmarking large language models in epidemiological question answering for enhanced alignment and reasoning\.arXiv preprint arXiv:2601\.03471\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- D\. S\. Wishart, C\. Knox, A\. C\. Guo, D\. Cheng, S\. Shrivastava, D\. Tzur, B\. Gautam, and M\. Hassanali \(2008\)DrugBank: a knowledgebase for drugs, drug actions and drug targets\.Nucleic acids research36\(suppl\_1\),pp\. D901–D906\.Cited by:[§3\.3](https://arxiv.org/html/2605.30637#S3.SS3.p3.1)\.
- M\. Wornow, R\. Thapa, E\. Steinberg, J\. Fries, and N\. Shah \(2023\)Ehrshot: an ehr benchmark for few\-shot evaluation of foundation models\.Advances in Neural Information Processing Systems36,pp\. 67125–67137\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p3.1)\.
- C\. Wu, P\. Qiu, J\. Liu, H\. Gu, N\. Li, Y\. Zhang, Y\. Wang, and W\. Xie \(2025a\)Towards evaluating and building versatile large language models for medicine\.npj Digital Medicine8\(1\),pp\. 58\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- G\. Wu, Y\. Xie, H\. Wu, Z\. He, H\. Shao, X\. Hu, and C\. Yang \(2025b\)Utilizing large language models for zero\-shot medical ontology extension from clinical notes\.arXiv preprint arXiv:2511\.16548\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p6.1)\.
- Y\. Wu, F\. Nahab, Y\. Ge, Y\. Xie, H\. Aboul\-Nour, C\. Yang, and X\. Hu \(2025c\)Prediction of post\-stroke af in esus patients is enhanced by combining expert\-derived predictors and embedding of full diagnostic codes using pre\-trained hypergraph neural networks\.STROKE56\.Cited by:[Appendix B](https://arxiv.org/html/2605.30637#A2.p1.1),[§3\.2](https://arxiv.org/html/2605.30637#S3.SS2.p1.1)\.
- Y\. Xiao, C\. Yang, M\. Mai, X\. Hu, and K\. Shu \(2025\)Beyond medqa: towards real\-world clinical decision making in the era of llms\.arXiv preprint arXiv:2510\.20001\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Xie, H\. Cui, Z\. Zhang, J\. Lu, K\. Shu, F\. Nahab, X\. Hu, and C\. Yang \(2025a\)KERAP: a knowledge\-enhanced reasoning approach for accurate zero\-shot diagnosis prediction using multi\-agent llms\.arXiv preprint arXiv:2507\.02773\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- Y\. Xie, X\. Han, R\. Xu, X\. Hu, J\. Lu, and C\. Yang \(2025b\)HypKG: hypergraph\-based knowledge graph contextualization for precision healthcare\.InInternational Semantic Web Conference,pp\. 328–348\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p2.1)\.
- Y\. Xie, J\. Lu, J\. Ho, F\. Nahab, X\. Hu, and C\. Yang \(2024a\)PromptLink: leveraging large language models for cross\-source biomedical concept linking\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2589–2593\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p6.1)\.
- Y\. Xie, F\. Nahab, Y\. Ge, Y\. Wu, J\. Saurman, C\. Yang, and X\. Hu \(2025c\)Abstract wp175: predicting post\-stroke cognitive impairment \(psci\) using multiple machine learning approaches\.Stroke56\(Suppl\_1\),pp\. AWP175–AWP175\.Cited by:[Appendix B](https://arxiv.org/html/2605.30637#A2.p1.1),[§3\.2](https://arxiv.org/html/2605.30637#S3.SS2.p1.1)\.
- Y\. Xie, G\. Niu, Q\. Da, W\. Dai, and Y\. Yang \(2022\)Survival prediction for gastric cancer via multimodal learning of whole slide images and gene expression\.In2022 IEEE international conference on bioinformatics and biomedicine \(BIBM\),pp\. 1311–1316\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Xie, Q\. Sang, Q\. Da, G\. Niu, S\. Deng, H\. Feng, Y\. Chen, Y\. Li, B\. Liu, Y\. Yang,et al\.\(2024b\)Improving diagnosis and outcome prediction of gastric cancer via multimodal learning using whole slide pathological images and gene expression\.Artificial intelligence in medicine152,pp\. 102871\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Xie, Y\. Wu, R\. Wang, F\. Nahab, X\. Hu, and C\. Yang \(2026\)Enhanced atrial fibrillation prediction in esus patients with hypergraph\-based pre\-training\.arXiv preprint arXiv:2603\.13297\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1)\.
- R\. Xu, Y\. Zhuang, Y\. Zhong, Y\. Yu, X\. Tang, H\. Wu, M\. D\. Wang, P\. Ruan, D\. Yang, T\. Wang,et al\.\(2025\)Medagentgym: training llm agents for code\-based medical reasoning at scale\.InThe Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance,Cited by:[§E\.4](https://arxiv.org/html/2605.30637#A5.SS4.p4.1),[§1](https://arxiv.org/html/2605.30637#S1.p6.1),[§2](https://arxiv.org/html/2605.30637#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.30637#S4.SS1.p7.1)\.
- L\. K\. Yan, Q\. Niu, M\. Li, Y\. Zhang, C\. H\. Yin, C\. Fei, B\. Peng, Z\. Bi, P\. Feng, K\. Chen,et al\.\(2024\)Large language model benchmarks in medical tasks\.arXiv preprint arXiv:2410\.21348\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- C\. Yang, Z\. Wu, P\. Jiang, Z\. Lin, J\. Gao, B\. Danek, and J\. Sun \(2023\)PyHealth: a deep learning toolkit for healthcare predictive modeling\.InProceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining \(KDD\) 2023,External Links:[Link](https://github.com/sunlabuiuc/PyHealth)Cited by:[§3\.2](https://arxiv.org/html/2605.30637#S3.SS2.p2.1)\.
- J\. Ye, G\. Wang, Y\. Li, Z\. Deng, W\. Li, T\. Li, H\. Duan, Z\. Huang, Y\. Su, B\. Wang,et al\.\(2024\)Gmai\-mmbench: a comprehensive multimodal evaluation benchmark towards general medical ai\.Advances in Neural Information Processing Systems37,pp\. 94327–94427\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- A\. Young, B\. Chen, C\. Li, C\. Huang, G\. Zhang, G\. Zhang,et al\.\(2024\)Yi: open foundation models by 01\. ai\.arXiv preprint arXiv:2403\.04652\.Cited by:[item \(a\)](https://arxiv.org/html/2605.30637#S4.I1.i1.p1.1)\.
- X\. Yue, Y\. Ni, K\. Zhang, T\. Zheng, R\. Liu, G\. Zhang, S\. Stevens, D\. Jiang, W\. Ren, Y\. Sun,et al\.\(2024\)Mmmu: a massive multi\-discipline multimodal understanding and reasoning benchmark for expert agi\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9556–9567\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- K\. Zhang, S\. Zeng, E\. Hua, N\. Ding, Z\. Chen, Z\. Ma, H\. Li, G\. Cui, B\. Qi, X\. Zhu,et al\.\(2024a\)Ultramedical: building specialized generalists in biomedicine\.Advances in Neural Information Processing Systems37,pp\. 26045–26081\.Cited by:[item \(b\)](https://arxiv.org/html/2605.30637#S4.I1.i2.p1.1)\.
- M\. Zhang, Y\. Shen, Z\. Li, H\. Sha, B\. Hu, Y\. Wang, C\. Huang, S\. Liu, J\. Tong, C\. Jiang,et al\.\(2025a\)LLMEval\-med: a real\-world clinical benchmark for medical llms with physician validation\.arXiv preprint arXiv:2506\.04078\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- X\. Zhang, C\. Wu, Z\. Zhao, W\. Lin, Y\. Zhang, Y\. Wang, and W\. Xie \(2023\)Pmc\-vqa: visual instruction tuning for medical visual question answering\.arXiv preprint arXiv:2305\.10415\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Z\. Zhang, H\. Cui, R\. Xu, Y\. Xie, J\. C\. Ho, and C\. Yang \(2024b\)Tacco: task\-guided co\-clustering of clinical concepts and patient visits for disease subtyping based on ehr data\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 6324–6334\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p6.1)\.
- Z\. Zhang, W\. Lily, M\. Weimin, L\. Chang, S\. Hui, V\. S\. Yan, G\. Jingchuan, B\. Jiang, Y\. Rui, and Y\. Carl \(2025b\)Type 2 diabetes subtyping via phenotype and genotype co\-learning\.Studies in health technology and informatics329,pp\. 1064\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1)\.
- S\. Zhou, W\. Xie, J\. Li, Z\. Zhan, M\. Song, H\. Yang, C\. Espinoza, L\. Welton, X\. Mai, Y\. Jin,et al\.\(2025\)Automating expert\-level medical reasoning evaluation of large language models\.npj Digital Medicine\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p3.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Zhou, X\. Liu, C\. Ning, and J\. Wu \(2024\)Multifaceteval: multifaceted evaluation to probe llms in mastering medical knowledge\.arXiv preprint arXiv:2406\.02919\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Z\. Zhouet al\.\(2025\)Large language models for disease diagnosis: a scoping review\.NPJ Digital Medicine\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p2.1)\.
- Y\. Zhu, Z\. Huang, L\. Mu, Y\. Huang, W\. Nie, J\. Liu, S\. Zhang, P\. Liu, and X\. Zhang \(2025\)DiagnosisArena: benchmarking diagnostic reasoning for large language models\.arXiv preprint arXiv:2505\.14107\.Cited by:[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.
- Y\. Zuo, S\. Qu, Y\. Li, Z\. Chen, X\. Zhu, E\. Hua, K\. Zhang, N\. Ding, and B\. Zhou \(2025\)Medxpertqa: benchmarking expert\-level medical reasoning and understanding\.arXiv preprint arXiv:2501\.18362\.Cited by:[§1](https://arxiv.org/html/2605.30637#S1.p5.1),[§2](https://arxiv.org/html/2605.30637#S2.p1.1)\.

## Appendices

## Appendix ANotation Table

Table 2\.Notations used in the paper\.
## Appendix BEHRBench QA Statistics

For EHRBench, raw EHR records are drawn from three real\-world sources: MIMIC\-III, MIMIC\-IV, and PROMOTE\. MIMIC\-III \(Version 1\.4\) is a publicly available critical\-care dataset containing 38,597 distinct patients and 53,423 hospital admissions from intensive care units at Beth Israel Deaconess Medical Center between 2001 and 2012\(Johnsonet al\.,[2016](https://arxiv.org/html/2605.30637#bib.bib25)\)\. MIMIC\-IV \(Version 3\.1\) is a newer release that extends MIMIC\-III with updated hospital data from the same institution \(2008–2022\), including 364,627 patients and 546,028 hospital admissions\(Johnsonet al\.,[2023](https://arxiv.org/html/2605.30637#bib.bib71)\)\. To further reduce potential contamination from public corpora and to evaluate language models in a setting less prone to data leakage, PROMOTE is additionally included as a private dataset from Emory Healthcare spanning 2012–2021, with records for 18,561 patients and 912,706 clinical records\(Xieet al\.,[2025c](https://arxiv.org/html/2605.30637#bib.bib72); Wuet al\.,[2025c](https://arxiv.org/html/2605.30637#bib.bib73)\)\.

To ensure that each constructed QA item is grounded in sufficiently informative clinical documentation, we retain only encounters with rich structured signals\. Concretely, an encounter is included only if it contains at least five diagnosis events and at least three treatment events \(where treatments include both prescriptions and procedures\)\. This filtering reduces the risk of constructing questions from sparse or under\-specified visits and improves the reliability of the downstream relation extraction and QA instantiation\.

For each data source and task, up to the first 10,000 EHR instances are extracted and processed by the construction pipeline\. End\-to\-end construction requires approximately 560 H100 GPU\-hours\. Across all sources and tasks, 465,748 candidate clinical relations are first extracted from EHR trajectories, and 62,786 knowledge\-base\-verified QA templates are retained \(13\.5% retention\)\. These templates are then instantiated into 960,067 QA items spanning three decision\-making tasks—diagnosis \(Dx\), treatment \(Tx\), and prognosis \(Px\)—and four question formats: 4/5/6\-choice multiple\-choice questions \(MCQs\) and open\-ended questions \(OEQs\)\. Table[3](https://arxiv.org/html/2605.30637#A2.T3)summarizes the breakdown by data source and task\. Treatment accounts for the largest share of questions \(450,501\), followed by prognosis \(259,123\) and diagnosis \(250,443\)\. Aggregated by data source, EHRBench contains 323,193 questions from MIMIC\-III, 259,089 from MIMIC\-IV, and 377,785 from PROMOTE, totaling960,067questions\.

Table 3\.EHRBench construction statistics by task, data source, and question format\. “Candidate Relations” counts relation candidates extracted from EHR trajectories before knowledge\-base \(KB\) verification, and “Verified Templates” counts KB\-supported templates retained for instantiation\. “MCQ\-4/5/6” report the numbers of 4/5/6\-choice MCQs, and “OEQ” reports the number of open\-ended questions\.TaskData SourceCandidate RelationsVerified TemplatesMCQ\-4MCQ\-5MCQ\-6OEQTotalDiagnosisMIMIC\-III32,3384,40617,62421,31524,2644,32967,532MIMIC\-IV36,8694,32317,28821,08524,4384,29167,102PROMOTE56,6997,25929,02436,25043,2787,257115,809Task\-Total125,90615,98863,93678,65091,98015,877250,443TreatmentMIMIC\-III53,6379,87839,42849,23058,9269,873157,457MIMIC\-IV38,1408,17832,67240,77548,7628,170130,379PROMOTE45,89910,17340,67250,83060,99010,173162,665Task\-Total137,67628,229112,772140,835168,67828,216450,501PrognosisMIMIC\-III71,8967,20728,78430,82031,4047,19698,204MIMIC\-IV60,8164,45217,80019,47019,8904,44861,608PROMOTE69,4546,91027,63231,32533,4446,91099,311Task\-Total202,16618,56974,21681,61584,73818,554259,123FullEHRBench465,74862,786250,924301,100345,39662,647960,067

From the length statistics, each question stem is relatively detailed, averaging 332\.4 characters \(about 46\.8 words\)\. The answer options are much more concise: each option averages 28\.6 characters \(about 3\.5 words\), indicating that choices are typically short concept\-level phrases\. The reason field is also substantial, averaging 251\.1 characters \(about 36\.4 words\), suggesting there is enough space for coherent justifications rather than single\-sentence fragments\.

## Appendix CDetails of Methodology

This appendix provides additional implementation details for the benchmark construction pipeline described in Section[3](https://arxiv.org/html/2605.30637#S3), including encounter filtering, relation/context constraints, concept linking via UMLS, evidence retrieval via SemMedDB, task\-specific relation definitions, and distractor generation and verification\.

### C\.1\.Clinical Relations and Context Extraction from EHR

#### Relation cap\.

For each EHR instance, the prompting stage may yield multiple candidate relations\. To control template complexity and limit noise from overly long candidate lists, we retain at most1515candidate relations per instance after deduplication\.

#### Context construction and size\.

For each retained relation templatePkP\_\{k\}, we generate a small auxiliary context setCkC\_\{k\}consisting of exactly two events per instance, as described in Section[3\.3](https://arxiv.org/html/2605.30637#S3.SS3)\. During QA instantiation, each question scenario incorporates three events in total: the two context events plus the relation\-subject entity, i\.e\.,

\(13\)Sj←Ck∪\{xk\},with​\|Ck\|=2\.S\_\{j\}\\leftarrow C\_\{k\}\\cup\\\{x\_\{k\}\\\},\\quad\\text\{with \}\|C\_\{k\}\|=2\.This design yields concise, controlled scenarios while preserving grounding in observed encounter events\.

#### Information leakage prevention\.

When extracting clinical relations and contextual events from raw EHR data, we enforce strict non\-overlap constraints between context events and clinical relations to prevent information leakage, since they may be developed into question stems and choices\. This design ensures that each generated question is clinically meaningful rather than surface\-level pattern matching\.

#### Task\-Specific Relation Definitions

Each template contains a verified relationRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\)and is associated with one of three clinical decision tasks\. We define the permissible entity roles and relation label spaces as follows\.

\(I\) Diagnosis decision\.For the diagnosis task, bothxkx\_\{k\}andyky\_\{k\}are diagnoses recorded in the*same*encounter\. LLMs will focus on relations reflecting shared pathophysiologic mechanisms, or common comorbidity patterns, supporting questions that ask which additional diagnosis is likely to be identified given other diagnoses already present in the encounter\. The relation label is constrained to the same set:

rk∈\{cause,affect,associate\-with\}\.r\_\{k\}\\in\\\{\\textit\{cause\},\\ \\textit\{affect\},\\ \\textit\{associate\-with\}\\\}\.
\(II\) Treatment decision\.For the treatment task,xkx\_\{k\}is a diagnosis from the encounter andyky\_\{k\}is a treatment action \(prescription or procedure\) from the same encounter\. The relation label is constrained to:

rk∈\{treat\-with\-drug,treat\-with\-procedure\},r\_\{k\}\\in\\\{\\textit\{treat\-with\-drug\},\\ \\textit\{treat\-with\-procedure\}\\\},with evidence matched throughUSAGE\_RELpredicates and entity typing determined by the event category \(prescription vs\. procedure\) after preprocessing\.

\(III\) Prognosis decision\.For the prognosis task,xkx\_\{k\}is a diagnosis or treatment from the*prior*encounter, andyky\_\{k\}is a diagnosis in the*subsequent*encounter\. LLMs will focus on direct complications, risk\-modifying effects, or clinically plausible associations that link earlier conditions or interventions to later outcomes\. The relation label is constrained to

rk∈\{cause,affect,associate\-with\},r\_\{k\}\\in\\\{\\textit\{cause\},\\ \\textit\{affect\},\\ \\textit\{associate\-with\}\\\},where these labels map to SemMedDB predicate groups viaCAUSE\_REL,AFFECT\_REL, andASSOC\_REL\(and their negative counterparts for contradiction checks\)\.

### C\.2\.KB Usage

#### Concept linking\.

Concept linking is performed using the UMLS RESTful web services \(UMLS web search\) provided by the UMLS API\.222[UMLS API Link](https://documentation.uts.nlm.nih.gov/rest/home.html)The current UMLS web version at query time is used\. Each entity mention extracted from EHRs or produced during template completion \(e\.g\., relation entities and distractor candidates\) is resolved to a UMLS Concept Unique Identifier \(CUI\) using UMLS search\.

To improve mapping precision, the search space is constrained to source vocabularies when appropriate:

- •Forprescriptions, source vocabularies aligned withDrugBankare prioritized\.
- •Forproceduresanddiagnoses, source vocabularies aligned withICDare prioritized\.

The resolved CUIs serve as canonical identifiers for downstream KB retrieval and predicate matching in SemMedDB\.

#### SemMedDB Dataset\.

The SemMedDB dataset is downloaded333[SemMedDB Link](https://lhncbc.nlm.nih.gov/temp/SemRep_SemMedDB_SKR/SemMedDB_download.html), and thePREDICATIONandSENTENCEfiles are used for relation verification and enrichment\. In thePREDICATIONfile, duplicate records are removed based on theSUBJECT,OBJECT, andPREDICATEcolumns to obtain unique KB relations\. Predicates are then grouped into positive and negative sets that correspond to the target labels for verification and filtering:

CAUSE\_REL=\{CAUSES,PRODUCES,CONVERTS\_TO\},\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{CAUSES\},\\ \\textit\{PRODUCES\},\\ \\textit\{CONVERTS\\\_TO\}\\end\{aligned\}\\right\\\},AFFECT\_REL=\{PREDISPOSES,COMPLICATES,STIMULATES,AUGMENTS,AFFECTS\},\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{PREDISPOSES\},\\ \\textit\{COMPLICATES\},\\ \\textit\{STIMULATES\},\\\\ &\\textit\{AUGMENTS\},\\ \\textit\{AFFECTS\}\\end\{aligned\}\\right\\\},ASSOC\_REL=\{ASSOCIATED\_WITH,COEXISTS\_WITH,INTERACTS\_WITH\},\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{ASSOCIATED\\\_WITH\},\\ \\textit\{COEXISTS\\\_WITH\},\\ \\textit\{INTERACTS\\\_WITH\}\\end\{aligned\}\\right\\\},USAGE\_REL=\{TREATS,DIAGNOSES,ADMINISTERED\_TO,USED\_FOR,MEASUREMENT\_OF,METHOD\_OF,PREVENTS,INHIBITS,DISRUPTS\},\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{TREATS\},\\ \\textit\{DIAGNOSES\},\\ \\textit\{ADMINISTERED\\\_TO\},\\ \\textit\{USED\\\_FOR\},\\\\ &\\textit\{MEASUREMENT\\\_OF\},\\ \\textit\{METHOD\\\_OF\},\\ \\textit\{PREVENTS\},\\ \\textit\{INHIBITS\},\\ \\textit\{DISRUPTS\}\\end\{aligned\}\\right\\\},and their corresponding negative forms:

NEG\_CAUSE\_REL=\{NEG\_CAUSES,NEG\_PRODUCES,NEG\_CONVERTS\_TO\},\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{NEG\\\_CAUSES\},\\ \\textit\{NEG\\\_PRODUCES\},\\ \\textit\{NEG\\\_CONVERTS\\\_TO\}\\end\{aligned\}\\right\\\},NEG\_AFFECT\_REL=\{NEG\_PREDISPOSES,NEG\_COMPLICATES,NEG\_STIMULATES,NEG\_AUGMENTS,NEG\_AFFECTS\},\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{NEG\\\_PREDISPOSES\},\\ \\textit\{NEG\\\_COMPLICATES\},\\ \\textit\{NEG\\\_STIMULATES\},\\\\ &\\textit\{NEG\\\_AUGMENTS\},\\ \\textit\{NEG\\\_AFFECTS\}\\end\{aligned\}\\right\\\},NEG\_ASSOC\_REL=\{NEG\_ASSOCIATED\_WITH,NEG\_COEXISTS\_WITH,NEG\_INTERACTS\_WITH\},\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{NEG\\\_ASSOCIATED\\\_WITH\},\\ \\textit\{NEG\\\_COEXISTS\\\_WITH\},\\ \\textit\{NEG\\\_INTERACTS\\\_WITH\}\\end\{aligned\}\\right\\\},NEG\_USAGE\_REL=\{NEG\_TREATS,NEG\_DIAGNOSES,NEG\_ADMINISTERED\_TO,NEG\_USED\_FOR,NEG\_MEASUREMENT\_OF,NEG\_METHOD\_OF,NEG\_PREVENTS,NEG\_INHIBITS,NEG\_DISRUPTS\}\.\\displaystyle=\\left\\\{\\begin\{aligned\} &\\textit\{NEG\\\_TREATS\},\\ \\textit\{NEG\\\_DIAGNOSES\},\\ \\textit\{NEG\\\_ADMINISTERED\\\_TO\},\\ \\textit\{NEG\\\_USED\\\_FOR\},\\\\ &\\textit\{NEG\\\_MEASUREMENT\\\_OF\},\\ \\textit\{NEG\\\_METHOD\\\_OF\},\\ \\textit\{NEG\\\_PREVENTS\},\\ \\textit\{NEG\\\_INHIBITS\},\\ \\textit\{NEG\\\_DISRUPTS\}\\end\{aligned\}\\right\\\}\.After evidence is identified, the correspondingSENTENCE\_IDis used to retrieve supporting PubMed sentences from theSENTENCEfile\.

### C\.3\.QA Template Completion

#### Distractor generation\.

For each verified relation template, an LLM generates1010distractor candidates based on internal knowledge and structured template attributes\. In addition, observed clinical events are sampled to expand the candidate pool, yielding up to2525total distractor candidates per template\.

#### Topic generation\.

For each relation template, a high\-level clinical condition topic is generated to summarize the clinical focus of the relation\. For diagnosis and prognosis tasks, the topic is derived fromyky\_\{k\}\(the object entity\)\. For treatment tasks, the topic is derived fromxkx\_\{k\}\(the subject entity\), because the object entity is a treatment rather than a diagnosis\. Statistics of the generated topics are reported in Appendix[D](https://arxiv.org/html/2605.30637#A4)\.

#### Rationale generation\.

The LLM is prompted to review the extracted clinical relations and the retrieved KB evidence \(e\.g\., entity definitions, positive evidence, and supporting sentences\) to generate a concise clinical rationale\. The resulting rationales are stored in the templates and can be directly used as explanations for the MCQs and OEQs derived from the corresponding templates\.

#### Template diversity constraint per patient\.

To avoid over\-representing a single patient trajectory and to promote diversity across conditions and care patterns, the number of verified templates per patient is restricted\. Each patient contributes at most three templates across all eligible encounters, selected to maximize coverage of distinct relations and reduce redundancy\.

#### Minimum distractor threshold\.

To ensure that MCQs include meaningful alternatives and reduce ambiguity, templates with fewer than three verified distractors after filtering are discarded\. As a result, each instantiated MCQ contains 4, 5, or 6 choices\. This constraint is applied before the final QA instantiation stage\.

### C\.4\.QA Generation

#### Question Skeleton

- •Diagnosis MCQ: Based on the clinical context summarized above, which additional diagnosis is most likely to be present or identified during this visit?
- •Treatment MCQ: Given the clinical context summarized above, which treatment is most likely to be prescribed during this visit?
- •Prognosis MCQ: Given the prior clinical history summarized above, which diagnosis is most likely to be present or identified at the next visit?
- •Diagnosis OEQ: Given the diagnoses of the patient, what may lead to this target diagnosis during the same visit? Why?
- •Treatment OEQ: Given the diagnoses of the patient, what may lead to this target treatment during the same visit? Why?
- •Prognosis OEQ: Given the prior visit of the patient, what may lead to this target diagnosis at the next visit? Why?

#### Question Paraphrasing

To improve robustness and linguistic diversity, an instruction\-following LLM is used to generateVVparaphrased versions of the same scenario while preserving clinical intent\. Concretely,V=6V\{=\}6paraphrased scenarios\{Sj\(v\)\}v=1V\\\{S^\{\(v\)\}\_\{j\}\\\}\_\{v=1\}^\{V\}are generated, and each scenario is paired with a fixed ask\-only question to form the corresponding question instances\. Prompts are designed to produce surface\-level paraphrases of the same clinical context by rephrasing wording and sentence structure without introducing new information, clinical interpretation, or emphasis across events\. Furthermore, the LLM is prohibited from using abstraction or causal language, and similar length and structure are maintained across versions to ensure that variation is limited to linguistic form rather than semantic content\. This controlled paraphrasing strategy enables robust evaluation under deterministic rewording while preventing information leakage and preserving clinical meaning\.

#### Choice Permutation

To mitigate sensitivity to answer\-position bias,ccpermutation variants are generated for each instance using a fixed random seed to ensure reproducibility\. For acc\-choice MCQ, each variant permutes the order of theccanswer options while updating the ground\-truth label to match the new position of the correct answer\. As a result, across theccvariants, each option appears exactly once in each position, eliminating systematic advantages associated with any specific answer index\. This design prevents models from exploiting positional heuristics, enforces position\-invariant evaluation, and enables a more faithful assessment of model reasoning ability rather than sensitivity to answer ordering\.

### C\.5\.Trade\-off in Knowledge\-base Verification

The benchmark construction pipeline adopts a precision\-first verification strategy: uncertain or weakly supported relations are discarded rather than retained\. This design choice is intended to improve the reliability and clinical validity of retained QA items at scale\. Among candidate relations not retained, approximately 81% lacked supporting KB evidence, 12% were associated with explicit negative evidence, 5% were removed by the per\-patient diversity cap, and 2% failed distractor\-quality filtering\. These statistics indicate that most rejected candidates arise from insufficient supporting evidence rather than direct contradiction\.

To estimate potential false negatives introduced by the conservative filtering strategy, we manually audited 225 rejected candidate relations \(25 sampled items across 3 tasks and 3 data sources\)\. Approximately 24% were judged clinically plausible despite not being retained\. These misses likely arise from incomplete KB coverage for medically reasonable relations, near\-synonymous CUI granularity mismatch, and entity\-linking loss introduced during API\-based normalization\.

This behavior reflects the intended trade\-off of the current pipeline\. EHRBench prioritizes higher precision and stronger grounding of retained relations over maximizing recall, acknowledging that some clinically plausible but weakly supported relations may be excluded during verification\.

### C\.6\.Quality Control

LLM\-based generation enables scalable and automated benchmark construction; however, reliability in the clinical domain requires systematic safeguards\. We therefore implement multi\-stage quality control throughout the construction of EHRBench to improve correctness, clarity, and fairness\. These safeguards are tightly integrated into the*EHR–LLM–KB*pipeline and operate at multiple stages of template generation, instantiation, and validation\. The quality control pipeline consists of the following components:

- •Terminology normalization\.Raw EHR events are mapped to normalized, human\-readable descriptions and standardized biomedical codes \(UMLS CUIs\)\. This step bridges heterogeneity across EHR systems and external knowledge bases, reducing spurious variation caused by synonyms, abbreviations, and data\-source\-specific naming conventions\. Normalization ensures consistent grounding of clinical concepts across data sources and enables reliable downstream KB verification\.
- •Knowledge\-base verification of clinical relations\.To reduce hallucinations and improve clinical validity, all LLM\-extracted relations are verified and enriched using external biomedical knowledge bases, including UMLS, SemMedDB, PubMed, ICD, and DrugBank\. Relations are retained only if they are supported by positive evidence and are not contradicted by known negative or conflicting relations\. Unsupported or contradictory relations are discarded during Stage 2 of template generation, ensuring that retained relations reflect clinically plausible associations grounded in biomedical knowledge\.
- •Knowledge\-base filtering of templates\.To preserve an unambiguous choice structure in multiple\-choice questions, candidate distractors are further pruned through KB\-based filtering\. Clinically plausible alternatives that are also supported by knowledge bases given the anchor concept and auxiliary context are removed, reducing the risk of multiple correct answers\. This filtering step, applied in Stage 4 of template generation, enforces single\-answer correctness while maintaining clinical realism\.
- •Leakage prevention\.To minimize information leakage from the context to the answer options, we enforce forbidden\-term constraints and non\-overlap rules between question stems and choices\. In particular, extracted clinical relations are required not to trivially overlap with events explicitly mentioned in the EHR context\. These constraints prevent models from exploiting surface cues or lexical overlap instead of performing genuine clinical reasoning\.
- •Format and structural validation\.All LLM outputs are required to conform to a predefined JSON schema and are parsed using robust validation routines\. At the QA\-item level, each record is checked to ensure that required fields are present and non\-empty, that option sets are well\-formed, and that duplicate or malformed items are removed\. This validation step guarantees structural consistency across the entire benchmark\.
- •Diversity and fairness controls\.EHRBench is constructed at large scale, spanning multiple clinical decision tasks, sources, and question types, resulting in a total of 960,067 QA items\. To mitigate evaluation bias, multiple deterministic variants are generated through paraphrasing and answer\-option permutations, ensuring that model performance reflects robust reasoning rather than sensitivity to surface form\. This design promotes fair and stable comparison across models\.

Together, these safeguards ensure that the generated QA items fully leverage the scalability of LLM\-based generation and the richness of real\-world EHR data, while maintaining high standards of clinical correctness, structural clarity, and evaluation fairness\.

### C\.7\.EHRBench Usage Protocol

The generated QA items are used to evaluate the CDM capability of LLMs\. For each QA itemIj=\(Sj,Qj,Bj\)I\_\{j\}=\(S\_\{j\},Q\_\{j\},B\_\{j\}\), the model receives only the natural\-language input\(Sj,Qj\)\(S\_\{j\},Q\_\{j\}\)\. All metadataBjB\_\{j\}, including the answer, choices, evidence, and clinical relations, are withheld from the model and used exclusively for scoring and post\-hoc analysis\. The model is required to produce an answer in the specified format for the corresponding question type\.

Because the benchmark contains parallel variants and multiple question types, flexible evaluation settings are supported\. For efficiency and broader coverage of clinical concepts, evaluation can be conducted on a fixed subset of variants, such as a single version of the 4\-choice questions \(e\.g\., Version 1\), which provides a controlled and non\-redundant evaluation set\. For robustness, evaluation can be conducted across multiple variants and aggregated, for example, by combining all four variants of the 4\-choice MCQs, all five variants of the 5\-choice MCQs, and all six variants of the 6\-choice MCQs, thereby reducing sensitivity to wording and answer\-position effects\. For MCQs,*accuracy \(ACC\)*is used as the primary metric\. For OEQs, metrics include*Coverage*\(whether the answer covers the target clinical relation\),*ROUGE*\(comparison with the reference rationale\), and*BERTScore*\.

## Appendix DTopic Analysis

We also extract question “topics” and analyze their distribution\. Topics are generated during the template generation step \(see Section[3\.3](https://arxiv.org/html/2605.30637#S3.SS3)and Appendix[C\.3](https://arxiv.org/html/2605.30637#A3.SS3)\)\. Each topic represents the primary clinical condition that a question targets, i\.e\., a “diagnosis” in structured EHR data\. After extracting topics from EHRBench, each topic is linked to an ICD\-10\-CM code, truncated to the 3\-digit level \(three characters\), and used to aggregate the number of questions \(Count\) and data source share \(Percent\) together with ICD\-10\-CM category descriptions\. In total, 3,808 unique ICD codes are mapped, and 752 unique codes remain after 3\-digit truncation, indicating broad coverage across ICD\-10\-CM categories and diverse clinical relations in EHRBench\. Table[4](https://arxiv.org/html/2605.30637#A4.T4)summarizes the top\-20 ICD\-10\-CM 3\-character categories in the full data source\.

Table 4\.Top\-20 most frequent ICD\-10\-CM categories in EHRBench\. We report ICD\-10 codes truncated to three characters, along with the total number of questions \(Count\), the corresponding data source share \(Percent\), and code descriptions\.From Table[4](https://arxiv.org/html/2605.30637#A4.T4), the most frequent ICD\-10\-CM 3\-character categories correspond to clinically common conditions and acute decompensation syndromes that routinely drive EHR\-based decision making, which supports that the EHRBench construction pipeline yields clinically sensible topics and code mappings\. In particular, the head of the distribution is dominated by high\-prevalence cardiometabolic and acute\-care presentations, such as heart failure \(I50, 7\.11%\), fluid/electrolyte and acid–base disorders \(E87, 6\.20%\), functional intestinal disorders \(K59, 5\.26%\), and essential \(primary\) hypertension \(I10, 4\.61%\)\. Failure\-state phenotypes are also prominent, including acute kidney failure \(N17, 3\.21%\), chronic kidney disease \(N18, 2\.21%\), and respiratory failure \(J96, 2\.21%\), consistent with the fact that many inpatient encounters center on physiologic instability and organ dysfunction\. Meanwhile, the presence of comorbidity\- and symptom\-burden categories—such as volume depletion \(E86, 1\.95%\), pain \(G89, 1\.16%\), anemia \(D64, 1\.09%\), and mental health conditions \(F41/F32, 1\.90%/1\.89%\)—suggests that the extracted topics cover both primary diagnoses and common accompanying problems reflected in longitudinal records\. At the same time, the distribution exhibits a long\-tail pattern: the top\-20 ICD\-10\-CM 3\-character categories cover 51% of all questions, implying that nearly half of the benchmark is distributed across a wide range of lower\-frequency categories\. This long tail is desirable for evaluation because it reduces over\-reliance on a small set of frequent conditions and encourages models to generalize beyond the most common clinical scenarios\.

Table 5\.Top\-20 most frequent ICD\-10\-CM categories in EHRBench \(from MIMIC\-III\)\. We report ICD\-10 codes truncated to three characters, along with the total number of questions \(Count\), the corresponding data source share \(Percent\), and code descriptions\.Table 6\.Top\-20 most frequent ICD\-10\-CM categories in EHRBench \(from MIMIC\-IV\)\. We report ICD\-10 codes truncated to three characters, along with the total number of questions \(Count\), the corresponding data source share \(Percent\), and code descriptions\.Table 7\.Top\-20 most frequent ICD\-10\-CM categories in EHRBench \(from PROMOTE\)\. We report ICD\-10 codes truncated to three characters, along with the total number of questions \(Count\), the corresponding data source share \(Percent\), and code descriptions\.We also analyze question topics from each data source \(MIMIC\-III, MIMIC\-IV, and PROMOTE\)\. Tables[5](https://arxiv.org/html/2605.30637#A4.T5)–[7](https://arxiv.org/html/2605.30637#A4.T7)report the top\-20 ICD\-10\-CM categories \(truncated to the ICD\-10\-CM 3\-character level\) within each data source, including aggregated counts, data source shares, and code descriptions\. Across all three sources, the head of the distribution is broadly consistent with the full\-dataset pattern: common cardiometabolic and decompensation\-related categories repeatedly appear among the most frequent topics, such as heart failure \(I50\), essential \(primary\) hypertension \(I10\), electrolyte and acid–base disorders \(E87\), functional intestinal disorders \(K59\), acute and chronic kidney disease \(N17/N18\), atrial fibrillation and related arrhythmias \(I48/I49\), and angina pectoris \(I20\)\. This cross\-source agreement suggests that the extracted topics capture clinically prevalent conditions that are routinely documented in structured EHRs, and further supports that the EHRBench topic extraction and ICD mapping procedures are clinically sensible rather than being driven by data source\-specific artifacts\. At the same time, each data source exhibits distinct secondary emphases consistent with its underlying cohort and documentation patterns, including higher prevalence of respiratory failure \(J96\) and sepsis \(A41\) in MIMIC\-III, symptom\-oriented categories such as nausea/vomiting \(R11\) and abdominal pain \(R10\) in MIMIC\-IV, and additional cardiovascular and peri\-procedural complications \(e\.g\., I63, I95, I51, T87\) in PROMOTE\.

Table 8\.Top\-20 most frequent ICD\-10\-CM categories in EHRBench \(from the diagnosis subset\)\. We report ICD\-10 codes truncated to three characters, along with the total number of questions \(Count\), the corresponding data source share \(Percent\), and code descriptions\.Table 9\.Top\-20 most frequent ICD\-10\-CM categories in EHRBench \(from the treatment subset\)\. We report ICD\-10 codes truncated to three characters, along with the total number of questions \(Count\), the corresponding data source share \(Percent\), and code descriptions\.Table 10\.Top\-20 most frequent ICD\-10\-CM categories in EHRBench \(from the prognosis subset\)\. We report ICD\-10 codes truncated to three characters, along with the total number of questions \(Count\), the corresponding data source share \(Percent\), and code descriptions\.We also analyze question topics separately for the three clinical decision tasks \(diagnosis, treatment, and prognosis\)\. Tables[8](https://arxiv.org/html/2605.30637#A4.T8)–[10](https://arxiv.org/html/2605.30637#A4.T10)summarize the top\-20 ICD\-10\-CM categories \(ICD\-10\-CM 3\-character level\) within each task subset\. Overall, the three subsets share a consistent head dominated by common cardiometabolic and decompensation\-related conditions, such as heart failure \(I50\), fluid/electrolyte and acid–base disorders \(E87\), essential \(primary\) hypertension \(I10\), atrial fibrillation and related arrhythmias \(I48/I49\), acute kidney failure \(N17\), chronic kidney disease \(N18\), and respiratory failure \(J96\)\. These patterns indicate that the extracted topics reflect clinically plausible task\-specific distributions while remaining broadly consistent across tasks\.

## Appendix EExtended Contents of the Main Experiment: Benchmarking LLMs on EHRBench Across Clinical Decision Tasks

### E\.1\.Implementation Details

When running experiments, we process questions in batches of ten using instruction prompts to improve efficiency and ensure stable outputs\. Inputs are typically around 8,000 tokens, with a maximum context length of 10,240 tokens\. To further reduce latency, we enable early stopping once the model produces a complete JSON\-formatted answer\. All experiments are implemented in Python 3\.11\.5 and executed on NVIDIA H200 GPUs with CUDA 12\.4\. For Azure OpenAI Service, we use API version“2025\-03\-01\-preview”\. We adopt deterministic decoding to ensure reproducibility; for example, when running the open\-source general LLM“llama3\-8b”, we set“do\-sample=False, temperature=0, top\-k=1”\. All models are evaluated using the same prompt template\.

To ensure a fair comparison across tasks, sources, and choice types with potentially different numbers of generated QA instances, we construct a fixed evaluation subset for every setting: for each source\-task\-type combination, we take the first 3,000 questions from the first version\. Therefore, each LLM is evaluated with 81,000 questions in total, which provides a very comprehensive evaluation\.

### E\.2\.Cost Details

We analyze the computational and monetary costs of the main experiment in Section[4\.1](https://arxiv.org/html/2605.30637#S4.SS1)\. Table[11](https://arxiv.org/html/2605.30637#A5.T11)reports total token usage, end\-to\-end runtime, API cost \(when applicable\), throughput defined as tokens per hour, and overall accuracy\. Total token usage is tightly clustered around 10–11M tokens across most models, with only a few higher\-token runs \(e\.g\., mistral\-7b at 12\.87M and yi\-1\.5\-34b at 13\.04M\), indicating that the main comparison is conducted under a largely matched token budget\. In contrast, runtime varies widely from 1\.01h \(smollm3\-3b\) to 20\.87h \(llama3\.3\-70b\), so throughput is largely time\-determined in this setting\. As a result, throughput spans an order of magnitude, from 0\.50 M Tokens/h \(llama3\.3\-70b\) to 10\.21 M Tokens/h \(smollm3\-3b\), suggesting that system\-level efficiency differences dominate over token\-count differences under the same evaluation protocol\. API\-based models often achieve higher throughput than similarly sized self\-hosted models, but this efficiency comes with non\-negligible monetary cost\.

The most accurate API\-based models are also among the most expensive and slower in practice\. For example, gpt\-5\.2 achieves the highest overall accuracy \(70\.91%\) but incurs the largest API cost \($71\.03\) and runs at 1\.00 M Tokens/h with a 12\.17h runtime\. The next tier, gpt\-4\.1 \(69\.43%\) and gpt\-5 \(69\.06%\), reduces the cost relative to gpt\-5\.2 but still requires $40\.67–$42\.37, with 4\.02–6\.45h runtime and 1\.57–2\.52 M Tokens/h throughput\. These results indicate a practical trade\-off: improvements at the top end of accuracy can require disproportionately larger time and monetary budgets\.

To further illustrate this trade\-off, Figure[3](https://arxiv.org/html/2605.30637#A5.F3)plots overall accuracy against throughput\. The highest accuracies concentrate at relatively low throughput, and the upper envelope is formed by comparatively slow models in this benchmark setup\. Across all models, higher throughput does not imply higher accuracy, and the high\-throughput region is primarily populated by weaker models, indicating that efficiency and effectiveness can be decoupled in practice\. A mid\-throughput band around 4–6 M Tokens/h provides a pragmatic operating point when both quality and runtime matter\. For example, mistral\-small3\-24b reaches 65\.01% at 4\.51 M Tokens/h, and doctor\-r1\-8b attains 61\.07% at 4\.25 M Tokens/h\. On the API side, gpt\-4\.1\-nano runs at 5\.88 M Tokens/h with 60\.48% accuracy, offering materially higher throughput than frontier models but with a noticeable accuracy gap to gpt\-4\.1 and gpt\-5\.2\. Within model families, scaling typically increases accuracy while reducing throughput\. In the Llama series, accuracy rises from 48\.90% \(llama3\-8b\) to 63\.35% \(llama3\-70b\) and 67\.28% \(llama3\.3\-70b\), while throughput drops from 4\.19 to 1\.35 and further to 0\.50 M Tokens/h\. A similar pattern appears in Qwen: qwen3\-4b and qwen3\-8b achieve 60\.63% and 60\.87% at 4\.36 and 3\.22 M Tokens/h, whereas qwen3\-32b improves to 66\.78% but slows to 1\.31 M Tokens/h\. These within\-family trends reinforce that accuracy gains from scaling are accompanied by systematically lower throughput under a fixed evaluation protocol\.

Table 11\.Cost and efficiency of benchmarking LLMs on EHRBench\. We report total token usage \(Tokens \(M\)\), end\-to\-end runtime \(Time \(h\)\), API cost \(Money \($\); not applicable to self\-hosted models\), throughput defined as total tokens divided by total time \(Tokens \(M\)/h\), and overall accuracy \(Overall Acc \(%\)↑\\uparrow\)\.![Refer to caption](https://arxiv.org/html/2605.30637v1/x3.png)Figure 3\.Overall accuracy versus throughput \(M Tokens/h\) for all evaluated models\. Throughput is computed as total tokens divided by end\-to\-end runtime under the same evaluation protocol\. For visualization, medical LLMs are grouped by the family of their corresponding base models: med42, ultramedical, and doctor\-r1 are grouped under Llama, while m1 is grouped under Qwen\.
### E\.3\.Error Analysis

We analyze errors from all evaluated models and categorize them into three types: \(i\) prediction errors \(Prediction Wrong\), which reflect knowledge or reasoning failures under correct formatting; \(ii\) missing structured outputs \(No JSON\), where the model does not return a valid JSON object; and \(iii\) malformed structured outputs \(Output Malformed\), where the output is not parseable under the required schema\. Table[12](https://arxiv.org/html/2605.30637#A5.T12)summarizes the per\-model error breakdown\.

Table 12\.Error breakdown on EHRBench\. We report overall accuracy \(Accuracy\), missing structured outputs \(No JSON\), malformed structured outputs \(Output Malformed\), and prediction errors under valid formatting \(Prediction Wrong\)\. All values are percentages \(%\)\.Across HIPAA compliant API\-based LLMs, format\-related failures are essentially eliminated: No JSON is 0\.00 and Output Malformed is 0\.00–0\.01 for gpt\-4\.1\-nano, gpt\-4\.1\-mini, gpt\-4\.1, gpt\-5\-nano, gpt\-5\-mini, gpt\-5, and gpt\-5\.2\. Consequently, residual errors are dominated by Prediction Wrong, which ranges from 29\.09% \(gpt\-5\.2\) to 42\.18% \(gpt\-5\-nano\), mirroring the accuracy ordering \(70\.91% to 57\.80%\)\. This indicates that, for high\-capability API models, the evaluation pipeline is not bottlenecked by response formatting, and measured performance primarily reflects decision quality rather than output compliance\.

Most open\-source general LLMs show near\-perfect compliance \(No JSON at 0\.00 and Output Malformed close to 0\.00\), but several models exhibit non\-negligible formatting failures that can materially affect end\-to\-end reliability\. For example, mistral\-7b shows 18\.12% Output Malformed, and qwen2\.5\-3b shows 13\.16%, both coinciding with low accuracies of 38\.23% and 37\.87%, respectively\. A smaller but notable No JSON rate appears in qwen3\-8b \(1\.36%\), qwen3\-32b \(0\.44%\), and yi\-1\.5\-9b \(0\.73%\), indicating that even strong families can occasionally violate structured\-output constraints\. In contrast, top open\-source performers combine high accuracy with clean outputs, such as llama3\.3\-70b \(67\.28% accuracy, 0\.00 No JSON, 0\.00 Output Malformed\) and mistral\-small3\-24b \(65\.01%, 0\.00, 0\.00\)\.

Compared with general\-purpose models, medical LLMs show markedly higher rates of structured\-output failures, suggesting weaker instruction following under the same JSON\-constrained protocol\. med42\-8b has 25\.28% Output Malformed with 36\.48% accuracy, and ultramedical\-8b has 17\.44% Output Malformed and the highest No JSON rate in the table at 1\.04%, with 29\.02% accuracy\. Even the strongest medical model by accuracy, m1\-32b\-1k \(63\.21%\), exhibits 0\.44% Output Malformed, while doctor\-r1\-8b shows a non\-trivial No JSON rate of 0\.70% despite 61\.07% accuracy\. These results imply that, for medical LLMs, reliability issues arise from both decision errors and output\-format instability\.

### E\.4\.Breakdown Analysis of Medical LLMs versus Base Models

The main experiment in Section[4\.1](https://arxiv.org/html/2605.30637#S4.SS1)shows that medical\-domain adaptation does not consistently improve performance on EHRBench\. To further analyze this phenomenon, we compare medical LLMs with their corresponding base models using question\-level aligned evaluation averaged across 4C/5C/6C MCQ variants\. Table[13](https://arxiv.org/html/2605.30637#A5.T13)summarizes the overall and task\-specific performance differences\.

Table 13\.Comparison between medical LLMs and their corresponding base models on EHRBench\. We report overall accuracy and task\-specific accuracy differences for diagnosis \(Dx\), treatment \(Tx\), and prognosis \(Px\)\. Overall results are reported as medical model / base model \(Δ\\Delta\), and task\-specific columns report the absolute accuracy differenceΔ\\Deltain percentage points\.Three consistent patterns emerge across these comparisons\. First, current medical\-domain fine\-tuning does not reliably improve grounded EHR reasoning performance\. This observation further highlights the difficulty of EHR\-grounded clinical decision\-making, since most existing medical LLM adaptation pipelines are not specifically optimized for reasoning over longitudinal structured EHR contexts\. Second, the largest performance degradations are generally observed on Dx tasks, which often require disentangling confounded disease–disease relations and comorbidity patterns\. Third, larger medical models narrow the gap relative to their base models, but do not consistently reverse the overall trend\.

At the same time, we observe several narrow topic\-level exceptions where medical adaptation provides localized gains\. For example, m1\-32b\-1k improves Px performance on topics such asAngioedema\(\+31\.60\),Bradycardia\(\+18\.78\), andHyperkalemia\(\+16\.67\)\. m1\-7b\-23k shows localized improvements onHeart failureTx \(\+14\.58\) andObstructive Sleep ApneaPx \(\+13\.68\)\. med42\-8b also improves on several Dx topics, includingCold intolerance\(\+14\.71\) andOsteoporosis, unspecified\(\+14\.58\)\. These exceptions suggest that domain\-specific tuning may help in narrow clinical niches, even though it does not consistently improve general EHR\-grounded clinical decision\-making performance\.

Overall, these results suggest that strong performance on EHRBench requires capabilities beyond biomedical terminology familiarity or medical text exposure alone\. Models must reason over real longitudinal EHR contexts and resolve clinically confounded relations, including disease progression patterns and disease–treatment associations\. Improving these capabilities may require training signals beyond conventional domain adaptation, such as large\-scale clinical case supervision and decision\-focused objectives\. EHR\-grounded resources such as EHRBench may therefore provide a useful foundation for future clinical reasoning\-oriented model development\. Similar observations have also been reported in prior work\(Dorfneret al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib124); Xuet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib41); Dorfneret al\.,[2024](https://arxiv.org/html/2605.30637#bib.bib123)\)\.

### E\.5\.Comparison with Embedding\-based non\-LLM Baselines

To provide additional reference points beyond LLM\-based evaluation, we compare EHRBench with several embedding\-based retrieval baselines under the same zero\-shot QA setting\. For each question, the model encodes the question together with each candidate option and selects the option with the highest cosine similarity\. We evaluate these methods on the same 27,000 6C questions used in the main experiment in Section[4\.1](https://arxiv.org/html/2605.30637#S4.SS1)\.

Table 14\.Comparison between embedding\-based retrieval baselines and LLMs on EHRBench\. We report overall accuracy and breakdowns by clinical decision task and data source\. Dx, Tx, and Px denote diagnosis, treatment, and prognosis, respectively; MIII, MIV, and PRO denote MIMIC\-III, MIMIC\-IV, and PROMOTE, respectively\. All values are percentages \(%\)\.Embedding\-based retrieval baselines consistently underperform reasoning\-capable LLMs across all evaluation settings\. Even the strongest biomedical encoder baseline, PubMedBERT, achieves only 32\.8% overall accuracy, substantially below general\-purpose LLMs such as llama3\-8b \(43\.8%\) and qwen3\-8b \(55\.3%\)\. The gap becomes even larger for the strongest API\-based model, gpt\-5\.2 \(66\.8%\)\.

The largest performance differences are observed on treatment questions\. PubMedBERT reaches only 26\.2% accuracy on Tx, whereas qwen3\-8b and gpt\-5\.2 achieve 70\.4% and 77\.3%, respectively\. This pattern suggests that many treatment questions in EHRBench cannot be solved through simple semantic similarity or terminology matching alone\. Instead, successful prediction often requires reasoning over clinically grounded relations between diagnoses, interventions, and longitudinal patient context\.

Overall, these results support the design objective of EHRBench as a benchmark for clinical reasoning rather than retrieval\-oriented matching\. Strong performance requires models to integrate biomedical knowledge with contextual inference over structured EHR\-derived scenarios, which remains challenging for embedding\-only retrieval approaches\.

### E\.6\.Robustness to QA Generation LLM Choice

In the main construction pipeline of EHRBench, we use HuatuoGPT\-o1\-8B as the primary source LLM for generating QA templates and question instances\. Although this choice provides a consistent generation protocol, it may raise a potential LLM bias concern\. To examine this issue and strengthen the robustness and validity of EHRBench, we conduct an additional source\-model bias analysis\.

Specifically, we regenerate held\-out 4C subsets from the same 400 patients using three different medical LLMs as source generators: HuatuoGPT\-o1\-7B, HuatuoGPT\-o1\-8B, and m1\-7b\-23k\. Each regenerated subset covers three data sources \(MIII/MIV/PRO\) and three clinical decision tasks \(Dx/Tx/Px\)\. We then evaluate six strong open\-source LLMs on each regenerated subset\. This design allows us to test whether the relative ordering of evaluated models remains stable when the QA generation model changes while the underlying patient set, task coverage, and evaluation protocol are held fixed\. The results are summarized in Table[15](https://arxiv.org/html/2605.30637#A5.T15)\.

Table 15\.Performance comparison by changing QA generation LLM choice on EHRBench\. Since HuatuoGPT\-o1\-8B is used as the primary LLM for EHRBench generation in the main construction pipeline, we regenerate held\-out 4C subsets from the same 400 patients using three QA generation LLMs and evaluate six open\-source LLMs on each subset\. All values are accuracy percentages \(%\)\.The absolute accuracies vary across source generators, suggesting that different source LLMs can produce QA subsets with different difficulty levels\. However, the relative model ordering remains highly stable\. Kendall’s W across the three source\-generator settings is 0\.937 \(p=0\.015p=0\.015\), and the pairwise Spearman correlations are 0\.829, 0\.943, and 0\.943\. These results indicate that, although the source generator affects the absolute difficulty of the regenerated subset, the main comparative conclusions are robust to the choice of source LLM\. Therefore, the observed model rankings in EHRBench are unlikely to be an artifact of relying on HuatuoGPT\-o1\-8B as a single benchmark\-construction model\.

### E\.7\.Robustness to Context Event Size

We further evaluate whether the main conclusions are sensitive to the number of context events used to construct the question scenario\. In the main construction pipeline of EHRBench, each QA scenario is generated from a compact EHR context consisting of two context events together with the relation\-subject entity\. This design follows the benchmark objective of evaluating clinical decision\-making under partial observation, rather than long\-context retrieval over a large number of EHR events\.

To assess whether this design choice affects the relative model comparison, we construct an aligned 4C subset with 300 templates across three clinical decision tasks \(Dx/Tx/Px\) and three data sources \(MIII/MIV/PRO\)\. We then vary the number of context events from 2 to 4 to 6 while keeping the templates, answer choices, evaluated models, and inference protocol fixed\. The results are shown in Table[16](https://arxiv.org/html/2605.30637#A5.T16)\.

Table 16\.Performance comparison by changing context event size on EHRBench\. In the main construction pipeline, each QA scenario uses two context events together with the relation\-subject entity\. We evaluate an aligned 4C subset while varying the number of context events from 2 to 4 to 6\. All values are accuracy percentages \(%\)\.The relative model ordering remains the same across all three context\-event settings, suggesting that the main conclusions are not driven by the specific use of two context events in the main benchmark\. Larger 32B/70B models are comparatively robust when additional local context is provided, whereas smaller 7B/8B models show slightly greater sensitivity\. For example, qwen3\-32b remains nearly unchanged from 46\.9% with 2 events to 46\.7% with 6 events, while qwen3\-8b decreases from 40\.5% to 37\.9%\. Overall, these results indicate that increasing the amount of local EHR context does not materially change the model ranking pattern, further supporting the robustness of the main evaluation protocol\.

## Appendix FAdditional Experiment: Testing Reasoning LLMs on EHRBench

Recent reasoning\-oriented LLMs explicitly produce intermediate reasoning traces during inference, which substantially increases token usage and may interact with context\-length constraints\. To prevent these token\-intensive behaviors from confounding the main benchmarking results and ensure a fair comparison, we report a separate analysis that characterizes accuracy\-efficiency trade\-offs under different reasoning\-effort configurations\. All experiments in this section use a maximum context length of 10,240 tokens\.

We evaluate ten configurations in total\. For gpt\-5\-nano and gpt\-5\-mini\(Singhet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib94)\), we test four reasoning\-effort settings \(minimal/low/medium/high\), where minimal matches the setting used in the main experiments\. We additionally include gpt\-oss\-20b and gpt\-oss\-120b\(Agarwalet al\.,[2025](https://arxiv.org/html/2605.30637#bib.bib87)\), which also enforce explicit reasoning output and therefore incur substantially higher token costs than the standard models evaluated in the main benchmark\. To control evaluation cost while maintaining broad coverage, all reasoning models are served through the HIPAA\-compliant Azure API, and evaluation is performed on a fixed subset of 1,000 QA items for each source–task–MCQ\-type combination\. This protocol yields1,000×3×3×3=27,0001\{,\}000\\times 3\\times 3\\times 3=27\{,\}000questions per configuration\. Figure[4](https://arxiv.org/html/2605.30637#A6.F4)summarizes overall accuracy and total token usage across these configurations\. Other settings are kept the same as the main experiment\.

![Refer to caption](https://arxiv.org/html/2605.30637v1/x4.png)Figure 4\.Overall performance and token cost of reasoning model configurations on EHRBench\. Each point corresponds to one configuration and is annotated with total token usage \(in millions\) and overall accuracy\.Figure[4](https://arxiv.org/html/2605.30637#A6.F4)indicates that higher model capacity and greater reasoning effort generally correspond to higher accuracy and higher token cost, which is consistent with the expected scaling trends and supports the validity of the EHRBench construction pipeline\. At matched effort levels, gpt\-5\-mini outperforms gpt\-5\-nano, and gpt\-oss\-120b slightly exceeds gpt\-oss\-20b\. Within gpt\-5\-mini, increasing effort from minimal to low and then to medium yields monotonic gains\. A notable exception occurs for gpt\-5\-nano: the high\-effort setting underperforms the medium setting despite consuming substantially more tokens \(19\.7M vs\. 11\.1M\)\. Output inspection indicates that gpt\-5\-nano\-high more frequently fails to return a final decision due to exceeding context or token limits \(approximately 10% of cases\), which plausibly explains the accuracy degradation at the highest effort level\.

The results also exhibit a consistent performance\-cost trade\-off, as higher reasoning effort incurs diminishing returns\. For gpt\-5\-mini, moving from minimal \(3\.8, 66\.0\) to low \(5\.0, 67\.4\) and then to medium \(6\.6, 69\.1\) yields steady improvements, whereas the step from medium to high increases accuracy by only 0\.8 points \(69\.1→\\rightarrow69\.9\) while increasing token usage by 4\.9M \(6\.6→\\rightarrow11\.5\)\. This pattern suggests that medium\-level reasoning can provide a more efficient operating point when token budgets are constrained\.

## Appendix GAdditional Experiment: Testing LLMs with Multiple Versions of Questions from EHRBench

Table 17\.Performance on multiple deterministic question versions in EHRBench\. Acc \(%\) denotes mean accuracy; V\-Std \(pp\) denotes the standard deviation of accuracy across versions of the same question; V\-Cons\. \(%\) denotes the fraction of questions whose predicted option remains identical across versions of the same question\. Overall aggregates results across 4/5/6\-choice MCQs\.To further evaluate robustness to multiple deterministic question versions in EHRBench, we conduct an extended evaluation\. Each version paraphrases the clinical context while preserving the same clinical meaning; meanwhile, answer options are systematically permuted so that each option, including the correct answer, appears in each position exactly once across versions\. We select ten representative open\-source LLMs spanning model scales to support a fair and comprehensive comparison: small models \(llama3\.2\-3b, qwen2\.5\-3b, qwen3\-4b\), mid\-sized models \(glm4\-9b, llama3\-8b, ministral\-8b, qwen2\.5\-7b, qwen3\-8b\), and large models \(qwen2\.5\-32b, qwen3\-32b\)\. We evaluate the first 1,000 questions for 4\-choice, 5\-choice, and 6\-choice MCQs, with 4/5/6 versions, respectively\. In total, each model is evaluated on1,000×15×3×3=135,0001\{,\}000\\times 15\\times 3\\times 3=135\{,\}000questions\. Other settings are kept the same as in the main experiment\.

We evaluate models using three metrics: accuracy \(Acc\), variability across versions \(V\-Std\), and prediction consistency across versions \(V\-Cons\.\)\. Letc∈\{4,5,6\}c\\in\\\{4,5,6\\\}denote the choice size, and letVc=cV\_\{c\}=cdenote the number of deterministic versions forcc\-choice questions \(paraphrase \+ answer permutation\)\. For each base questionqq, the model produces one predicted optiony^q\(v\)∈\{A,…\}\\hat\{y\}\_\{q\}^\{\(v\)\}\\in\\\{A,\\dots\\\}under versionv∈\{1,…,Vc\}v\\in\\\{1,\\dots,V\_\{c\}\\\}\.

For each versionvv, define the version\-level accuracy as

\(14\)Acc\(v\)=1\|𝒬\|​∑q∈𝒬𝕀​\[y^q\(v\)=yq\],\\mathrm\{Acc\}^\{\(v\)\}\\;=\\;\\frac\{1\}\{\|\\mathcal\{Q\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\}\\mathbb\{I\}\\\!\\left\[\\hat\{y\}\_\{q\}^\{\(v\)\}=y\_\{q\}\\right\],where𝒬\\mathcal\{Q\}is the evaluated set of base questions andyqy\_\{q\}is the gold answer\. We reportAcc\\mathrm\{Acc\}as the mean ofAcc\(v\)\\mathrm\{Acc\}^\{\(v\)\}over versions\.

To quantify robustness to version perturbations, we measure the standard deviation of version\-level accuracies:

\(15\)V​\-​Std=1Vc​∑v=1Vc\(Acc\(v\)−Acc¯\)2,Acc¯=1Vc​∑v=1VcAcc\(v\)\.\\mathrm\{V\\text\{\-\}Std\}\\;=\\;\\sqrt\{\\frac\{1\}\{V\_\{c\}\}\\sum\_\{v=1\}^\{V\_\{c\}\}\\left\(\\mathrm\{Acc\}^\{\(v\)\}\-\\overline\{\\mathrm\{Acc\}\}\\right\)^\{2\}\},\\quad\\overline\{\\mathrm\{Acc\}\}\\;=\\;\\frac\{1\}\{V\_\{c\}\}\\sum\_\{v=1\}^\{V\_\{c\}\}\\mathrm\{Acc\}^\{\(v\)\}\.A smallerV​\-​Std\\mathrm\{V\\text\{\-\}Std\}indicates more stable performance across paraphrase/permutation versions under the same choice size\.

We further measure whether a model makes the*same*prediction across versions for each base question\. Define a per\-question consistency indicator:

\(16\)Cons​\(q\)=𝕀​\[y^q\(1\)=y^q\(2\)=⋯=y^q\(Vc\)\]\.\\mathrm\{Cons\}\(q\)\\;=\\;\\mathbb\{I\}\\\!\\left\[\\hat\{y\}\_\{q\}^\{\(1\)\}=\\hat\{y\}\_\{q\}^\{\(2\)\}=\\cdots=\\hat\{y\}\_\{q\}^\{\(V\_\{c\}\)\}\\right\]\.The overall consistency is then

\(17\)V\-Cons\.=1\|𝒬\|∑q∈𝒬Cons\(q\)\.\\mathrm\{V\\text\{\-\}Cons\.\}\\;=\\;\\frac\{1\}\{\|\\mathcal\{Q\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\}\\mathrm\{Cons\}\(q\)\.HigherV​\-​Cons\.\\mathrm\{V\\text\{\-\}Cons\.\}implies that predictions are less sensitive to version perturbations, complementingV​\-​Std\\mathrm\{V\\text\{\-\}Std\}, which measures accuracy fluctuation at the aggregate level\.

The evaluation results are reported in Table[17](https://arxiv.org/html/2605.30637#A7.T17)\. Across all settings, qwen3\-32b achieves the highest overall accuracy at 65\.98%, followed by qwen2\.5\-32b at 64\.32%\. The relative ordering among models is consistent with that in the main experiment, which further supports the correctness of the EHRBench construction pipeline\.

Moreover, both V\-Std and V\-Cons\. indicate strong stability across versions, suggesting that evaluation on a single version in the main experiment is reasonable\. For example, qwen2\.5\-32b attains the highest overall consistency \(V\-Cons\.\) of 88\.36% with the lowest overall variability \(V\-Std\) of 1\.64\(pp\), while qwen3\-32b remains close in consistency \(87\.15\) with a V\-Std of 1\.73\. These results indicate that the evaluated LLMs are robust to deterministic paraphrasing and answer\-option permutations, reducing the likelihood that the reported performance is driven by an arbitrary or randomly chosen question version rather than the underlying model capability for CDM\.

Increasing the number of answer options consistently reduces accuracy and consistency across all models, and the degradation is substantially larger for weaker models\. For the strongest models, the accuracy drops from 4C to 6C is approximately ten percentage points; for example, qwen3\-32b decreases from 70\.77% \(4C\) to 61\.20% \(6C\), and qwen2\.5\-32b decreases from 69\.29% to 59\.80% \(a 9\.49\-point drop\)\. In contrast, qwen2\.5\-3b exhibits a much sharper decline from 46\.16% to 28\.41% \(a 17\.75\-point drop\)\. A similar pattern appears in V\-Cons\.: qwen2\.5\-32b drops modestly from 90\.31 to 86\.55, whereas qwen2\.5\-3b drops substantially from 67\.66 to 53\.59\. These results suggest that increasing choice cardinality amplifies both difficulty and sensitivity to model capacity, particularly for reliability\.

The results reveal a strong qualitative coupling between accuracy and stability: higher\-accuracy models generally exhibit lower V\-Std and higher V\-Cons\. For example, qwen2\.5\-32b and qwen3\-32b jointly occupy the top tier in accuracy while maintaining high consistency \(above 87\) and low variability \(at or below 1\.73\)\. Nevertheless, the comparison between these two models indicates a nuanced tradeoff: qwen3\-32b yields the best overall accuracy \(65\.98%\), whereas qwen2\.5\-32b achieves slightly higher overall consistency \(88\.36 vs\. 87\.15\) and the lowest overall V\-Std \(1\.64\)\. This separation between peak accuracy and peak reliability suggests that both metrics are necessary for characterizing clinically relevant robustness, especially under harder settings such as 6C, where both accuracy and consistency decline across all models\.

## Appendix HAdditional Experiment: Testing LLMs with Extended Questions from EHRBench

Table 18\.Additional evaluation on extended questions from EHRBench across tasks, sources, and question types\. We use abbreviations Dx/Tx/Px for diagnosis/treatment/prognosis decision task, MIII/MIV/PRO for MIMIC\-III/MIMIC\-IV/PROMOTE, and 4C/5C/6C for 4/5/6\-choice MCQs\. We additionally report evaluation cost in tokens \(M\) and time \(h\)\.In the main experiment, to ensure efficiency and a fair comparison across tasks, data sources, and question types, we evaluate a fixed subset consisting of the first 3,000 questions for each task–source–type combination\. To further assess whether this subset\-based protocol faithfully reflects model behavior at scale, we conduct an additional evaluation over the extended question set, which covers all extracted clinical relations in EHRBench\. We select ten representative open\-source LLMs spanning a wide range of model scales: small models \(llama3\.2\-3b, qwen2\.5\-3b, qwen3\-4b\), mid\-sized models \(glm4\-9b, llama3\-8b, ministral\-8b, qwen2\.5\-7b, qwen3\-8b\), and large models \(qwen2\.5\-32b, qwen3\-32b\)\. For each model, we evaluate all verified templates of each multiple\-choice format \(4\-choice, 5\-choice, and 6\-choice\), resulting in a total of 180,517 evaluated questions per model\. All other settings are identical to those in the main experiment\. We report the results in Table[18](https://arxiv.org/html/2605.30637#A8.T18)\.

Overall, the extended evaluation yields highly consistent conclusions with the main experiment\. In particular, both the overall accuracy and the relative ranking of models closely match those observed under the subset protocol \(only 0\.15% overall accuracy difference across all models\), indicating that the subset\-based results are not driven by sampling artifacts or a particular slice of questions\. Across the ten representative LLMs spanning small to large scales, the same top\-performing models remain at the top and the same weaker models remain at the bottom, with only minor fluctuations in absolute accuracy\. This stability suggests that the first 3,000 questions per task–source–type provide sufficient coverage of the underlying relation and question distributions, and that model comparisons are robust to expanding the evaluation set\. Consequently, the fixed\-subset design offers a practical yet reliable proxy for extended\-scale benchmarking, enabling efficient experimentation while preserving the key comparative conclusions about model capability\.

Beyond overall performance, the same structural patterns persist across granular slices: treatment remains the easiest task for every evaluated model, whereas prognosis is consistently the most challenging, indicating a stable task\-level difficulty imbalance rather than model\-specific noise\. Source\-level differences are also small, suggesting limited sensitivity to data source choice under our evaluation setting\. Finally, accuracy monotonically decreases as the number of answer options increases \(MCQ\-4\>\>MCQ\-5\>\>MCQ\-6\), consistent with the main experiment\. Taken together, these results further validate that the main experimental design captures the key performance trends and comparative conclusions of extended\-scale evaluation on EHRBench\.

## Appendix IAdditional Experiment: Testing LLMs with Open\-ended Questions from EHRBench

Table 19\.Performance on Open\-ended questions \(OEQs\) of EHRBench\. We report RC, ROUGE\-1, ROUGE\-L, and BERTScore \(all in %\), together with token usage in millions and runtime in hours\.To further evaluate LLMs on paraphrased open\-ended questions \(OEQs\) in EHRBench, we conduct an extended evaluation using ten representative open\-source LLMs spanning different model scales to support a fair and comprehensive comparison, including small models \(llama3\.2\-3b, qwen2\.5\-3b, qwen3\-4b\), mid\-sized models \(glm4\-9b, llama3\-8b, ministral\-8b, qwen2\.5\-7b, qwen3\-8b\), and large models \(qwen2\.5\-32b, qwen3\-32b\)\. Considering efficiency, we evaluate the first 1,000 OEQs for each source\-task setting; in total, each model is evaluated on1,000×3×3=9,0001\{,\}000\\times 3\\times 3=9\{,\}000questions\. Other settings are kept the same as in the main experiment\.

We report four automatic metrics for OEQ evaluation, including RC, ROUGE\-1, ROUGE\-L, and BERTScore\. Specifically, for each OEQ itemIj=\(Sj,Qj,Bj\)I\_\{j\}=\(S\_\{j\},Q\_\{j\},B\_\{j\}\)derived from a templatePkP\_\{k\}, the model produces a free\-text answera^j\\hat\{a\}\_\{j\}, which is compared against the reference rationaleaja\_\{j\}stored in the template\. RC measures whethera^j\\hat\{a\}\_\{j\}covers the target clinical relationRk=\(xk,rk,yk\)R\_\{k\}=\(x\_\{k\},r\_\{k\},y\_\{k\}\)by checking whether the answer recovers the intended entity \(or clinically equivalent surface forms\)xkx\_\{k\}under concept normalization\. ROUGE\-1 and ROUGE\-L quantify lexical overlap betweena^j\\hat\{a\}\_\{j\}andaja\_\{j\}, where ROUGE\-1 emphasizes unigram\-level overlap and ROUGE\-L captures sequence\-level similarity via the longest common subsequence\. BERTScore measures semantic similarity betweena^j\\hat\{a\}\_\{j\}andaja\_\{j\}using contextual token embeddings \(from the BERT modelbert\-base\-uncased\) and soft matching, which makes it less sensitive to paraphrasing than ROUGE\. For efficiency, we aggregate the total number of \(prompt\+completion\) tokens consumed across all evaluated OEQs and the total wall\-clock runtime to obtain Tokens \(M\) and Time \(h\), respectively\. The results are presented in Table[19](https://arxiv.org/html/2605.30637#A9.T19)\.

The results in Table[19](https://arxiv.org/html/2605.30637#A9.T19)show a clear scale\-dependent trend that is consistent with the main experiments: larger models achieve substantially better OEQ quality across all metrics, while small models perform poorly\. In particular, qwen2\.5\-32b achieves the strongest overall quality, reaching 68\.24% RC, 34\.05% ROUGE\-1, 30\.82% ROUGE\-L, and 56\.25% BERTScore, which outperforms all other evaluated models\. Notably, qwen3\-32b underperforms qwen2\.5\-32b across all reported metrics, indicating that model scaling alone does not guarantee superior OEQ performance across model families\. Overall, these results align with the observations in the main experiments and further support that EHRBench can reliably differentiate the open\-ended clinical reasoning capabilities of LLMs at different scales\.

OEQs also reveal a strong quality\-efficiency trade\-off\. The fastest runtimes \(0\.73–0\.74h\) are achieved by llama3\.2\-3b and qwen2\.5\-3b, but both exhibit severe quality loss: llama3\.2\-3b drops to 10\.87% RC and 21\.20% BERTScore, while qwen2\.5\-3b is even lower in RC \(4\.39%\) despite a moderate BERTScore \(40\.25%\)\. These results indicate that speed alone does not guarantee usable OEQ performance and suggest a clear lower\-capacity regime where models generate quickly but fail to meet accuracy\-oriented criteria\. In contrast, mid\-sized models provide more reliable quality with modest cost \(e\.g\., ministral\-8b reaches 39\.86% RC with 1\.13h runtime, and qwen3\-8b reaches 61\.86% RC with 1\.77h runtime\)\. Finally, although qwen3\-32b approaches the best RC \(66\.22% versus 68\.24% for qwen2\.5\-32b\), it incurs substantially higher cost \(1\.97M tokens and 7\.20h\), which makes it less attractive under runtime constraints\. Overall, these findings suggest that OEQs are more demanding and benefit more from stronger models, and that model scale should be selected to balance quality and efficiency\.

## Appendix JLLMs Utilized in EHRBench

In this paper, we leverage more than 30 representative LLMs released between 2023 and 2025\. The set of evaluated LLMs is large and up\-to\-date, supporting meaningful conclusions about current LLM performance trends on EHRBench\. Their detailed descriptions and access links are provided below\.

- •glm4\-9b: GLM\-4 instruction model with 9B parameters, released as a general\-purpose bilingual \(Chinese–English\) LLM for conversational generation and instruction following\. HuggingFace:[https://huggingface\.co/zai\-org/GLM\-4\-9B\-0414](https://huggingface.co/zai-org/GLM-4-9B-0414)
- •glm4\-32b: Larger GLM\-4 instruction model with 32B parameters, providing higher capacity than the 9B variant and typically used when stronger generation quality is desired under similar prompting\. HuggingFace:[https://huggingface\.co/zai\-org/GLM\-4\-32B\-0414](https://huggingface.co/zai-org/GLM-4-32B-0414)
- •llama3\-8b:Llama 3instruction model with 8B parameters, a general\-purpose open\-weight LLM commonly used as an efficient baseline for instruction following and text generation\. HuggingFace:[https://huggingface\.co/meta\-llama/Meta\-Llama\-3\-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
- •llama3\-70b:Llama 3instruction model with 70B parameters, a larger\-capacity variant designed to improve performance on knowledge\-intensive generation and complex instruction\-following workloads\. HuggingFace:[https://huggingface\.co/meta\-llama/Meta\-Llama\-3\-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
- •llama3\.1\-8b:Llama 3\.1instruction model with 8B parameters, an updated release in the Llama family that is used as a drop\-in general\-purpose model under the same prompting interface\. HuggingFace:[https://huggingface\.co/meta\-llama/Meta\-Llama\-3\.1\-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B)
- •llama3\.2\-3b:Llama 3\.2instruction model with 3B parameters, a lightweight variant intended for low\-latency or resource\-constrained inference while retaining basic instruction\-following capabilities\. HuggingFace:[https://huggingface\.co/meta\-llama/Llama\-3\.2\-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
- •llama3\.3\-70b:Llama 3\.3instruction model with 70B parameters, a later Llama release that maintains the same open\-weight instruction interface, and is evaluated here as a high\-capacity general\-purpose model\. HuggingFace:[https://huggingface\.co/meta\-llama/Llama\-3\.3\-70B\-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
- •mistral\-7b:Mistralinstruction model with 7B parameters, widely used as a compact general\-purpose baseline that offers strong practical throughput under open\-weight deployment\. HuggingFace:[https://huggingface\.co/mistralai/Mistral\-7B\-Instruct\-v0\.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
- •ministral\-8b:Ministralinstruction model with 8B parameters from the Mistral family, evaluated as a mid\-sized open\-weight model emphasizing practical instruction\-following performance\. HuggingFace:[https://huggingface\.co/mistralai/Ministral\-3\-8B\-Instruct\-2512](https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512)
- •
- •qwen2\.5\-3b:Qwen2\.5model with 3B parameters, a small multilingual checkpoint commonly used for lightweight inference and as a compact baseline within the Qwen family\. HuggingFace:[https://huggingface\.co/Qwen/Qwen2\.5\-3B](https://huggingface.co/Qwen/Qwen2.5-3B)
- •qwen2\.5\-7b:Qwen2\.5model with 7B parameters, a mid\-sized multilingual model that supports instruction\-style prompting and serves as a standard open\-weight baseline\. HuggingFace:[https://huggingface\.co/Qwen/Qwen2\.5\-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
- •qwen2\.5\-32b:Qwen2\.5model with 32B parameters, a higher\-capacity multilingual checkpoint typically used for improved response quality and more complex generation tasks relative to smaller Qwen variants\. HuggingFace:[https://huggingface\.co/Qwen/Qwen2\.5\-32B](https://huggingface.co/Qwen/Qwen2.5-32B)
- •qwen3\-4b:Qwen3model with 4B parameters, evaluated as a newer\-generation multilingual model in the Qwen series under standard instruction prompting\. HuggingFace:[https://huggingface\.co/Qwen/Qwen3\-4B](https://huggingface.co/Qwen/Qwen3-4B)
- •qwen3\-8b:Qwen3model with 8B parameters, evaluated as a mid\-sized Qwen3 checkpoint that balances capacity and efficiency for multilingual instruction\-style generation\. HuggingFace:[https://huggingface\.co/Qwen/Qwen3\-8B](https://huggingface.co/Qwen/Qwen3-8B)
- •qwen3\-32b:Qwen3model with 32B parameters, evaluated as a large Qwen3 checkpoint representing a higher\-capacity multilingual baseline under the same prompting and decoding setup\. HuggingFace:[https://huggingface\.co/Qwen/Qwen3\-32B](https://huggingface.co/Qwen/Qwen3-32B)
- •smollm3\-3b:SmolLM3model with 3B parameters, a lightweight open\-weight model used to study low\-resource performance and efficiency under the same evaluation protocol\. HuggingFace:[https://huggingface\.co/HuggingFaceTB/SmolLM3\-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- •yi\-1\.5\-9b:Yi\-1\.5bilingual \(Chinese–English\) model with 9B parameters, used as an additional open\-weight general\-purpose baseline with strong Chinese/English coverage\. HuggingFace:[https://huggingface\.co/01\-ai/Yi\-1\.5\-9B](https://huggingface.co/01-ai/Yi-1.5-9B)
- •yi\-1\.5\-34b:Yi\-1\.5model with 34B parameters, a larger\-capacity Yi checkpoint included to compare scaling behavior under the same evaluation pipeline\. HuggingFace:[https://huggingface\.co/01\-ai/Yi\-1\.5\-34B](https://huggingface.co/01-ai/Yi-1.5-34B)
- •doctor\-r1\-8b:Doctor\-R1medical model released as a domain\-focused checkpoint intended for clinical reasoning and instruction\-style medical generation, fine\-tuned on theQwen3\-8Bmodel\. HuggingFace:[https://huggingface\.co/unicornftk/Doctor\-R1](https://huggingface.co/unicornftk/Doctor-R1)
- •med42\-8b:Med42clinical model \(8B\) fine\-tuned on theLlama3\-8Bmodel for medical and biomedical language understanding and instruction following\. HuggingFace:[https://huggingface\.co/m42\-health/Llama3\-Med42\-8B](https://huggingface.co/m42-health/Llama3-Med42-8B)
- •ultramedical\-8b:UltraMedicalinstruction\-tuned medical model \(8B\) based on theLlama3\-8Bmodel, designed for medical QA\-style prompting and clinical instruction following\. HuggingFace:[https://huggingface\.co/TsinghuaC3I/Llama\-3\-8B\-UltraMedical](https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical)
- •m1\-7b\-23k:m1long\-context medical model \(7B; 23K variant\) based onQwen2\.5\-7B, included to study the impact of long\-context capacity in clinical\-style prompting\. HuggingFace:[https://huggingface\.co/UCSC\-VLAA/m1\-7B\-23K](https://huggingface.co/UCSC-VLAA/m1-7B-23K)
- •m1\-32b\-1k:m1long\-context medical model \(32B; 1K variant\) based onQwen2\.5\-32B, included to represent a higher\-capacity medical checkpoint with an extended context interface\. HuggingFace:[https://huggingface\.co/UCSC\-VLAA/m1\-32B\-1K](https://huggingface.co/UCSC-VLAA/m1-32B-1K)
- •huatuogpt\-o1\-8b:HuatuoGPT\-o1medical reasoning model \(8B\) fine\-tuned onLlama3\-8B, included as a medical\-domain checkpoint with instruction\-style interfaces and clinically oriented training\. HuggingFace:[https://huggingface\.co/FreedomIntelligence/HuatuoGPT\-o1\-8B](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-8B)
- •gpt\-oss\-20b: Open\-weight GPT\-OSS model with 20B parameters, included as an additional open\-weight baseline with a GPT\-style architecture and publicly released weights\. HuggingFace:[https://huggingface\.co/openai/gpt\-oss\-20b](https://huggingface.co/openai/gpt-oss-20b)
- •
- •gpt\-4\.1\-nano: Proprietary GPT\-4\.1 family model accessed via Azure OpenAI \(deployment:gpt\-4\.1\-nano\), included as a low\-latency API model under the same prompting and evaluation protocol\.
- •gpt\-4\.1\-mini: Proprietary GPT\-4\.1 family model accessed via Azure OpenAI \(deployment:gpt\-4\.1\-mini\), included as a cost\-efficient API model for instruction\-style generation in our evaluation setting\.
- •gpt\-4\.1: Proprietary GPT\-4\.1 family model accessed via Azure OpenAI \(deployment:gpt\-4\.1\), included as a higher\-capacity API model for strong general\-purpose instruction following and generation\.
- •gpt\-5\-nano: Proprietary GPT\-5 family model accessed via Azure OpenAI \(deployment:gpt\-5\-nano\), included as a compact API model in the latest available GPT series within our subscription at evaluation time\.
- •gpt\-5\-mini: Proprietary GPT\-5 family model accessed via Azure OpenAI \(deployment:gpt\-5\-mini\), included as a mid\-sized API model representing the same GPT series under a higher\-capacity configuration thangpt\-5\-nano\.
- •gpt\-5: Proprietary GPT\-5 family model accessed via Azure OpenAI \(deployment:gpt\-5\-chat\), included as a large\-sized API model representing the same GPT series under a higher\-capacity configuration thangpt\-5\-nano\.
- •gpt\-5\.2: Proprietary GPT\-5 family model accessed via Azure OpenAI \(deployment:gpt\-5\.2\-chat\), included as a large\-sized API model representing the same GPT series under a higher\-capacity configuration thangpt\-5\-nano\.

## Appendix KLimitations

While EHRBench enables large\-scale, reliable EHR\-grounded evaluation of LLMs for clinical decision making, it has limitations\. First, its construction uses only structured diagnoses, prescriptions, and procedures, excluding informative modalities \(e\.g\., demographics, vital signs, laboratory tests, and imaging\)\. Second, to make KB verification feasible and limit leakage between scenario context and the queried relation, each template uses a small, fixed context window\. We focus on encounter\-level settings as they are more reliable and grounded, whereas long\-range cross\-visit relations are weaker and harder to validate\. Accordingly, our prognosis task is framed as next\-encounter risk prediction rather than calibrated time\-to\-event forecasting, reflecting uncertain real\-world visit timing\. Third, although EHRBench contains 960,067 questions, full\-set benchmarking is omitted because inference cost and runtime would be prohibitive for many models, making comparisons impractical\. We therefore evaluate a capped subset per data source and task for feasible, fair comparisons across diverse models\. Finally, KB support trades recall for precision by favoring relations covered by resources \(e\.g\., SemMedDB and UMLS\-linked concepts\), potentially under\-representing rare, emerging, institution\-specific, or context\-dependent practices\. Future work can extend EHRBench by adding modalities, relaxing fixed\-context assumptions with leakage\-aware controls, supporting reliable multi\-visit reasoning, and broadening verification coverage through KBs and evidence sources\.

## Appendix LPrompt Templates

We summarize the prompts used for relation extraction \(Table[20](https://arxiv.org/html/2605.30637#A12.T20)\), template completion \(Table[21](https://arxiv.org/html/2605.30637#A12.T21)\), QA generation for MCQ \(Table[22](https://arxiv.org/html/2605.30637#A12.T22)\) and OEQ \(Table[23](https://arxiv.org/html/2605.30637#A12.T23)\), and evaluation for MCQs and OEQs \(Tables[24](https://arxiv.org/html/2605.30637#A12.T24)and[25](https://arxiv.org/html/2605.30637#A12.T25), respectively\)\.

Table 20\.Prompt template for task\-grounded clinical relation extraction, used in Stage 1 of template generation\.\(A\) Common Instruction Skeleton\(B\) Strict Output Schema \(must match exactly\)json \{ "raw\_relations":\[ \{"entity\_1":"\.\.\.", "relation":"\.\.\.", "entity\_2":"\.\.\.", "rationale":"\.\.\."\}, … \], "context\_events":\["\.\.\.", "\.\.\."\] \} \(C\) Task\-Specific Block \(choose exactly one\)

Table 21\.Prompt template for template completion, used in Stage 3 of template generation\.DISTRACTOR\_NUMBER:10 \(TEN\)\(A\) Common Instruction Skeleton\(B\) Strict Output Schema \(must match exactly\)json \{ "base\_questions":\[ \{ "answer":"\.\.\.", "topic":"\.\.\.", "distractors":\[\{"entity\_name":"\.\.\.", "type":"reversed"\}, …\] \} \] \} \(C\) Task\-Specific Block \(choose exactly one\)

Table 22\.Prompt template for MCQ QA generation\.\(A\) Common Paraphrasing Instruction Skeleton\(B\) Strict Output Schema \(must match exactly\)json \{ "question\_versions":\[ \{"version":1, "context":"\.\.\.", "question":"\.\.\."\}, … \] \} \(C\) Task\-Specific Block \(choose exactly one\)

Table 23\.Prompt template for OEQ generation\.\(A\) Common Reason\-Only Instruction Skeleton\(B\) Strict Output Schema \(must match exactly\)json \{"reason":"\.\.\."\} \(C\) Task\-Specific Block \(choose exactly one\)

Table 24\.Prompt template for evaluating MCQs\.\(A\) Common Evaluation Instruction Skeleton\(B\) Required Output Schema \(must match exactly\)json \{ "answers":\[ "A", "C", "B", … \] \}

Table 25\.Prompt template for evaluating OEQs\.\(A\) Common Evaluation Instruction Skeleton\(B\) Required Output Schema \(must match exactly\)json \{ "answers":\["\.\.\.","\.\.\.","\.\.\."\], "reasons":\["\.\.\.","\.\.\.","\.\.\."\] \}

Similar Articles

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

arXiv cs.AI

MedCUA-Bench is a new benchmark for evaluating computer-use agents on clinical software tasks, covering 18 scenarios across 10 medical domains with safety dimensions. Results show that current agents perform poorly, especially on real OpenEMR, highlighting a significant gap in reliability.

Introducing HealthBench

OpenAI Blog

OpenAI introduces HealthBench, a new benchmark for evaluating AI systems in healthcare contexts, created with 262 physicians across 60 countries. The benchmark includes 5,000 realistic health conversations with physician-written rubrics to assess model performance on meaningful, trustworthy, and improvable metrics.