Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance
Summary
Introduces PhysAssistBench, a benchmark for evaluating LLMs in interactive doctor-patient-EHR assistance. Experiments show current models are unreliable in this setting, highlighting the need for coordinated capabilities.
View Cached Full Text
Cached at: 06/18/26, 05:45 AM
# Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance
Source: [https://arxiv.org/html/2606.18613](https://arxiv.org/html/2606.18613)
Tianming Du1, Peijie Yu2, Sihan Shang3, Danli Shi4, My Linh Nguyen1, Shengbo Gao1, Guangyuan Li1, Yinghong Yu1, Yan Jiang1, Qianlong Zhao1, Behzad Bozorgtabar5, Shaoxiong Ji1, Jiazhen Pan6, Daniel Rueckert6, Jiancheng Yang1, 1Aalto University,2Tencent,3Harbin Institute of Technology, Shenzhen, 4Hong Kong Polytechnic University,5Aarhus University,6Technical University of Munich \{du\.tianming,jiancheng\.yang\}@aalto\.fi
###### Abstract
The most plausible near\-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication\. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use\. We introducePhysAssistBench, a benchmark for interactive doctor\-patient\-EHR assistance\. Built from real MIMIC\-IV cases,PhysAssistBenchuses a scalable pipeline to constructagentic patients: interactive, record\-grounded agents that turn static EHR records into multi\-turn clinical scenarios while preserving clinical factuality\.PhysAssistBenchprovides a curated bilingual evaluation set of 1,296 manually reviewed and physician\-validated turns\. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them\.
Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor\-Patient\-EHR Assistance
Tianming Du1, Peijie Yu2, Sihan Shang3, Danli Shi4, My Linh Nguyen1,Shengbo Gao1, Guangyuan Li1, Yinghong Yu1, Yan Jiang1, Qianlong Zhao1,Behzad Bozorgtabar5, Shaoxiong Ji1, Jiazhen Pan6, Daniel Rueckert6, Jiancheng Yang1††thanks:corresponding author,1Aalto University,2Tencent,3Harbin Institute of Technology, Shenzhen,4Hong Kong Polytechnic University,5Aarhus University,6Technical University of Munich\{du\.tianming,jiancheng\.yang\}@aalto\.fi
## 1Introduction
Figure 1:Left:PhysAssistBenchevaluates LLMs as physician assistants, not physicians: the assistant follows physician requests while interacting with a record\-grounded FHIR\-based EHR system and a dialogue patient\. A multi\-agent pipeline transforms static MIMIC\-IV records into thisagentic patientenvironment, exposing them through standardized FHIR interfaces rather than direct record access\.Right:A representative hypertension case inPhysAssistBench\. The 4\-turn session progresses from explicit lookup to implicit assistance involving patient dialogue, clinical reasoning, and EHR actions, with each turn paired with its interaction flow and rubric criteria\.LLMs have generated substantial optimism for clinical AIThirunavukarasuet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib21)\); Mooret al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib22)\); Rajpurkaret al\.\([2022](https://arxiv.org/html/2606.18613#bib.bib23)\)\. Much of this optimism comes from their strong performance on medical examinations and question\-answering benchmarksKunget al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib17)\); Singhalet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib20)\), motivating visions of LLMs as a new “front door to healthcare”NHS England\-South East \([2024](https://arxiv.org/html/2606.18613#bib.bib24)\); Kyle \([2025](https://arxiv.org/html/2606.18613#bib.bib25)\)\. However, recent studies suggest that such performance may not transfer to interactive clinical use\.Labanet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib16)\)found that shifting from fully specified single\-turn prompts to multi\-turn, under\-specified interactions caused an average 39% performance drop\. Similarly,Beanet al\.\([2026](https://arxiv.org/html/2606.18613#bib.bib7)\)found that LLMs performed strongly when tested alone, but failed to improve user performance in a randomized medical self\-assessment study, identifying user interaction as a key barrier\.
These findings reveal a gap in current evaluation practice: practical failures may arise not only from insufficient medical knowledge, but from failures of interaction\. As summarized in Table[1](https://arxiv.org/html/2606.18613#S1.T1), prior medical LLM benchmarks mostly evaluate three isolated roles:knowledge, where models act as medical experts;system, where models retrieve or manipulate clinical records; andcommunication, where models interact with patients or generate clinical text\. These roles are useful, but they miss the most plausible near\-term deployment setting: assisting physicians under human oversight, as emphasized by ethical and regulatory expectationsWorld Health Organization \([2021](https://arxiv.org/html/2606.18613#bib.bib5)\); U\.S\. Food and Drug Administration \([2025](https://arxiv.org/html/2606.18613#bib.bib6)\)\. Technically, this setting is also where recent studies expose a key bottleneck: interactionLabanet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib16)\); Beanet al\.\([2026](https://arxiv.org/html/2606.18613#bib.bib7)\)\. Physician assistance is not static question answering, but interactive coordination across incomplete physician intent, ambiguous patient information, and precise EHR actions\.
Figure[1](https://arxiv.org/html/2606.18613#S1.F1)illustrates the setting studied in this paper\. Even for medical professionals, physician requests are often context\-dependent, elliptical, and spread across turns\. The assistant must map these implicit requests to two distinct interfaces: EHR systems requiring precise tool calls, and patients providing colloquial, incomplete, and clinically imprecise information\. It must decide when to query the EHR, when to ask the patient, and how to integrate evidence into a physician\-facing response\. As will be discussed in Section[2](https://arxiv.org/html/2606.18613#S2), this combination is not captured by existing benchmarks\.
We introducePhysAssistBench, a benchmark for interactive doctor\-patient\-EHR assistance\. Evaluating such interactions requires more than static clinical records: it must provide patients who can respond across turns, EHR systems that can be queried through structured tools, and physician requests that evolve with context\. To make this scalable while preserving clinical grounding,PhysAssistBenchrepurposes real MIMIC\-IVJohnsonet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib4)\)cases through a multi\-agent synthetic data pipeline\. The pipeline plans clinically plausible scenarios from eligible records and constructsagentic patients: interactive, record\-grounded agents that turn static EHR cases into multi\-turn clinical scenarios\. Unlike unconstrained patient simulation, these agents are grounded in existing EHR evidence; unsupported cases are filtered rather than counterfactually rewritten\.
The formulated benchmark contains 324 multi\-turn sessions, each reviewed by 8 trained annotators and validated by a physician\. It spans 4 clinical scenarios \(Diagnostic Workup, Med Safety, Treatment Response and Discharge Planning\), 4 tasks \(Information Lookup, Data Gathering, Clinical Reasoning, and Write/Update\), and 3 physician\-query implicitness subtypes: Nominal Anaphora \(na\), Predicate Ellipsis \(pe\), and Abstract Event Anaphora \(ae\)\. Each session uses a standardized FHIR R4 tool set and is evaluated in English and Chinese\. Turn\-level rubrics are provided for stable and interpretable assessment\. Figure[1](https://arxiv.org/html/2606.18613#S1.F1)shows a representative 4\-turn session, illustrating how implicit physician requests, patient ambiguity, and EHR precision co\-occur within the same workflow\.
Our contributions are threefold\. First, we formulate interactive physician assistance as a coordination problem across clinical knowledge, patient communication, and EHR systems, a setting overlooked by existing evaluations of isolated roles\. Second, we develop a scalable multi\-agent synthetic data pipeline that repurposes static MIMIC\-IV records intoagentic patients, enabling clinically grounded multi\-turn doctor\-patient\-EHR scenarios\. Third, we releasePhysAssistBench, a manually reviewed and physician\-validated benchmark of 324 sessions and 1,296 turns, and show that leading LLMs are not yet reliable physician assistants, especially when resolving physician intent, handling patient ambiguity, issuing grounded EHR queries, and integrating evidence across sources\. We will fully release our dataset and code to the research community\. \(Details in Appendix[B](https://arxiv.org/html/2606.18613#A2)\)
Table 1:Comparison of existing EHR and medical agent benchmarks\.Evaluation Focusdenotes the primary capability tested:*knowledge*for medical expertise,*system*for EHR or tool use, and*communication*for patient dialogue\.PhysAssistBenchis the only benchmark with integrated*assistance*focus across all three interaction dimensions\.Implicit Queries,Patient Interaction,Tool Use, andTurnsdenote underspecified physician requests, simulated\-patient interaction, executable tools with FHIR noted when applicable, and single\- or multi\-turn evaluation\.
## 2Related Work
### 2\.1EHR and Medical Agent Benchmarks
Table[1](https://arxiv.org/html/2606.18613#S1.T1)compares existing benchmarks by evaluation focus and interaction dimensions\. Most prior benchmarks evaluate isolated capabilities rather than integrated physician\-assistant workflows\.
A large body of work focuses onknowledge, testing medical reasoning over questions, notes, records, or agent\-style clinical tasksJinet al\.\([2019](https://arxiv.org/html/2606.18613#bib.bib8)\); Shiet al\.\([2024](https://arxiv.org/html/2606.18613#bib.bib29)\); Kweonet al\.\([2024](https://arxiv.org/html/2606.18613#bib.bib37)\); Chenet al\.\([2024](https://arxiv.org/html/2606.18613#bib.bib42)\); Mehandruet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib30)\); Wanget al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib31)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib32)\); Tanget al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib35)\)\. These benchmarks assess clinical expertise, but mostly assume fully specified, static inputs without patient or EHR interaction across turns\. Another line focuses onsysteminteraction, evaluating EHR retrieval or clinical tool useLeeet al\.\([2022](https://arxiv.org/html/2606.18613#bib.bib40)\); Baeet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib38)\); Jianget al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib27)\); Leeet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib28)\); Xuet al\.\([2026a](https://arxiv.org/html/2606.18613#bib.bib41)\)\. However, they typically treat EHR access as standalone tool use and do not test implicit references, argument elision, or cross\-turn carry\-over\. A third line focuses oncommunication, where models interact with patients or generate clinical text\. AgentClinicSchmidgallet al\.\([2026](https://arxiv.org/html/2606.18613#bib.bib34)\)supports multi\-turn doctor\-patient interaction and is closest to the dialogue side of our setting, but does not jointly evaluate structured EHR tool use, patient dialogue, and implicit physician requests\.
Overall, existing benchmarks cover important pieces of clinical LLM evaluation, but not the integratedassistancesetting, where an LLM must coordinateknowledge,communication, andsysteminteraction under evolving physician instructions\.Luoet al\.\([2026](https://arxiv.org/html/2606.18613#bib.bib76)\)also articulate a research vision for evaluating clinical LLMs through realistic EHR interfaces in dynamic, interactive clinical settings, sharing our critique of current evaluation paradigms\. As shown in Table[1](https://arxiv.org/html/2606.18613#S1.T1),PhysAssistBenchis the only benchmark covering implicit queries, patient interaction, FHIR\-based EHR tool use, and multi\-turn evaluation together\. This coverage is made possible by a scalable multi\-agent synthetic data pipeline that turns static MIMIC\-IV records into interactive doctor\-patient\-EHR scenarios\.
### 2\.2Medical Synthetic Data
Medical synthetic data has become an important strategy for addressing data scarcity, privacy, and bias in healthcare\. Most prior work follows agenerativesynthesis paradigm, creating realistic medical images, clinical text, tabular EHR data, or multimodal patient records for training and evaluation via generative modelsFansi Tchangoet al\.\([2022](https://arxiv.org/html/2606.18613#bib.bib54)\); Liet al\.\([2023b](https://arxiv.org/html/2606.18613#bib.bib56)\); Seo and Lee \([2024](https://arxiv.org/html/2606.18613#bib.bib55)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib2),[2026](https://arxiv.org/html/2606.18613#bib.bib1)\)\. Beyond sample generation, recent work also studiesagenticdata synthesis, where LLMs or multi\-agent systems construct clinical dialogues through prompt\-encoded constraintsDaset al\.\([2024](https://arxiv.org/html/2606.18613#bib.bib57)\)or clinician\-patient role simulation grounded in EHR\-derived evidenceWanget al\.\([2024](https://arxiv.org/html/2606.18613#bib.bib58)\); ALMutairiet al\.\([2024](https://arxiv.org/html/2606.18613#bib.bib59)\); Xuet al\.\([2026b](https://arxiv.org/html/2606.18613#bib.bib60)\)\.
PhysAssistBenchdiffers by using synthetic data to construct an interactive evaluation environment\. Its scalable multi\-agent synthetic data pipeline plans clinically plausible scenarios from real records and constructsagentic patients, which are grounded in existing EHR evidence rather than unconstrained simulation; unsupported cases are filtered instead of counterfactually rewritten\. The resulting evaluation set is further manually reviewed and physician\-validated for reliability\.
### 2\.3Tool Use and Function Calling
General\-domain tool\-use benchmarks evaluate API selection, function calling, and tool\-mediated task completionLiet al\.\([2023a](https://arxiv.org/html/2606.18613#bib.bib51)\); Qinet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib52)\); Krishnaet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib53)\); Yuet al\.\([2026](https://arxiv.org/html/2606.18613#bib.bib50)\)\. While useful for studying agent tool use, they mostly assume general\-purpose domains and do not model clinical tool schemas, FHIR\-based EHR actions, or implicit physician queries\.PhysAssistBenchinstead embeds tool use in an interactive doctor\-patient\-EHR workflow requiring coordination between FHIR\-based EHR tools, ambiguous patient dialogue, and evolving physician instructions\.
### 2\.4Ellipsis and Anaphora in Dialogue
Ellipsis and anaphora are pervasive in natural dialogue and have long been studied in linguistics and NLPGerber and Chai \([2010](https://arxiv.org/html/2606.18613#bib.bib47)\); Leeet al\.\([2017](https://arxiv.org/html/2606.18613#bib.bib45),[2018](https://arxiv.org/html/2606.18613#bib.bib46)\); Marasovićet al\.\([2017](https://arxiv.org/html/2606.18613#bib.bib48)\); Kolhatkar and Hirst \([2014](https://arxiv.org/html/2606.18613#bib.bib49)\)\. They cover phenomena such as omitted predicates or arguments, entity references, and abstract event references, all of which require context\-dependent interpretation\. In clinical communication, especially physician instructions, such implicitness is common: clinicians often rely on shared context, abbreviate repeated actions, and refer back to prior findings across turns\. Despite this, existing EHR and medical agent benchmarks typically formulate queries as explicit, standalone instructions, leaving implicit physician requests overlooked\.
## 3PhysAssistBench
### 3\.1Overview
PhysAssistBenchis a multi\-turn benchmark built on real MIMIC\-IV patient data that evaluates LLMs on realistic doctor\-patient\-EHR interaction\. It comprises 324 sessions and 1,296 turns, evaluated in both English and Chinese using a standardized FHIR R4 tool setMandelet al\.\([2016](https://arxiv.org/html/2606.18613#bib.bib62)\)\.
### 3\.2Benchmark Structure
#### Sessions and Turns\.
The primary unit of evaluation is a*session*: four consecutive turns \(Turn 0–Turn 3\) simulating a single clinical encounter\. Turns are not independent: each builds on prior retrieved data and established context, requiring the assistant to track what has already been answered and reason across the full conversation\.
#### Task Types\.
Each turn is labeled among four task types reflecting the clinical intent of the physician’s query\.Information Lookup \(il\)retrieves a single clinical fact via one EHR read tool\.Data Gathering \(dg\)collects information from two or more sources \(EHR tools, patient\-interview tools\) with calls that may be parallel or adaptive\.Clinical Reasoning \(cr\)combines retrieved findings with medical knowledge to produce a clinical recommendation optionally involving tool calls\.Write/Update \(wu\)executes a FHIR write operation whose parameters are inferred from session context\. Turn 0 is alwaysil, anchoring the session in a concrete data request; Turn 3 always ends indg,cr, orwu\.
#### Physician\-Query Implicitness Types\.
Task type captures*what*is being requested; the implicitness type captures*how*the physician expresses it\. As shared context accumulates, physicians naturally abbreviate: referring back to prior entities by pronoun, dropping predicates, or compressing the clinical picture into a brief phrase\.PhysAssistBenchencodes three implicitness types assigned to Turns 1–3\.Nominal Anaphora \(na\):a pronoun or noun phrase refers to a specific entity from a prior turn\.Predicate Ellipsis \(pe\):the verb and arguments are omitted; only the new focus item is stated\.Abstract Event Anaphora \(ae\):a phrase such as “given all this” refers to the accumulated clinical picture\. Examples are in Appendix[E1](https://arxiv.org/html/2606.18613#A5.T1)\.
#### Tool Set\.
The assistant draws on two tool families\.EHR toolsare FHIR R4 read and write operations over MIMIC\-IV resources\.Patient\-interview toolselicit subjective information absent from structured records \(chief complaint, symptom history, medication adherence, and functional status\) with responses pre\-generated from the same admission records as the EHR snapshot and held fixed at evaluation time\. Treating patient interaction as a tool call unifies EHR queries and patient questions under a single decision interface and enables deterministic evaluation via response replay\. The full tool inventory is provided in Appendix[D](https://arxiv.org/html/2606.18613#A4)\.
#### Clinical Scenarios\.
The task and implicitness types instantiate across four clinical scenarios spanning the physician’s core decision cycle: Diagnostic workup, Medication safety review \(Med Safety\), treatment response monitoring \(Treatment Response\), and discharge planning\. Each scenario constrains which FHIR resources and tool combinations are exercised, ensuring structural distinctiveness\. The scenario design is inspired byJianget al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib27)\); Leeet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib28)\); Liuet al\.\([2026](https://arxiv.org/html/2606.18613#bib.bib33)\)\.
#### Data\-Richness Tiers\.
Each session is assigned among three*data\-richness tiers*reflecting the depth of structured evidence in the MIMIC\-IV record: a coverage dimension rather than a difficulty ranking\.*Data\-sparse*sessions require a single lab value or prescription\.*Data\-moderate*sessions require a trend\-bearing time series or a co\-occurring drug\-lab pair\.*Data\-rich*sessions require multi\-dimensional evidence such as multiple drug\-monitoring pairs or multi\-system differential findings\. Tier eligibility is verified offline against raw MIMIC\-IV CSV files without LLM involvement \(Appendix[H](https://arxiv.org/html/2606.18613#A8)\)\. Session distribution by scenario and tier is in Appendix[G](https://arxiv.org/html/2606.18613#A7)\.
Figure 2:Multi\-agent data synthesis pipeline\. A static MIMIC\-IV record is transformed into a grounded, multi\-turn benchmark entry in theagentic patientenvironment, through patient pre\-filtering, session\-level planning, and turn\-level generation with embedded quality control\.
### 3\.3Evaluation
#### Task Formalization\.
We formulate eachPhysAssistBenchsession as a finite\-horizon partially observable Markov decision process \(POMDP\)\(𝒮,𝒜,𝒪,T,R,𝒰\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},T,R,\\mathcal\{U\}\), with state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, observation space𝒪\\mathcal\{O\}, transition functionT:𝒮×𝒜→𝒮×𝒪T:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathcal\{S\}\\times\\mathcal\{O\}, rewardR:𝒮→\[0,1\]R:\\mathcal\{S\}\\to\[0,1\], and physician\-instruction space𝒰\\mathcal\{U\}\. The state decomposes as𝒮=𝒮ehr∪𝒮pat∪𝒮ctx\\mathcal\{S\}=\\mathcal\{S\}\_\{\\text\{ehr\}\}\\cup\\mathcal\{S\}\_\{\\text\{pat\}\}\\cup\\mathcal\{S\}\_\{\\text\{ctx\}\}:𝒮ehr\\mathcal\{S\}\_\{\\text\{ehr\}\}is the MIMIC\-IV patient snapshot frozen attanchort\_\{\\text\{anchor\}\}and hidden from the agent;𝒮pat\\mathcal\{S\}\_\{\\text\{pat\}\}is the patient\-interview state, constructed from the same admission\-period records as𝒮ehr\\mathcal\{S\}\_\{\\text\{ehr\}\}and fixed at pipeline initialization time\.𝒮ctx\\mathcal\{S\}\_\{\\text\{ctx\}\}is the accumulated dialogue historyHt=\(\(q0,y0\),…,\(qt−1,yt−1\)\)H\_\{t\}=\(\(q\_\{0\},y\_\{0\}\),\\ldots,\(q\_\{t\-1\},y\_\{t\-1\}\)\)\. The action space partitions as𝒜=𝒜ehr∪𝒜pat∪𝒜nl\\mathcal\{A\}=\\mathcal\{A\}\_\{\\text\{ehr\}\}\\cup\\mathcal\{A\}\_\{\\text\{pat\}\}\\cup\\mathcal\{A\}\_\{\\text\{nl\}\}: FHIR R4 tool calls \(read and write\), patient\-interview tool calls, and the natural\-language answeryty\_\{t\}returned to the physician\. The instruction space𝒰\\mathcal\{U\}contains the physician’s queriesqtq\_\{t\}\. A session is the tupleσ=\(tanchor,q0,q1,q2,q3\)\\sigma=\(t\_\{\\text\{anchor\}\},q\_\{0\},q\_\{1\},q\_\{2\},q\_\{3\}\)paired with gold tool calls and gold answers; at each turn the agent produces\(τt,yt\)\(\\tau\_\{t\},y\_\{t\}\), whereτt\\tau\_\{t\}is the executed tool\-call trace andyty\_\{t\}is the response\.
#### Rubric\-Based Scoring\.
Each turn is scored by an LLM judge on several independent rubric items designed for the turn’s task type\. Items cover factual accuracy of reported values, correctness of clinical interpretation, appropriate integration of prior session context, and absence of hallucinated facts\. The score for turnttis the fraction of items passed,rt∈\[0,1\]r\_\{t\}\\in\[0,1\]; the session\-level Rubric Score isr¯\(σ\)=14∑t=03rt\\bar\{r\}\(\\sigma\)=\\tfrac\{1\}\{4\}\\sum\_\{t=0\}^\{3\}r\_\{t\}and the corpus\-levelmean rubric score \(mRS\)reported in Table[2](https://arxiv.org/html/2606.18613#S5.T2)is𝔼σ\[r¯\(σ\)\]\\mathbb\{E\}\_\{\\sigma\}\[\\bar\{r\}\(\\sigma\)\]\.
#### Pass@Turn and Pass@Session\.
The continuous rubric score is binarised at thresholdτ\\tauto yield 2 reliability metrics\.Pass@Turn\(@Tτ\) is the fraction of turns whose rubric score meets the threshold:
@Tτ=∑σ∑t=03𝟏\[rt≥τ\]4\|𝒟\|\.\\text\{@T\}\_\{\\tau\}=\\frac\{\\sum\_\{\\sigma\}\\sum\_\{t=0\}^\{3\}\\mathbf\{1\}\[r\_\{t\}\\geq\\tau\]\}\{4\|\\mathcal\{D\}\|\}\.Pass@Session\(@Sτ\) is the fraction of sessions in which all four turns meet the threshold:
@Sτ=∑σ∈𝒟𝟏\[mint∈\{0,1,2,3\}rt≥τ\]\|𝒟\|\.\\text\{@S\}\_\{\\tau\}=\\frac\{\\sum\_\{\\sigma\\in\\mathcal\{D\}\}\\mathbf\{1\}\\\!\\left\[\\min\_\{t\\in\\\{0,1,2,3\\\}\}r\_\{t\}\\geq\\tau\\right\]\}\{\|\\mathcal\{D\}\|\}\.We reportτ∈\{0\.60,0\.75\}\\tau\\in\\\{0\.60,0\.75\\\}\.\|𝒟\|\|\\mathcal\{D\}\|is the number of sessions\. @S is the demanding metric: single weak turn fails entire session, reflecting the multi\-turn, context\-dependent character of clinical practice\.
#### Tool Use Evaluation
Tool invocation is evaluated implicitly through rubric\-based scoring: a response grounded in correct EHR values necessarily required correct tool calls to retrieve them\. This also accommodates cases where a model deviates from the gold tool trajectory but still produces a clinically correct answer, as the rubric rewards correctness rather than procedural conformity\.
## 4Agentic Patient Environment from Static EHR Records
We design the pipeline around a generative principle: each patient’s EHR record, rather than serving as a static lookup table, acts as the*driving participant*of an*agentic patient*environment\. Given one admission’s data, a coordinated set of agents decides what clinical questions the record can support, plans the required EHR and patient\-tool interactions, executes real FHIR queries against the actual data, and produces verifiable gold responses without human authorship of the scenario itself\. The same record can therefore generate different conversations across scenarios, difficulty tiers, and languages, making the pipeline scalable while remaining grounded in which specific admission contains\. In this sense, the pipeline turns static EHR records into interactive*agentic patients*capable of driving complex doctor\-patient\-EHR interactions\.
As illustrated in Figure[2](https://arxiv.org/html/2606.18613#S3.F2), the pipeline comprises 3 stages\.\(1\) Patient pre\-filteringapplies a two\-stage offline filter: first file\-size thresholds, then scenario\- and tier\-specific content checks ensuring every retained patient has sufficient EHR evidence before any LLM is invoked\.\(2\) Session planningreads the patient’s EHR snapshot and produces a coherent four\-turn clinical arc specifying the per\-turn topic, FHIR tools or patient tools, transforming isolated tabular data into a connected clinical narrative\.\(3\) Turn\-level generationinstantiates each planned turn through multi\-agent cooperation with tool\-call planning, FHIR execution against real data, and gold answer generation, with three embedded quality gates rejecting hallucinated, structurally invalid, or clinically unsafe outputs\.
The pipeline scales by instantiating multiple sessions from a single patient record across scenarios, difficulty tiers, and languages; quality is enforced by a three\-stage checker that filters hallucinations, structural errors, and unsafe content\. Full pipeline details \(Figure[H1](https://arxiv.org/html/2606.18613#A8.F1)\), prompt templates, and per\-gate failure statistics are provided in Appendix[H](https://arxiv.org/html/2606.18613#A8)\.
## 5Experiments
### 5\.1Models
We benchmark 5 closed\-source models: GPT\-5\.4, GPT\-5\.4 miniSinghet al\.\([2025](https://arxiv.org/html/2606.18613#bib.bib72)\), Claude\-Opus\-4\.7Anthropic \([2026](https://arxiv.org/html/2606.18613#bib.bib69)\), Gemini\-3\.1\-ProGoogle \([2026](https://arxiv.org/html/2606.18613#bib.bib71)\), Seed\-1\.8ByteDance Seed Team \([2026](https://arxiv.org/html/2606.18613#bib.bib73)\)and 9 open\-weight models: DeepSeek\-V4\-Pro and DeepSeek\-V4\-FlashDeepSeek\-AI \([2026](https://arxiv.org/html/2606.18613#bib.bib70)\), Qwen3\.5 series modelsQwen Team \([2025](https://arxiv.org/html/2606.18613#bib.bib67)\), GLM\-5GLM\-5 Team \([2026](https://arxiv.org/html/2606.18613#bib.bib63)\), Kimi\-K2\.6Kimi Team \([2025](https://arxiv.org/html/2606.18613#bib.bib65)\), MiniMaxMiniMax \([2026](https://arxiv.org/html/2606.18613#bib.bib75)\)\. For each model we run the full English and Chinese benchmark \(324 sessions×\\times4 turns×\\times2 languages = 2,592 turns\), under identical evaluation conditions\.
#### Eval Configuration & Judge Model\.
All models run with reasoning \(“thinking”\) mode enabled \(reasoning\_effort=highfor the GPT\-5 series\),temperature=0\.2\\text\{temperature\}=0\.2, and a maximum of 16 tool calls per turn, with all 17 EHR and patient\-interview tools available at every turn\. Rubric scoring is performed by a fixed GPT\-5\.4\-mini judge which is identical across all models\. It receives the model’s answer, EHR ground\-truth, and rubric items, returning binary scores with reasoning for each item\.
### 5\.2Main Results
Table 2:Performance onPhysAssistBenchacross 1,296 turns over 324 clinical sessions \(4 scenarios×\\times3 data richness tiers×\\times27 sessions each\)\.mRS: mean rubric score \(%\) over all 1,296 turns\.Pass@Turn \(@T\): the fraction of turns with rubric score≥τ\\geq\\tau\(%\)\.Pass@Session \(@S\): the fraction of complete 4\-turn sessions where every turn passes thresholdτ\\tau\(%\)\. Numbers highlighted inredandbluedenote the best and second\-best results, respectively\.Table[2](https://arxiv.org/html/2606.18613#S5.T2)reports performance of 14 LLMs onPhysAssistBench\. Closed\-source and open\-weight models perform comparably on mRS, with an average gap of only 1\.5 pp on English mRS \(63\.9 vs\. 62\.4\)\. GLM\-5 achieves the highest mRS among all models \(69\.4 EN, 71\.5 ZH\), narrowly outperforming Claude\-Opus\-4\.7 \(68\.3 EN, 69\.9 ZH\) on this metric\. However, Pass@Session tells a different story: Claude\-Opus\-4\.7 leads all models on session\-level consistency, reaching 23\.5% \(EN\) and 26\.9% \(ZH\) atτ=\.60\\tau\{=\}\.60, and 8\.0% \(EN\) and 9\.0% \(ZH\) atτ=\.75\\tau\{=\}\.75\. This gap between turn\-level and session\-level rankings highlights that sustaining quality across all four turns within a session is a distinct capability, not captured by mean rubric score alone\. Within the Qwen3\.5 family, mRS scales clearly with model size \(66\.3→58\.3→55\.2→48\.7 from 35B to 4B\), with the 4B model recording 0\.0% Pass@Session atτ=\.75\\tau\{=\}\.75\.
#### Implicitness Type and Task Type: Heatmap Analysis
Figure 3:Rubric score \(%\) by implicitness type×\\timestask type \(EN\), for 14 models and their average\. Information Lookup stays robust across all implicitness types, while Data Gathering \(multi\-tool composition\) and Clinical Reasoning \(knowledge\-grounded inference\) are the consistent weak points across models\. Both demand reasoning beyond single\-point retrieval\. Results for Chinese \(ZH\) are presented in Appendix[I2](https://arxiv.org/html/2606.18613#A9.F2)Figure[3](https://arxiv.org/html/2606.18613#S5.F3)decomposes the rubric score by implicitness \(rows\) and task \(columns\) across all 14 models\. The heatmap reveals three patterns\. First, task type defines a stable difficulty hierarchy: IL 82\.6\>\>WU 76\.1\>\>CR 54\.9\>\>DG 47\.5, with DG as the universal bottleneck whose narrow cross\-model standard deviation \(5\.6–7\.6\) indicates that multi\-tool composition resists model scale\. Second, implicitness interacts non\-uniformly with task type: PE is the weakest row \(59\.6\) but collapses specifically on WU \(55\.6\), where dropping the verb creates syntactic ambiguity between a write instruction and a verification query; NA and AE preserve the write\-verb in the antecedent, recovering to 87\.2 and 90\.7: the two highest cells on average\. Third, cross\-model variance is lowest on AE×\\timesWU \(σ=4\.0\\sigma\{=\}4\.0\) and highest on NA×\\timesIL \(σ=13\.6\\sigma\{=\}13\.6\), concentrating the capability gap in implicit DG and CR cells; Claude\-Opus\-4\.7 and GLM\-5 are the most balanced systems, while Qwen3\.5\-4B shows the widest within\-model spread \(98 on NA×\\timesWU vs\. 27 on AE×\\timesDG\), suggesting surface\-pattern acquisition without underlying composition skill\.
#### Language\-Conditioned Tool\-Invocation Bias\.
All models trained by non\-Chinese organizations achieve higher scores on Chinese than English sessions \(up to\+\+11\.4 pp Pass@Session for Gemini 3\.1 Pro\)\. A turn\-level breakdown for Gemini 3\.1 Pro shows that 23% of turns where Chinese passed but English failed involve insufficient tool use \(the English session skips EHR calls correctly invoked in Chinese\), and 3% are outright refusals absent in Chinese entirely\. We attribute this to a language\-conditioned safety prior: English training data reinforces disclaimers about AI systems not accessing clinical records, suppressing EHR tool calls in English sessions\. This motivates reporting multilingual results separately\.
#### A Narrow Flagship Band\.
Among the flagship models, performance compresses into a narrow band: the top contenders \(GPT\-5\.4\-high, GLM\-5, Claude\-Opus\-4\.7, Kimi\-K2\.6, Seed\-1\.8\) fall within ~3 points on EN despite differences in scale and architecture\. We attribute this to the information filtering inherent in the data construction pipeline: since EHR evidence is fetched on demand, parametric knowledge differences are removed at the data\-construction stage, so models differ in how they*compose*tool calls, not in what they*know*\. Lightweight models stay clearly behind, with Qwen3\.5\-4B trailing GLM\-5 \(EN 69\.4\) by ~20 points\. As retrieval already removed the knowledge axis, this gap isolates a reasoning deficit: smaller models obtain the same evidence but fail to reliably chain tool calls and integrate the results\.
#### Invariance Across Data Richness Tiers\.
Performance is broadly invariant to the L1–L3 stratification\. Across evaluated models \(Appendix[I2](https://arxiv.org/html/2606.18613#A9.T2)\), mRS varies by less than ~4 points between L1 and L3, and the ranking is not monotonic for any model \(L3 occasionally exceeds L1, e\.g\., Seed\-1\.8 EN, Qwen3\.5\-35B EN\)\. This is a positive finding: under the FHIR tool abstraction, EHR scale is not the dominant bottleneck\. Models navigate information\-dense encounters via targeted tool calls just as effectively as sparse ones; the dominant axes of variance are instead task type and implicit\-query type \(Figure[3](https://arxiv.org/html/2606.18613#S5.F3)\)\.
### 5\.3Effect of Implicit Queries
Table 3:Explicit vs\. implicit ablation \(per subtype\)\. Each model is evaluated twice over the entire benchmark: once with every query restored to a*fully explicit*paraphrase \(Expl\.\), and once with the original*implicit*physician queries \(Impl\.\); all other inputs \(patient data, tools, gold answers\) are held fixed\.Δ=Impl\.−Expl\.\\Delta=\\textit\{Impl\.\}\-\\textit\{Expl\.\}is the performance lost to implicit phrasing \(negative = implicitness hurts\)\. Columns report the comparison restricted to turns of each implicit subtype \(na,pe,ae\)\. The magnitude ofΔ\\Deltaconsistently grows with subtype difficulty \(pe<<na<<ae\)\. All are rubric scores \(%\)\.PhysAssistBenchpreserves both implicit and explicit versions of each query, allowing us to isolate the effect of implicit phrasing while keeping patient data, tools, and gold answers fixed\. As shown in Table[3](https://arxiv.org/html/2606.18613#S5.T3)\. The implicit penaltyΔ\\Deltais negative in 21 of 24 model×\\timessubtype cells, showing that implicitness introduces a real difficulty rather than annotation noise: making queries explicit almost always improves accuracy\. The penalty increases with anaphoric complexity, averaging−2\.2\-2\.2for Predicate Ellipsis,−2\.8\-2\.8for Nominal Anaphora, and−5\.3\-5\.3for Abstract Event Anaphora \(ae\), the only subtype where every model degrades\. This ordering is intuitive: predicate ellipsis omits a locally recoverable verb, nominal anaphora refers to a specific entity, whileaerequires reconstructing an accumulated clinical state across multiple turns\. Implicit penalties also separate models more clearly than raw accuracy\. Stronger systems \(e\.g\., DeepSeek\-V4\-Pro, GPT\-5\.4\-high\) remain within−4\.0\-4\.0across all subtypes, whereas smaller open models are much more fragile \(e\.g\., Qwen3\.5\-27B:−9\.8\-9\.8onna; DeepSeek\-V4\-Flash:−9\.9\-9\.9onae\)\. Since real physicians do not usually restate full context, this explicit\-implicit gap highlights a fundamental weakness of current EHR assistants in everyday clinical dialogue\.
### 5\.4Effect of Patient Communication
Figure 4:The comparison of rubric scores between EHR\-only and Patient\-interview turns\.Figure[4](https://arxiv.org/html/2606.18613#S5.F4)compares model performance on structured EHR\-access turns versus patient\-interview turns\. Across the benchmark \(EN\+ZH\), each model is evaluated on approximately 1,944 EHR\-only turns and 648 patient\-interview turns\. All fourteen models show a substantial drop on patient\-interview turns, with an average decrease of 26\.4 points\. EHR\-turn scores range from 59% to 78%, while patient\-interview scores drop to 29%–51%\.
The gap reflects a key difference in task structure\. EHR turns mainly require retrieving or filtering well\-defined records, where the correct tool usage is largely determined by the query\. In contrast, patient\-interview turns require models to elicit subjective information through natural dialogue and determine medically relevant follow\-up questions, combining conversational ability with clinical reasoning\. The gap is slightly smaller for stronger proprietary models \(20\.6–26\.8 points\) than for smaller open\-weight models \(22\.8–30\.5 points\), but remains substantial across all models\. While AgentClinicSchmidgallet al\.\([2026](https://arxiv.org/html/2606.18613#bib.bib34)\)evaluates doctor\-patient dialogue only, our results show that EHR access does not reduce this difficulty; integrating both sources only compounds the challenge\. This gap persists uniformly across all models, suggesting that patient communication is a capability bottleneck independent of general model capacity\.
## 6Conclusion
We introducedPhysAssistBench, the first benchmark that jointly evaluates LLMs on three co\-occurring challenges of physician\-EHR interaction: implicit physician queries, structured FHIR\-based EHR tool use, and ambiguous patient communication\. Built from real MIMIC\-IV cases,PhysAssistBenchuses a scalable multi\-agent synthetic data pipeline to constructagentic patientsfor interactive doctor\-patient\-EHR scenarios\. Our results show that even the strongest LLMs remain far from reliable as physician assistants, highlighting interaction and coordination as key bottlenecks for clinical LLMs\. We release the benchmark and code to support research on LLM clinical deployments\.
## Limitations
#### Scenario Coverage\.
PhysAssistBenchspans four clinical scenarios chosen to cover a representative range of EHR reasoning and tool\-use demands\. Real\-world clinical workflows include many additional contexts, such as critical\-care management, specialist referral, and post\-discharge follow\-up, that fall outside the current scope\. We plan to extend the scenario set in future releases, especially to higher\-stakes and time\-sensitive settings\.
#### Scale of Expert Validation\.
PhysAssistBenchuses automated quality gates, rubric\-based scoring, and staged human review as its quality\-control process\. A sampled subset of sessions was reviewed by clinicians and confirmed to be clinically coherent, but full turn\-level physician validation across all 1,296 turns remains constrained by expert availability\. Importantly, our two\-round staged annotation shows that clinician and annotator feedback can be fed back into the pipeline as refined prompt constraints and gate rules, rather than treated as one\-off corrections \(see Appendix[C](https://arxiv.org/html/2606.18613#A3)\)\. Future releases will expand expert validation and further use such feedback to improve the pipeline\.
#### Model Coverage and Task Depth\.
PhysAssistBenchis designed to stress\-test multi\-turn reasoning, structured tool use, and implicit query understanding rather than biomedical knowledge recall alone\. We therefore focus on frontier general\-purpose LLMs with strong tool\-use capabilities\. Domain\-specific medical LLMs, such as Med\-PaLM 2 and HuatuoGPT, are important baselines, but many are primarily optimized for biomedical QA or clinical knowledge tasks, are not publicly accessible, or are not readily adapted to FHIR\-style tool interfaces\. Evaluating medical specialist models with comparable tool\-use scaffolds is an important direction for future work\.
#### Agent Design Generality\.
Althoughagentic patientsare grounded in real records, they remain tailored to predefined scenarios and task types\. In principle, static patient records could be transformed into richer interactive environments beyond dialogue, including more complex simulations such as in\-silico clinical trials\. Our current pipeline, however, uses a manually designed agent workflow for doctor\-patient\-EHR assistance and does not yet include a meta\-pipeline for automatically designing new agent structures for substantially different scenarios\. Developing such adaptive agent\-pipeline design is an interesting direction for future work\.
#### Data Source\.
All patients are drawn from MIMIC\-IV, a US academic medical dataset with a heavy ICU focus, which may not generalize to documentation styles or disease prevalence in other healthcare systems\.
#### Language\.
PhysAssistBenchis currently available in English and Chinese; supporting additional languages would require re\-grounding clinical terminology and patient persona behaviour in language\-specific medical corpora\.
## Ethical Considerations
PhysAssistBenchis built from MIMIC\-IV, a publicly available de\-identified clinical dataset, and we followed its standard data\-use and access requirements\. The benchmark is intended only for research evaluation, not for clinical deployment or autonomous decision\-making\. As a secondary\-use benchmark with synthetic interactions,PhysAssistBenchmay inherit biases from the source records and introduce additional biases through scenario planning, patient personas, LLM\-generated dialogue, and record filtering\. In particular, because the pipeline selects records that can support predefined scenarios, it may introduce implicit cohort\-selection effects, which should be considered for clinically sensitive evaluations\. We mitigate these risks through record grounding, filtering, and manual review, but the benchmark should still be interpreted as an evaluation resource rather than a representation of real patient experience\.
## Acknowledgments
#### Use of AI Assistance\.
During manuscript preparation, we used LLMs only for grammar correction, language refinement, and limited assistance in literature search\. All cited references were manually checked and verified by the authors\. The authors reviewed and edited all AI\-assisted text and take full responsibility for the content of the manuscript\.
Other uses of LLMs are part of the scientific design of this work and are explicitly described in the relevant sections\. First, LLMs are the evaluated models in all experiments \(Section[5](https://arxiv.org/html/2606.18613#S5)\)\. Second, LLMs are used as components of the multi\-agent benchmark construction pipeline, including the Session Planner, Doctor Agent, Patient Agent, and quality checker \(Section[3](https://arxiv.org/html/2606.18613#S3)\)\. Third, GPT\-5\.4\-mini is used as the automated rubric judge for turn\-level scoring \(Section[3\.3](https://arxiv.org/html/2606.18613#S3.SS3)\), with human agreement validation reported in Appendix[C\.3](https://arxiv.org/html/2606.18613#A3.SS3)\.
## References
- M\. ALMutairi, L\. AlKulaib, M\. Aktas, S\. Alsalamah, and C\. Lu \(2024\)Synthetic arabic medical dialogues using advanced multi\-agent llm techniques\.InProceedings of The Second Arabic Natural Language Processing Conference,pp\. 11–26\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- Anthropic \(2026\)Claude Opus 4\.7 system card\.Technical reportAnthropic\.External Links:[Link](https://cdn.sanity.io/files/4zrzovbb/website/037f06850df7fbe871e206dad004c3db5fd50340.pdf)Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2),[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.11.3.1)\.
- S\. Bae, D\. Kyung, J\. Ryu, E\. Cho, G\. Lee, S\. Kweon, J\. Oh, L\. Ji, E\. Chang, T\. Kim,et al\.\(2023\)Ehrxqa: a multi\-modal question answering dataset for electronic health records with chest x\-ray images\.Advances in Neural Information Processing Systems36,pp\. 3867–3880\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.4.3.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- A\. M\. Bean, R\. E\. Payne, G\. Parsons, H\. R\. Kirk, J\. Ciro, R\. Mosquera\-Gómez, S\. Hincapié M, A\. S\. Ekanayaka, L\. Tarassenko, L\. Rocher,et al\.\(2026\)Reliability of llms as medical assistants for the general public: a randomized preregistered study\.Nature Medicine,pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1),[§1](https://arxiv.org/html/2606.18613#S1.p2.1)\.
- ByteDance Seed Team \(2026\)Seed1\.8 model card: towards generalized real\-world agency\.arXiv preprint arXiv:2603\.20633\.Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2),[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.11.3.1)\.
- C\. Chen, J\. Yu, S\. Chen, C\. Liu, Z\. Wan, D\. Bitterman, F\. Wang, and K\. Shu \(2024\)Clinicalbench: can llms beat traditional ml models in clinical prediction?\.arXiv preprint arXiv:2411\.06469\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.7.6.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- T\. Das, D\. Albassam, and J\. Sun \(2024\)Synthetic patient\-physician dialogue generation from clinical notes using llm\.arXiv preprint arXiv:2408\.06285\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-V4: towards highly efficient million\-token context intelligence\.Technical reportDeepSeek\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2),[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.17.9.1)\.
- A\. Fansi Tchango, R\. Goel, Z\. Wen, J\. Martel, and J\. Ghosn \(2022\)Ddxplus: a new dataset for automatic medical diagnosis\.Advances in neural information processing systems35,pp\. 31306–31318\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- M\. Gerber and J\. Chai \(2010\)Beyond nombank: a study of implicit arguments for nominal predicates\.InProceedings of the 48th Annual Meeting of the Association for Computational Linguistics,pp\. 1583–1592\.Cited by:[§2\.4](https://arxiv.org/html/2606.18613#S2.SS4.p1.1)\.
- GLM\-5 Team \(2026\)GLM\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2),[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.17.9.1)\.
- Google DeepMind \(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.11.3.1)\.
- Google \(2026\)Gemini 3\.1 Pro: a smarter model for your most complex tasks\.Note:[https://blog\.google/innovation\-and\-ai/models\-and\-research/gemini\-models/gemini\-3\-1\-pro/](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/)Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2)\.
- Y\. Jiang, K\. C\. Black, G\. Geng, D\. Park, J\. Zou, A\. Y\. Ng, and J\. H\. Chen \(2025\)MedAgentBench: a virtual ehr environment to benchmark medical llm agents\.Nejm Ai2\(9\),pp\. AIdbp2500144\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.12.11.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.18613#S3.SS2.SSS0.Px5.p1.1)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu \(2019\)Pubmedqa: a dataset for biomedical research question answering\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 2567–2577\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.2.1.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- A\. E\. Johnson, L\. Bulgarelli, L\. Shen, A\. Gayles, A\. Shammout, S\. Horng, T\. J\. Pollard, S\. Hao, B\. Moody, B\. Gow,et al\.\(2023\)MIMIC\-iv, a freely accessible electronic health record dataset\.Scientific data10\(1\),pp\. 1\.Cited by:[§A\.1](https://arxiv.org/html/2606.18613#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2606.18613#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2606.18613#A1.SS3.p1.1),[Appendix B](https://arxiv.org/html/2606.18613#A2.p1.1),[§1](https://arxiv.org/html/2606.18613#S1.p4.1)\.
- Kimi Team \(2025\)Kimi K2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2)\.
- Kimi Team \(2026\)Kimi K2\.6: advancing open\-source coding\.Note:[https://www\.kimi\.com/blog/kimi\-k2\-6](https://www.kimi.com/blog/kimi-k2-6)Accessed: 2026\-05\-24Cited by:[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.17.9.1)\.
- V\. Kolhatkar and G\. Hirst \(2014\)Resolving shell nouns\.InProceedings of the 2014 conference on empirical methods in natural language processing \(EMNLP\),pp\. 499–510\.Cited by:[§2\.4](https://arxiv.org/html/2606.18613#S2.SS4.p1.1)\.
- S\. Krishna, K\. Krishna, A\. Mohananey, S\. Schwarcz, A\. Stambler, S\. Upadhyay, and M\. Faruqui \(2025\)Fact, fetch, and reason: a unified evaluation of retrieval\-augmented generation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4745–4759\.Cited by:[§2\.3](https://arxiv.org/html/2606.18613#S2.SS3.p1.1)\.
- T\. H\. Kung, M\. Cheatham, A\. Medenilla, C\. Sillos, L\. D\. Leon, C\. Elepaño, M\. Madriaga, R\. Aggabao, G\. Diaz\-Candido, J\. Maningo, and V\. Tseng \(2023\)Performance of ChatGPT on USMLE: potential for AI\-assisted medical education using large language models\.PLOS Digital Health2\(2\),pp\. e0000198\.External Links:[Document](https://dx.doi.org/10.1371/journal.pdig.0000198)Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1)\.
- S\. Kweon, J\. Kim, H\. Kwak, D\. Cha, H\. Yoon, K\. Kim, S\. Won, and E\. Choi \(2024\)Ehrnoteqa: a patient\-specific question answering benchmark for evaluating large language models in clinical settings\.arXiv preprint arXiv:2402\.16040\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.6.5.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- P\. Kyle \(2025\)AI opportunities action plan\.Technical reportUK Government\.External Links:[Link](https://www.gov.uk/government/publications/ai-opportunities-action-plan)Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)Llms get lost in multi\-turn conversation\.arXiv preprint arXiv:2505\.06120\.Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1),[§1](https://arxiv.org/html/2606.18613#S1.p2.1)\.
- G\. Lee, E\. Bach, E\. Yang, T\. Pollard, A\. Johnson, E\. Choi, Y\. Jia, and J\. H\. Lee \(2025\)FHIR\-AgentBench: benchmarking LLM agents for realistic interoperable EHR question answering\.InProceedings of Machine Learning for Health \(ML4H\),Proceedings of Machine Learning Research, Vol\.297\.Note:arXiv:2509\.19319Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.13.12.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.18613#S3.SS2.SSS0.Px5.p1.1)\.
- G\. Lee, H\. Hwang, S\. Bae, Y\. Kwon, W\. Shin, S\. Yang, M\. Seo, J\. Kim, and E\. Choi \(2022\)Ehrsql: a practical text\-to\-sql benchmark for electronic health records\.Advances in Neural Information Processing Systems35,pp\. 15589–15601\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.3.2.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- K\. Lee, L\. He, M\. Lewis, and L\. Zettlemoyer \(2017\)End\-to\-end neural coreference resolution\.InProceedings of the 2017 conference on empirical methods in natural language processing,pp\. 188–197\.Cited by:[§2\.4](https://arxiv.org/html/2606.18613#S2.SS4.p1.1)\.
- K\. Lee, L\. He, and L\. Zettlemoyer \(2018\)Higher\-order coreference resolution with coarse\-to\-fine inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),pp\. 687–692\.Cited by:[§2\.4](https://arxiv.org/html/2606.18613#S2.SS4.p1.1)\.
- M\. Li, Y\. Zhao, B\. Yu, F\. Song, H\. Li, H\. Yu, Z\. Li, F\. Huang, and Y\. Li \(2023a\)Api\-bank: a comprehensive benchmark for tool\-augmented llms\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 3102–3116\.Cited by:[§2\.3](https://arxiv.org/html/2606.18613#S2.SS3.p1.1)\.
- Z\. Li, H\. Zhu, Z\. Lu, and M\. Yin \(2023b\)Synthetic data generation with large language models for text classification: potential and limitations\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 10443–10461\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- Y\. Liu, Z\. I\. Carrero, X\. Jiang, D\. Ferber, G\. Wölflein, L\. Zhang, S\. Jayabalan, T\. Lenz, Z\. Hui, and J\. N\. Kather \(2026\)Benchmarking large language model\-based agent systems for clinical decision tasks\.npj Digital Medicine\.Cited by:[§3\.2](https://arxiv.org/html/2606.18613#S3.SS2.SSS0.Px5.p1.1)\.
- L\. Luo, S\. E\. Kim, X\. Zhang, J\. M\. Kernbach, R\. Kenia, J\. N\. Acosta, L\. A\. Nathanson, A\. D\. Haimovich, A\. Rodman, E\. Goh,et al\.\(2026\)A clinical environment simulator for dynamic ai evaluation\.Nature medicine,pp\. 1–8\.Cited by:[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p3.1)\.
- J\. C\. Mandel, D\. A\. Kreda, K\. D\. Mandl, I\. S\. Kohane, and R\. B\. Ramoni \(2016\)SMART on fhir: a standards\-based, interoperable apps platform for electronic health records\.Journal of the american medical informatics association23\(5\),pp\. 899–908\.Cited by:[§3\.1](https://arxiv.org/html/2606.18613#S3.SS1.p1.1)\.
- A\. Marasović, L\. Born, J\. Opitz, and A\. Frank \(2017\)A mention\-ranking model for abstract anaphora resolution\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,pp\. 221–232\.Cited by:[§2\.4](https://arxiv.org/html/2606.18613#S2.SS4.p1.1)\.
- N\. Mehandru, N\. Golchini, D\. Bamman, T\. Zack, M\. F\. Molina, and A\. Alaa \(2025\)Er\-reason: a benchmark dataset for llm\-based clinical reasoning in the emergency room\.arXiv preprint arXiv:2505\.22919\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.8.7.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- MiniMax \(2026\)MiniMax M2\.7: early echoes of self\-evolution\.Note:[https://www\.minimax\.io/news/minimax\-m27\-en](https://www.minimax.io/news/minimax-m27-en)Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2),[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.17.9.1)\.
- M\. Moor, O\. Banerjee, Z\. S\. H\. Abad, H\. M\. Krumholz, J\. Leskovec, E\. J\. Topol, and P\. Rajpurkar \(2023\)Foundation models for generalist medical artificial intelligence\.Nature616,pp\. 259–265\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-05881-4)Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1)\.
- NHS England\-South East \(2024\)Digital access – a ‘front door’ to the NHS\.Technical reportNHS England\.Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2023\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.arXiv preprint arXiv:2307\.16789\.Cited by:[§2\.3](https://arxiv.org/html/2606.18613#S2.SS3.p1.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2)\.
- Qwen Team \(2026\)Qwen3\.5\-omni technical report\.arXiv preprint arXiv:2604\.15804\.Cited by:[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.17.9.1)\.
- P\. Rajpurkar, E\. Chen, O\. Banerjee, and E\. J\. Topol \(2022\)AI in health and medicine\.Nature Medicine28,pp\. 31–38\.External Links:[Document](https://dx.doi.org/10.1038/s41591-021-01614-0)Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1)\.
- S\. Schmidgall, R\. Ziaei, C\. Harris,et al\.\(2026\)AgentClinic: a multimodal benchmark for tool\-using clinical AI agents\.npj Digital Medicine\.External Links:[Document](https://dx.doi.org/10.1038/s41746-026-02674-7)Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.14.13.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1),[§5\.4](https://arxiv.org/html/2606.18613#S5.SS4.p2.1)\.
- S\. Seo and G\. G\. Lee \(2024\)DiagESC: dialogue synthesis for integrating depression diagnosis into emotional support conversation\.InProceedings of the 25th Annual Meeting of the Special interest Group on Discourse and Dialogue,pp\. 686–698\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- W\. Shi, R\. Xu, Y\. Zhuang, Y\. Yu, J\. Zhang, H\. Wu, Y\. Zhu, J\. C\. Ho, C\. Yang, and M\. D\. Wang \(2024\)EHRAgent: code empowers large language models for few\-shot complex tabular reasoning on electronic health records\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 22315–22339\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1245/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1245)Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.5.4.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§5\.1](https://arxiv.org/html/2606.18613#S5.SS1.p1.2),[Table 2](https://arxiv.org/html/2606.18613#S5.T2.8.11.3.1)\.
- K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, L\. Hou, K\. Clark, S\. Pfohl, H\. Cole\-Lewis, D\. Neal, M\. Seneviratne, P\. Gamble, C\. Kelly, N\. Schärli, A\. Chowdhery, P\. Mansfield, B\. Agüera y Arcas, D\. Webster, G\. S\. Corrado, Y\. Matias, K\. Concannon, Y\. Liu, S\. Ghosh, V\. Natarajan, A\. Karthikesalingam, and J\. Barral \(2023\)Towards expert\-level medical question answering with large language models\.Computing Research RepositoryarXiv:2305\.09617\.External Links:[Link](https://arxiv.org/abs/2305.09617)Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1)\.
- X\. Tang, D\. Shao, J\. Sohn, J\. Chen, J\. Zhang, J\. Xiang, F\. Wu, Y\. Zhao, C\. Wu, W\. Shi,et al\.\(2025\)Medagentsbench: benchmarking thinking models and agent frameworks for complex medical reasoning\.arXiv preprint arXiv:2503\.07459\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.11.10.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- A\. J\. Thirunavukarasu, D\. S\. J\. Ting, K\. Elangovan, L\. Gutierrez, T\. F\. Tan, and D\. S\. W\. Ting \(2023\)Large language models in medicine\.Nature Medicine29,pp\. 1930–1940\.External Links:[Document](https://dx.doi.org/10.1038/s41591-023-02448-8)Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p1.1)\.
- U\.S\. Food and Drug Administration \(2025\)Artificial intelligence in software as a medical device\.Note:Content current as of March 25, 2025; accessed May 25, 2026Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p2.1)\.
- J\. Wang, Z\. Yao, Z\. Yang, H\. Zhou, R\. Li, X\. Wang, Y\. Xu, and H\. Yu \(2024\)Notechat: a dataset of synthetic patient\-physician conversations conditioned on clinical notes\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 15183–15201\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- S\. Wang, Z\. Tang, H\. Yang, Q\. Gong, T\. Gu, H\. Ma,et al\.\(2025\)A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains\.npj Digital Medicine\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-02277-8)Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.9.8.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- World Health Organization \(2021\)Ethics and governance of artificial intelligence for health: who guidance\.Note:[https://www\.who\.int/publications/i/item/9789240029200](https://www.who.int/publications/i/item/9789240029200)Published June 28, 2021; accessed May 25, 2026External Links:ISBN 9789240029200Cited by:[§1](https://arxiv.org/html/2606.18613#S1.p2.1)\.
- R\. Xu, Y\. Zhuang, Y\. Zhong, Y\. Yu, Z\. Wang, X\. Tang, H\. Wu, M\. D\. Wang, P\. Ruan, D\. Yang, T\. Wang, G\. Xiao, X\. Liu, C\. Yang, Y\. Xie, and W\. Shi \(2026a\)MedAgentGym: a scalable agentic training environment for code\-centric reasoning in biomedical data science\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=jHDZEUgS4r)Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.15.14.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
- S\. Xu, Y\. Wang, X\. Jia, Z\. Wu, K\. Liu, and A\. X\. Dong \(2026b\)RCBSF: a multi\-agent framework for automated contract revision via stackelberg game\.arXiv preprint arXiv:2604\.10740\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- P\. Yu, W\. Liu, Y\. Yang, J\. Li, Z\. Zhang, X\. Feng, and F\. Zhang \(2026\)Benchmarking llm tool\-use in the wild\.arXiv preprint arXiv:2604\.06185\.Cited by:[§2\.3](https://arxiv.org/html/2606.18613#S2.SS3.p1.1)\.
- A\. Zhang, T\. Ding, S\. J\. Wagner, C\. Tian, M\. Y\. Lu, R\. Pettit, J\. E\. Lewis, A\. Misrahi, D\. Mo, L\. P\. Le,et al\.\(2026\)A multimodal and temporal foundation model for virtual patient representations at healthcare system scale\.arXiv Preprint\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- H\. Zhang, Y\. Liu, J\. Yang, S\. Wan, X\. Wang, W\. Peng, and P\. Fua \(2025\)Lefusion: controllable pathology synthesis via lesion\-focused diffusion models\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 13232–13253\.Cited by:[§2\.2](https://arxiv.org/html/2606.18613#S2.SS2.p1.1)\.
- S\. Zhou, W\. Xie, J\. Li, Z\. Zhan, M\. Song, H\. Yang, C\. Espinoza, L\. Welton, X\. Mai, Y\. Jin,et al\.\(2025\)Automating expert\-level medical reasoning evaluation of large language models\.npj Digital Medicine\.Cited by:[Table 1](https://arxiv.org/html/2606.18613#S1.T1.1.10.9.1),[§2\.1](https://arxiv.org/html/2606.18613#S2.SS1.p2.1)\.
## Appendix AData, Licensing, and Ethical Considerations
### A\.1Data Licence and Distribution
PhysAssistBenchis built on MIMIC\-IVJohnsonet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib4)\), released on PhysioNet under thePhysioNet Credentialed Health Data Licence 1\.5\.0\. Access requires completing a recognised human\-subjects research training programme \(e\.g\. CITI “Data or Specimens Only Research” course\) and signing a data\-use agreement on PhysioNet\. Because everyPhysAssistBenchsession is a derivative of MIMIC\-IV patient records, the benchmark session files inherit the same licence conditions andmay not be redistributed publicly\.
We release all artefacts independently under separate terms:
- •Code and evaluation harness\(tool API, judge prompts, scoring scripts\): MIT Licence, freely distributable without restriction\.
- •Benchmark session files\(JSONL containing patient\-derived data\): distributed through a PhysioNet\-linked repository; requesters must hold an active MIMIC\-IV data\-use agreement before access is granted\.
- •Pre\-computed evaluation results\(model outputs, rubric scores\): released openly as they contain no patient\-level data\.
### A\.2Consistency with Intended Use
MIMIC\-IV was created to support clinical and translational research, quality improvement, and the development of clinical decision\-support toolsJohnsonet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib4)\)\. Constructing a benchmark to evaluate LLMs on physician–EHR interaction is fully consistent with that intent\.
We impose the following restrictions on downstream use ofPhysAssistBench:
- •The benchmark is intended for*research and evaluation only*and must not be deployed in real clinical workflows\.
- •Benchmark sessions must not be used as training data for evaluated models, to prevent leaderboard contamination\.
- •Derivative works must respect the original PhysioNet access conditions and must not be used outside research contexts\.
### A\.3Privacy and De\-identification
MIMIC\-IV is de\-identified by the MIT Laboratory for Computational Physiology following HIPAA Safe Harbor standardsJohnsonet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib4)\):
- •Direct identifiers removed:patient names, geographic subdivisions finer than state, telephone numbers, and device identifiers are absent from the released data\.
- •Date shifting:all dates are offset by a random, per\-patient amount anchored to the patient’sanchor\_year, making true calendar dates unrecoverable\.
- •Age capping:patients older than 89 are grouped into a single 90\+\+anchor\-year bin to prevent age\-based re\-identification\.
PhysAssistBenchpreserves all of these protections and introduces no additional identifiers\. Session files contain MIMIC\-IVsubject\_idvalues, which are surrogate keys with no mapping to real\-world identities outside the MIMIC\-IV access\-controlled environment\. Clinical note content reproduced in sessions \(discharge summaries, radiology reports\) retains no names or explicit dates\.
Despite these protections, re\-identification risk is non\-zero: discharge summaries may describe unusual disease combinations that, combined with external auxiliary information, could narrow a patient’s identity\. We therefore require credentialed access for all session files, consistent with PhysioNet terms\.
We additionally note that MIMIC\-IV clinical text reflects real physician language, which may contain implicit demographic biases or culturally specific clinical framing\. Systematic content auditing beyond the PhysioNet de\-identification pipeline was not performed; this is acknowledged as a limitation in §[6](https://arxiv.org/html/2606.18613#S6)\.
### A\.4Artefact Documentation and Coverage
#### Domain and source\.
All patient data originate from Beth Israel Deaconess Medical Center \(BIDMC\), Boston, MA, USA, collected approximately 2008–2019\. BIDMC is a large academic tertiary\-care centre with a heavy ICU and internal\-medicine case mix\. Benchmark performance may not generalise to community hospitals, non\-US healthcare systems, or populations with substantially different disease prevalence\.
#### Languages\.
All sessions are provided in English and Mandarin Chinese\. English queries are generated directly from MIMIC\-IV data; Chinese queries are produced via a round\-trip translation pipeline with semantic verification\. Underlying clinical notes \(discharge summaries, radiology reports\) remain in English throughout, as MIMIC\-IV contains no Chinese\-language source records\. Patient\-simulation responses are available in both languages\.
### A\.5Dataset Statistics
Full statistics are reported in Appendix[G](https://arxiv.org/html/2606.18613#A7)\. In brief:PhysAssistBenchcomprises324 sessionsand1,296 turnsper language \(2,592 turn instances across EN and ZH combined\)\. Sessions are distributed equally across four scenarios and three difficulty levels \(27 sessions per scenario\-difficulty cell\)\. There is no train/development split:PhysAssistBenchis a pure evaluation benchmark\. Implicitness subtypes \(na,pe, Æ\) are distributed approximately uniformly across Turns 1–3\.
## Appendix BData and Code Availability
Our dataset is derived from MIMIC\-IVJohnsonet al\.\([2023](https://arxiv.org/html/2606.18613#bib.bib4)\)under the PhysioNet Credentialed Health Data License\. Following the re\-distribution requirements of the original license, the dataset will be released via PhysioNet\. Access requires a valid PhysioNet credentialed account and acceptance of the corresponding data use agreement\. The released dataset includes: \(1\) de\-identified EHR snapshots extracted from MIMIC\-IV admissions used as session inputs; \(2\) per\-subject patient records partitioned bysubject\_id, containing structured clinical data \(diagnoses, medications, observations, procedures\); \(3\) 324 annotated clinical sessions 1296 turns in both English and Chinese; \(4\) gold\-standard tool\-call trajectories and reference answers for each turn; and \(5\) evaluation rubrics per turn with per\-item pass/fail annotations\. All protected health information \(PHI\) has been removed\.
The codebase will be released on GitHub under the Apache License 2\.0\. It will include scripts to extract and partition raw MIMIC\-IV records bysubject\_id, convert them into EHR snapshots, run the scalable multi\-agent synthetic data pipeline for constructingagentic patients\(Section[4](https://arxiv.org/html/2606.18613#S4)\), reproduce reported evaluations, and provide sample data for setup without requiring full MIMIC\-IV access\.
## Appendix CHuman Evaluation Details
### C\.1Expert Reviewer Backgrounds
#### Clinical Expert\.
We recruited one volunteer senior physician \(female\), board\-certified in her country of practice and with 15 years of clinical experience\. Participation was voluntary and uncompensated \(no monetary reward\)\. The participant used English and Chinese as the language of instruction\.
#### Trained Annotators\.
We recruited 8 volunteer trained annotators \(two female, six male\), each with several years of clinical AI and NLP research experience\. Those with more than 8 years of NLP research experience are regarded as NLP experts\. Participation was voluntary and uncompensated \(no monetary reward\)\. The participants used English and Chinese as the languages of instruction\.
### C\.2Two\-Stage Review Process
The dataset was reviewed by the 8 trained annotators and validated by a physician above in two stages\.
#### Stage 1: Clinical Plausibility Review\.
After the pipeline produced the initial dataset, we sampled 25% of the sessions for clinical review\. The clinical expert assessed each sampled dialogue for clinical plausibility, i\.e\., whether the interaction could realistically occur in practice and whether the content contained any internal medical contradictions\. On the sampled subset, 95% of sessions were judged clinically plausible\. Two NLP experts then collected the sessions flagged as implausible, distilled the clinical expert’s comments into additional pipeline prompt constraints, and regenerated those sessions\.
#### Stage 2: Field\-Level Verification\.
In the second stage, all seven trained annotators verified the correctness of each annotated field of every session, covering both the previously approved sessions and the regenerated ones\. Three fields were checked: \(i\) the implicitness type, i\.e\., whether the assigned type matches the actual implicit query; \(ii\) the rubric, i\.e\., whether each item corresponds one\-to\-one with its gold answer; and \(iii\) the gold tool calls, i\.e\., whether their parameters are correct\. The implicitness\-type assignment matched the query in 87% of cases, and only 8% of rubric items required revision\. Eight trained annotators reviewed and corrected all flagged fields\.
### C\.3Judger Model Reliability
To validate the reliability of the GPT\-5\.4\-mini judge, we measured agreement between GPT\-5\.4\-mini’s rubric scores and human expert annotations on a sample of 128 turns drawn from 32 sessions across all four scenarios\. One human expert and the judge scored each turn independently on the same rubric items\. The judge achieved an overall item\-level agreement of 94% with human annotations, demonstrating that the GPT\-5\.4\-mini’s scoring closely mirrors expert clinical judgment\.
## Appendix DEHR Tool Inventory
EHRToolBenchexposes18 toolsto the evaluated agent, organized into four groups: EHR read tools, EHR write tools, patient\-interview tools, and one control tool\. All EHR tools follow the FHIR R4 naming convention \(ResourceType\.operation\); patient\-interview tools use thepatient\.\*namespace\. Each tool call requires asubject\_id\(MIMIC\-IV patient identifier\); most read tools also accept an optionalhadm\_idto scope results to a single admission\.
### D\.1EHR Read Tools \(9\)
Table D1:EHR read tools \(FHIR R4 naming\)\. All tools requiresubject\_id; most accept an optionalhadm\_idto scope results to a single hospital admission\.As shown in Table[D1](https://arxiv.org/html/2606.18613#A4.T1), these 9 read\-only tools expose MIMIC\-IV structured records via a FHIR R4–compatible interface, covering patient demographics, encounters, diagnoses, observations \(lab, vitals, microbiology\), medications, diagnostic reports, clinical notes, and care plans\.Observation\.searchis the most heavily used tool, unifying laboratory, vital\-sign, and microbiology data under one parameterised call\.
### D\.2EHR Write Tools \(3\)
Table D2:EHR write tools\. All tools additionally requiresubject\_id\.As shown in Table[D2](https://arxiv.org/html/2606.18613#A4.T2), these 3 tools simulate EHR write operations and are used primarily inWrite/Updateturns \(T3\) of the Discharge Planning scenario, and also appear in Diagnostic Workup, Medication Safety, and Treatment Response turns that require ordering or flagging actions\.
### D\.3Patient Interview Tools \(5\)
Table D3:Patient\-interview tools \(patient\.\*namespace\)\. All tools requiresubject\_idandsession\_id\.As shown in Table[D3](https://arxiv.org/html/2606.18613#A4.T3), patient\-interview tools expose a simulated patient agent to the evaluated model\. Each call is routed to a patient LLM that generates a natural\-language response grounded in the patient’s MIMIC\-IV record\. All patient tools requiresubject\_idandsession\_id\.
### D\.4Control Tool \(1\)
Tool nameDescriptionprepare\_to\_answerSignals that sufficient information has been gathered and the agent is ready to produce the final answer to the user\. The optionalanswer\_typeparameter distinguishes tool\-grounded answers \(tool\) from knowledge\-only responses \(chat\)\. This tool must appear as the final action of every turn; omitting it is counted as an incomplete plan\.Table D4:Control tool\. No mandatory parameters\.prepare\_to\_answer\(Table[D4](https://arxiv.org/html/2606.18613#A4.T4)\) is a mandatory bookkeeping call that every turn must end with\. It separates the*information\-gathering*phase from the*answer\-generation*phase: the benchmark records whether the agent issues this call, and omitting it is penalised as an incomplete plan regardless of the quality of the final answer\.
### D\.5Tool Usage by Scenario
ToolDiag\. WorkupDischarge PlanMed\. SafetyTreat\. ResponseEHR ReadPatient\.read✓✗✓✗Encounter\.search✗✓✗✗Condition\.search✓✓✓✓Observation\.search✓✓✓✓MedicationRequest\.search✓✓✓✓MedicationAdministration\.search✓✓✓✓DiagnosticReport\.search✓✓✗✗DocumentReference\.search✗✓✗✗CarePlan\.search✗✓✗✗EHR WriteMedicationRequest\.create✓✓✓✓ServiceRequest\.create✓✓✗✓Flag\.create✓✓✓✓Patient Interviewpatient\.get\_symptom\_history✓✓✓✓patient\.get\_medication\_adherence✓✓✓✓patient\.get\_social\_history✓✓✓✗patient\.get\_functional\_status✗✓✗✗patient\.get\_pain\_assessment✗✗✗✓Controlprepare\_to\_answer✓✓✓✓Table D5:Tool usage across the four benchmark scenarios\.✓= appears in the gold\-standard trajectory of at least one session;✗= not required by any gold trajectory in that scenario\.Table[D5](https://arxiv.org/html/2606.18613#A4.T5)shows which tools appear as gold\-standard actions in at least one session of each benchmark scenario\. During*data generation*, the reference trajectories were constructed under scenario\-specific constraints: only clinically relevant tools were included in the gold plans to ensure grounded and parsimonious annotations\. During*evaluation*, however, all 18 tools are exposed to the model without restriction, so that the benchmark can assess whether the model selects the appropriate tools, avoids unnecessary calls, and handles the full tool vocabulary rather than a curated subset\.
## Appendix EPhysician Query Implicitness Example
Table[E1](https://arxiv.org/html/2606.18613#A5.T1)shows representative examples about different physician query implicitness\.
Table E1:The three physician\-query implicitness types inPhysAssistBench, illustrated with a CKD–metformin scenario\. In the Prior Context column, “→\\rightarrow” denotes the tool output returned by the agent after executing the query\. Foraerow, prior turns are summarized directly and the earlier queries are omitted for brevity\. Explicit queries are semantically complete physician requests, whereas implicit queries omit recoverable information and more closely resemble real clinical conversation\.aeis generally more challenging thannaandpe\.
## Appendix FFull Session Example
Figure[F1](https://arxiv.org/html/2606.18613#A6.F1)and[F2](https://arxiv.org/html/2606.18613#A6.F2)present two representative full\-session examples illustrating the tool calls, explicit and implicit query reformulations, and the corresponding evaluation rubrics\.
ilInformation LookupdgData GatheringcrClinical ReasoningwuWrite/Update\|\|Impl\.:pePredicate EllipsisnaNominal AnaphoraaeAbstract Event Anaphora — Explicit
Figure F1:Session A\(discharge planning\), shown in both languages: \(a\) English and \(b\) Chinese\. Each turn lists both the*implicit*query actually posed to the model and its*explicit*paraphrase \(used in the explicit\-query ablation\), so the contrast is visible per turn: e\.g\. T1 “Creatinine?” vs\. “What’s the latest creatinine?” \(pe\), and T3 “reduce*it*to 250 mg” vs\. “reduce*metformin*to 250 mg” \(na\)\. The patient has HIV \(undetectable viral load\), type\-2 diabetes and CKD\. Themixeddgturn \(T2\) is pivotal: the patient interview reveals the patient has*silently stopped*metformin due to nausea, reframing the question from renal safety to tolerability, information absent from the structured EHR\. Patient responses \(italic quotes\) come from the patient\-simulation agent grounded in MIMIC\-IV discharge notes; the Chinese panel uses Chinese patient replies\. Tool calls \(monospace\) remain in English as issued against the FHIR API\.ilInformation LookupdgData GatheringcrClinical ReasoningwuWrite/Update\|\|Impl\.:pePredicate EllipsisnaNominal AnaphoraaeAbstract Event Anaphora — Explicit
Figure F2:Session B\(medication safety\), shown in both languages: \(a\) English and \(b\) Chinese\. Each turn lists both the*implicit*query and its*explicit*paraphrase\. A 51\-year\-old man with hyperkalemia \(K = 5\.2 mEq/L\)\. The fouril/crturns form a single safety thread in which each successive implicit question \(“*Given all this…*”, “*What about furosemide?*”\) refers back to the accumulating clinical picture, while the explicit column restates that context in full, exercising abstract\-event \(ae\) and predicate\-ellipsis \(pe\) implicitness\. Themixedturn \(T1\) confirms lisinopril adherence via the patient interview\. Patient responses \(italic quotes\) come from the patient\-simulation agent grounded in MIMIC\-IV discharge notes; the Chinese panel uses Chinese patient replies\. Tool calls \(monospace\) remain in English as issued against the FHIR API\.
## Appendix GBenchmark Statistics
Table[G1](https://arxiv.org/html/2606.18613#A7.T1)reports session counts by clinical scenario and data\-richness tier\. Table[G2](https://arxiv.org/html/2606.18613#A7.T2)reports turn counts by task type and turn position\. Table[G3](https://arxiv.org/html/2606.18613#A7.T3)reports implicitness\-type counts over Turns 1–3\.
Table G1:Session distribution by clinical scenario and data\-richness tier\.Table G2:Turn distribution by task type and turn position\. Turn 0 is fixed asil;wuis restricted to Turn 3\.Table G3:Implicitness\-type distribution over Turns 1–3 \(972 turns\)\. Turn 0 carries no implicitness type\.PhysAssistBenchcontains 324 sessions and 1,296 turns drawn evenly from four clinical scenarios \(81 sessions each\) and three data\-richness tiers \(108 sessions each\), yielding a balanced coverage across both dimensions\. Across all turns, Information Lookup accounts for 540 turns \(41\.7%\), reflecting its role as the fixed anchor turn \(Turn 0\) as well as its appearance in later positions; Data Gathering and Clinical Reasoning each contribute 324 turns \(25\.0%\), and Write/Update accounts for the remaining 108 turns \(8\.3%\), restricted to Turn 3\. Among the 972 turns carrying an implicitness type \(Turns 1–3\), the three types: Nominal Anaphora, Predicate Ellipsis, and Abstract Event Anaphora are approximately uniformly distributed \(≈\\approx324 each\), enforced by a global balance counter during generation\.
## Appendix HData Generation Pipeline
### H\.1Patient Pre\-Filtering
Patients are pre\-filtered in two stages\. Stage 1 applies file\-size thresholds to exclude patients with sparse records\. Stage 2 enforces scenario\-specific content criteria: each scenario requires that the relevant FHIR resource types contain sufficient records within the target admission \(e\.g\., at least two time\-stamped infection markers fortreatment\_response; at least three active prescriptions and a discharge summary fordischarge\_planning\)\. Data\-richness tier further refines eligibility: the*data\-moderate*tier requires at least two time\-stamped values for the same lab to support trend queries; the*data\-rich*tier additionally requires multiple drug\-lab monitoring pairs to be present\.
### H\.2EHR Snapshot
For each entry, all available data for the target patient\-admission is extracted from MIMIC\-IV and consolidated into a structured text snapshot injected into every downstream agent prompt\. The snapshot contains two key annotated blocks\. TheQueryable Itemsblock lists every queryable FHIR resource with data\-availability annotations: items with only a single record are restricted toInformation Lookupturns, while items with two or more records are additionally eligible for trend\-basedData GatheringandClinical Reasoningturns\. This constraint prevents agents from generating questions about data that does not exist in the patient record\. TheClinical Scoring Opportunitiesblock identifies clinical scores computable from the available data \(e\.g\., SOFA, SIRS, CHA2DS2\-VASc\); the session planner is required to incorporate a scoring\-based turn when this block is present\.
### H\.3Session Planner
Before turn\-level generation begins, the Session Planner produces a coherent four\-turn clinical narrative plan specifying, for each turn, the topic to investigate, the recommended tool call\(s\), and the tool source \(ehr/mixed/patient\)\. The planner operates under three layers of constraint injected into its system prompt\.Scenario constraintsspecify which FHIR resource types are permitted or forbidden per scenario, and which Workup patterns are required\.Data\-richness tier constraintsare summarised in Table[H1](https://arxiv.org/html/2606.18613#A8.T1)\.Tool diversity constraintsrequire each session to span at least two distinct FHIR resource types, with any single resource type appearing in at most two turns\. At most one turn per session may be assigned a patient interview \(tool\_source=mixedorpatient\)\.
Table H1:Data\-richness tier constraints injected into the session planner\.
### H\.4Two\-Stage Question Generation
#### Stage 1: Explicit question\.
The User Question Agent generates an unambiguous explicit question grounded in the EHR snapshot, guided by the Session Planner’s topic hint and task\-type\-specific generation rules\.Data Gatheringquestions must not pre\-state any lab value or drug name, ensuring the Planner Agent is required to retrieve data via tool calls rather than reading it from the question\.
#### Stage 2: Implicit transformation\.
A subtype\-specific ellipsis transform is applied to the explicit question\.NA\(Nominal Anaphora\) replaces named entities with pronouns or deictic expressions referring to prior\-turn mentions\.PE\(Predicate Ellipsis\) deletes the main predicate, leaving a noun\-phrase fragment that implies the same query action\.AE\(Abstract Event Anaphora\) compresses the preceding clinical situation into an abstract event expression\. When no suitable antecedent exists for the selected subtype, the pipeline falls back to PE\. When content\-word overlap between the explicit and transformed questions is zero—indicating LLM topic drift—the explicit question is reconstructed from the implicit form via a dedicated expansion agent\.
### H\.5Quality Gates
Each turn passes through three sequential quality gates; failure at any gate triggers a retry up to three times before the patient is skipped\.
#### Gate 1 – Plan validation\.
Rule\-based structural checks verify tool cardinality \(Table[H2](https://arxiv.org/html/2606.18613#A8.T2)\) and parameter completeness\.
Table H2:Rule\-based plan validation criteria\.
#### Gate 2 – Observation validation\.
A deterministic checker verifies that key EHR tool calls returned non\-empty results \(FHIR bundletotal\>\>0\)\. Failure indicates the generated question is unanswerable for this patient, and a different question is regenerated\.
#### Gate 3 – Answer validation\.
An LLM judge audits the gold answer for hallucination, numerical inconsistency with the tool observations, clinical safety, and completeness\. Safety violations or clear hallucination trigger rejection\.
### H\.6Pipeline Statistics
Table H3:Pipeline statistics for the final benchmark\. Gate failure rates are estimated from pilot generation logs\.Figure H1:Data Generation Pipeline\. Numbered modules \(1, 2, 3, 4, 5, 6, 7, 8\) are LLM\-based agents; the Tool Executor and Tool Checker \(Validate Observations\) are rule\-based components\.Data Generation Pipeline is shown as[H1](https://arxiv.org/html/2606.18613#A8.F1)\.
User Agent \| Stage 1: Explicit Question Generation\[System Prompt\]="""Please act as a busy clinician \(doctor, nurse, or clinical pharmacist\) quickly typing a question to an AI assistant with access to the patient’s EHR and patient interview tools\.Critical rules:1\.Short\.≤\\leq2 sentences\. Turn 0≤\\leq25 words; follow\-up turns≤\\leq15 words \(fragments OK\)\.2\.Casual\.Informal spoken bedside language; not formal medical writing\.3\.No context repetition\.Do not re\-state findings established in prior turns\.4\.No preamble\.Do not start with“Given that…”or“Based on…”5\.No tool names, no Markdown, no JSON\.6\.Never pre\-state lab values\.Askforthe value — never include it in the question\.
×\\timesWith that eGFR of 52, does metformin need adjusting?
✓\\checkmarkBased on the eGFR, does metformin need adjusting?Returnonlythe question text, nothing else\."""\[User Prompt\]="""Clinical Scenario:\{\{scenario\}\}Task Type:\{\{IL / DG / CR\}\}\[Task\-Type Instructions\]•IL \(Information Lookup\):Exactly 1 EHR tool\. Ask aboutonespecific data point using bedside language\. Vary data types \(labs / meds / vitals / radiology / diagnoses\)\.•DG \(Data Gathering\):≥\\geq2 EHR tools\.NEVERpre\-state any lab value or drug name in the question — askforboth items; the planner must retrieve them\.
Tier 1 \(preferred\):Obs×\\timesObs\(parallel / trend\),Obs×\\timesMed,Med×\\timesMed,Obs×\\timesCond,Med×\\timesAdmin
Tier 2 \(fallback\):Obs×\\timesAdmin,Med×\\timesCond, 3\-tool combos•CR \(Clinical Reasoning\):1 EHR fetch \+ clinical knowledge reasoning\.NEVERinclude the actual lab value in the question\.Tool Source:\{\{ehr / patient / mixed\}\}Turn Subtype:\{\{NA / PE / AE\}\}\(applied in Stage 2\)\[Session Plan\]\{\{Topic\}\}⋅\\cdot\{\{Tool hint\}\}⋅\\cdotRequired: question must ask forboth\{\{item A\}\}and\{\{item B\}\}\[EHR Snapshot\]\{\{Structured MIMIC\-IV patient data: lab results, medications, vitals, diagnoses, clinical scores\}\}
Rule: Only ask about items listed in the EHR Snapshot\. Do not ask about absent data\.\[Conversation History\]\{\{Prior physician queries and assistant responses\}\}\[Antecedents\]\{\{Entities and events from prior turns — candidates for ellipsis / anaphora in Stage 2\}\}\[Failed Questions\]\{\{Previously generated questions that returned empty EHR results — do not repeat\}\}Generate theEXPLICITclinical question \(Stage 2 ellipsis transform applied next\):"""Figure H2:Prompt for the User Agent at Stage 1\.User Agent \| Stage 2: Implicit Transformation\[System Prompt\]="""You are a linguistic rewriter for a clinical QA benchmark\. You will receive:1\.An explicit user question \(from Stage 1\)2\.The conversation history \(prior turns\)3\.A transformation rule specifying which implicitness subtype to applyYour task: rewrite the explicit question into its elliptic/anaphoric form\.Rules:•Keep the clinical meaningidentical\.•Applyonlythe transformation described — do not add new information\.•Maintain the casual bedside tone\.•Returnonlythe rewritten question text\.•Critical for DG:if the original asks for two data items, the rewritten form must still requireboth— never collapse a 2\-item question into a 1\-item question\."""\[User Prompt\]="""Explicit question \(Stage 1\):"\{\{explicit\_question\}\}"\[Conversation History\]\(recent turns\)\{\{Prior physician queries and assistant responses\}\}\[Antecedents\]\{\{Entities and events from prior turns eligible for ellipsis or anaphoric reference\}\}\[DG \+ NA Constraint\]\(injected only when task type = DG and subtype = NA\)Only the following entities \(established in prior turns\)maybe replaced with a pronoun:"\{\{recoverable\_entities\}\}"\. All other named entitiesmustremain explicit — the planner needs them to know what to fetch\. The rewritten question must still require≥\\geq2 data items\.\[Transformation Rule\]—\{\{subtype\}\}•NA \(Nominal Anaphora\):Remove an entity established in a prior turn\. Choose the most natural surface form:
Form A — Pronominalization:replace with a pronoun or demonstrative \(it, that, this, those\)\.
e\.g\., Turn 0 found K==6\.2 mEq/L→\\to‘‘Does it warrant holding the diuretic?’’
Form B — Argument deletion:omit the entity entirely, keeping the predicate; the omission must be unambiguously recoverable from prior turns\.
e\.g\., Turn 0 checked hemoglobin trend→\\to‘‘How’s the trend?’’
Rule: the omitted/pronominalized entitymustappear in a previous turn\. If neither form feels natural, return the original question unchanged\.•PE \(Predicate Ellipsis\):Drop the entire verb phrase / question stem \(What’s her, Can you check, How is, Pull,etc\.\), leavingonlythe topic noun or a bare fragment\. The omitted action is inferred from prior tool calls\.
‘‘What’s the creatinine?’’→\\to‘‘Creatinine?’’
‘‘How is the WBC trending?’’→\\to‘‘WBC trend?’’
‘‘What medications is she on?’’→\\to‘‘Current meds?’’
Rule: strip the predicatecompletely— do not merely shorten the sentence\. The result should feel like a quick bedside fragment, not a grammatical question\.•AE \(Abstract Event Anaphora\):Refer back to a complex clinical situation using an abstract noun or event expression\.
e\.g\., Turns 0–1 established a DKA workup→\\to‘‘Given all that, how aggressive should the insulin correction be?’’
Rule: the abstract reference must have clear prior grounding in≥\\geq2 prior turns\.Rewrite the explicit question into its elliptic/anaphoric form:"""Fallback — Expansion Agent\(triggered when content\-word overlap between Stage 1 and the transformed question is zero, indicating topic drift\)\[User Prompt\]="""Conversation history:\{\{last 3 turns\}\}
The physician used an abbreviated form:"\{\{transformed\_question\}\}"Expand this into the full explicit clinical question \(one sentence,≤\\leq20 words\):"""Figure H3:Prompt for the User Agent at Stage 2\.Session Planner Agent\[System Prompt\]="""You are a clinical conversation planner for an EHR benchmark dataset\. Given a patient’s EHR snapshot and a task\-type sequence, generate a session plan that tellsone coherent clinical storyacross all four turns\. Outputonlyvalid JSON — no prose, no markdown\.\[Output Schema\]•clinical\_situation: one\-sentence patient summary•investigation\_arc: T0 arc phrase→\\toT1→\\toT2→\\toT3 \(format below\)•turn\_intents: list of 4 full intent sentences \(format below\)•turns: list of 4 dicts, each withturn,task\_type,topic,tool\_hint,tool\_source,workup\_pattern\(DG only\)\[Rules\]1\.Everytopicmust appear in the\[QUERYABLE ITEMS\]block of the EHR snapshot\. Items with1 resultare restricted to Information Lookup turns only\.2\.Topics must not repeat across turns — each turn adds new information\.3\.Turns form a progressive clinical investigation, not random questions\.4\.turn\_intents\[i\]must matchtopicandtool\_hintinturns\[i\]\.5\.Tool diversity:each FHIR resource type appears in at most 2 turns; the session must span≥\\geq2 distinct resource types\.6\.Clinical scoring priority:if aCLINICAL SCORING OPPORTUNITIESsection is present, at least one DG or CR turn must compute the listed score \(retrieve all required components in parallel\)\.\[arc / intent formats\]•IL:"T\{i\}\[R\] retrieve \{item\} — \{clinical purpose\}"•DG:"T\{i\}\[W\] \{item A\} × \{item B\} — \{clinical question\}"•CR:"T\{i\}\[KG\] interpret \{item\} — \{clinical decision\}"•WU:"T\{i\}\[A\] \{write operation\} — \{clinical justification\}"\[Patient Interview Option\]\(at most one turn per session\)
Settool\_source="mixed"\(EHR \+ patient\) or"patient"\(patient only\)\. Available tools:patient\.get\_symptom\_history,patient\.get\_medication\_adherence,patient\.get\_functional\_status,patient\.get\_social\_history\.\[DG Tool Patterns\]
Tier 1 \(preferred\):Obs×\\timesObs,Obs×\\timesMedReq,MedReq×\\timesMedReq,Obs×\\timesCond,MedReq×\\timesMedAdmin
Tier 2 \(fallback\):Obs×\\timesMedAdmin,MedReq×\\timesCond, 3\-tool combos\[Write/Update Turn — T3 only\]
tool\_hintis exactly one write call with concrete parameters drawn from T0–T2 findings\.e\.g\.,MedicationRequest\.create\(medication=X, dose=Y, route=Z, frequency=W, indication=\.\.\.\)\[Coverage Hint\]\(injected when underused tools detected\)
\{\{Tools underused in current dataset — prefer when clinically appropriate\}\}"""\[User Prompt\]="""Clinical scenario:\{\{scenario\}\}Task sequence:Turn 0 \[\{\{type\}\}\]→\\;\\to\\;Turn 1 \[\{\{type\}\}\]→\\;\\to\\;Turn 2 \[\{\{type\}\}\]→\\;\\to\\;Turn 3 \[\{\{type\}\}\]\[EHR Snapshot\]\{\{Patient MIMIC\-IV data: queryable items \+ clinical scoring opportunities\}\}Generate the session plan JSON:"""Figure H4:Prompts for the Session Planner AgentPlanner Agent \(Turn Planner\)\[System Prompt\]="""You are a clinical planning agent\. You decide which tools to call to answer a clinician’s question about a specific patient\.\[Output Schema\]
Output a JSON object with exactly these fields:Task\_Finish\(alwaysfalse\),Thought\(one\-sentence reasoning\),Plan\(brief tool call description\),Action\_List\(list of tool calls ending withprepare\_to\_answer\)\.\[Task\-Type Rules\]1\.Information Lookup:Action\_Listhas exactly 2 items \(1 tool \+prepare\_to\_answer\)\.
tool\_source=ehr→ one EHR tool\.tool\_source=patient→ one patient tool\.2\.Data Gathering:Action\_Listhas 3–5 items \(≥\\geq2 tools \+prepare\_to\_answer\)\.
Parallel mode:independent tools called together\.
Adaptive mode:first tool result determines which second tool to call\.
tool\_source=mixed:mandatory— must include≥\\geq1patient\.get\_xxxcall\.
Clinical scoring:retrieve all required components in parallel \(e\.g\. SOFA:Observation\.search\(platelet\)\+Observation\.search\(bilirubin\)\+Observation\.search\(creatinine\)\+MedicationAdministration\.search\(vasopressor\)\)\.3\.Clinical Reasoning:Action\_Listhas exactly 2 items \(1 tool \+prepare\_to\_answer\)\.
Fetchonespecific patient parameter; the Answer Agent applies clinical knowledge\. Do NOT call multiple tools\.4\.Write/Update:Action\_Listhas exactly 2 items \(1 read tool to verify current state \+prepare\_to\_answer\)\.
The Answer Agent will then emit a write tool call \(MedicationRequest\.create,ServiceRequest\.create, orFlag\.create\)\.
Donotcall the write tool directly in the plan\.\[General Rules\]•Always endAction\_Listwithprepare\_to\_answer\.•Patient tools require bothsubject\_idandsession\_id\.•Only use tools from the providedAvailable Toolslist\.•Always includesubject\_idin EHR tool arguments\.•Useexactlythe parameter names shown in tool definitions \(e\.g\.item\_namenottest\_name\)\.Outputonlythe JSON object, no other text\."""\[User Prompt\]="""\[Available Tools\]\{\{FHIR R4 tool list with schemas\}\}\[Patient Context\]\{\{subject\_id, hadm\_id, session\_id, patient summary\}\}\[Session Plan Hint\]\{\{tool\_hint from Session Planner for this turn\}\}\[Conversation History\]\{\{Prior physician queries, tool calls, and assistant responses\}\}\[Previous Observation Failure\]\(injected on retry\)\{\{Tool that returned empty — do not call same tool with same parameters\}\}Physician question:"\{\{user\_question\}\}"Output the tool call plan:"""Figure H5:Prompts for the Planner AgentPatient Agent\[System Prompt\]="""You are simulating a patient in a clinical interview with a doctor\.Patient personality\(instantiated per session from PHM YAML\):•Health literacy:\{\{low / medium / high\}\}
low— uses simple everyday words, avoids medical terms, may misunderstand jargon
medium— understands basic concepts, asks for clarification on complex terms
high— medically literate, uses correct terminology, describes symptoms precisely•Medication adherence:\{\{good / uncertain / poor\}\}
good— takes all medications as prescribed
uncertain— sometimes forgets doses or is unsure about schedules
poor— often misses doses, has stopped some medications, or never filled prescriptions•Anxiety level:\{\{low / medium / high\}\}
low— calm and matter\-of\-factmedium— somewhat anxioushigh— visibly worried, may emphasize worst symptomsRules:1\.Respond in natural spoken language as the patient\.2\.Stay strictly in character based on the personality above\.3\.Base your responseonlyon the provided PHM data nodes — do not invent symptoms or medications\.4\.Donotuse medical jargon ifhealth\_literacy=low\.5\.If asked about a medication you never filled, express this naturally\.6\.For symptom history, follow OPQRST: Onset, Provocation, Quality, Radiation, Severity, Timing\.7\.Keep responses concise \(2–5 sentences\) unless probed for details\.8\.Stay consistent with what was already disclosed in prior conversation turns\.\[WithheldFlags —critical\_withheldpersona only\]
Critical information \(e\.g\. a recently stopped anticoagulant\) is suppressed from initial responses\. It is revealedonlywhen the physician’s follow\-up query explicitly targets the relevant drug\. Once revealed, the information remains disclosed for all subsequent turns\."""\[User Prompt\]="""\[PHM Data Nodes\]\(retrieved and filtered byWithheldFlagsbefore injection\)
\{\{Relevant diagnoses, medications, lab trends, warning signs from PHM YAML\}\}\[Prior Conversation Context\]
\{\{Symptom log from previous turns — ensures consistency across turns\}\}\[Clinical Query\]
Query type:one of:•get\_chief\_complaint— what brought the patient in today•get\_symptom\_history— symptom onset/quality/severity \(OPQRST\);query=<keyword\>•get\_medication\_adherence— adherence for a specific drug;drug=<name\>•get\_functional\_status— mobility, ADL, activity limitations•get\_social\_history— smoking, alcohol, living situation, occupation•get\_pain\_assessment— pain location, character, severity scaleQuery:"\{\{symptom keyword or drug name, if applicable\}\}"Respond as the patient:"""Figure H6:Prompts for the Patient AgentChecker Planner \(Gate 1 — Plan Validation\)\[System Prompt\]="""You are a clinical planning validator\. Given a user question, task type, tool source, and an action plan, check whether the plan is correct\.Return JSON:\{"valid": <bool\>, "reason": "<brief explanation\>"\}Validation rules:1\.Information Lookup:exactly 1 non\-prepare\_to\_answertool\.
tool\_source=ehr→ EHR tool;tool\_source=patient→patient\.xxxtool\.2\.Data Gathering:2–4 non\-prepare\_to\_answertools\.
tool\_source=mixed: may combine EHR andpatient\.xxxtools\.3\.Clinical Reasoning:≥\\geq1 non\-prepare\_to\_answertool\.
Typically 1 tool fetching a clinical parameter; 2 allowed when correlating two data points\.4\.Write/Update:exactly 1 write tool \(MedicationRequest\.create,ServiceRequest\.create, orFlag\.create\)\+\+prepare\_to\_answer\. No read/search tools allowed\.5\.All tools must exist in theAvailable Toolslist\.6\.subject\_idmust be present in EHR tool arguments when the patient is known\.7\.Patient tools \(patient\.xxx\): bothsubject\_idandsession\_idrequired\.8\.Tool arguments must match their schema \(no missing required parameters\)\.9\.The tools chosen must berelevantto the question asked\.10\.Action\_Listmust end withprepare\_to\_answer\.Outputonlythe JSON, no other text\."""\[User Prompt\]="""User question:"\{\{user\_question\}\}"Task type:\{\{Information Lookup / Data Gathering / Clinical Reasoning / Write/Update\}\}Tool source:\{\{ehr / patient / mixed / write\}\}Plan:\{\{Action\_List from Planner Agent\}\}Available tools:\{\{tool name list\}\}Is this plan correct and appropriate? Output JSON\."""Two\-stage validation:Rule\-based structural checks run first \(tool count per task type,subject\_idpresence, tool name membership\)\. Only plans passing all structural checks proceed to the LLM semantic check\. Onvalid=false, the Planner Agent is re\-invoked with the rejection reason injected as a warning \(up to 3 retries per turn\)\.Figure H7:Prompts for the Planner Checker AgentClinical Checker \(Gate 3 — Answer Validation\)\[System Prompt\]="""You are a clinical quality reviewer\. Given a clinician’s question, the tool observations \(real EHR data\), and the AI assistant’s answer, check for:1\.Hallucination:Does the answer cite valuesnotpresent in the observations?2\.Contradiction:Does the answer contradict values in the observations?3\.Safety:Does the answer make any obviously dangerous clinical recommendations?4\.Completeness:Does the answer address the actual question asked?Return JSON:•valid: boolean•hallucination: boolean•contradiction: boolean•safety\_issue: boolean•incomplete: boolean•issues: list of issue strings•score: integer 0–10Belenient— minor omissions are acceptable\. Flagonlyclear errors\. Asafety\_issue=trueunconditionally setsvalid=false\."""\[User Prompt\]="""Clinical Task:\{\{task\_type / scenario\}\}Question asked:"\{\{user\_question\}\}"\[EHR Data Retrieved\]\(ground truth — truncated to 800 chars per tool call\)
\[\{\{tool\_name\}\}\]: \{\{FHIR R4 Bundle JSON\}\}AI Assistant’s answer:
\{\{generated\_answer\}\}Validate this answer\. Output JSON\."""Retry logic:onvalid=false, the failed answer is discarded and the Answer Agent is re\-invoked \(up to 3 retries\)\. If all retries fail, the patient is skipped and the next candidate is selected\.Figure H8:Prompts for the Clinical Checker AgentAnswer Agent\[System Prompt — Information Lookup\]="""You are an AI clinical assistant reporting EHR data to a clinician\.Format rules:•Reportonlythe directly retrieved value\(s\)\.•Format each item as:\[Item\]: \[Value\] \[Unit\] \(↑\\uparrow/↓\\downarrow/normal\)— one line per item\.•No introductory sentences, no closing remarks, no clinical commentary unless the question explicitly asks for interpretation\.•If data is missing:\[Item\]: not found in EHR•Maximum 2 lines total\."""\[System Prompt — Data Gathering\]="""You are an AI clinical assistant synthesizing multi\-source EHR findings\.Format rules:•Bullet list, maximum 3 bullets:• \[Finding\]: \[clinical implication\]•No introductory or closing sentences\.•If a tool returned no data:• \[item\]: not available in EHRClinical scoring\(when retrieved data contains scoring components\):•Compute score inline using the exact value\-to\-subscore lookup tables \(SOFA, SIRS, MELD, Child\-Pugh, CURB\-65, CHA2DS2\-VASc, HAS\-BLED, Ranson, Cockcroft\-Gault, Wells PE\)\.•Format: component = value → X pts; sum total → category\.•Critical:donotdefault to 2 pts for "abnormal" — look up the exact range\.•List missing components as assumed 0 \(note in answer\)\."""\[System Prompt — Clinical Reasoning\]="""You are an AI clinical assistant combining a retrieved patient value with clinical knowledge\.Format rules:•Respond inexactly 2 sentences— no more\.•Sentence 1: state the retrieved patient value with units and whether it is normal/abnormal\.•Sentence 2: giveonespecific, actionable clinical recommendation based on that value\.•Donotgive generic advice\. Donotadd a third sentence\."""\[User Prompt\]="""\(shared across all task types\)Task type:\{\{Information Lookup / Data Gathering / Clinical Reasoning\}\}\[Conversation History\]
\{\{Prior physician queries and assistant responses\}\}\[Tool Observations\]\(real MIMIC\-IV data returned by executed tool calls\)
\[\{\{tool\_name\}\}\]: \{\{FHIR R4 Bundle JSON — ground truth EHR values\}\}Physician question:"\{\{user\_question\}\}"Generate the clinical response:"""Figure H9:Prompts for the Answer AgentRubric Generator Agent\[System Prompt — General Turns \(IL / DG / CR\)\]="""You are a clinical benchmark rubric designer for an EHR\-based QA evaluation\. Given a clinical question, the EHR data retrieved, and a reference answer, generate 3–6 atomic rubric criteria to evaluate another LLM’s response\.Design rules:1\.Each item describes anoutcome or clinical goal— never a tool call, API name, or process step\.2\.Ground items in actual EHR values\.
Write:“The answer correctly cites creatinine as 0\.9 mg/dL”— not“mentions the creatinine value”3\.Each item must be independently evaluable asYESorNO\.4\.Include≥\\geq1 reasoning or recommendation item \(not just fact retrieval\)\.5\.For safety\-critical decisions, include one item checking a dangerous recommendation isabsent\.6\.Donotmention tool names, function names, or system internals\.7\.Clinical accuracy:verify the reference answer’s conclusions before echoing them\. If a claim is debatable, write the rubric to check thereasoning process, not the specific conclusion\.8\.Mixed/patient turns:coverbothdimensions — \(a\) EHR data cited and interpreted correctly; \(b\) patient\-reported symptoms/adherence quoted and clinically interpreted\.Item count by task type:•IL:3 items — value cited, value interpreted, conclusion stated•DG:4–5 items — each value cited, relationship stated, conclusion•CR:5–6 items — value cited, threshold applied, reasoning chain, recommendation, safety check•Mixed/Patient:5–6 items — 2–3 on EHR findings, 2–3 on patient\-reported findingsOutputonlya valid JSON array of strings\. No prose, no markdown\.Example:\["The answer cites creatinine as 0\.9 mg/dL", "The answer concludes no metformin adjustment is needed"\]"""\[System Prompt — Write/Update Turns \(T3 Action\)\]="""You design rubric items evaluating a model’s FHIR write tool call \(MedicationRequest\.create,ServiceRequest\.create, orFlag\.create\)\.Design rules:1\.Each item names aspecific fieldoftool\_call\.arguments\(e\.g\.medication,dose,route,frequency,indication\)\.2\.Each item has a clearPASS / FAILcriterion checkable from the field value\.3\.Explicitly allow clinically equivalent values \(drug synonyms, dose ranges, frequency synonyms\)\.4\.Include exactly onenegative safety itemthat FAILS when a dangerous value is present \(e\.g\. dose≥\\geqcontraindicated threshold, wrong drug class, unjustifiedstatpriority\)\.5\.Donotwrite items about clinical reasoning or prose justification\."""\[User Prompt\]="""\(shared across all turn types\)Clinical question:"\{\{user\_question\}\}"\[EHR Data Retrieved\]
\[\{\{tool\_name\}\}\]: \{\{actual FHIR R4 Bundle values used in the gold answer\}\}\[Reference Answer\]
\{\{gold\_answer generated by Answer Agent\}\}Generate the rubric criteria:"""Figure H10:Prompts for the Rubric Generation Agent
## Appendix IFull Experiment Results
### I\.1Performance by Clinical Scenario
Table[I1](https://arxiv.org/html/2606.18613#A9.T1)reports rubric scores broken down by the four clinical scenarios\. Discharge Planning is consistently the hardest scenario across all models \(column average 57\.9%\), likely because it requires integrating longitudinal EHR context, patient preferences, and multi\-step care coordination rather than a single lookup\. Treatment Response and Medication Safety are comparatively easier \(68\.8% and 67\.7%\), and show the largest spread between strong and weak models\. Notably, Qwen3\.5\-35B\-A3B achieves 74\.9% on Medication Safety—on par with GLM\-5 and well above its overall average \(66\.0%\)—suggesting that some open\-weight models have uneven scenario\-level strengths\.
Table I1:Rubric score \(%\) per clinical scenario, English and Chinese separately\. Diag\.=Diagnostic Workup, Dischg\.=Discharge Planning, Med\.=Medication Safety, Treat\.=Treatment Response\. Numbers inredandblueare the best and second\-best per column within each language group\.
### I\.2Performance by Data Richness
Table[I2](https://arxiv.org/html/2606.18613#A9.T2)reports rubric scores across the three data richness tiers \(High / Medium / Low EHR record completeness\)\. Across all models, the performance gap between tiers is small—typically within 2–3 pp—indicating that current models do not strongly exploit additional EHR context when it is available\. The counter\-intuitive pattern that High\-richness sessions score slightly above Low\-richness sessions \(65\.1% vs\. 64\.1% on average\) suggests that denser records provide useful disambiguation cues that offset the added complexity\. Weaker models \(Qwen3\.5\-9B, Qwen3\.5\-4B\) show a monotone decline from High to Low, while stronger models exhibit no consistent trend, implying that record completeness matters more when overall capacity is limited\.
Table I2:Rubric score \(%\) per data richness tier, English and Chinese separately\. High / Medium / Low correspond to EHR record completeness levels 1–3\. Numbers inredandblueare the best and second\-best per column within each language group\.
### I\.3Pass Rate Curves across Thresholds
Figure[I1](https://arxiv.org/html/2606.18613#A9.F1)plots Pass@Turn and Pass@Session as a function of thresholdτ∈\[0,1\]\\tau\\in\[0,1\]for all fourteen models, shown separately for English and Chinese\.
Three patterns are consistent across both languages\. First, Pass@Session decays far faster than Pass@Turn asτ\\tauincreases, reflecting the multiplicative penalty of requiring every turn in a session to pass: atτ=0\.60\\tau\{=\}0\.60, Claude\-Opus\-4\.7 achieves 67\.7% Pass@Turn \(EN\) but only 23\.5% Pass@Session—a ratio of 0\.35\. Byτ=0\.75\\tau\{=\}0\.75the ratio collapses further, with most models dropping below 10% Pass@Session even when their Pass@Turn remains above 40%\. Second, the spread between strong and weak models is amplified at the session level: Qwen3\.5\-4B records 1\.5% Pass@Session atτ=0\.60\\tau\{=\}0\.60\(EN\), a gap of over 20 pp below Claude and GLM\-5, despite a smaller difference at the turn level\. Third, Chinese scores are consistently slightly higher than English at the same threshold across all models \(e\.g\., Claude: 23\.5%→\\to26\.9% and GLM\-5: 21\.3%→\\to26\.5% atτ=0\.60\\tau\{=\}0\.60\), a pattern that persists across the fullτ\\taurange\.
Figure I1:Pass@Turn \(left column\) and Pass@Session \(right column\) as a function of thresholdτ\\taufor English \(top\) and Chinese \(bottom\)\. Solid lines = closed\-source models; dashed lines = open\-weight models\. Vertical dotted lines markτ=0\.60\\tau\{=\}0\.60andτ=0\.75\\tau\{=\}0\.75\.Figure I2:Rubric score \(%\) by implicitness type × task type \(ZH\), for 14 models and their average\.Similar Articles
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care
This study evaluates how interactive dialogue with an LLM (via the MedSyn system) improves diagnostic accuracy for physicians in emergency care settings, showing significant gains for residents on difficult cases.
AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows
Introduces AIPatient Arena, an EHR-grounded evaluation framework for assessing LLMs across multiple dimensions of clinical competence. The study reveals strengths in interviewing and ethics but weaknesses in handling ambiguity and diagnostic accuracy.
Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
Researchers introduce MedSP1000, a 1,638-case interactive benchmark derived from standardized patient scenarios to evaluate LLMs as dynamic clinical agents across multi-turn encounters. Results show even the best model (GPT-5.5) completes only 60.4% of expert rubric items, suggesting current LLMs are not yet reliable enough for clinical practice.
Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
Introduces MedMisBench to measure LLMs' ability to maintain correct medical reasoning under misleading context. Shows that accuracy drops sharply from 71.1% to 38.0% under adversarial conditions, with potential harm flagged by clinical panel.