AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows
Summary
Introduces AIPatient Arena, an EHR-grounded evaluation framework for assessing LLMs across multiple dimensions of clinical competence. The study reveals strengths in interviewing and ethics but weaknesses in handling ambiguity and diagnostic accuracy.
View Cached Full Text
Cached at: 06/17/26, 05:40 AM
# AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows Source: [https://arxiv.org/abs/2606.17474](https://arxiv.org/abs/2606.17474) Authors:[Jiahui Niu](https://arxiv.org/search/cs?searchtype=author&query=Niu,+J),[Huizi Yu](https://arxiv.org/search/cs?searchtype=author&query=Yu,+H),[Wenkong Wang](https://arxiv.org/search/cs?searchtype=author&query=Wang,+W),[Guangxin Dai](https://arxiv.org/search/cs?searchtype=author&query=Dai,+G),[Jingxian He](https://arxiv.org/search/cs?searchtype=author&query=He,+J),[Xiang Li](https://arxiv.org/search/cs?searchtype=author&query=Li,+X),[Zhiying Liang](https://arxiv.org/search/cs?searchtype=author&query=Liang,+Z),[Xinxin Lin](https://arxiv.org/search/cs?searchtype=author&query=Lin,+X),[Kent CY So](https://arxiv.org/search/cs?searchtype=author&query=So,+K+C),[Bryan YP Yan](https://arxiv.org/search/cs?searchtype=author&query=Yan,+B+Y),[Yun Kwok Wing](https://arxiv.org/search/cs?searchtype=author&query=Wing,+Y+K),[Yanqiu Xing](https://arxiv.org/search/cs?searchtype=author&query=Xing,+Y),[Xin Ma](https://arxiv.org/search/cs?searchtype=author&query=Ma,+X),[Lizhou Fan](https://arxiv.org/search/cs?searchtype=author&query=Fan,+L) [View PDF](https://arxiv.org/pdf/2606.17474) > Abstract:Large language models \(LLMs\) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single\-turn, or narrowly outcome\-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real\-world care\. Here, we propose AIPatient Arena, an EHRs\-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence\. The framework integrates EHR data into patient\-specific knowledge graphs, enabling multi\-turn physician\-patient interactions\. We applied AIPatient Arena on a primary cohort of 437 patients and two out\-of\-distribution validation cohorts of 119 and 67 patients\. We observe that LLMs performed well in medical interview questioning skills \(QS; mean scores, 4\.43\-4\.99/5\), ethical and professional conduct \(ET; 4\.38\-4\.93/5\), and clarity and transparency of clinical explanations \(EX; 3\.80\-4\.72/5\)\. Performance was moderate in information integration \(II; 3\.19\-4\.21/5\) and medication safety and justification \(MS; 3\.13\-3\.78/5\), but persistent weaknesses were observed in handling of ambiguous patient responses \(HR; 2\.57\-3\.32/5\), information coverage \(IC; 2\.08\-3\.02/5\), and diagnostic accuracy and reasoning \(Dx; 2\.63\-3\.55/5\)\. Process\-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty\. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning\. These findings indicate that final\-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation\. AIPatient Arena provides an EHR\-grounded framework for workflow\-oriented pre\-deployment evaluation of medical LLMs\. ## Submission history From: Lizhou Fan \[[view email](https://arxiv.org/show-email/9c8e7af0/2606.17474)\] **\[v1\]**Tue, 16 Jun 2026 03:35:17 UTC \(12,193 KB\)
Similar Articles
ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
ChatHealthAI is a multimodal reasoning framework that aligns structured EHR representations with a frozen LLM to enable grounded clinical reasoning while maintaining predictive performance.
Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
Researchers introduce MedSP1000, a 1,638-case interactive benchmark derived from standardized patient scenarios to evaluate LLMs as dynamic clinical agents across multi-turn encounters. Results show even the best model (GPT-5.5) completes only 60.4% of expert rubric items, suggesting current LLMs are not yet reliable enough for clinical practice.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance
Introduces PhysAssistBench, a benchmark for evaluating LLMs in interactive doctor-patient-EHR assistance. Experiments show current models are unreliable in this setting, highlighting the need for coordinated capabilities.
Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis
This paper presents ClaMPAPP, a hybrid architecture that uses an LLM as an interface to extract features from clinical narratives, which are then passed to an XGBoost classifier for pediatric appendicitis diagnosis, demonstrating improved robustness and safety over end-to-end LLM baselines.