ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

arXiv cs.AI 06/03/26, 04:00 AM Papers
benchmark clinical-decision-making large-language-models healthcare evaluation multi-course
Summary
ClinicalMC is a benchmark designed to evaluate large language models in multi-course clinical decision-making, featuring datasets in Chinese and English and a multi-agent evaluation framework.
arXiv:2606.03157v1 Announce Type: new Abstract: Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
Source: [https://arxiv.org/html/2606.03157](https://arxiv.org/html/2606.03157)
Ruihui Hou♢\\diamondsuit,Siyi Zhu♢\\diamondsuit,Ziyue Huai♢\\diamondsuit,Guangya Yu♢\\diamondsuit,Yongqi Fan♢\\diamondsuit, Chunming Wang♣\\clubsuit,Tong Ruan♢\\diamondsuit ♢\\diamondsuitEast China University of Science and Technology, Shanghai, China, ♣\\clubsuitRenji Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China\.

###### Abstract

Large language models \(LLMs\) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision\-making scenarios\. Existing benchmarks primarily assess LLM performance in single\-course settings and lack systematic evaluation in multi\-course scenarios, where a patient’s condition evolves over time\. To address this gap, we propose ClinicalMC, a benchmark for multi\-course clinical decision\-making\. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge\. These stages cover triage, first\-course examination/diagnosis/treatment, subsequent multi\-course examination/assessment/treatment, and final diagnosis\. In ClinicalMC, patients in the English dataset undergo an average of 5\.11 clinical courses, whereas those in the Chinese dataset undergo 3\.42\. To assess LLM performance, we construct a multi\-agent evaluation framework that includes patient, examiner, and doctor agents\. Based on the benchmark and framework, we design two experimental settings—a single\-turn static setting and a multi\-turn dynamic setting—and assess three categories of LLMs: 1\) closed\-source LLMs like GPT5\-mini; 2\) open\-source LLMs like DeepSeek\-V3\.2, and 3\) medical LLMs like HuatuoGPT\-o1\. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare\.111Data and code are available at the URL[https://github\.com/hzyuezh/ClinicalMPD](https://github.com/hzyuezh/ClinicalMPD)\.

ClinicalMC: A Benchmark for Multi\-Course Clinical Decision\-Making with Large Language Models

Ruihui Hou♢\\diamondsuit, Siyi Zhu♢\\diamondsuit, Ziyue Huai♢\\diamondsuit, Guangya Yu♢\\diamondsuit, Yongqi Fan♢\\diamondsuit,Chunming Wang♣\\clubsuit,Tong Ruan♢\\diamondsuit††thanks:Corresponding Authors\.♢\\diamondsuitEast China University of Science and Technology, Shanghai, China,♣\\clubsuitRenji Hospital Affiliated to Shanghai Jiaotong UniversitySchool of Medicine, Shanghai, China\.

## 1Introduction

Large language models \(LLMs\) have shown strong performance in various medical NLP tasks, including information extractionZhanet al\.\([2025](https://arxiv.org/html/2606.03157#bib.bib47)\), text generationLinet al\.\([2023](https://arxiv.org/html/2606.03157#bib.bib2)\)and question answeringJinet al\.\([2021](https://arxiv.org/html/2606.03157#bib.bib29)\)\. However, their reliability remains limited in complex clinical decision\-making scenariosHageret al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib12)\), which require the continuous integration of heterogeneous data \(e\.g\., vital signs, laboratory results\) and real\-time reasoning under evolving patient conditionsSuttonet al\.\([2020](https://arxiv.org/html/2606.03157#bib.bib43)\)\. This limitation highlights the necessity of systematically evaluating the LLM applications in multi\-course222The course is a continuous record of a patient’s condition and treatment during hospitalization, including key details such as vital signs, surgeries, and major clinical changes\.clinical decision\-making\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x1.png)Figure 1:The solid boxes highlight the distinctions between our clinical decision\-making tasks and previous benchmarks\. Both “First Course Decision” and “Daily Course Decision” consist of three subtasks each, while the dashed boxes provide their detailed descriptions\.Table 1:Overview of clinical decision\-making benchmarks\. “Dept\.”, “CAS\.”, “A2D\.”, and “Multi\-C\.” stand for “department”, “assessment task”, “admission to discharge process”, and “multiple courses decision”, respectively\. “continuous assessment” indicates whether the patient’s condition is continuously assessed\.Clinical decision\-making is a multi\-stage, iterative process that often spans several treatment coursesHageret al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib12)\)\. Upon admission, clinicians first determine the most suitable department for each patient based on their primary presenting symptoms\. During the first course, they gather relevant clinical information and recommend necessary examinations to guide preliminary diagnostic and treatment decisions\. If the patient’s condition fails to improve, additional examinations are conducted in subsequent courses to reassess the clinical condition and promptly adjust the treatment plan\. This iterative process continues until the patient’s condition stabilizes and discharge criteria are met\. The overall process is illustrated in Fig\.[1](https://arxiv.org/html/2606.03157#S1.F1)\.

Several benchmarks have been proposed for clinical decision\-making, which can broadly be categorized into exam\-based and clinical case\-based benchmarks\. Exam\-based benchmarks, such as MedQAJinet al\.\([2021](https://arxiv.org/html/2606.03157#bib.bib29)\), MedMCQAPalet al\.\([2022](https://arxiv.org/html/2606.03157#bib.bib30)\), PubMedQAJinet al\.\([2019](https://arxiv.org/html/2606.03157#bib.bib31)\), and MMLUHendryckset al\.\([2021](https://arxiv.org/html/2606.03157#bib.bib32)\), primarily consist of Q&A pairs extracted from medical books and literature, aiming to evaluate the domain knowledge of LLMs\. However, they are largely biased toward theoretical knowledge and fail to align with actual clinical decision scenarios\. Clinical case\-based benchmarks such as ClinicallabYanet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib11)\), AI HospitalFanet al\.\([2025b](https://arxiv.org/html/2606.03157#bib.bib14)\), and MedJourneyWuet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib17)\)aim to simulate real\-world clinical scenarios\. However, they typically focus on single\-course decision\-making, involving only a single round of diagnosis and treatment, overlooking the crucial process of reassessing and adjusting treatment plans when a patient fails to improve across multiple courses\. In this work, we further address this gap by modeling multi\-course decision\-making scenarios that better reflect real clinical practice\. For ease of comparison, we summarize the differences between our benchmark and the most relevant clinical benchmarks in Table[1](https://arxiv.org/html/2606.03157#S1.T1)\.

Hence, in the paper, we introduceClinicalMC, a novel benchmark for evaluating the multi\-course clinical decision\-making capabilities of LLMs\. To construct this benchmark, we collect clinical records that encompass multiple changes in patient conditions and incorporate condition assessment tasks into each key decision point throughout the clinical course\. In addition, we design a three\-round annotation workflow to ensure high\-quality and consistent annotations\. Using this approach, we build 1,275 Chinese samples \(covering 16 departments\) and 5,804 English samples \(covering 24 departments\) from MedEurekaFanet al\.\([2025a](https://arxiv.org/html/2606.03157#bib.bib42)\)and PMC\-patientsZhaoet al\.\([2022](https://arxiv.org/html/2606.03157#bib.bib1)\)\. To facilitate systematic evaluation on ClinicalMC, we develop a multi\-agent evaluation framework comprising a patient agent, an examiner agent, and a doctor agent\. The patient agent provides the primary symptoms\. The examiner agent provides feedback on the examination results\. The doctor agent makes decisions at each stage of the workflow based on the patient’s evolving condition\. Using this benchmark and framework, we construct two experimental settings—a single\-turn static setting and a multi\-turn dynamic setting—and conduct a comprehensive evaluation with a range of doctor agents, including closed\-source LLMs such as GPT\-4o\-miniHurstet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib9)\), open\-source LLMs such as DeepSeek\-V3\.2Liuet al\.\([2025a](https://arxiv.org/html/2606.03157#bib.bib3)\), and medical LLMs such as HuatuoGPT\-o1Chenet al\.\([2023](https://arxiv.org/html/2606.03157#bib.bib26)\)\.

In summary, our contributions include:

- •We introduce a novel benchmark for multi\-course clinical decision\-making, ClinicalMC\. The benchmark comprises 1,275 Chinese samples across 16 departments and 5,804 English samples across 24 departments\.
- •The main characteristic of ClinicalMC is its inclusion of multiple clinical courses for each patient, enabling a more realistic representation of how a patient’s condition evolves over time\. In the English dataset, patients have an average of 5\.11 clinical courses, whereas in the Chinese dataset, the average is 3\.42\.
- •We evaluate medical LLMs as well as closed\- and open\-source LLMs on ClinicalMC, indicating that state\-of\-the\-art medical models like instruction\-tuned HuatuoGPT\-o1\(7B\) achieve average performance of 43\.40% and 47\.77% on Chinese and English, respectively, far below human performance \(85\.00% and 87\.51%\)\. We further provide detailed analyses and suggest future research directions\.

## 2Problem Formulation

In this work, we evaluate the complete clinical process from patient admission to discharge\. Each clinical task can be formally defined as:

Triage \(TRTR\):This task requires the doctor to select the most suitable departmentdpdpfrom a set of candidate departmentsdsds, given the patient’s chief complaintccccand basic informationbibi\. Formally, this is represented as:dp=TR\(cc,bi,ds\)dp=TR\(cc,bi,ds\)\.

Examination Recommendation \(ERER\):This task involves predicting the necessary auxiliary examinationsexexbased on the patient’s chief complaint, present historyph1ph\_\{1\}, past historyph2ph\_\{2\}, and physical examinationpepe\. Formally, this can be represented as:ex=ER\(cc,bi,ph1,ph2,pe,dp\)ex=ER\(cc,bi,ph\_\{1\},ph\_\{2\},pe,dp\)\. For examination recommendations across multiple courses, the input includes the patient’s chief complaint of the current course, along with all prior patient information\. This can be represented as:ex′=ER\(emr,pc,cc′,pe′\)ex^\{\\prime\}=ER\(emr,pc,cc^\{\\prime\},pe^\{\\prime\}\), wherecc′cc^\{\\prime\},ex′ex^\{\\prime\}, andpe′pe^\{\\prime\}represent the chief complaint, examination recommendation, and physical examination in the current course\.pcpcandemremrrepresent the previous course and the patient’s admission information\.

Clinical Diagnosis \(CDCD\):This task requires the doctor to determine the patient’s preliminary diagnosispdpd, the corresponding diagnostic basispbpb, and the differential diagnosisdddd, based on the patient’s chief complaint, present history, past history, physical examination, and auxiliary examinations\. It can be formally represented as:pd,pb,dd=CD\(cc,bi,ph1,ph2,pe,dp,ex\)pd,pb,dd=CD\(cc,bi,ph\_\{1\},ph\_\{2\},pe,dp,ex\)\.

Assessment \(ASAS\):This task requires the doctor to assess the patient’s condition, based on the chief complaint and physical examination of the current course\. The assessment may involve updating an existing diagnosis or introducing a new one\. Formally, the task is defined as:as’=AS\(cc′,pe′,ex′,emr\)as’=AS\(cc^\{\\prime\},pe^\{\\prime\},ex^\{\\prime\},emr\), whereas′as^\{\\prime\}represent the clinical assessment for the current course\.

Treatment Planning \(TPTP\):This task involves predicting the optimal treatment plan based on the patient’s chief complaint, present history, past history, physical examination, auxiliary examinations, preliminary diagnosis, diagnostic basis, and differential diagnosis\. It can be formally expressed as:tp=TP\(emr\)tp=TP\(emr\)\. For treatment planning across multiple courses, the input also includes the current course’s data\. This can be represented astp′=TP\(emr,cc′,pe′,ex′,as′\)tp^\{\\prime\}=TP\(emr,cc^\{\\prime\},pe^\{\\prime\},ex^\{\\prime\},as^\{\\prime\}\)\.

Final Diagnosis \(FDFD\):This task requires the doctor to determine the final diagnosisfdfdand its supporting basisfbfbbased on the entire clinical trajectory\. This task can be formally represented as:fd,fb=FD\(emr,pn\)fd,fb=FD\(emr,pn\), wherepn=\[pc1,pc2,…,pcn\]pn=\[pc\_\{1\},pc\_\{2\},\\ldots,pc\_\{n\}\]is the sequence ofnncourses\. Each coursepci\(1≤i≤n\)pc\_\{i\}\(1\\leq i\\leq n\)includes the chief complaint, physical examination, auxiliary examination, assessment, and treatment plan:pci=\(cc′,ex′,pe′,as′,tp′\)pc\_\{i\}=\(cc^\{\\prime\},ex^\{\\prime\},pe^\{\\prime\},as^\{\\prime\},tp^\{\\prime\}\)\.

## 3ClinicalMC Construction

In this section, we provide a detailed description of the data collection and processing, quality control, and data statistics and analysis\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x2.png)Figure 2:The department distribution of the Chinese and English datasets\.### 3\.1Data Collection and Processing

For theChinese data, we use Electronic Health Records \(EHRs\) from MedEureka as the original data source\. To obtain strictly anonymized and high\-quality EHRs, we process the data in two stages\. In the first stage, we identify EHRs containing personal information \(e\.g\., names, phone numbers\) using regular expressions, and replace sensitive data with placeholders \(e\.g\., “Patient A”\) or randomly generated values, resulting in 6,947 EHRs\. In the second stage, we further filter the data to retain only complete and high\-quality EHRs\. We first remove EHRs lacking key information \(e\.g\., chief complaints, diagnoses, or treatment processes\), retaining 5,106 EHRs\. We then exclude EHRs with a final outcome of death, leaving 4,179 EHRs, and finally eliminate duplicate records through fine\-grained demographic matching \(e\.g\., gender and occupation\)\. After this stage, we obtain 3,317 high\-quality EHRs, each containing multiple treatment courses\. For theEnglish data, we use 167,034 anonymized case reports from PMC\-Patients as the original data source\. To obtain high\-quality multi\-course reports, we conduct three screening steps\. First, we use the GPT\-4o model to remove reports that lack multiple courses or contain incomplete clinical courses \(e\.g\., no improvement or death\), retaining 37,357 reports\. Second, we remove reports missing key fields such as admission and final diagnosis, or those labeled as “undiagnosed”, leaving 15,572 reports\. Finally, we exclude non\-human data \(e\.g\., treatment reports for animals\)\. After this rigorous screening process, we ultimately retain 6,748 reports\. Additionally, to ensure compliance with ethical standards, three clinicians from a Grade 3A hospital conduct a thorough ethical review of the final dataset, confirming that no ethical or moral guidelines are violated\.

Table 2:Statistics of our constructed dataset\.
### 3\.2Quality Control

To construct ClinicalMC, we assemble a professional annotation team comprising three inspectors and two reviewers\. The dataset is first automatically segmented from multi\-course EHRs using an LLM\. Subsequently, three clinically trained inspectors perform an initial verification, followed by a dual review conducted by two senior clinicians\. The detailed annotation workflow is provided in Appendix[A\.1](https://arxiv.org/html/2606.03157#A1.SS1)\. After a rigorous two\-stage quality review, we obtain 1,275 high\-quality Chinese EHRs and 5,804 high\-quality English EHRs\. To further ensure data integrity and clinical relevance, we conduct an additional quality\-control procedure involving three senior clinicians, each with over ten years of clinical experience and independent of the annotation reviewers\. For this assessment, we randomly sample 3,000 cases from the English dataset \(51\.68%\) and 1,000 cases from the Chinese dataset \(78\.43%\)\. We design a standardized scoring framework that presents complete case information and requires clinicians to assess six binary quality dimensions: 1\) rationality of course segmentation, 2\) accuracy of triage, 3\) correctness of diagnostic results, 4\) appropriateness of treatment plans, 5\) accuracy of clinical assessments, and 6\) accuracy of examination recommendations\. Clinicians make a “yes/no” judgment for each dimension, and a case is deemed valid only when all criteria are satisfied\. Evaluation results show that 93\.3% of sampled cases meet the predefined quality standards\. The pass rates for individual criteria range from 91\.9% to 96\.3%, indicating consistently high overall quality\. The Cohen’s kappaBanerjeeet al\.\([1999](https://arxiv.org/html/2606.03157#bib.bib51)\)for inter\-reviewer agreement is 0\.85, demonstrating strong consistency among reviewers\. For the remaining 6\.7% of cases that do not meet the standards, we perform manual corrections to ensure the reliability and completeness of the dataset\.

### 3\.3Data Statistics and Analysis

We conduct an in\-depth statistical analysis of clinical decision\-making from two perspectives\.1\) Department distribution\. Fig\.[2](https://arxiv.org/html/2606.03157#S3.F2)presents the department distribution in both the Chinese and English EHR datasets\. In the English dataset, the “Cardiovascular Medicine” department contains the most samples \(893 EHRs\), whereas the “Anus & Intestine Surgery” department has the fewest \(10 EHRs\)\. In the Chinese dataset, the “Surgery” department has the largest sample size \(391 EHRs\), while the “Chinese Medicine” and “Stomatology” department have the smallest, with only 6 EHRs each\. By analyzing the department distribution in both datasets, we observe an imbalance in the sample sizes, reflecting the real\-world situation in clinical data\. This imbalance is likely due to differences in clinical demand across different departments\.2\) Number of courses\. As shown in Table[2](https://arxiv.org/html/2606.03157#S3.T2), the English dataset has an average of 5\.11 courses per patient, ranging from 2 to 11\. In comparison, the Chinese dataset has an average of 3\.42 courses per patient, with a range of 2 to 10\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x3.png)Figure 3:The SimHospital framework includes a doctor agent, an examiner agent, and a patient agent\. In different tasks, different roles will engage in dialogues\. When the patient shows improvement and is ready for discharge, the final diagnosis task is performed; otherwise, the patient continues into a new course\. Theboldtext indicates the information that has been newly added in the current task compared to the previous task\.

## 4Evaluation Framework

Inspired by AI HospitalFanet al\.\([2025b](https://arxiv.org/html/2606.03157#bib.bib14)\), we develop an evaluation framework,SimHospital, which consists of a doctor agent, a patient agent, and an examiner agent\. GPT\-4o\-mini is used for the patient and examiner agents, while various LLMs are employed as the doctor agent to assess clinical decision\-making performance\. We also conduct an ablation study of the patient and examiner models in Appendix[A\.7](https://arxiv.org/html/2606.03157#A1.SS7)\.

### 4\.1Agent Behavior Setting for All Role

Examiner\.The examiner agent is responsible for providing relevant examination results upon request from the patient agent\. If the requested examination has corresponding results available, the examiner agent returns those results to the doctor\. Otherwise, it responds with an indication that no such examination has been conducted\.Patient\.The patient agent’s main task is to interact with the doctor and the examiner agents\. To match the actual situation, we add the chief complaint, present history, past history, and physical examination to the prompts of the patient agent, but do not specify the diagnosis or treatment plan\. If the doctor suggests performing a specific examination, the patient agent follows the suggestion and provides the examination to the examiner agent\.Doctor\.The doctor agent’s primary task is to gather and analyze patient information to complete clinical decision\-making tasks, including triage, examination recommendation, clinical diagnosis, assessment, treatment planning, and final diagnosis\.

### 4\.2Clinical Workflow

The SimHospital Framework simulates the entire process from admission to recovery and discharge by constructing multiple agents, as illustrated in Fig\.[3](https://arxiv.org/html/2606.03157#S3.F3)\. The interaction begins with the patient agent and proceeds through four stages\. In the first stage, the patient agent presents a chief complaint, and the doctor agent recommends the appropriate department\. In the second stage, the doctor interacts with both the patient and examiner agents, recommends necessary examinations, and makes clinical diagnoses and treatment plans\. In the third stage, the patient agent enters the multi\-course phase, during which the doctor agent sequentially performs tasks such as examination recommendation, assessment, and treatment planning based on the patient’s complaints for the current course\. This process repeats until the patient recovers and is ready for discharge\. In the fourth stage, the doctor agent provides a discharge diagnosis based on the patient’s complete medical information\.

Table 3:Experimental results on English data \(%\)\. “T”, “E”, “PD”, “PB”, “DD”, “TP”, “FD”, and “FB” refer to triage, examination recall, preliminary diagnosis, preliminary diagnosis basis, differential diagnosis, treatment planning, final diagnosis, and final diagnosis basis, respectively\. “CE”, “CA”, and “CT” represent the examination recommendation, assessment, and treatment planning for each course, respectively\.ModelT\_AccE\_RecallPD\_F1PB\_ScoreDD\_ScoreTP\_IoUCE\_RecallCA\_IoUCT\_IoUFD\_F1FB\_ScoreAvg\\cellcolor\[HTML\]EFEFEFMedical LLMsApollo2\-7B61\.0672\.7629\.9361\.9940\.955\.3737\.1323\.9810\.0665\.2774\.6143\.92Asclepius\-Llama2\-13B0\.020\.000\.0044\.0538\.301\.420\.0022\.661\.020\.0031\.2512\.61Asclepius\-Llama2\-7B0\.020\.000\.0043\.6538\.451\.420\.0022\.491\.010\.0031\.2912\.58MedGemma63\.0219\.8822\.4076\.9962\.7910\.0724\.6650\.802\.1070\.3885\.8544\.45Baichuan\-M261\.7325\.5024\.8274\.8551\.769\.3625\.4754\.022\.2780\.1686\.3345\.12HuatuoGPT\-o1\-7B58\.5559\.0020\.0158\.4451\.308\.2158\.4151\.671\.6476\.1382\.1047\.77\\cellcolor\[HTML\]EFEFEFOpen\-source LLMsLlama\-3\.3\-70B63\.6617\.3320\.5171\.9751\.8310\.8715\.6070\.165\.7482\.1979\.8144\.52Llama\-3\.2\-3B46\.3634\.7714\.8548\.7638\.915\.9626\.1548\.053\.7463\.7669\.8736\.47Mistral\-7B\-v0\.337\.3559\.3716\.4765\.5346\.478\.4545\.8958\.974\.8572\.9679\.3745\.06Mixtral\-8x22B59\.8442\.6427\.6771\.2950\.6010\.9139\.7070\.536\.2587\.8480\.4949\.80Falcon3\-7B50\.7643\.8114\.6760\.5747\.127\.4742\.0865\.484\.3667\.5776\.5143\.67Qwen2\.5\-72B62\.6717\.7327\.2774\.5750\.6611\.1022\.9259\.085\.5777\.4987\.5645\.15Qwen2\.5\-32B62\.7919\.0622\.1374\.4252\.3811\.7224\.7266\.116\.3979\.7586\.2645\.98Qwen2\.5\-14B65\.1314\.9825\.8573\.8852\.2810\.4019\.1861\.674\.6390\.5980\.6145\.38Qwen2\.5\-7B59\.3052\.1510\.9459\.7949\.038\.1339\.3255\.465\.3677\.3677\.9844\.98Qwen3\-Next\-80B\-A3B57\.1522\.9629\.6777\.9157\.2312\.4017\.1659\.054\.1279\.5684\.3545\.59DeepSeek\-V3\.2\-Chat62\.9020\.6827\.6981\.8260\.6010\.9523\.4942\.875\.2869\.6286\.2844\.74DeepSeek\-V3\.2\-Reason53\.1113\.9528\.0878\.4356\.1712\.8913\.0457\.685\.8870\.8587\.1743\.39\\cellcolor\[HTML\]EFEFEFClosed\-source LLMsGPT\-4o\-mini65\.7317\.7434\.7372\.8059\.0811\.6217\.1654\.442\.3093\.1379\.7546\.23GPT\-5\-mini61\.7128\.1631\.8683\.3361\.7113\.2025\.8460\.014\.7158\.4686\.6546\.88Qwen\-turbo63\.7842\.2432\.9672\.3055\.089\.9751\.4146\.682\.0989\.9679\.9849\.68\\cellcolor\[HTML\]EFEFEFOther MethodHuman \(sampling\)90\.0086\.3988\.7584\.8582\.8083\.0683\.0587\.2292\.0990\.9593\.4587\.51

## 5Experiments

In this section, we implement state\-of\-the\-art models on our newly constructed ClinicalMC benchmark, aiming at assessing their performance and identifying the underlying challenges\.

### 5\.1Experimental Setup

Baseline Model\.We evaluate four categories of LLMs: 1\)Medical LLMs,including MedGemmaSellergrenet al\.\([2025](https://arxiv.org/html/2606.03157#bib.bib52)\), Baichuan\-M2Douet al\.\([2025](https://arxiv.org/html/2606.03157#bib.bib53)\), HuatuoGPT\-o1, and Apollo2\-7BZhenget al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib27)\)\. Additionally, as some medical LLMs demonstrate strong performance across different languages, we use HuatuoGPT2 \(7B, 13B, and 34B\)Chenet al\.\([2023](https://arxiv.org/html/2606.03157#bib.bib26)\)for Chinese datasets and Asclepius\-Llama2 \(7B and 13B\)Kweonet al\.\([2024a](https://arxiv.org/html/2606.03157#bib.bib28)\)for English datasets\. Notably, Asclepius\-Llama2 was trained on the PMC\-Patients dataset, making it well\-suited forassessing potential data leakage risks\. 2\)Open\-source LLMs, including Falcon3\-7BAlmazroueiet al\.\([2023](https://arxiv.org/html/2606.03157#bib.bib19)\), Qwen2\.5 \(ranging from 7B to 72B\), DeepSeek\-V3\.2 \(Chat and Reason\)Liuet al\.\([2025a](https://arxiv.org/html/2606.03157#bib.bib3)\), Llama\-3\.3\-70BGrattafioriet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib23)\), Llama\-3\.2\-3B, Mistral\-7BJianget al\.\([2023](https://arxiv.org/html/2606.03157#bib.bib25)\), Mixtral\-8x22BJianget al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib24)\), and Qwen3\-Next\-80B\-A3B333[https://huggingface\.co/Qwen/Qwen3\-Next\-80B\-A3B\-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)\. Since the Llama series models exhibit certain capabilities in processing Chinese, we also evaluate their performance on Chinese datasets\. 3\)Closed\-source LLMs, such as GPT\-4o\-miniHurstet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib9)\), GPT\-5\-mini, and Qwen\-turbo\. 4\)Other Method\. We randomly select 100 samples and invite a medical student who does not participate in the data annotation process to answer the questions\.

Table 4:Experimental results on Chinese data \(%\)\.ModelT\_AccE\_RecallPD\_F1PB\_ScoreDD\_ScoreTP\_IoUCE\_RecallCA\_IoUCT\_IoUFD\_F1FB\_ScoreAvg\\cellcolor\[HTML\]EFEFEFMedical LLMsApollo2\-7B52\.7129\.0533\.8566\.6535\.015\.4029\.3946\.540\.9825\.8254\.7534\.56HuatuoGPT2\-7B42\.8227\.720\.2649\.8029\.992\.4740\.7928\.830\.460\.4724\.3622\.54HuatuoGPT2\-13B45\.1326\.630\.0048\.2629\.871\.9952\.9423\.990\.060\.0037\.6924\.23HuatuoGPT2\-34B57\.5929\.6526\.1861\.5334\.435\.0442\.8435\.310\.7322\.2437\.2332\.07MedGemma65\.6528\.6430\.1376\.7846\.346\.9113\.7379\.742\.9258\.4883\.7244\.82HuatuoGPT\-o1\-7B65\.7330\.9333\.3170\.4839\.365\.7817\.2471\.772\.8962\.5577\.3843\.40Baichuan\-M254\.2032\.7435\.6778\.3747\.035\.1118\.8176\.342\.8068\.4580\.9345\.50\\cellcolor\[HTML\]EFEFEFOpen\-source LLMsLlama\-3\.3\-70B52\.3935\.5233\.7273\.0441\.116\.3825\.2850\.811\.9729\.1267\.9437\.93Llama\-3\.2\-3B46\.2023\.7910\.6338\.3224\.331\.8450\.6721\.550\.183\.4721\.1322\.01Mistral\-7B\-v0\.339\.1423\.9014\.5234\.1522\.782\.2824\.6023\.670\.4913\.8427\.7520\.65Mixtral\-8x22B53\.1027\.2326\.1560\.1925\.164\.7522\.9736\.261\.1423\.7552\.0030\.25Falcon3\-7B41\.1029\.0512\.6034\.7822\.711\.4737\.0923\.620\.416\.9324\.8521\.33Qwen2\.5\-72B\-Chat57\.4123\.7437\.9673\.9840\.055\.9420\.8849\.172\.1131\.8870\.3237\.59Qwen2\.5\-32B\-Chat59\.9223\.0634\.9876\.6143\.457\.6718\.2460\.092\.8430\.6964\.4238\.36Qwen2\.5\-14B\-Chat53\.5721\.1636\.2074\.7337\.997\.8423\.6863\.222\.6330\.5464\.3637\.81Qwen2\.5\-7B\-Chat62\.3528\.8735\.3572\.4938\.536\.3431\.3350\.661\.4028\.9064\.7538\.27Qwen3\-Next\-80B\-A3B68\.3128\.5331\.2582\.4661\.306\.3811\.9884\.892\.7563\.8088\.1748\.17DeepSeek\-V3\.2\-Chat67\.0625\.4630\.8777\.0463\.447\.5715\.5577\.943\.0754\.6179\.2245\.62DeepSeek\-V3\.2\-Reason59\.5023\.9437\.9877\.1058\.656\.8613\.6781\.023\.2467\.3681\.0846\.40\\cellcolor\[HTML\]EFEFEFClosed\-source LLMsGPT\-4o\-mini54\.4320\.6532\.5869\.4736\.415\.8115\.4653\.971\.5628\.2864\.6634\.84GPT\-5\-mini59\.6225\.9212\.0083\.7649\.486\.769\.6684\.344\.1944\.6082\.1642\.04Qwen\-turbo55\.6925\.8833\.8073\.2938\.687\.4926\.1362\.312\.7129\.7769\.0838\.62\\cellcolor\[HTML\]EFEFEFOther MethodHuman \(sampling\)90\.9185\.6187\.6882\.9188\.1479\.4186\.7082\.7684\.5383\.7182\.7385\.00

Evaluation Metrics\.For the triage task, we useAccuracy\(Acc\)Accuracy\(Acc\)as the evaluation metric\. For examination recommendation, we adoptRecallRecall, and for disease diagnosis—covering both clinical and final diagnosis—we use theF1F1score\. To evaluate examination and diagnosis entities, we construct a standardized synonym list by first collecting synonyms from the Medeureka\_corpusFanet al\.\([2025a](https://arxiv.org/html/2606.03157#bib.bib42)\)\. To address terminological inconsistencies due to differences in model training data, we select the largest model from each of 10 LLM series to independently generate synonym lists\. These are merged with the Medeureka corpus and refined by three clinicians to ensure consistency and clinical validity\. For diagnosis basis, we employ an LLM to assess \(1\) whether the medical reasoning process is logically coherent, and \(2\) whether the provided evidence sufficiently and effectively supports the predicted diagnosis\. The scores for preliminary and final diagnosis are denoted asPB\_ScorePB\\\_ScoreandFB\_ScoreFB\\\_Score, respectively\. The detailed evaluation prompts are provided in Appendix[A\.11](https://arxiv.org/html/2606.03157#A1.SS11)\. For differential diagnosis, we focus on evaluating whether the set of predicted differential diagnoses adequately covers other clinically significant potential conditions\. This metric is denoted asDD\_ScoreDD\\\_Score\. The detailed evaluation prompts are provided in Appendix[A\.11](https://arxiv.org/html/2606.03157#A1.SS11)\. For the assessment and treatment planning in each clinical course, we follow the approach of MedChainLiuet al\.\([2025b](https://arxiv.org/html/2606.03157#bib.bib13)\)by decomposing model outputs into structured clinical entities\. We then compute the Intersection over Union \(IoU\) between these entities and the gold\-standard key interventions\. This metric emphasizes the coverage of critical clinical actions rather than surface\-level wording, and is denoted asCA\_IoUCA\\\_IoUandCT\_IoUCT\\\_IoU, respectively\.

Implementation Details\.We design two experimental settings: a single\-round static setting and a multi\-round dynamic setting\. In the former, the ground\-truth annotations from preceding tasks are provided as inputs to subsequent tasks\. In the latter, the model outputs of preceding tasks are used as inputs to subsequent tasks\. Detailed implementation details are presented in Appendix[A\.2](https://arxiv.org/html/2606.03157#A1.SS2)\.

### 5\.2Main Results

We systematically evaluate all baseline LLMs on the ClinicalMC under the single\-turn setting\. The English and Chinese results are reported in Table[3](https://arxiv.org/html/2606.03157#S4.T3)and Table[4](https://arxiv.org/html/2606.03157#S5.T4), respectively\. We select one representative model from each of the medical, closed\-source, and open\-source LLM categories for evaluation under the multi\-turn experimental setting\. The detailed results are presented in Appendix[A\.5](https://arxiv.org/html/2606.03157#A1.SS5)\.

From the Table[3](https://arxiv.org/html/2606.03157#S4.T3)and Table[4](https://arxiv.org/html/2606.03157#S5.T4), we notice that: 1\) All LLMs perform poorly on both the English and Chinese datasets, leaving substantial room for improvement compared to human performance \(AvgAvgscores of 85\.00% and 87\.51%, respectively\)\. The best\-performing model achieves only 49\.68% on the English dataset and 48\.17% on the Chinese dataset, highlighting the significant challenge posed by our ClinicalMC benchmark\. 2\) LLMs perform worse in multi\-course settings compared to single\-course settings\. Specifically, Llama\-3\.3\-70B achieves aTP\_IoUTP\\\_IoUof 6\.38% on Chinese data and 10\.87% on English data, outperforming theCT\_IoUCT\\\_IoUby 4\.41% and 5\.13%, respectively\. This decline is primarily due to the increasing complexity of clinical information as the number of courses grows\. Patients’ records often contain redundant or repeated examinations and treatments, making it more challenging for the model to accurately assess the current condition and generate up\-to\-date treatment plans in real\-time\. 3\) Notably, although Asclepius\-Llama2 is trained on the PMC\-Patients dataset, it performs poorly on the English subset of ClinicalMC\. Specifically, the 7B and 13B variants achieveAvgAvgof only 12\.58% and 12\.61%, respectively\. This is primarily because ClinicalMC reconstructs the medical records into reasoning tasks that require cross\-trajectory information integration and explicit clinical decision\-making, through multi\-trajectory decomposition and multiple rounds of human review\. In contrast, Asclepius\-Llama2 focuses more on medical record generation and local semantic modeling, limiting its effectiveness in such complex clinical reasoning scenarios\. Consequently, these results further highlight the challenging nature of ClinicalMC for evaluating clinical reasoning capabilities\.

### 5\.3Error Type

To guide future research in clinical decision\-making for LLMs, we manually analyze and classify 200 error samples generated by LLMs on the Chinese and English datasets of ClinicalMC\. These errors are categorized into five types: \(a\)Redundant Diagnostic and Treatment Plan\(RDTP\): The model generates an excessive number of unnecessary diagnostic tests and treatment plans\. \(b\)Failure to Detect Subtle but Critical Changes\(FDSC\): The model fails to recognize subtle yet clinically significant changes in a patient’s condition—such as slight fluctuations in laboratory results—which may lead to delayed or inappropriate adjustments in diagnosis or treatment plans\. \(c\)Incorrect Clinical Diagnosis\(ICD\): Due to a lack of domain\-specific medical knowledge or misinterpretation of clinical information, the model produces incorrect diagnostic conclusions\. \(d\)Incorrect Reasoning Chain\(IRC\): The diagnostic rationale produced by the model does not align with the actual clinical condition\. \(e\)Other Errors: all other cases of errors\. The error distribution is shown in Fig\.[4](https://arxiv.org/html/2606.03157#S5.F4), with illustrative Chinese and English examples included in Appendix[A\.8](https://arxiv.org/html/2606.03157#A1.SS8)\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x4.png)Figure 4:Distribution of error types\.

## 6Related Work

Clinical decision\-making benchmark\. Clinical decision\-making tasks refer to assisting doctors in making the most appropriate diagnosis and treatment decisions by continuously analyzing the patient’s chief complaints, medical history, examination results and other informationHageret al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib12)\)\. Existing clinical decision\-making benchmarks can be broadly classified into two types: exam\-based and clinical case\-based benchmarks\. Exam\-based benchmarks include datasets such as MedQAJinet al\.\([2021](https://arxiv.org/html/2606.03157#bib.bib29)\), MedMCQAPalet al\.\([2022](https://arxiv.org/html/2606.03157#bib.bib30)\), PubMedQAJinet al\.\([2019](https://arxiv.org/html/2606.03157#bib.bib31)\), and MMLUHendryckset al\.\([2021](https://arxiv.org/html/2606.03157#bib.bib32)\), primarily consist of Q&A pairs extracted from medical books and literature\. However, there is still a certain gap between these benchmarks and actual clinical decision\-making\. Therefore, recent studies have proposed benchmarks based on clinical case benchmarks, such as MedChainLiuet al\.\([2025b](https://arxiv.org/html/2606.03157#bib.bib13)\), ClinicallabYanet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib11)\), MSDiagnosisHouet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib33)\), Ai\-HospitalFanet al\.\([2025b](https://arxiv.org/html/2606.03157#bib.bib14)\), and MedJourneyWuet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib17)\)\. However, these benchmarks mostly focus on a single course or simulate decision\-making in outpatient settings\. For example, Clinicallab evaluates tasks such as department guidance, clinical diagnosis, and treatment planning, but does not involve continuous assessment of patients after treatment until they recover and are discharged\. Therefore, this study focuses on the clinical decision\-making performance of models in multi\-course after patient admission\.

Agent for medical decision\-making\. Research on intelligent agents for medical decision\-making can be divided into single\-agentLiet al\.\([2024a](https://arxiv.org/html/2606.03157#bib.bib38)\); Chenet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib34)\); Houet al\.\([2026](https://arxiv.org/html/2606.03157#bib.bib4)\)and multi\-agentTanget al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib36)\); Liet al\.\([2024b](https://arxiv.org/html/2606.03157#bib.bib37)\)methods\. In single\-agent research, CoDChenet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib34)\)assesses potential candidate diseases by planning to inquire about the patient’s latent symptoms and generates a diagnostic chain from symptoms to possible diseases\. In multi\-agent research, medical decision\-making problems are typically tackled through a multi\-agent task division and collaboration paradigm, such as in frameworks like MDAgentsKimet al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib35)\), MedAgentsTanget al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib36)\), and Agent HospitalLiet al\.\([2024b](https://arxiv.org/html/2606.03157#bib.bib37)\)\. MDAgents is a multi\-agent framework that utilizes adaptive decision\-making mechanisms to tackle medical decision\-making challenges\. It operates through multiple phases, including analyzing problem complexity, dynamically recruiting experts, and employing reasoning and decision\-making processes at various stages to solve medical Q&A\.

## 7Conclusion

We introduce ClinicalMC, a benchmark comprising both Chinese and English datasets that encompass the full patient journey from admission to discharge\. These stages cover triage, first\-course examination/diagnosis/treatment, subsequent multi\-course examination/assessment/treatment, and final diagnosis\. To evaluate model performance in multi\-course clinical decision\-making, we develop a multi\-agent framework involving patient, examiner, and doctor agents\. Based on the dataset and framework, we define two experimental settings—single\-turn and multi\-turn—and evaluate medical LLMs as well as closed\-source and open\-source LLMs and conduct extensive experimental analysis\. The results show that ClinicalMC is a challenging dataset that warrants further research and exploration\.

## Limitations

This paper has two primary limitations that offer avenues for future research: First, the lack of multimodal information\. The raw data used in this study primarily consist of textual medical records collected during hospitalization and do not cover multimodal data across multiple courses, such as medical images and time\-series physiological signals\. In future work, we plan to investigate the integration of medical imaging with text\-based reasoning to support clinical decision\-making over heterogeneous, multi\-source data\. Second, an imbalanced department distribution\. The current dataset is mainly derived from a single data source, leading to imbalanced distributions across clinical departments\. Although this imbalance partially reflects real\-world clinical practice, it may still affect the model’s generalization performance in underrepresented departments\. In future work, we will incorporate data from multiple healthcare systems to expand coverage and mitigate department\-level imbalance\.

## Ethical Consideration

Our ClinicalMC benchmark is based on PMC\-Patients and MedEureka, licensed under the Creative Commons Attribution 4\.0 License\. Accordingly, we assign the copyright of ClinicalMC to the CC\-BY 4\.0 license\. In addition, we have meticulously reviewed our dataset to ensure it does not contain any harmful content, including gender bias, racial discrimination, or inappropriate material\.

## Acknowledgments

This paper was supported by the Shanghai Science and Technology Innovation Action Plan in Computational Biology \(No\. 24JS2840200\)\.

## References

- E\. Almazrouei, H\. Alobeidli, A\. Alshamsi, A\. Cappelli, R\. Cojocaru, M\. Debbah, É\. Goffinet, D\. Hesslow, J\. Launay, Q\. Malartic,et al\.\(2023\)The falcon series of open language models\.arXiv preprint arXiv:2311\.16867\.Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- M\. Banerjee, M\. Capozzoli, L\. McSweeney, and D\. Sinha \(1999\)Beyond kappa: a review of interrater agreement measures\.Canadian journal of statistics27\(1\),pp\. 3–23\.Cited by:[§3\.2](https://arxiv.org/html/2606.03157#S3.SS2.p1.1)\.
- J\. Chen, C\. Gui, A\. Gao, K\. Ji, X\. Wang, X\. Wan, and B\. Wang \(2024\)CoD, towards an interpretable medical agent using chain of diagnosis\.arXiv preprint arXiv:2407\.13301\.Cited by:[§6](https://arxiv.org/html/2606.03157#S6.p2.1)\.
- J\. Chen, X\. Wang, K\. Ji, A\. Gao, F\. Jiang, S\. Chen, H\. Zhang, D\. Song, W\. Xie, C\. Kong,et al\.\(2023\)Huatuogpt\-ii, one\-stage training for medical adaption of llms\.arXiv preprint arXiv:2311\.09774\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p4.1),[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- Z\. Chen, Z\. Peng, X\. Liang, C\. Wang, P\. Liang, L\. Zeng, M\. Ju, and Y\. Yuan \(2025\)Map: evaluation and multi\-agent enhancement of large language models for inpatient pathways\.arXiv preprint arXiv:2503\.13205\.Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.8.8.1)\.
- C\. Dou, C\. Liu, F\. Yang, F\. Li, J\. Jia, M\. Chen, Q\. Ju, S\. Wang, S\. Dang, T\. Li,et al\.\(2025\)Baichuan\-m2: scaling medical capability with large verifier system\.arXiv preprint arXiv:2509\.02208\.Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- Y\. Fan, N\. Wang, K\. Xue, J\. Liu, and T\. Ruan \(2025a\)MedEureka: a medical domain benchmark for multi\-granularity and multi\-data\-type embedding\-based retrieval\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 2825–2851\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.154/),ISBN 979\-8\-89176\-195\-7Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p4.1),[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p2.8)\.
- Z\. Fan, L\. Wei, J\. Tang, W\. Chen, W\. Siyuan, Z\. Wei, and F\. Huang \(2025b\)AI hospital: benchmarking large language models in a multi\-agent medical interaction simulator\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 10183–10213\.External Links:[Link](https://aclanthology.org/2025.coling-main.680/)Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.6.6.1),[§1](https://arxiv.org/html/2606.03157#S1.p3.1),[§4](https://arxiv.org/html/2606.03157#S4.p1.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- Y\. Gao, T\. Miller, D\. Xu, D\. Dligach, M\. M\. Churpek, and M\. Afshar \(2022\)Summarizing patients’ problems from hospital progress notes using pre\-trained sequence\-to\-sequence models\.InProceedings of COLING\. International Conference on Computational Linguistics,Vol\.2022,pp\. 2979\.Cited by:[§A\.1](https://arxiv.org/html/2606.03157#A1.SS1.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- P\. Hager, F\. Jungmann, R\. Holland, K\. Bhagat, I\. Hubrecht, M\. Knauer, J\. Vielhauer, M\. Makowski, R\. Braren, G\. Kaissis,et al\.\(2024\)Evaluation and mitigation of the limitations of large language models in clinical decision\-making\.Nature medicine30\(9\),pp\. 2613–2622\.Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.4.4.1),[§1](https://arxiv.org/html/2606.03157#S1.p1.1),[§1](https://arxiv.org/html/2606.03157#S1.p2.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p3.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- R\. Hou, S\. Chen, Y\. Fan, L\. Zhu, J\. Sun, J\. Liu, and T\. Ruan \(2024\)MSDiagnosis: an emr\-based dataset for clinical multi\-step diagnosis\.arXiv preprint arXiv:2408\.10039\.Cited by:[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- R\. Hou, D\. Xue, H\. Sun, P\. He, W\. Zhang, and T\. Ruan \(2026\)CDAFlow: enhancing llm clinical decision\-making through agentic workflow\.Expert Systems with Applications,pp\. 131806\.Cited by:[§6](https://arxiv.org/html/2606.03157#S6.p2.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p4.1),[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p1.1),[§1](https://arxiv.org/html/2606.03157#S1.p3.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. W\. Cohen, and X\. Lu \(2019\)Pubmedqa: a dataset for biomedical research question answering\.arXiv preprint arXiv:1909\.06146\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p3.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- A\. E\. Johnson, L\. Bulgarelli, L\. Shen, A\. Gayles, A\. Shammout, S\. Horng, T\. J\. Pollard, S\. Hao, B\. Moody, B\. Gow,et al\.\(2023\)MIMIC\-iv, a freely accessible electronic health record dataset\.Scientific data10\(1\),pp\. 1\.Cited by:[§A\.1](https://arxiv.org/html/2606.03157#A1.SS1.p2.1)\.
- A\. E\. Johnson, T\. J\. Pollard, L\. Shen, L\. H\. Lehman, M\. Feng, M\. Ghassemi, B\. Moody, P\. Szolovits, L\. Anthony Celi, and R\. G\. Mark \(2016\)MIMIC\-iii, a freely accessible critical care database\.Scientific data3\(1\),pp\. 1–9\.Cited by:[§A\.1](https://arxiv.org/html/2606.03157#A1.SS1.p2.1)\.
- S\. Johri, J\. Jeong, B\. A\. Tran, D\. I\. Schlessinger, S\. Wongvibulsin, L\. A\. Barnes, H\. Zhou, Z\. R\. Cai, E\. M\. Van Allen, D\. Kim,et al\.\(2025\)An evaluation framework for clinical use of large language models in patient interaction tasks\.Nature Medicine,pp\. 1–10\.Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.7.7.1)\.
- Y\. Kim, C\. Park, H\. Jeong, Y\. S\. Chan, X\. Xu, D\. McDuff, H\. Lee, M\. Ghassemi, C\. Breazeal, and H\. W\. Park \(2024\)MDAgents: an adaptive collaboration of LLMs for medical decision\-making\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=EKdk4vxKO4)Cited by:[§6](https://arxiv.org/html/2606.03157#S6.p2.1)\.
- S\. Kweon, J\. Kim, J\. Kim, S\. Im, E\. Cho, S\. Bae, J\. Oh, G\. Lee, J\. H\. Moon, S\. C\. You, S\. Baek, C\. H\. Han, Y\. B\. Jung, Y\. Jo, and E\. Choi \(2024a\)Publicly shareable clinical large language model built on synthetic clinical notes\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 5148–5168\.External Links:[Link](https://aclanthology.org/2024.findings-acl.305/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.305)Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- S\. Kweon, J\. Kim, J\. Kim, S\. Im, E\. Cho, S\. Bae, J\. Oh, G\. Lee, J\. H\. Moon, S\. C\. You, S\. Baek, C\. H\. Han, Y\. B\. Jung, Y\. Jo, and E\. Choi \(2024b\)Publicly shareable clinical large language model built on synthetic clinical notes\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 5148–5168\.External Links:[Link](https://aclanthology.org/2024.findings-acl.305/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.305)Cited by:[§A\.1](https://arxiv.org/html/2606.03157#A1.SS1.p2.1)\.
- B\. Li, T\. Yan, Y\. Pan, J\. Luo, R\. Ji, J\. Ding, Z\. Xu, S\. Liu, H\. Dong, Z\. Lin, and Y\. Wang \(2024a\)MMedAgent: learning to use medical tools with multi\-modal agent\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8745–8760\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.510),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.510)Cited by:[§6](https://arxiv.org/html/2606.03157#S6.p2.1)\.
- J\. Li, Y\. Lai, W\. Li, J\. Ren, M\. Zhang, X\. Kang, S\. Wang, P\. Li, Y\. Zhang, W\. Ma,et al\.\(2024b\)Agent hospital: a simulacrum of hospital with evolvable medical agents\.arXiv preprint arXiv:2405\.02957\.Cited by:[§6](https://arxiv.org/html/2606.03157#S6.p2.1)\.
- Y\. Lin, T\. Ruan, J\. Liu, and H\. Wang \(2023\)A survey on neural data\-to\-text generation\.IEEE Transactions on Knowledge and Data Engineering36\(4\),pp\. 1431–1449\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p1.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025a\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p4.1),[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- J\. Liu, W\. Wang, Z\. Ma, G\. Huang, S\. Yihang, K\. Chang, H\. Li, L\. Shen, M\. Lyu, and W\. Chen \(2025b\)MedChain: bridging the gap between LLM agents and clinical practice with interactive sequence\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=YvuufwkFJY)Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.5.5.1),[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p2.8),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)Medmcqa: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InConference on health, inference, and learning,pp\. 248–260\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p3.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- P\. Qiu, C\. Wu, S\. Liu, W\. Zhao, Y\. Zhang, Y\. Wang, and W\. Xie \(2025\)Quantifying the reasoning abilities of llms on real\-world clinical cases\.arXiv preprint arXiv:2503\.04691\.Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.2.2.1)\.
- A\. Sellergren, S\. Kazemzadeh, T\. Jaroensri, A\. Kiraly, M\. Traverse, T\. Kohlberger, S\. Xu, F\. Jamil, C\. Hughes, C\. Lau,et al\.\(2025\)Medgemma technical report\.arXiv preprint arXiv:2507\.05201\.Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.
- R\. T\. Sutton, D\. Pincock, D\. C\. Baumgart, D\. C\. Sadowski, R\. N\. Fedorak, and K\. I\. Kroeker \(2020\)An overview of clinical decision support systems: benefits, risks, and strategies for success\.NPJ digital medicine3\(1\),pp\. 17\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p1.1)\.
- X\. Tang, A\. Zou, Z\. Zhang, Z\. Li, Y\. Zhao, X\. Zhang, A\. Cohan, and M\. Gerstein \(2024\)MedAgents: large language models as collaborators for zero\-shot medical reasoning\.InFindings of the Association for Computational Linguistics ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand and virtual meeting,pp\. 599–621\.External Links:[Link](https://aclanthology.org/2024.findings-acl.33),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.33)Cited by:[§6](https://arxiv.org/html/2606.03157#S6.p2.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§A\.1](https://arxiv.org/html/2606.03157#A1.SS1.p2.1)\.
- Ö\. Uzuner, Y\. Luo, and P\. Szolovits \(2007\)Evaluating the state\-of\-the\-art in automatic de\-identification\.Journal of the American Medical Informatics Association14\(5\),pp\. 550–563\.Cited by:[§A\.1](https://arxiv.org/html/2606.03157#A1.SS1.p2.1)\.
- B\. Wang, J\. Chang, Y\. Qian, G\. Chen, J\. Chen, Z\. Jiang, J\. Zhang, Y\. Nakashima, and H\. Nagahara \(2024\)DiReCT: diagnostic reasoning for clinical notes via large language models\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§A\.1](https://arxiv.org/html/2606.03157#A1.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.10.10.1)\.
- X\. Wu, Y\. Zhao, Y\. Zhang, J\. Wu, Z\. Zhu, Y\. Zhang, Y\. Ouyang, Z\. Zhang, H\. Wang, J\. Yang,et al\.\(2024\)MedJourney: benchmark and evaluation of large language models over patient clinical journey\.Advances in Neural Information Processing Systems37,pp\. 87621–87646\.Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.9.9.1),[§1](https://arxiv.org/html/2606.03157#S1.p3.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- W\. Yan, H\. Liu, T\. Wu, Q\. Chen, W\. Wang, H\. Chai, J\. Wang, W\. Zhao, Y\. Zhang, R\. Zhang,et al\.\(2024\)ClinicalLab: aligning agents for multi\-departmental clinical diagnostics in the real world\.arXiv preprint arXiv:2406\.13890\.Cited by:[Table 1](https://arxiv.org/html/2606.03157#S1.T1.1.1.3.3.1),[§1](https://arxiv.org/html/2606.03157#S1.p3.1),[§6](https://arxiv.org/html/2606.03157#S6.p1.1)\.
- Z\. Zhan, S\. Zhou, M\. Li, and R\. Zhang \(2025\)RAMIE: retrieval\-augmented multi\-task information extraction with large language models on dietary supplements\.Journal of the American Medical Informatics Association32\(3\),pp\. 545–554\.External Links:ISSN 1527\-974X,[Document](https://dx.doi.org/10.1093/jamia/ocaf002),[Link](https://doi.org/10.1093/jamia/ocaf002),https://academic\.oup\.com/jamia/article\-pdf/32/3/545/61415205/ocaf002\.pdfCited by:[§1](https://arxiv.org/html/2606.03157#S1.p1.1)\.
- Z\. Zhao, Q\. Jin, F\. Chen, T\. Peng, and S\. Yu \(2022\)Pmc\-patients: a large\-scale dataset of patient summaries and relations for benchmarking retrieval\-based clinical decision support systems\.arXiv preprint arXiv:2202\.13876\.Cited by:[§1](https://arxiv.org/html/2606.03157#S1.p4.1)\.
- G\. Zheng, X\. Wang, J\. Liang, N\. Chen, Y\. Zheng, and B\. Wang \(2024\)Efficiently democratizing medical llms for 50 languages via a mixture of language family experts\.arXiv preprint arXiv:2410\.10626\.Cited by:[§5\.1](https://arxiv.org/html/2606.03157#S5.SS1.p1.1)\.

## Appendix AAppendix

### A\.1Data Annotation

To create a high\-quality benchmark, we organize a professional team of three inspectors and two reviewers, all trained in specialized medical knowledge\. The annotation procedure includes first\-round annotation, second\-round checking, and third\-round review\.

First\-round annotation\. Since PMC\-Patients contains case reports instead of complete EHRs, we use the GPT\-4o model to convert these summaries into full EHRs containing multiple courses\. For bothChineseandEnglishdatasets, the EHRs contain key information such as the primary and final diagnoses, but still lack the initial treatment plan and multiple progress notes\. Therefore, we first input the patient’s primary diagnosis, auxiliary examinations, and chief complaint into the GPT\-4o model to generate the initial treatment plan\. We then prompt the model to segment the treatment process into multiple progress notes based ontemporal informationandclinical status changes, with each note following the standard SOAP formatGaoet al\.\([2022](https://arxiv.org/html/2606.03157#bib.bib48)\); Wanget al\.\([2024](https://arxiv.org/html/2606.03157#bib.bib46)\)\. To mitigate hallucinations during generation, we explicitly instruct the model in the prompt to avoid generating clinical information \(such as examination results and time intervals\) not present in the record summary\. Additionally, inspired by AsclepiusKweonet al\.\([2024b](https://arxiv.org/html/2606.03157#bib.bib5)\), we evaluate the similarity between the converted English and real EHRs using perplexity\. Specifically, we fine\-tune LLaMA\-7BTouvronet al\.\([2023](https://arxiv.org/html/2606.03157#bib.bib6)\)on 57,000 real discharge summaries from the MIMIC\-III databaseJohnsonet al\.\([2016](https://arxiv.org/html/2606.03157#bib.bib7)\)\. We then measure the perplexity of 500 discharge summaries from two hospital datasets\-MIMIC\-IVJohnsonet al\.\([2023](https://arxiv.org/html/2606.03157#bib.bib50)\)and i2b2Uzuneret al\.\([2007](https://arxiv.org/html/2606.03157#bib.bib8)\)—as well as 500 case reports from PMC\-Patients using the same model\. Finally, we evaluate the perplexity of clinical cases synthesized from the PMC\-Patients\. Results show perplexity scores of 3\.144 for MIMIC\-IV and 5\.916 for i2b2\. In comparison, the original PMC\-Patients data yields 72\.471, while the GPT\-4o\-converted EHRs achieve a much lower score of 6\.064\. These results indicate that our synthetic notes are substantially more coherent and closely aligned with real hospital data\.

Second\-round checking\. We invite a review team of three clinically trained medical students to perform quality checks on the annotations generated by GPT\-4o\. Any sample unanimously deemed invalid by all three reviewers is directly discarded\. If only one or two reviewers raise concerns, the sample is manually re\-annotated and retained only after all three reviewers agree that the revised annotation is appropriate\. The criteria for determining annotation validity focus on three key aspects: 1\) Consistency in the number of course records compared with the original EHR, including whether the model hallucinates nonexistent entries or omits essential course records; 2\) Completeness and accuracy of examination information, ensuring that key results \(e\.g\., laboratory findings\) appear in the generated course records and remain faithful to the original data; 3\) Correctness of field semantics, such as ensuring that the “chief complaint” reflects the patient’s subjective description of symptoms rather than objective examination findings\. Additionally, we employ a batch\-based iterative validation mechanism: each batch must achieve over 90% accuracy in the aggregated evaluation of the three reviewers before progressing to the next stage\. This process effectively filters out structural inconsistencies, hallucinated content, and medical reasoning errors in the synthetic data, thereby establishing a reliable foundation for subsequent expert review\.

Third\-round review\. We submit the preliminarily inspected EHRs to two clinicians for dual expert review\. The clinicians randomly sample 30% of the cases for quality assessment and systematically evaluate whether each case narrative aligns with real clinical workflows \(e\.g\., examination sequences, diagnostic reasoning, and treatment decisions\) and whether any medical inaccuracies or potential safety risks are present\. Any sample deemed unsatisfactory is returned to the previous stage for revision by the inspection team and then resubmitted for expert review\. We repeatedly implement this iterative cycle ofexpert feedback → manual correction → re\-reviewuntil the sampling accuracy consistently reaches 95% or higher\. After multiple rounds of iteration and dual clinical review, we ultimately obtain 1,275 high\-quality Chinese cases and 5,804 high\-quality English cases\. All cases pass rigorous evaluations of medical consistency, data safety, and factual accuracy\.

Contributors\.Medical students and clinicians are primarily recruited from the internship programs and clinical departments of a Grade 3A hospital and jointly participate in the data\-annotation process\. Compensation is provided at $5–20/hr for medical students and $50–100/hr for clinicians, based on task difficulty and required expertise\.

Table 5:The evaluation of LLM performance on ClinicalMC English data using GPT\-4, with a maximum score of 10 points\. “Comp\.”, “Prof\.” and “Auth\.” denote “Comprehensiveness”, “Professionalism” and “Authenticity”, respectively\.
### A\.2Implementation Details

In this paper, we adopt two experimental settings, with all experiments conducted under a zero\-shot setting\. In the first experimental setting, for downstream tasks in the workflow, we provide the ground\-truth annotations from preceding tasks as inputs, rather than using the model\-generated outputs\. In the second experimental setting, model responses from earlier tasks are directly used as inputs for subsequent tasks\. To enhance the stability and reliability of the results and reduce the impact of randomness, each experiment is repeated three times, and the average performance is reported\. For all experiments, the model temperature is set to 0\.01\. All experiments are conducted on four NVIDIA A800 GPUs \(80 GB\)\. For the open\-source and medical LLMs, we deploy them using the vLLM framework444[https://github\.com/vllm\-project/vllm](https://github.com/vllm-project/vllm)\. For closed\-source LLMs and DeepSeek\-V3 and DeepSeek\-R1, we use their official APIs555[https://platform\.DeepSeek\.com/usage](https://platform.deepseek.com/usage)for evaluation due to their excessively large parameter sizes\.

### A\.3LLM Evaluation

In this section, we primarily use GPT\-4 to evaluate the performance of LLMs on ClinicalMC\. The evaluation includes tasks such as preliminary diagnosis basis, differential diagnosis, first treatment plan, assessment and treatment in the multi\-course, and final diagnosis basis\. To account for potential instability in GPT\-4’s responses, we conduct three evaluations for each model on each benchmark and calculate the average score\. The specific prompts used are shown in Fig\.[5](https://arxiv.org/html/2606.03157#A1.F5)\. The experimental results of Chinese data and English data are shown in Table[6](https://arxiv.org/html/2606.03157#A1.T6)and Table[5](https://arxiv.org/html/2606.03157#A1.T5), respectively\. The experimental results show that DeepSeek\-V3 performs the best on both Chinese and English data\. Specifically, DeepSeek\-V3 achieves aTotalTotalscore of 9\.36 on English data and 9\.46 on Chinese data\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x5.png)Figure 5:Prompt Template of GPT\-4 Evaluation\.Table 6:The evaluation of LLM performance on ClinicalMC Chinese data using GPT\-4\.
### A\.4Human Evaluation

In this section, to evaluate the quality and accuracy of the model’s decision results, we invite three medical experts with over ten years of clinical experience for manual evaluation\. We randomly select 50 Chinese and 50 English EHRs, with each EHR anonymized to ensure that the evaluators cannot identify the model used\. Furthermore, each EHR is evaluated by two different experts in a double\-blind cross\-assessment setup\. The evaluators score the decision results based on four dimensions: comprehensiveness, professionalism, authenticity, and safety\. The scoring criteria align with the LLM evaluation standards outlined in Section LLM Evaluation\. The manual evaluation results on the Chinese and English datasets are shown in Table[8](https://arxiv.org/html/2606.03157#A1.T8)and Table[7](https://arxiv.org/html/2606.03157#A1.T7), respectively\. The table shows that the DeepSeek\-V3 model performs the best, which is similar to the ranking obtained from the LLM evaluation in Appendix[A\.3](https://arxiv.org/html/2606.03157#A1.SS3)\. Specifically, on the English data, the DeepSeek\-V3\.2 model achieves aTotalTotalscore of 9\.00, while on the Chinese data, theTotalTotalscore reaches 8\.75\.

Table 7:Human evaluation of LLM performance on ClinicalMC English dataset, with a maximum score of 10 points\.Table 8:The human evaluation of LLM performance on ClinicalMC Chinese data\.Table 9:Results under the Multi\-turn Experimental Setting on Chinese and English Data \(%\)\.ModelT\_AccE\_RecallPD\_F1PB\_ScoreDD\_ScoreTP\_IoUCE\_RecallCA\_IoUCT\_IoUFD\_F1FB\_ScoreAvg\\cellcolor\[HTML\]EFEFEFEnglishApollo2\-7B55\.0047\.5027\.0653\.0031\.404\.7433\.3523\.701\.628\.4545\.8030\.15Mixtral\-8x22B47\.0043\.8326\.9168\.6043\.8010\.2638\.4235\.262\.8524\.8363\.0036\.80Qwen\-turbo56\.0045\.2829\.1374\.0044\.6011\.2549\.7339\.462\.6527\.6767\.0040\.62\\cellcolor\[HTML\]EFEFEFChineseHuatuoGPT2\-34B44\.8662\.7828\.3466\.3633\.835\.4829\.2459\.170\.790\.0068\.7936\.33Qwen2\.5\-7B59\.8162\.8635\.1369\.3535\.145\.5620\.3861\.211\.1715\.7466\.3639\.34GPT\-4o\-mini63\.5550\.9724\.8869\.7240\.564\.3623\.6764\.711\.3716\.0362\.4338\.39

Table 10:Statistics of the Number of Courses in the Chinese and English Datasets\.
### A\.5Evaluation in Multi\-turn Dynamic Environment

To evaluate model performance in dynamic environments, we use the responses generated by the models in previous tasks as input for subsequent stages of the clinical workflow, rather than relying on ground\-truth answers\. Specifically, based on the results of the static evaluation, we select representative models for comparison: Huatuo2\-34B, Qwen2\.5\-7B, and GPT\-4o\-mini for the Chinese dataset; and Apollo2\-7B, Mixtral\-8X22B, and Qwen\-turbo for the English dataset\. Furthermore, due to the inherent uncertainty in disease progression within dynamic settings—resulting in unpredictable task sequence lengths—we randomly sample 100 cases for experimentation\. During evaluation, each stage’s output is compared against the gold standard\. If an error occurs in any previous task, all subsequent tasks for that case are marked as invalid, simulating the cascading effect of errors in real\-world applications\. The experimental results are summarized in Table[9](https://arxiv.org/html/2606.03157#A1.T9)\. From the table, we observe that all models experience a decline in performance in dynamic settings on the English dataset, primarily because early\-stage errors often propagate downstream, negatively affecting later decisions\. Specifically, on the English dataset, Apollo2\-7B, Mixtral\-8x22B, and Qwen\-turbo show performance drops of 13\.77%, 13%, and 9\.06%, respectively, compared to static evaluation\. However, we observe an opposite trend on the Chinese dataset, where performance slightly improves under dynamic evaluation\. Specifically, HuatuoGPT2\-34B, Qwen2\.5\-7B, and GPT\-4o\-mini achieve gains of 4\.26%, 1\.07%, and 3\.55%, respectively\. This discrepancy can be attributed to differences in data distribution and task complexity between the Chinese and English settings\. Specifically, Chinese cases tend to involve shorter clinical trajectories and more concise information chains, making the context generated in earlier turns more likely to serve as complementary cues for subsequent reasoning\. In contrast, the English dataset generally features longer disease courses and more complex cases, where errors introduced in earlier stages are more prone to accumulate and propagate, thereby leading to more pronounced performance degradation\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x6.png)Figure 6:The performance of Examination Recall, Assessment Score, and Treatment Score for each course in the Chinese multi\-course dataset\.![Refer to caption](https://arxiv.org/html/2606.03157v1/x7.png)Figure 7:The performance of Examination Recall, Assessment Score, and Treatment Score for each course in the English multi\-course dataset\.
### A\.6Analysis of Course Quantity Effects on LLM Performance

In this section, to analyze the impact of course quantity on LLM performance, we first compile statistics on the data corresponding to different numbers of courses in both the Chinese and English datasets, as shown in Table[10](https://arxiv.org/html/2606.03157#A1.T10)\. The table reveals an imbalance in course distribution\. Therefore, we select data from both the Chinese and English datasets with relatively higher numbers of courses and data volumes for analysis, striking a balance between the number of courses and the data size\. Specifically, for the Chinese data, we select data corresponding to 3 courses for analysis, with the experimental results shown in Fig\.[6](https://arxiv.org/html/2606.03157#A1.F6)\. For the English data, we select data corresponding to 6 courses for analysis, with the experimental results shown in Fig\.[7](https://arxiv.org/html/2606.03157#A1.F7)\. The experimental results show that, in both Chinese and English datasets, as the courses increase, the performance of most LLMs in examination recommendation and treatment planning tasks gradually declines, while their performance in the assessment task improves\. This is primarily because, as the courses accumulate, the patient’s medical history becomes more complex and lengthy, which may lead to redundant examinations or treatment plans, thereby affecting the model’s decision\-making effectiveness regarding the patient’s current progress\. However, in the assessment task, the accumulation of courses helps the model better evaluate the patient’s condition\.

### A\.7Analysis of Different Examiner Models

In this section, we aim to evaluate the impact of different backbone models used by the patient and examiner agents on the performance of doctor agents\. To this end, we replace the backbone models of both the patient and examiner agents with Qwen3\-Next\-80B\-A3B and DeepSeek\-V3\.2\-Chat, respectively\. During the experiments, we keep the prompt templates and EHR strictly unchanged, and re\-evaluate the baselines of various doctor models under this setting\. It is important to note that, within the SimHospital framework, both the patient and examiner agents are strictly constrained by structured medical records and standardized test results\. Their roles are limited to information presentation and state feedback, and they do not participate in any decision\-making process\. Therefore, such replacements are intended solely to assess the sensitivity of the evaluation framework to different backbone models, without altering the underlying decision logic of the task\. The experimental results using Qwen3\-Next\-80B\-A3B and DeepSeek\-V3\.2\-Chat as examiners on the English dataset are reported in Table[11](https://arxiv.org/html/2606.03157#A1.T11)and Table[12](https://arxiv.org/html/2606.03157#A1.T12), respectively\. Correspondingly, the results on the Chinese dataset are shown in Table[13](https://arxiv.org/html/2606.03157#A1.T13)and Table[14](https://arxiv.org/html/2606.03157#A1.T14)\. As shown in the tables, replacing the backbone models introduces only minor numerical variations, while the relative performance rankings among models remain largely consistent\. Moreover, no systematic bias toward any specific model is observed\. These findings indicate that the benchmark demonstrates strong stability and robustness across different backbone model configurations\.

Table 11:Evaluation results of baseline models on English data assessed by Qwen3\-Next\-80B\-A3B as the examiner model\.ModelT\_AccE\_RecallPD\_F1PB\_ScoreDD\_ScoreTP\_IoUCE\_RecallCA\_IoUCT\_IoUFD\_F1FB\_ScoreAvg\\cellcolor\[HTML\]EFEFEFMedical LLMsApollo2\-7B60\.6433\.1928\.4860\.3740\.495\.8233\.2757\.623\.9446\.9580\.6341\.04Asclepius\-Llama2\-13B0\.040\.000\.0042\.1236\.341\.260\.0021\.321\.120\.000\.009\.29Asclepius\-Llama2\-7B0\.040\.000\.0042\.1236\.341\.260\.0021\.321\.120\.000\.009\.29Baichuan\-M261\.8020\.0525\.5176\.4056\.019\.5221\.1154\.496\.3182\.5386\.0245\.43MedGemma62\.8820\.7521\.3777\.0664\.2610\.2024\.1652\.566\.7371\.3587\.0245\.30HuatuoGPT\-o1\-7B59\.1129\.6819\.9660\.7752\.718\.0727\.3361\.585\.1576\.1981\.4143\.81\\cellcolor\[HTML\]EFEFEFOpen\-source LLMsLlama\-3\.3\-70B62\.9514\.1622\.7174\.6851\.1710\.5915\.5564\.736\.2379\.4486\.1844\.40Llama\-3\.2\-3B49\.1613\.9322\.8354\.8448\.706\.8818\.6251\.288\.6478\.9083\.1039\.72Mistral\-7B37\.2622\.7819\.4267\.9747\.599\.1527\.9559\.605\.9471\.7582\.7541\.11Mixtral\-8x22B60\.5219\.0433\.9372\.9049\.6912\.5121\.7263\.097\.7186\.9684\.1346\.56Falcon3\-7B51\.3422\.1421\.8261\.9548\.397\.7623\.4659\.314\.7763\.2585\.8340\.91Qwen2\.5\-72B63\.4816\.7127\.2376\.3447\.3410\.0318\.4858\.625\.6584\.5685\.6444\.92Qwen2\.5\-32B63\.2818\.0118\.2474\.9554\.5510\.2220\.7666\.126\.1580\.7084\.4145\.22Qwen2\.5\-14B64\.4019\.2825\.7674\.8648\.139\.4119\.5664\.245\.8284\.3084\.4545\.47Qwen2\.5\-7B59\.9024\.9723\.3364\.4047\.409\.3123\.6361\.736\.2977\.6885\.6044\.02Qwen3\-Next\-80B\-A3B63\.1720\.6127\.9081\.9760\.9910\.8023\.3742\.855\.3369\.4986\.3044\.80DeepSeek\-V3\.2\-Chat52\.3322\.7428\.2578\.7857\.2512\.1122\.7657\.255\.5970\.6286\.8044\.95DeepSeek\-V3\.2\-Reason56\.6718\.9025\.1981\.0061\.1713\.5819\.5657\.467\.1267\.3287\.0845\.00\\cellcolor\[HTML\]EFEFEFClosed\-source LLMsGPT\-4o\-mini64\.3019\.2235\.9672\.1054\.2411\.7422\.0662\.986\.3593\.3981\.8447\.65GPT5\-mini60\.9921\.4428\.8986\.7360\.7412\.5724\.9062\.746\.2351\.5986\.9845\.80Qwen\-turbo63\.9029\.7331\.9671\.5050\.9210\.3126\.6556\.086\.0887\.4284\.2447\.16

Table 12:Evaluation results of baseline models on English data assessed by DeepSeek\-V3\.2\-Chat as the examiner model\.ModelT\_AccE\_RecallPD\_F1PB\_ScoreDD\_ScoreTP\_IoUCE\_RecallCA\_IoUCT\_IoUFD\_F1FB\_ScoreAvg\\cellcolor\[HTML\]EFEFEFMedical LLMsApollo2\-7B60\.7420\.5928\.4060\.5740\.925\.9220\.7357\.573\.9646\.7680\.7038\.81Asclepius\-Llama2\-13B0\.020\.000\.0032\.2336\.201\.440\.0021\.621\.020\.000\.008\.41Asclepius\-Llama2\-7B0\.020\.000\.0032\.0637\.341\.420\.0022\.571\.040\.000\.008\.59MedGemma62\.8611\.9121\.3876\.7564\.1910\.0713\.7652\.616\.7471\.3186\.9343\.50Baichuan\-M261\.8815\.7625\.5276\.4655\.839\.4815\.1754\.496\.3582\.4286\.1344\.50HuatuoGPT\-o1\-7B59\.1115\.7620\.0660\.6252\.698\.0916\.5761\.414\.9876\.0981\.4541\.53\\cellcolor\[HTML\]EFEFEFOpen\-source LLMsLlama\-3\.3\-70B62\.379\.2324\.0174\.5049\.8410\.7212\.0064\.096\.236\.2384\.8936\.74Llama\-3\.2\-3B49\.739\.7423\.3354\.7748\.077\.1713\.6551\.178\.678\.6783\.2532\.57Mistral\-7B33\.3310\.7429\.0568\.4445\.4410\.0916\.8761\.525\.9178\.3682\.2240\.18Mixtral\-8x22B59\.8711\.3233\.5372\.9749\.2813\.2716\.2263\.477\.757\.7584\.1938\.15Falcon3\-7B52\.8814\.0721\.9862\.7147\.908\.0815\.2859\.954\.694\.6987\.1034\.48Qwen2\.5\-72B63\.559\.2427\.1076\.0847\.2810\.0812\.2158\.655\.605\.6085\.6536\.46Qwen2\.5\-32B63\.268\.8718\.2874\.9254\.3310\.229\.5766\.176\.1380\.5684\.4143\.34Qwen2\.5\-14B64\.3910\.4125\.6774\.9448\.029\.309\.2264\.295\.8284\.1984\.4543\.70Qwen2\.5\-7B60\.3515\.5220\.1666\.2351\.648\.9615\.1861\.866\.396\.3984\.3536\.09Qwen3\-Next\-80B\-A3B63\.2311\.4627\.8781\.8060\.7010\.8114\.6642\.875\.4169\.7686\.3343\.17DeepSeek\-V3\.2\-Chat53\.4314\.6028\.2478\.6556\.6512\.2714\.5657\.875\.725\.7286\.6037\.66DeepSeek\-V3\.2\-Reason57\.229\.7225\.8881\.2860\.6713\.8910\.9157\.077\.2766\.9287\.0843\.45\\cellcolor\[HTML\]EFEFEFClosed\-source LLMsGPT\-4o\-mini59\.2611\.2137\.5772\.0458\.5911\.6317\.5462\.466\.5289\.3280\.8246\.09GPT5\-mini57\.789\.7429\.1687\.3360\.3311\.4513\.6366\.764\.2855\.1587\.0043\.87Qwen\-turbo62\.9028\.6331\.2971\.0151\.039\.8427\.1755\.876\.0986\.9284\.2446\.82

Table 13:Evaluation results of baseline models on Chinese data assessed by Qwen3\-Next\-80B\-A3B as the examiner model\.ModelT\_AccE\_RecallPD\_F1PB\_ScoreDD\_ScoreTP\_IoUCE\_RecallCA\_IoUCT\_IoUFD\_F1FB\_ScoreAvg\\cellcolor\[HTML\]EFEFEFMedical LLMsApollo2\-7B56\.6331\.6334\.5970\.9842\.847\.1416\.0559\.362\.2170\.1676\.6642\.57HuatuoGPT2\-7B39\.9226\.435\.7060\.8941\.074\.8312\.6857\.022\.0521\.2770\.0931\.09HuatuoGPT2\-13B61\.5733\.640\.0062\.4238\.606\.4717\.2760\.362\.243\.8670\.8132\.48HuatuoGPT2\-34B46\.9830\.3031\.0769\.4136\.757\.0716\.5363\.243\.0365\.3677\.8440\.69MedGemma65\.3313\.2829\.9377\.1547\.346\.669\.9579\.292\.8858\.5683\.6543\.09HuatuoGPT\-o1\-7B67\.1427\.1532\.2571\.1838\.955\.3511\.3268\.782\.7660\.9977\.0542\.08Baichuan\-M253\.6525\.7433\.9478\.8746\.655\.3213\.2677\.712\.6065\.9781\.4044\.10\\cellcolor\[HTML\]EFEFEFOpen\-source LLMsLlama\-3\.3\-70B64\.9417\.2529\.6173\.5537\.305\.599\.4270\.522\.3368\.2784\.8542\.15Llama\-3\.2\-3B40\.399\.5412\.5655\.8435\.293\.846\.8646\.111\.4553\.6159\.8429\.58Mistral\-7B40\.3118\.0521\.8761\.6930\.783\.809\.7854\.162\.0563\.5064\.6433\.69Mixtral\-8x22B62\.9028\.4927\.3770\.1231\.555\.9713\.9970\.672\.4367\.3980\.2241\.92Falcon3\-7B52\.1614\.0319\.3660\.9340\.053\.097\.2851\.611\.8663\.4670\.1834\.91Qwen2\.5\-72B65\.0231\.5136\.4378\.5653\.418\.5814\.9976\.172\.9167\.6783\.7547\.18Qwen2\.5\-32B59\.1434\.6636\.6979\.0645\.547\.0812\.4374\.763\.7471\.0986\.5146\.43Qwen2\.5\-14B64\.1624\.1737\.6078\.7641\.626\.5711\.4874\.182\.8669\.4980\.5344\.67Qwen2\.5\-7B64\.3936\.3233\.5970\.8439\.366\.6516\.5367\.052\.5771\.0572\.8843\.75Qwen3\-Next\-80B\-A3B67\.6922\.9131\.2881\.4060\.566\.249\.3183\.682\.9363\.5687\.3346\.99DeepSeek\-v3\.2\-Chat63\.0614\.6429\.8275\.8361\.717\.666\.7279\.593\.3853\.4581\.2943\.38DeepSeek\-v3\.2\-Reason66\.9016\.0031\.8780\.8266\.758\.076\.6284\.453\.5456\.4986\.4646\.18\\cellcolor\[HTML\]EFEFEFClosed\-source LLMsGPT\-4o\-mini63\.3816\.9828\.5571\.4640\.565\.758\.4071\.262\.5171\.7283\.7642\.21GPT\-5\-mini61\.973\.2111\.5983\.1950\.896\.121\.0183\.773\.9044\.0182\.3539\.27Qwen\-turbo53\.9931\.6326\.6179\.7253\.526\.107\.7873\.773\.1674\.8683\.1044\.93

Table 14:Evaluation results of baseline models on Chinese data assessed by DeepSeek\-V3\.2 as the examiner model\.ModelT\_AccE\_RecallPD\_F1PB\_ScoreDD\_ScoreTP\_IoUCE\_RecallCA\_IoUCT\_IoUFD\_F1FB\_ScoreAvg\\cellcolor\[HTML\]EFEFEFMedical LLMsApollo2\-7B57\.7433\.5734\.5970\.8843\.297\.4518\.2559\.652\.3170\.3176\.5843\.15HuatuoGPT2\-7B40\.6329\.065\.4360\.9441\.525\.0019\.5056\.742\.2422\.4270\.9332\.22HuatuoGPT2\-13B61\.3335\.650\.0062\.5938\.266\.5521\.1460\.152\.203\.7070\.0232\.87HuatuoGPT2\-34B45\.1829\.6029\.4266\.4935\.146\.3018\.4260\.742\.7562\.3675\.3739\.25MedGemma65\.4916\.4930\.1577\.2247\.677\.0811\.0279\.102\.9558\.7183\.8743\.61HuatuoGPT\-o1\-7B66\.6722\.6232\.9170\.5339\.225\.6114\.4968\.522\.5762\.2576\.8842\.02Baichuan\-M253\.2526\.0635\.5478\.8246\.985\.2914\.9976\.472\.4965\.2082\.1644\.30\\cellcolor\[HTML\]EFEFEFOpen\-source LLMsLlama\-3\.3\-70B65\.0218\.4429\.4372\.9337\.765\.5312\.9369\.982\.3567\.7084\.6742\.43Llama\-3\.2\-3B41\.4114\.1912\.4555\.9134\.684\.058\.2145\.781\.5453\.1060\.6430\.18Mistral\-7B40\.6318\.6222\.0961\.6830\.233\.7510\.9553\.931\.8963\.6864\.0633\.77Mixtral\-8x22B62\.6730\.2127\.5670\.8532\.035\.9115\.1770\.932\.3866\.8480\.7142\.30Falcon3\-7B51\.8419\.1219\.5260\.1139\.393\.1915\.8152\.931\.8662\.5270\.4636\.07Qwen2\.5\-72B65\.1027\.7936\.7878\.3753\.958\.3615\.7776\.562\.8967\.7983\.5947\.00Qwen2\.5\-32B59\.0631\.1236\.7879\.0744\.747\.3113\.0574\.603\.5370\.8786\.5746\.06Qwen2\.5\-14B63\.9223\.2037\.3278\.1341\.716\.7611\.7673\.532\.9269\.1480\.1144\.41Qwen2\.5\-7B64\.3931\.7633\.4270\.5638\.926\.7416\.7867\.412\.4671\.0573\.1143\.33Qwen3\-Next\-80B\-A3B68\.0821\.9731\.5782\.1060\.806\.1210\.1384\.742\.9463\.7488\.3047\.32DeepSeek\-v3\.2\-Chat66\.2021\.4631\.3580\.5565\.638\.0011\.9585\.183\.4656\.7586\.9347\.04DeepSeek\-v3\.2\-Reason66\.9821\.5832\.0680\.8565\.327\.9511\.6285\.143\.6856\.6286\.1047\.08\\cellcolor\[HTML\]EFEFEFClosed\-source LLMsGPT\-4o\-mini63\.3819\.5228\.8871\.5540\.005\.7510\.9571\.212\.0771\.9384\.3242\.69GPT\-5\-mini58\.224\.8112\.7083\.1950\.426\.305\.8783\.493\.9044\.7383\.0039\.69Qwen\-turbo52\.5830\.7627\.6778\.0352\.865\.849\.7272\.172\.8375\.3083\.4744\.66

### A\.8Error Case

In this section, we introduce the Chinese and English error samples of LLMs on ClinicalMC\. Both error examples come from the DeepSeek\-V3 model\. The Chinese error sample is shown in Fig\.[8](https://arxiv.org/html/2606.03157#A1.F8)\. The English error sample is shown in Fig\.[9](https://arxiv.org/html/2606.03157#A1.F9)\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x8.png)Figure 8:Examples of the three error types for Chinese data in ClinicalMC\. Theincorrect rationale,\# comments, andevidenceare highlighted\.![Refer to caption](https://arxiv.org/html/2606.03157v1/x9.png)Figure 9:Examples of the three error types for English data in ClinicalMC\. Theincorrect rationale,\# comments, andevidenceare highlighted\.
### A\.9Prompt of SimHospital Framework

In this section, we provide a detailed description of the prompts for the three agents introduced in the SimHospital evaluation framework\. The prompt for the doctor agent is shown in Fig\.[10](https://arxiv.org/html/2606.03157#A1.F10)\. The prompt for the examiner agent is shown in Fig\.[11](https://arxiv.org/html/2606.03157#A1.F11)\. The prompt for the patient agent is shown in Fig\.[12](https://arxiv.org/html/2606.03157#A1.F12)\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x10.png)Figure 10:Prompt of the doctor agent\.![Refer to caption](https://arxiv.org/html/2606.03157v1/x11.png)Figure 11:Prompt of the examiner agent\.![Refer to caption](https://arxiv.org/html/2606.03157v1/x12.png)Figure 12:Prompt of the patient agent\.
### A\.10Prompt of ClinicalMC Annotation

In this section, we provide a detailed description of the prompts used during the ClinicalMC annotation process\. During data annotation, the model is explicitly instructed to “strictly extract the following information from the original medical records without adding, deleting, or modifying any content\.” to minimize hallucination during the annotation\. The prompt for data annotation is shown in Fig\.[13](https://arxiv.org/html/2606.03157#A1.F13)\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x13.png)Figure 13:Prompt for data annotation\.
### A\.11Evaluation Prompts and ClinicalMC Examples

In this section, we present the evaluation prompts as well as example Chinese and English EHRs from ClinicalMC\. The prompts used for evaluation are shown in Fig\.[14](https://arxiv.org/html/2606.03157#A1.F14)\. The Chinese EHR is shown in Fig\.[15](https://arxiv.org/html/2606.03157#A1.F15), and the English EHR is shown in Fig\.[16](https://arxiv.org/html/2606.03157#A1.F16)\.

![Refer to caption](https://arxiv.org/html/2606.03157v1/x14.png)Figure 14:Prompts for evaluating differential diagnosis, diagnostic basis, and assessment\.![Refer to caption](https://arxiv.org/html/2606.03157v1/x15.png)Figure 15:Chinese EHR example\.![Refer to caption](https://arxiv.org/html/2606.03157v1/x16.png)Figure 16:English EHR example\.
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

Similar Articles

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Large Language Models as Unified Multimodal Learners for Clinical Prediction

MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

MedRealMM: A Real-World Multimodal Benchmark for Chinese Online Medical Consultation

Submit Feedback

Similar Articles

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
Large Language Models as Unified Multimodal Learners for Clinical Prediction
MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models
LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
MedRealMM: A Real-World Multimodal Benchmark for Chinese Online Medical Consultation