Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
Summary
Researchers introduce MedSP1000, a 1,638-case interactive benchmark derived from standardized patient scenarios to evaluate LLMs as dynamic clinical agents across multi-turn encounters. Results show even the best model (GPT-5.5) completes only 60.4% of expert rubric items, suggesting current LLMs are not yet reliable enough for clinical practice.
Similar Articles
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
ClinicalMC is a benchmark designed to evaluate large language models in multi-course clinical decision-making, featuring datasets in Chinese and English and a multi-agent evaluation framework.
MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
MedGuideX transforms clinical practice guidelines into executable decision logic to generate factual and counterfactual QA data for training medical LLMs, achieving a 10.28% relative improvement in average accuracy across clinical reasoning benchmarks.
MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models
MEDSYN is a multilingual multimodal benchmark for evaluating MLLMs on complex clinical cases with up to 7 distinct visual evidence types per case. The study reveals that while frontier models match human experts on differential diagnosis generation, all MLLMs show significant gaps in final diagnosis selection due to poor synthesis of heterogeneous clinical evidence.