Tag
ClinicalMC is a benchmark designed to evaluate large language models in multi-course clinical decision-making, featuring datasets in Chinese and English and a multi-agent evaluation framework.
This study examines how AI raters (LLMs) score clinical AI outputs under different protocols in complex type 2 diabetes pharmacotherapy, finding that rubric-anchored scoring provides greater discriminative power than rubric-free scoring.
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.