CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

arXiv cs.CL 07/01/26, 04:00 AM Papers
Summary
CLExEval introduces a human-in-the-loop framework for evaluating LLM clinical reasoning under progressive information masking, revealing failure patterns such as verbosity bias, hidden knowledge paradox, and reasoning-to-output mismatch in models like GPT-4o-mini and HuatuoGPT-o1.
arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations can appear clinically convincing even when the final diagnosis is incorrect. We introduce CLExEval, a human-in-the-loop framework for evaluating LLM clinical reasoning under progressive information masking. CLExEval combines 5,600 expert-physician annotations with 200 clinical reasoning traces derived from 40 rare diagnostic cases. Our analysis identifies three recurring failure patterns: (i) verbosity bias, where GPT-4o-mini's diagnostic accuracy drops from 95.0% to 32.5% under information scarcity; (ii) a hidden knowledge paradox, where a specialist model reaches 92.5% maximum diagnostic potential but fails to retrieve that knowledge reliably in verbose contexts; and (iii) a 68.6% reasoning-to-output mismatch, where correct diagnoses appear in reasoning traces but are not reflected in final answers. We further evaluate the LLM-as-a-Judge paradigm on a human-verified failure set (n = 142). GPT-4o-mini approved 47.9% of clinically incorrect outputs, while HuatuoGPT-o1 approved all validly scored failures and showed a positive self-preference bias. These results suggest that standalone automated clinical evaluations can substantially overestimate clinical reliability without expert-grounded validation.
Original Article
View Cached Full Text
Cached at: 07/01/26, 05:34 AM
# A Human-in-the-Loop Frameworkfor Qualitative Evaluation of LLM Clinical Reasoning
Source: [https://arxiv.org/html/2606.31608](https://arxiv.org/html/2606.31608)
Ajmal M\.1,2Abin Roy3,∗Afthab Salam Kanniyan3,∗Jawadh Abdul Kabeer3,∗Jerin James3,∗ Preslav Nakov1Zhuohan Xie1 1MBZUAI2IIT Madras3Calicut Medical College ajmal\.m@mbzuai\.ac\.aezhuohan\.xie@mbzuai\.ac\.ae ∗Equal contribution\. [Project](https://24f2004489.github.io/CLExEval-Project-Page/)[RARECASE\-2000](https://huggingface.co/datasets/AjmalMIITM/RARECASE-2000)[Code](https://github.com/24f2004489/CLExEval)

###### Abstract

Large Language Models \(LLMs\) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably\. A central risk is anevaluation illusion: fluent and well\-structured explanations can appear clinically convincing even when the final diagnosis is incorrect\. We introduce CLExEval, a human\-in\-the\-loop framework for evaluating LLM clinical reasoning under progressive information masking\. CLExEval combines 5,600 expert\-physician annotations with 200 clinical reasoning traces derived from 40 rare diagnostic cases\. Our analysis identifies three recurring failure patterns: \(i\)verbosity bias, where GPT\-4o\-mini’s diagnostic accuracy drops from 95\.0% to 32\.5% under information scarcity; \(ii\) ahidden knowledge paradox, where a specialist model reaches 92\.5% maximum diagnostic potential but fails to retrieve that knowledge reliably in verbose contexts; and \(iii\) a 68\.6% reasoning\-to\-output mismatch, where correct diagnoses appear in reasoning traces but are not reflected in final answers\. We further evaluate the LLM\-as\-a\-Judge paradigm on a human\-verified failure set \(n=142n=142\)\. GPT\-4o\-mini approved 47\.9% of clinically incorrect outputs, while HuatuoGPT\-o1 approved all validly scored failures and showed a positive self\-preference bias\. These results suggest that standalone automated clinical evaluations can substantially overestimate clinical reliability without expert\-grounded validation\.

![[Uncaptioned image]](https://arxiv.org/html/2606.31608v1/robo.png)CLExEval: A Human\-in\-the\-Loop Frameworkfor Qualitative Evaluation of LLM Clinical ReasoningAjmal M\.1,2Abin Roy3,∗Afthab Salam Kanniyan3,∗Jawadh Abdul Kabeer3,∗Jerin James3,∗Preslav Nakov1Zhuohan Xie11MBZUAI2IIT Madras3Calicut Medical Collegeajmal\.m@mbzuai\.ac\.aezhuohan\.xie@mbzuai\.ac\.ae∗Equal contribution\.[Project](https://24f2004489.github.io/CLExEval-Project-Page/)[RARECASE\-2000](https://huggingface.co/datasets/AjmalMIITM/RARECASE-2000)[Code](https://github.com/24f2004489/CLExEval)

![Refer to caption](https://arxiv.org/html/2606.31608v1/Fig_1st_Page.png)Figure 1:Reasoning\-to\-output mismatch in a clinical case\.A HuatuoGPT\-o1\-8B example where the reasoning trace contains pyloric\-atresia cues, but the final answer commits to duodenal atresia\. Automated judges assign full credit \(1\.001\.00\), whereas human experts score the diagnosis as incorrect \(0\.000\.00\)\.## 1Introduction

Large Language Models \(LLMs\) have demonstrated broad capabilities across multiple tasks, including reasoning, generation, and domain\-specific applications\(Zhaoet al\.,[2023](https://arxiv.org/html/2606.31608#bib.bib50); Xieet al\.,[2023](https://arxiv.org/html/2606.31608#bib.bib51)\)\. These capabilities have accelerated interest in applying LLMs to clinical workflows, prompting a surge of benchmarks for evaluating medical reasoning, including interactive testing\(Chiuet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib10)\), structured reasoning tasks\(Wanget al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib11)\), and dynamic information seeking\(Liet al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib3)\)\. To scale evaluation, recent work increasingly adopts theLLM\-as\-a\-Judgeparadigm, relying on frontier models to automatically score clinical outputs\(Qiuet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib12); Zhanget al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib7)\)\. Despite this progress, current evaluation pipelines have an important blind spot: theEvaluation Illusion\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib2)\), where automated judges can reward fluent and well\-structured reasoning despite weak clinical validity\. As illustrated in Figure[1](https://arxiv.org/html/2606.31608#S0.F1), to simulate the LLM\-as\-a\-Judge paradigm, automated evaluators were provided with the explicit ground\-truth diagnosis alongside the model’s complete reasoning trace\. Because the reasoning appears confident and coherent, the automated judge assigns a perfect score to a clinically incorrect diagnosis, whereas the human expert identifies the diagnostic failure despite the fluent presentation\. Beyond this issue, existing benchmarks primarily focus on final diagnostic accuracy, but often do not explainwhyreasoning fails, whether due to missing knowledge, unstable retrieval, or misaligned decision\-making under information scarcity, which is inherent to real\-world clinical practice\.

To address these limitations, we introduceCLExEval, a human\-in\-the\-loop evaluation framework for clinical reasoning under uncertainty\. Built on the clinician\-curated RARECASE\-200 benchmark, CLExEval applies progressive information masking \(Levels 0–3\) to simulate realistic diagnostic scenarios where clinicians must reason with incomplete and evolving evidence\. This design prevents models from relying on surface\-level cues and enables controlled stress\-testing of reasoning behavior\. Combined with 5,600 expert annotations, CLExEval provides a fine\-grained lens to evaluate not only whether models fail, buthowandwhythose failures occur\. Using this framework, we show that common evaluation practices can overestimate model capability\. Automated judges frequently approve incorrect clinical reasoning in our consensus failure set, and many model failures arise not only from lack of knowledge, but also from misalignment between internal reasoning and final outputs, as well as sensitivity to context\.

Our contributions are as follows: \(i\)CLExEval, a human\-in\-the\-loop framework combining progressive information masking with expert annotation to evaluate clinical reasoning under uncertainty; \(ii\) a formalization of theEvaluation Illusion,Δ=Communication−Precision\\Delta=\\text\{Communication\}\-\\text\{Precision\}, and a Hallucination Approval Rate \(HAR\) evaluation of standalone LLM judges; and \(iii\) a mechanistic analysis using ROM, ISS, and MVR to separate knowledge deficits, reasoning\-output misalignment, and context sensitivity, showing that models may contain relevant latent knowledge but fail to express it reliably under information scarcity\.

## 2Related Work

Work / BenchmarkTask TypeInformation SettingEvaluation SignalMechanistic InsightEvaluatorStatic medical QA and exam\-style benchmarksMedQA\(Jinet al\.,[2021](https://arxiv.org/html/2606.31608#bib.bib8)\)Medical exam MCQAStatic / completeFinal\-answer accuracyNone; answer\-key scoringAutomaticMedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2606.31608#bib.bib25)\)Multi\-subject medical MCQAStatic / completeFinal\-answer accuracyNone; answer\-key scoringAutomaticMMLU Clinical\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.31608#bib.bib26)\)Broad medical knowledge QAStatic / completeFinal\-answer accuracyNone; answer\-key scoringAutomaticChen et al\.\(Chenet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib21)\)Challenging medical QA \+ explanationsStatic / completeMCQA accuracy \+ explanation metricsExplanation quality, limited degradation insightAutomatic \+ limited humanClinical reasoning, uncertainty, and physician\-grounded evaluationKanjee et al\.\(Kanjeeet al\.,[2023](https://arxiv.org/html/2606.31608#bib.bib22)\)NEJM diagnostic challengesStatic / completeDifferential diagnosis qualityFinal differential onlyPhysiciansCabral et al\.\(Cabralet al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib23)\)Sequential clinical reasoningInformation accumulationR\-IDEA reasoning scoreReasoning documentation qualityPhysiciansMcCoy et al\.\(McCoyet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib24)\)Script Concordance TestingNew information addedClinician\-concordant Likert updatesProbabilistic updating under uncertaintyExpert panelBhasuran et al\.\(Bhasuranet al\.,[2026](https://arxiv.org/html/2606.31608#bib.bib19)\)Clinical causal reasoningStatic / completeFinal answer \+ reasoning qualityCausal reasoning levelMedical expertsEHR, agentic, and automated evaluation frameworksMEDALIGN\(Fleminget al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib28)\)EHR instruction followingStatic EHR contextClinical correctness / preferenceOutput quality, not reasoning collapseCliniciansMedAgentBench\(Jianget al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib29)\)EHR agent tasksInteractive tool useFinal API/action successTool\-use success, not reasoning traceRule\-basedMedR\-Bench\(Qiuet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib12)\)Clinical workflow reasoningStage\-wise interactiveEfficiency, factuality, completenessReasoning\-step qualityAgentic evaluatorHealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib14)\)Open\-ended health conversationsStatic / multi\-turnPhysician\-authored rubricsClinically meaningful response qualityRubric\-basedCARDBiomedBench
\(Bianchiet al\.,[2026](https://arxiv.org/html/2606.31608#bib.bib16)\)Biomedical research QAStatic / completeRQR \+ abstention safetySafety–accuracy trade\-offAutomated judgeCLExEval \(Ours\)Rare diagnostic reasoning tracesProgressive information masking7D expert rubric \+ judge auditROM, ISS, MVR, HAR, degradationΔ\\DeltaHuman panel, 5,600 annotationsTable 1:Positioning of CLExEval\.Unlike prior clinical LLM benchmarks, CLExEval combines progressive information masking with expert annotation to localize reasoning degradation, reasoning\-to\-output mismatch, and judge approval of clinically incorrect outputs\.### 2\.1Evolution of Clinical Task Design

Static medical QA benchmarks such as MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2606.31608#bib.bib8)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2606.31608#bib.bib25)\), MMLU clinical subsets\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.31608#bib.bib26)\), MedicationQA\(Abachaet al\.,[2019](https://arxiv.org/html/2606.31608#bib.bib27)\), and recent licensing\-exam evaluations such as the Chinese NMLE study\(Wanget al\.,[2026](https://arxiv.org/html/2606.31608#bib.bib15)\)test medical knowledge using information\-complete inputs and fixed answer keys\. These benchmarks are useful for standardized comparison, but they primarily evaluate final\-answer selection or retrieval rather than open\-ended reasoning traces, evidence\-grounded verification\(Xieet al\.,[2025b](https://arxiv.org/html/2606.31608#bib.bib52)\), verifiable intermediate steps\(Xieet al\.,[2025a](https://arxiv.org/html/2606.31608#bib.bib53)\), diagnostic uncertainty, or failure modes under missing information\.Chenet al\.\([2025](https://arxiv.org/html/2606.31608#bib.bib21)\)extend this paradigm through JAMA Clinical Challenge and Medbullets, adding expert\-written explanations for challenging medical QA, but these tasks remain static evaluations of information\-complete cases\.

Subsequent work moves toward clinical reasoning and physician\-grounded evaluation\. Clinical studies have assessed LLMs on complex diagnostic cases using physician judgment, including NEJM clinicopathological conference cases\(Kanjeeet al\.,[2023](https://arxiv.org/html/2606.31608#bib.bib22)\)and R\-IDEA comparisons with attending physicians and residents\(Cabralet al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib23)\)\. Other benchmarks probe uncertainty or interaction through Script Concordance Testing\(McCoyet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib24)\), causal laboratory\-test scenarios\(Bhasuranet al\.,[2026](https://arxiv.org/html/2606.31608#bib.bib19)\), dynamic viva voce examinations\(Chiuet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib10)\), proactive information seeking\(Liet al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib3)\), diagnostic reasoning over clinical notes\(Wanget al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib11)\), EHR\-derived questions\(Zhanget al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib7)\), clinician\-generated EHR instructions\(Fleminget al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib28)\), virtual EHR agent tasks\(Jianget al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib29)\), and sequential clinical workflow evaluation\(Raoet al\.,[2026](https://arxiv.org/html/2606.31608#bib.bib17)\)\. These works move beyond simple final\-answer accuracy, but they generally evaluate static cases, information accumulation, tool\-use success, or overall reasoning quality rather than systematically measuring how the same case degrades as diagnostic evidence is progressively removed\.

Unlike prior frameworks, CLExEval pairs four\-level progressive information masking with a 5,600\-point expert audit, enabling within\-case analysis of whether failures arise from missing knowledge, instability under information scarcity, or reasoning\-to\-output mismatch\.

### 2\.2Evaluation and the Judge’s Illusion

Scalable evaluation through automated metrics or judge models carries its own risk: fluent, well\-structured responses can be over\-rewarded despite weak clinical validity, a pattern thatAgrawalet al\.\([2025](https://arxiv.org/html/2606.31608#bib.bib2)\)term the evaluation illusion\.Aroraet al\.\([2025](https://arxiv.org/html/2606.31608#bib.bib14)\)demonstrate the value of physician\-authored rubrics for open\-ended healthcare evaluation, while related work has documented knowledge–reasoning dissociation in clinical NLI\(Jullienet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib31)\), substantial variability in LLM\-judge agreement across NLP tasks\(Bavarescoet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib30)\), and misalignment between medical safety judges and human annotations on complex safety dimensions\(Diekmannet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib32)\)\.

Many recent frameworks still rely on automated metrics or judge models to scale evaluation, including MedR\-Bench\(Qiuet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib12)\), LLMEval\-Med\(Zhanget al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib7)\), and CARDBiomedBench\(Bianchiet al\.,[2026](https://arxiv.org/html/2606.31608#bib.bib16)\)\. Such judgments can correlate poorly with clinical validity and mask flawed reasoning behaviors\(Sim and Chen,[2025](https://arxiv.org/html/2606.31608#bib.bib5); Bediet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib4)\); MedR\-Bench, for instance, shows that performance collapses when critical reasoning steps are omitted\(Qiuet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib12)\)\. Studies on pattern disruption\(Bediet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib4)\)and knowledge conflict\(Wuet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib6)\)further demonstrate that models degrade when they cannot rely on surface\-level cues\. Together, these limitations motivate CLExEval: an expert\-grounded framework that stress\-tests progressive masking, expert scoring, and judge\-failure analysis on the same matched diagnostic cases\.

## 3Methodology

We introduce CLExEval, a human\-in\-the\-loop evaluation framework designed to assess clinical reasoning under information scarcity\. Our approach combines progressive information masking with expert annotation to evaluate not only final diagnostic accuracy, but also the mechanisms underlying reasoning failures\. An overview of the pipeline is shown in Figure[2](https://arxiv.org/html/2606.31608#S3.F2)\.

![Refer to caption](https://arxiv.org/html/2606.31608v1/sections/Fig00001.png)Figure 2:Overview of the CLExEval pipeline, including case curation, progressive information masking, model evaluation, and expert assessment\.### 3\.1Dataset

We initially curate 500 diagnostically rare clinical narratives from the MultiCaRe dataset\(Nievas Offidani and Delrieux,[2023](https://arxiv.org/html/2606.31608#bib.bib1)\)to construct RARECASE\-2000, a resource of 2,000 progressively masked cases\. For the present deep mechanistic audit, we use 40 cases selected by four clinicians for complexity and suitability for reasoning evaluation\. Each case is standardized into a structured format containing key findings, diagnostic progression, and outcomes\. Rather than serving as a broad model leaderboard, CLExEval is designed as a depth\-oriented mechanistic audit: the 40 source cases are expanded through progressive masking into 200 matched evaluation instances and 5,600 expert scores, enabling within\-case analysis of reasoning degradation\. Further details on data curation and demographics are provided in Appendix[A](https://arxiv.org/html/2606.31608#A1)\.

### 3\.2Progressive Information Masking

We apply progressive information masking to simulate diagnostic uncertainty through clinician\-guided abstraction rather than random deletion\. For each case, clinicians identified the gold diagnosis and ranked the clinical cues most important for supporting that diagnosis or excluding major differentials\. Each case was then converted into five matched versions: ❶ Full Case, the complete original narrative; ❷ Level 0, with only the explicit diagnosis name removed; ❸ Level 1, with the strongest diagnosis\-supporting clue removed; ❹ Level 2, with the two strongest cues removed or abstracted; and ❺ Level 3, the most abstract version, where major diagnostic cues such as demographics, temporal details, or disease\-specific findings were removed or generalized when clinically relevant\. Across all levels, clinicians preserved grammatical coherence and the broad clinical scenario, yielding 200 matched evaluation instances \(40 cases×\\times5 levels\) for controlled analysis of reasoning under varying information availability\.

### 3\.3Model Evaluation

We evaluate all models using a unified prompting setup that requires structured reasoning outputs consisting of aThinkingsection followed by aFinal Response\. This design ensures comparability across models in terms of reasoning steps and final predictions\. We generate all outputs deterministically and anonymize them prior to evaluation\. Prompting details are provided in Appendix[B](https://arxiv.org/html/2606.31608#A2)\.

### 3\.4Human Evaluation and CLExEval Rubric

We introduce and develop the seven\-dimensional CLExEval rubric as part of this work\. We developed the rubric through iterative review of clinical literature, pilot annotation, and feedback from senior clinicians to separate surface fluency from clinical validity\. The final rubric assesses Diagnostic Precision, Differential Reasoning Quality, Evidence Integration/Grounding, Diagnostic Justification Depth, Completeness vs\. Overload, Clinical Plausibility/Soundness, and Communication/Interpretability\. Annotators score each dimension on a five\-point ordinal scale \(0\.00, 0\.25, 0\.50, 0\.75, 1\.00\), allowing them to distinguish clinically incorrect outputs, partially grounded reasoning, and fully clinically valid responses\. Full scoring definitions and calibration details are provided in Appendix[C](https://arxiv.org/html/2606.31608#A3)\.

### 3\.5Reasoning Metrics

To analyze reasoning behavior, we use three core metrics: Information Scarcity Sensitivity \(ISS\), Monotonicity Violation Rate \(MVR\), and Reasoning\-to\-Output Mismatch \(ROM\)\. ISS measures how much diagnostic accuracy drops as clinical information is progressively removed; higher ISS indicates greater sensitivity to information scarcity\. MVR captures non\-monotonic instability, i\.e\., cases where performance improves after information is removed; higher MVR indicates less stable reasoning\. ROM identifies failures where the correct diagnosis appears in the model’s reasoning trace but is omitted or changed in the final answer; higher ROM indicates greater reasoning\-to\-output misalignment\. We also report Max Diagnostic Accuracy \(MDA\), the best accuracy achieved across masking levels, as an estimate of latent diagnostic potential\.

We use several qualitative terms to interpret these metrics\. TheHidden Knowledge Paradoxrefers to cases where a model appears to possess the correct diagnosis but fails to use it reliably in the final answer\.Attention Dispersiondescribes cases where additional context distracts the model across competing cues, reducing diagnostic focus\. ACrossover Eventoccurs when the relative performance of two models reverses at a particular masking level\. Formal definitions are provided in Appendix[D](https://arxiv.org/html/2606.31608#A4)\.

## 4Results and Analysis

### 4\.1The Evaluation Illusion

![Refer to caption](https://arxiv.org/html/2606.31608v1/x1.png)Figure 3:The evaluation illusion\.Comparison of communication quality vs\. diagnostic precision\. The clustering in the top\-left quadrant \(the “Illusion Zone”\) highlights cases where models are highly fluent \(\>0\.8\>0\.8\) but diagnostically wrong \(<0\.5<0\.5\)\.As visualized in Figure[3](https://arxiv.org/html/2606.31608#S4.F3),Diagnostic Precisiondrops sharply at Level 3 \(0\.453\), whereas GPT\-4o\-mini’sCommunicationscore remains high \(0\.881\)\. We formalize this disconnect as the Illusion Gap \(Δ=Communication−Diagnostic Precision\\Delta=\\textit\{Communication\}\-\\textit\{Diagnostic Precision\}\), which identifies cases where fluent presentation coexists with weak diagnostic validity\. The correlation betweenCommunicationandDiagnostic Precisionis modest \(ρ=0\.482\\rho=0\.482\), suggesting that linguistic quality is an unreliable proxy for medical correctness\.

This pattern is consistent with a fluency bias in general\-purpose instruction\-following models\. When clinical information is sparse, GPT\-4o\-mini often continues to produce confident, well\-structured rationales even when the underlying diagnostic inference is incorrect\. Such outputs can be difficult for automated judges to penalize, because the same surface features that improve readability can also obscure clinical errors\. This motivates expert\-grounded evaluation rather than relying on surface fluency or standalone automated judges\.

### 4\.2Limitations of Automated Judges

The evaluation illusion also affects automated evaluation\. To test this, we constructed a consensus failure set from human expert annotations and measured whether LLM judges would incorrectly approve known clinical failures\.

#### 4\.2\.1The Consensus Failure Set

We constructed the consensus failure set from expert annotations to serve as a ground\-truth negative set\. Specifically, we isolated and aggregated evaluation instances across all dimensions and masking levels where human experts assigned a low score \(Human Score≤0\.25\\leq 0\.25\), resulting inn=142n=142unique consensus failures\. We then evaluated these outputs with four LLM judges \(GPT\-4o\-mini, HuatuoGPT\-o1\-8B, DeepSeek\-R1\-Distill\-Llama\-8B, and Llama\-3\.1\-8B\-Instruct\)\. Due to formatting failures and invalid judge responses, HuatuoGPT\-o1 and DeepSeek\-R1 produced valid scores for 138 and 141 cases, respectively\. Each judge was prompted with the full clinical case, the exact CLExEval rubric, the complete model output, and the gold\-standard diagnosis\. We define the Hallucination Approval Rate \(HAR\) as the percentage of these known failures that receive a passing judge score \(Judge Score≥0\.75\\geq 0\.75\) despite the judge having access to the gold\-standard diagnosis\.

#### 4\.2\.2Hallucination Approval Rate

Our analysis shows that automated judges frequently assign passing scores to human\-verified failures\. Contrary to the hypothesis that domain\-specific models would act as stricter evaluators, the specialized medical model had the highest HAR in this setting \(Table[2](https://arxiv.org/html/2606.31608#S4.T2)\)\.

Table 2:Hallucination approval rate\.Higher HAR indicates that an automated judge more often assigns passing scores to human\-verified clinical failures\.These results indicate that automated judges are not reliable enough to serve as standalone arbiters in clinical evaluation settings\. Even the judge with the lowest HAR in our experiment \(GPT\-4o\-mini\) approves nearly half of the consensus failures, suggesting that LLM\-based evaluation should be paired with expert human validation\. Detailed analysis is provided in Appendix[E](https://arxiv.org/html/2606.31608#A5)\.

### 4\.3Supporting Human Evaluation

To validate clinical correctness beyond automated metrics, we conducted expert evaluation of 5,600 scores across 40 cases and 5 masking levels\. Annotations show high inter\-rater reliability \(ICC = 0\.802\), indicating consistent expert judgment\.

#### 4\.3\.1Performance Disparity

The generalist GPT\-4o\-mini significantly outperformed the biomedically fine\-tuned HuatuoGPT across all dimensions \(Table[4](https://arxiv.org/html/2606.31608#S4.T4)\)\. GPT\-4o\-mini achieved a mean overall score of 0\.867 \(±\\pm0\.20\) vs\. HuatuoGPT’s 0\.699 \(±\\pm0\.25\) \(UU=5,472,088,p<0\.001p<0\.001,d=0\.98d=0\.98\)\. The largest disparity occurred inDifferential Reasoning\(d=1\.53d=1\.53\), where HuatuoGPT \(0\.552\) often fixated on single incorrect diagnoses early in reasoning chains\.

This disparity reflects aHidden Knowledge Paradoxcoupled withAttention Dispersion: when confronted with verbose, full\-context clinical narratives, the specialist model appears to be distracted by competing cues and fails to retrieve alternative diagnoses, despite reaching high maximum diagnostic potential elsewhere\. Conversely, GPT\-4o\-mini’s advantage in dimensions such asJustification DepthandCommunicationis partly attributable to instruction tuning for fluent and structured explanations\. The generalist model maintains a high baseline of structural coherence, generating comprehensive differential lists and plausible rationales even when the underlying clinical deduction is flawed\. This creates aGeneralist’s Illusion, where strong formatting and articulation can inflate perceived clinical capability\. Detailed dimensional profiles are provided in Appendix[F](https://arxiv.org/html/2606.31608#A6)\.

Table 3:Expert evaluation results\.Higher scores are better for all dimensions; shaded cells highlight the best\-performing model and strongest observed scores\.
Table 4:Stability and reasoning gap analysis\.Higher is better for accuracy, MSS, and MDA; lower is better for ISS, MVR, and hallucination\. Shading highlights metric direction: better values in blue/teal and worse instability values in red/orange\.

#### 4\.3\.2The Reasoning Gap

We identified a strong positive correlation betweenJustification DepthandDiagnostic Precision\(Spearmanρ=0\.692\\rho=0\.692\), confirming that diagnostic success is tightly coupled with causal explanation depth\. Notably, as visualized in Figure[4](https://arxiv.org/html/2606.31608#S4.F4), GPT\-4o\-mini’sJustification Depthnever dropped below 0\.50, whereas HuatuoGPT frequently scored 0\.25 \(Irrelevant\)\.

This divergence highlights a difference in how generalist and specialist models handle uncertainty\. GPT\-4o\-mini’s conversational instruction tuning encourages structurally complete and articulate rationales, even when the diagnostic conclusion is incorrect\. This creates a “Floor Effect,” where plausible explanations can mask clinical failures\. In contrast, HuatuoGPT is less likely to inflate its reasoning traces with generic clinical language; when it fails to retrieve the correct diagnosis, its reasoning structure more visibly degrades\. The generalist model’s tendency to synthesize coherent but shallow justifications therefore contributes to theEvaluation Illusion, making some errors harder to detect from surface presentation alone\.

![Refer to caption](https://arxiv.org/html/2606.31608v1/x2.png)Figure 4:The reasoning gap\.Scatter plot of Justification Depth vs\. Diagnostic Precision\. Note the “Floor Effect” \(blue points\) where GPT\-4o\-mini maintains high justification scores \(\>0\.5\>0\.5\) even when the diagnosis is incorrect, whereas HuatuoGPT \(red points\) correctly reflects its ignorance with low scores\.

### 4\.4Degradation & Paradoxes

To quantify the mechanisms behind the expert findings, we tracked performance stability across all 200 evaluation instances\. Table[4](https://arxiv.org/html/2606.31608#S4.T4)summarizes the stability profiles for both models\. We identified a strong dependency on explicit clinical descriptors\. GPT\-4o\-mini demonstrated a verbosity bias, with accuracy decreasing from 95\.0% in the Narrative setting to 32\.5% at Level 3, corresponding to a 62\.5 percentage\-point Information Scarcity Sensitivity \(ISS\)\. A two\-way ANOVA interaction analysis showed that GPT\-4o\-mini is more sensitive to information loss \(partialη2=0\.35\\eta^\{2\}=0\.35\) than HuatuoGPT \(partialη2=0\.24\\eta^\{2\}=0\.24\)\. This suggests that some diagnostic success depends on explicit descriptive cues rather than robust causal reasoning\.

Conversely, HuatuoGPT exhibited a retrieval failure pattern\. While its Narrative accuracy was lower \(77\.5%\), its Max Diagnostic Potential \(MDA\) was 92\.5%\. This reveals a Hidden Knowledge Paradox: the open\-source model can reach the correct diagnosis for most cases, but does not retrieve or express it reliably in full, noisy narratives\. At Level 2 \(Partial Information\), a crossover event occurred where GPT\-4o\-mini’s performance fell below the open\-source baseline\. This suggests that under higher uncertainty, the specialist model’s domain prior can provide relative stabilization\. We quantified stability using the Monotonicity Violation Rate \(MVR\)\. We found that 27\.5% of GPT\-4o\-mini’s correct answers at Level 3 were non\-monotonic recoveries, where the model failed withmoreinformation but succeeded withless\. This indicates Attention Dispersion, where verbose context may distract the model from retrieving knowledge it can otherwise express \(reasoning fingerprints are visualized in Appendix[G](https://arxiv.org/html/2606.31608#A7)\)\.

Despite architectural differences, both models exhibited identical hallucination rates \(24\.2%\) when stability failed, fabricating specific disease entities to fit the reduced context\. The accuracy degradation patterns and paradoxes observed above raise a critical question:*Do models fail because they lack knowledge, or because they possess knowledge but fail to commit to it?*To distinguish between these failure modes, we conducted a systematic Reasoning\-to\-Output Mismatch \(ROM\) analysis, examining whether the correct diagnosis appeared in the model’s internal reasoning trace despite being absent from the final output\.

### 4\.5Reasoning\-to\-Output Mismatch Analysis

To investigate the divergence between internal latent knowledge and external diagnostic output, we implemented the Reasoning\-to\-Output Mismatch \(ROM\) protocol defined\. For every diagnostic failure across all 200 evaluation instances \(40 cases×\\times5 masking levels\), we programmatically analyzed the model’s Thinking trace to determine whether the correct gold standard diagnosis was considered and subsequently discarded\.

##### Overall ROM Findings

Table[5](https://arxiv.org/html/2606.31608#S4.T5)summarizes the ROM analysis results for both models across all failure instances\. GPT\-4o\-mini exhibited a ROM of 68\.6% across 86 diagnostic failures, indicating that over two\-thirds of its errors involved decision\-making uncertainty rather than a complete absence of the correct diagnosis from the reasoning trace\. In contrast, HuatuoGPT’s ROM of 44\.0% across 100 failures suggests that its diagnostic errors were more evenly distributed between knowledge gaps \(56%\) and reasoning\-output misalignment \(44%\)\. This 24\.6 percentage\-point ROM gap indicates different failure profiles: GPT\-4o\-mini more often considers the correct diagnosis but does not commit to it, while HuatuoGPT’s failures more often lack the correct diagnosis in the reasoning trace\.

Table 5:Reasoning\-to\-output mismatch \(ROM\) analysis\.Blue highlighting indicates the high overall ROM in GPT\-4o\-mini\.Teal valuesindicate that the correct diagnosis appears in the reasoning trace for all failures at that level\. TheLevel 1 dropmarks the lowest ROM for the generalist model\.
##### GPT\-4o\-mini’s U\-Shaped ROM Curve:

GPT\-4o\-mini’s ROM exhibited a non\-monotonic U\-shaped pattern, peaking at the Narrative level \(100%\) and dropping sharply to its lowest point at Level 1 \(53\.3%\), representing the highest rate of failures where the correct diagnosis was absent from the reasoning trace\. This suggests the model experiences a sharp loss of diagnostic anchors when initial clinical cues are removed, then partially adapts under sustained information constraints\.

##### HuatuoGPT’s Progressive Degradation:

In contrast, HuatuoGPT showed a general downward trend, dropping to 32\.3% at Level 3\. This indicates that its failures are driven by progressive knowledge degradation rather than decision\-making hesitation; as information is removed, the model’s ability to retrieve the correct diagnosis from its latent knowledge base deteriorates proportionally \(trajectory visualized in Appendix[F](https://arxiv.org/html/2606.31608#A6)\.\)

##### ROM and the Hidden Knowledge Paradox

The ROM analysis helps explain the Hidden Knowledge Paradox observed in Section[4\.4](https://arxiv.org/html/2606.31608#S4.SS4)\. HuatuoGPT’s high Max Diagnostic Potential \(MDA: 92\.5%\) but lower Narrative accuracy is clarified by its moderate ROM: the model can reach the correct diagnosis under some masking conditions, but fails to access it reliably in full, verbose narratives\. Conversely, GPT\-4o\-mini’s perfect Narrative ROM \(100%\) indicates that the correct diagnosis appears in its reasoning trace for all Narrative\-level failures, but is not consistently reflected in the final answer\.

## 5Discussion

Our findings provide evidence for the “Evaluation Illusion”\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.31608#bib.bib2)\): models can maintain surface fluency across masking levels even as diagnostic validity degrades under information scarcity\.

##### Implications for Benchmarking

Our results complement information\-complete benchmarks such as MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2606.31608#bib.bib8)\)\. HuatuoGPT’s high Max Diagnostic Potential \(MDA: 92\.5%\) suggests that one bottleneck is contextual reasoning stability under uncertainty, not only static medical knowledge\. Additionally, GPT\-4o\-mini’s high Monotonicity Violation Rate \(27\.5%\) indicates that diagnostic performance can depend on surface cues and context presentation\. Benchmarks should therefore evaluate models under information scarcity to distinguish stable clinical reasoning from pattern association\.

##### Implications for Model Development

The ROM analysis suggests that different model families may require different interventions\. For GPT\-4o\-mini, a high ROM \(68\.6%\) indicates that the correct diagnosis often appears in the reasoning trace but is not selected in the final answer, suggesting a need for better confidence calibration and reasoning\-to\-answer alignment\. Conversely, HuatuoGPT’s lower ROM \(44\.0%\) indicates that 56% of its failures lack the correct diagnosis in the reasoning trace, suggesting that retrieval\-augmented generation or expanded rare\-disease training data may be more relevant\.

##### Limits of Standalone LLM\-as\-a\-Judge Evaluation

Our consensus\-failure analysis shows that relying on automated evaluators as standalone arbiters in clinical medical NLP can be misleading\. The generalist judge \(GPT\-4o\-mini\) demonstrated a 47\.9% Hallucination Approval Rate \(HAR\), indicating that it can assign passing scores to fluent but clinically incorrect reasoning\. The specialized judge \(HuatuoGPT\-o1\) exhibited a 100% HAR alongside a positive self\-preference bias \(\+0\.096\), suggesting that domain\-specific judge models may also validate medical\-sounding text over clinical correctness\. Consequently, evaluations that rely only on automated judges may overestimate clinical utility due to fluency bias, reinforcing the need for human\-in\-the\-loop expert validation\.

## 6Conclusions and Future Work

We introduced CLExEval, a human\-in\-the\-loop framework for evaluating LLM clinical reasoning under progressive information masking\. Across 5,600 expert annotations, CLExEval shows that fluent clinical explanations can remain persuasive even when diagnostic precision degrades, creating an evaluation illusion for both readers and automated judges\. Our mechanistic analysis separates three failure modes: sensitivity to information scarcity, hidden knowledge that is not reliably retrieved or expressed, and reasoning\-to\-output mismatch\. On a human\-verified consensus failure set, standalone LLM judges approved a substantial fraction of clinically incorrect outputs, with HAR reaching 100% for one judge under validly scored cases\. These findings support the use of expert\-grounded validation when evaluating clinical reasoning systems, especially in settings where surface fluency can obscure diagnostic correctness\.

Future work should test whether intervention\-optimized reasoning methods reduce the failures exposed by CLExEval, including deliberative search via Tree of Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2606.31608#bib.bib33)\), self\-reflection for hallucination mitigation\(Jiet al\.,[2023](https://arxiv.org/html/2606.31608#bib.bib34)\), and symbolic verification through Symbolic Chain\-of\-Thought\(Xuet al\.,[2024](https://arxiv.org/html/2606.31608#bib.bib35)\)\. Because these methods generate longer or multi\-branch reasoning traces, evaluating them would require a new expert\-annotation cycle over thousands of additional outputs\. Future research should also focus on developing reliable, clinically grounded automated metrics that correlate with expert human baselines, implementing architecture\-specific interventions, and expanding this uncertainty\-based framework to other high\-stakes domains\.

## Limitations

While our evaluation framework provides fine\-grained evidence through 5,600 expert assessments, this depth required a trade\-off with scale\. Generating this dataset and annotation required an estimated 1,000 hours of cognitive labor across a four\-member medically trained panel, consisting of two licensed physicians and two senior clinical interns with 4\.5 years of medical education; the inclusion of senior interns rather than only licensed clinicians is an important limitation\. This labor\-intensive process limited our scope to two models and 40 source clinical narratives, expanded through progressive masking into 200 annotated case instances; replicating this pipeline for every new frontier model remains highly resource\-intensive\. CLExEval should therefore be viewed as a depth\-oriented mechanistic evaluation rather than a broad model leaderboard\. Furthermore, our deliberate selection of diagnostically rare, multisystem cases stress\-tests causal reasoning but may not reflect model performance on routine, unambiguous clinical scenarios\. Finally, while the progressive masking methodology and CLExEval rubric are conceptually generalizable to other domains requiring reasoning under uncertainty \(e\.g\., law, intelligence analysis\), our empirical validation remains confined to clinical medicine, motivating future research into reliable, scalable automated metrics that correlate with our human baselines\.

## Ethical Statement and Broad Impact

##### Data License

We will release the RARECASE\-2000 benchmark and study artifacts via GitHub, including the 2,000 progressively masked clinical case variants, CLExEval rubric, and the model outputs, reasoning traces, and 5,600 expert annotations used in our main evaluation\. The source narratives are derived from the MultiCaRe dataset\(Nievas Offidani and Delrieux,[2023](https://arxiv.org/html/2606.31608#bib.bib1)\), which is distributed under a Creative Commons Zero \(CC0 1\.0\) license\. We release our curated and annotated artifacts under a CC\-BY 4\.0 license to support reuse, redistribution, and replication\.

## References

- Bridging the gap between consumers’ medication questions and trusted answers\.\.264,pp\. 25–29\.Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1)\.
- M\. Agrawal, I\. Y\. Chen, F\. Gulamali, and S\. Joshi \(2025\)The evaluation illusion of large language models in medicine\.npj Digital Medicine8\(1\),pp\. 600\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01963-x),[Link](https://www.nature.com/articles/s41746-025-01963-x)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p1.1),[§5](https://arxiv.org/html/2606.31608#S5.p1.1)\.
- R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel, J\. Heidecke, and K\. Singhal \(2025\)HealthBench: evaluating large language models towards improved human health\.External Links:2505\.08775,[Link](https://arxiv.org/abs/2505.08775)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.17.16.1.1.1)\.
- A\. Bavaresco, R\. Bernardi, L\. Bertolazzi, D\. Elliott, R\. Fernández, A\. Gatt, E\. Ghaleb, M\. Giulianelli, M\. Hanna, A\. Koller,et al\.\(2025\)LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),Vienna, Austria,pp\. 238–255\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-short.20),[Link](https://aclanthology.org/2025.acl-short.20/)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p1.1)\.
- S\. Bedi, Y\. Jiang, P\. Chung, S\. Koyejo, and N\. Shah \(2025\)Fidelity of medical reasoning in large language models\.JAMA Network Open8\(8\),pp\. e2526021\.External Links:[Document](https://dx.doi.org/10.1001/jamanetworkopen.2025.26021),[Link](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2837372)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p2.1)\.
- B\. Bhasuran, M\. Prosperi, K\. Hanna, J\. Petrilli, C\. J\. Washington, and Z\. He \(2026\)Evaluation of causal reasoning for large language models in contextualized clinical scenarios of laboratory test interpretation\.External Links:[Document](https://dx.doi.org/10.1038/s41746-026-02632-3),[Link](https://www.nature.com/articles/s41746-026-02632-3)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.12.11.1.1.1)\.
- O\. Bianchi, M\. Willey, C\. X\. Alvarado, B\. Danek, M\. Khani, N\. Kuznetsov, A\. Dadu, S\. Shah, M\. J\. Koretsky, M\. B\. Makarious,et al\.\(2026\)CARDBiomedBench: a benchmark for evaluating the performance of large language models in biomedical research\.8\(1\),pp\. 100943\.External Links:[Document](https://dx.doi.org/10.1016/j.landig.2025.100943),[Link](https://pubmed.ncbi.nlm.nih.gov/41622090/)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.18.17.1.1.1.1)\.
- S\. Cabral, D\. Restrepo, Z\. Kanjee, P\. Wilson, B\. Crowe, R\. Abdulnour, and A\. Rodman \(2024\)Clinical reasoning of a generative artificial intelligence model compared with physicians\.184\(5\),pp\. 581–583\.Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.10.9.1.1.1)\.
- H\. Chen, Z\. Fang, Y\. Singla, and M\. Dredze \(2025\)Benchmarking large language models on answering and explaining challenging medical questions\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Albuquerque, New Mexico,pp\. 3563–3599\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.182),[Link](https://aclanthology.org/2025.naacl-long.182/)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.7.6.1.1.1)\.
- C\. Chiu, S\. Pitis, and M\. van der Schaar \(2025\)Simulating viva voce examinations to evaluate clinical reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.38\.Note:NeurIPS 2025External Links:[Link](https://arxiv.org/abs/2510.10278)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1)\.
- J\. Cohen \(1960\)A coefficient of agreement for nominal scales\.20\(1\),pp\. 37–46\.External Links:[Document](https://dx.doi.org/10.1177/001316446002000104),[Link](https://doi.org/10.1177/001316446002000104),https://doi\.org/10\.1177/001316446002000104Cited by:[§C\.1](https://arxiv.org/html/2606.31608#A3.SS1.p1.2)\.
- Y\. Diekmann, C\. Fensore, R\. Carrillo\-Larco, E\. C\. Rosales, S\. Shiromani, R\. Pai, M\. Shah, and J\. Ho \(2025\)LLMs as medical safety judges: evaluating alignment with human annotation in patient\-facing QA\.InProceedings of the 24th Workshop on Biomedical Language Processing,Vienna, Austria,pp\. 217–224\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.bionlp-1.19),[Link](https://aclanthology.org/2025.bionlp-1.19/)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p1.1)\.
- A\. Field \(2024\)Discovering statistics using IBM SPSS statistics\.Sage publications limited\.Cited by:[§C\.1](https://arxiv.org/html/2606.31608#A3.SS1.p1.2)\.
- S\. L\. Fleming, A\. Lozano, W\. J\. Haberkorn, J\. A\. Jindal, E\. Reis, R\. Thapa, L\. Blankemeier, J\. Z\. Genkins, E\. Steinberg, A\. Nayak,et al\.\(2024\)MedAlign: a clinician\-generated dataset for instruction following with electronic medical records\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,Vancouver, Canada,pp\. 22021–22030\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v38i20.30205),[Link](https://ojs.aaai.org/index.php/AAAI/article/view/30205)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.14.13.1.1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,Virtual Event\.External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.6.5.1.1.1)\.
- Z\. Ji, T\. Yu, Y\. Xu, N\. Lee, E\. Ishii, and P\. Fung \(2023\)Towards mitigating LLM hallucination via self reflection\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 1827–1843\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.123),[Link](https://aclanthology.org/2023.findings-emnlp.123/)Cited by:[§6](https://arxiv.org/html/2606.31608#S6.p2.1)\.
- Y\. Jiang, K\. C\. Black, G\. Geng, D\. Park, J\. Zou, A\. Y\. Ng, and J\. H\. Chen \(2025\)MedAgentBench: a virtual EHR environment to benchmark medical LLM agents\.2\(9\),pp\. AIdbp2500144\.External Links:[Document](https://dx.doi.org/10.1056/AIdbp2500144),[Link](https://ai.nejm.org/doi/full/10.1056/AIdbp2500144),https://ai\.nejm\.org/doi/pdf/10\.1056/AIdbp2500144Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.15.14.1.1.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied SciencesNature CommunicationsScientific ReportsThe Lancet Digital HealthJAMA Network Opennpj Digital Medicinenpj Digital Medicinenpj Digital MedicineJAMAJAMA Internal MedicineNEJM AIMedInfoNEJM AIAdvances in neural information processing systemsarXiv preprint arXiv:2507\.05201Educational and Psychological MeasurementBiometricsJAMA Network OpeneLifearXiv preprint arXiv:2505\.11733v2arXiv preprint arXiv:2506\.09513arXiv preprint arXiv:2303\.18223arXiv preprint arXiv:2506\.02515arXiv preprint arXiv:2506\.01793arXiv preprint arXiv:2505\.23802Frontiers in Artificial IntelligencemedRxivarXiv preprint arXiv:2312\.07399arXiv preprint arXiv:2505\.22919arXiv preprint arXiv:2503\.04691Nature MedicineChestFrontiers in Artificial IntelligenceJournal of the American Medical Informatics AssociationData in BriefJAMA Network OpeneLifenpj Digital MedicineJournal of Medical Internet ResearchPLoS Computational Biology11\(14\),pp\. 6421\.External Links:[Link](https://www.mdpi.com/2076-3417/11/14/6421),ISSN 2076\-3417Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.4.3.1.1.1),[§5](https://arxiv.org/html/2606.31608#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Jullien, M\. Valentino, and A\. Freitas \(2025\)The knowledge\-reasoning dissociation: fundamental limitations of LLMs in clinical natural language inference\.External Links:2508\.10777,[Document](https://dx.doi.org/10.48550/arXiv.2508.10777),[Link](https://arxiv.org/abs/2508.10777)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p1.1)\.
- Z\. Kanjee, B\. Crowe, and A\. Rodman \(2023\)Accuracy of a generative artificial intelligence model in a complex diagnostic challenge\.330\(1\),pp\. 78–80\.Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.9.8.1.1.1)\.
- K\. Krippendorff \(2011\)Computing Krippendorff’s Alpha\-Reliability\.Note:University of Pennsylvania ScholarlyCommonsExternal Links:[Link](https://repository.upenn.edu/asc_papers/43/)Cited by:[§C\.1](https://arxiv.org/html/2606.31608#A3.SS1.p1.2)\.
- S\. S\. Li, V\. Balachandran, S\. Feng, J\. S\. Ilgen, E\. Pierson, P\. W\. Koh, and Y\. Tsvetkov \(2024\)MediQ: question\-asking LLMs and a benchmark for reliable interactive clinical reasoning\.InAdvances in Neural Information Processing Systems,Vol\.37,Vancouver, Canada,pp\. 28858–28888\.External Links:[Document](https://dx.doi.org/10.52202/079017-0908),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/32b80425554e081204e5988ab1c97e9a-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1)\.
- L\. G\. McCoy, R\. Swamy, N\. Sagar, M\. Wang, S\. Bacchi, J\. M\. N\. Fong, N\. C\. Tan, K\. Tan, T\. A\. Buckley, P\. Brodeur,et al\.\(2025\)Assessment of large language models in clinical reasoning: a novel benchmarking study\.2\(10\),pp\. AIdbp2500120\.External Links:[Document](https://dx.doi.org/10.1056/AIdbp2500120),[Link](https://ai.nejm.org/doi/10.1056/AIdbp2500120)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.11.10.1.1.1)\.
- M\. Nievas Offidani and C\. Delrieux \(2023\)The MultiCaRe dataset: a multimodal case report dataset with clinical cases, labeled images and captions from open access PMC articles\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.10079370),[Link](https://doi.org/10.5281/zenodo.10079370)Cited by:[Appendix A](https://arxiv.org/html/2606.31608#A1.p1.1),[§3\.1](https://arxiv.org/html/2606.31608#S3.SS1.p1.1),[Data License](https://arxiv.org/html/2606.31608#Sx2.SS0.SSS0.Px1.p1.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InProceedings of the Conference on Health, Inference, and Learning,Proceedings of Machine Learning Research, Vol\.174,Virtual Event,pp\. 248–260\.External Links:[Link](https://proceedings.mlr.press/v174/pal22a.html)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.5.4.1.1.1)\.
- P\. Qiu, C\. Wu, S\. Liu, Y\. Fan, W\. Zhao, Z\. Chen, H\. Gu, C\. Peng, Y\. Zhang, Y\. Wang, and W\. Xie \(2025\)Quantifying the reasoning abilities of LLMs on clinical cases\.16\(1\),pp\. 9799\.External Links:[Document](https://dx.doi.org/10.1038/s41467-025-64769-1)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.31608#S2.T1.1.16.15.1.1.1)\.
- A\. S\. Rao, K\. P\. Esmail, R\. S\. Lee, S\. Jiang, B\. Arraiza Carlo, J\. Gill, P\. Khanna, E\. Kalmowitz, B\. Montagnese, K\. Heydari,et al\.\(2026\)Large language model performance and clinical reasoning tasks\.9\(4\),pp\. e264003\.External Links:[Document](https://dx.doi.org/10.1001/jamanetworkopen.2026.4003),[Link](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1)\.
- S\. Z\. Y\. Sim and T\. Chen \(2025\)Critique of impure reason: unveiling the reasoning behaviour of medical large language models\.eLife14,pp\. e106187\.External Links:[Document](https://dx.doi.org/10.7554/eLife.106187),[Link](https://elifesciences.org/articles/106187)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p2.1)\.
- J\. W\. Tukey \(1949\)Comparing individual means in the analysis of variance\.5\(2\),pp\. 99–114\.External Links:ISSN 0006341X, 15410420,[Link](http://www.jstor.org/stable/3001913)Cited by:[§C\.1](https://arxiv.org/html/2606.31608#A3.SS1.p1.2)\.
- B\. Wang, J\. Chang, Y\. Qian, G\. Chen, J\. Chen, Z\. Jiang, J\. Zhang, Y\. Nakashima, and H\. Nagahara \(2024\)DiReCT: diagnostic reasoning for clinical notes via large language models\.InAdvances in Neural Information Processing Systems,Vol\.37,Vancouver, Canada,pp\. 74999–75011\.Note:Datasets and Benchmarks TrackExternal Links:[Document](https://dx.doi.org/10.52202/079017-2386)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1)\.
- X\. Wang, Z\. Long, B\. Zhu, Y\. Cao, H\. Tang, K\. He, and S\. Zhang \(2026\)Evaluation of DeepSeek\-R1 and ChatGPT\-4o on the Chinese national medical licensing examination: a multi\-year comparative study\.16\(1\),pp\. 2237\.External Links:[Document](https://dx.doi.org/10.1038/s41598-025-31874-6),[Link](https://www.nature.com/articles/s41598-025-31874-6)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1)\.
- W\. Wu, X\. Xu, C\. Gao, X\. Diao, S\. Li, L\. A\. Salas, and J\. Gui \(2025\)Assessing and mitigating medical knowledge drift and conflicts in large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 707–730\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.38/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.38)Cited by:[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p2.1)\.
- Z\. Xie, T\. Cohn, and J\. H\. Lau \(2023\)The next chapter: a study of large language models in storytelling\.InProceedings of the 16th International Natural Language Generation Conference,Prague, Czechia,pp\. 323–351\.External Links:[Link](https://aclanthology.org/2023.inlg-main.23/),[Document](https://dx.doi.org/10.18653/v1/2023.inlg-main.23)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1)\.
- Z\. Xie, D\. Orel, R\. Thareja, D\. Sahnan, H\. Madmoun, F\. Zhang, D\. Banerjee, G\. Georgiev, X\. Peng, L\. Qian, J\. Huang, J\. Su, A\. Singh, R\. Xing, R\. Elbadry, C\. Xu, H\. Li, F\. Koto, I\. Koychev, T\. Chakraborty, Y\. Wang, S\. Lahlou, V\. Stoyanov, S\. Ananiadou, and P\. Nakov \(2025a\)FinChain: a symbolic benchmark for verifiable chain\-of\-thought financial reasoning\.External Links:[Link](https://arxiv.org/abs/2506.02515),[Document](https://dx.doi.org/10.48550/arXiv.2506.02515)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1)\.
- Z\. Xie, R\. Xing, Y\. Wang, J\. Geng, H\. Iqbal, D\. Sahnan, I\. Gurevych, and P\. Nakov \(2025b\)FIRE: fact\-checking with iterative retrieval and verification\.InFindings of the Association for Computational Linguistics: NAACL 2025,Albuquerque, New Mexico,pp\. 2901–2914\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.158/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.158)Cited by:[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p1.1)\.
- J\. Xu, H\. Fei, L\. Pan, Q\. Liu, M\. Lee, and W\. Hsu \(2024\)Faithful logical reasoning via symbolic chain\-of\-thought\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 13326–13365\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.720),[Link](https://aclanthology.org/2024.acl-long.720/)Cited by:[§6](https://arxiv.org/html/2606.31608#S6.p2.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.36,pp\. 11809–11822\.Cited by:[§6](https://arxiv.org/html/2606.31608#S6.p2.1)\.
- M\. Zhang, Y\. Shen, Z\. Li, H\. Sha, B\. Hu, Y\. Wang, C\. Huang, S\. Liu, J\. Tong, C\. Jiang, M\. Chai, Z\. Xi, S\. Dou, T\. Gui, Q\. Zhang, and X\. Huang \(2025\)LLMEval\-Med: a real\-world clinical benchmark for medical LLMs with physician validation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 4888–4914\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.263/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.263)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.31608#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.31608#S2.SS2.p2.1)\.
- W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, X\. Wang, Y\. Hou, Y\. Min, B\. Zhang, J\. Zhang, Z\. Dong, Y\. Du, C\. Yang, Y\. Chen, Z\. Chen, J\. Jiang, R\. Ren, Y\. Li, X\. Tang, Z\. Liu, P\. Liu, J\. Nie, and J\. Wen \(2023\)A survey of large language models\.External Links:[Link](https://arxiv.org/abs/2303.18223),[Document](https://dx.doi.org/10.48550/arXiv.2303.18223)Cited by:[§1](https://arxiv.org/html/2606.31608#S1.p1.1)\.

## Appendix ADataset Details

We use de\-identified clinical narratives from the MultiCaRe dataset\(Nievas Offidani and Delrieux,[2023](https://arxiv.org/html/2606.31608#bib.bib1)\), an open\-access corpus\. From an initial pool of 10,000 narratives, clinicians conducted a multi\-stage review to curate 500 diagnostically rare cases for RARECASE\-2000, from which 40 structurally rich cases were selected for the RARECASE\-200 evaluation subset\. Rather than routine scenarios, we prioritized atypical diagnostic trajectories, overlapping symptom clusters, and clear narrative flows\.

To prepare these cases for our evaluation framework, we restructured each selected narrative into a standardized format comprising demographics, key findings, diagnostic evolution, and outcome details\. These structured cases served as the basis for constructing the five masking levels\. The final 40 cases span a diverse demographic from pediatric to geriatric populations \(55% male, 45% female\)\.

### A\.1Dataset Demographics

The age and gender distribution of the 40 rare clinical narratives selected for the RARECASE\-200 benchmark is provided in Figure[5](https://arxiv.org/html/2606.31608#A1.F5)\. The cases were specifically curated to ensure coverage across a wide spectrum of patient profiles, from pediatric to geriatric populations\. This demographic diversity ensures the models’ diagnostic reasoning is rigorously stress\-tested across varying clinical contexts\. The selection criteria for the 40 RARECASE narratives included: \(1\) atypical or uncommon diagnostic trajectories, \(2\) multi\-system involvement or overlapping symptom clusters, \(3\) the presence of complete demographic and clinical details, \(4\) a clear narrative flow enabling multi\-level abstraction and masking, and \(5\) verified de\-identification and adherence to CC0 licensing conditions\.

![Refer to caption](https://arxiv.org/html/2606.31608v1/sections/Fig5.png)Figure 5:Age and gender distribution of 40 rare clinical narratives selected for evaluation\. Cases span pediatric to geriatric populations\.

## Appendix BPrompting Details

### B\.1Model Selection and Exclusion Criteria

To determine the most representative models for the RARECASE\-200 benchmark, we conducted a pilot evaluation on a subset of five cases\. Our initial candidate pool includedMedGemma\-4B,II\-Medical\-8B,DeepSeek\-R1\-Distill\-Llama\-8B, andGPT\-5\.1\-chat\.

While these models showed promising general performance, several were excluded from the final evaluation based on expert feedback:

- •GPT\-5\.1\-chat:Despite its frontier capabilities, it consistently refused to generate the structured chain\-of\-thought required for our reasoning evaluation, stating:“I cannot provide step\-by\-step internal diagnostic reasoning, but I can give a concise, clinically relevant explanation\.”
- •II\-Medical\-8B, MedGemma & DeepSeek\-R1 \(8B\):While functional, these models exhibited redundant performance profiles compared to HuatuoGPT\-o1 in the pilot phase\.

Ultimately, expert clinicians recommended focusing onHuatuoGPT\-o1\-8BandGPT\-4o\-minito provide the most discriminative contrast between a natively “slow\-thinking” medical specialist and a high\-performing generalist under the intensive human\-in\-the\-loop scoring requirements of our framework \(1,000 expert hours\)\.

### B\.2LLM Prompting and Output Generation

To ensure rigorous comparability while respecting architectural differences, we employed a core unified prompt strategy\. Both models were evaluated using an identical structured user prompt that enforced aThinkingsection followed by aFinal Response, ensuring they were assessed on the exact same reasoning subtasks\. Because GPT\-4o\-mini is a general\-purpose model, we applied system\-level scaffolding via a senior diagnostician persona and a one\-shot example\. In contrast, HuatuoGPT\-o1\-8B received no external system prompt, relying instead on its intrinsic slow\-thinking mechanism\. All outputs were generated deterministically, decoded, and anonymized for blinded expert scoring\.

### B\.3Prompting Templates

To ensure complete reproducibility, we provide the exact prompt templates utilized in our evaluation framework\. Both models received an identical user prompt to enforce comparable task constraints, while the generalist model \(GPT\-4o\-mini\) additionally received system\-level scaffolding to induce structured clinical reasoning\.

#### B\.3\.1HuatuoGPT\-o1\-8B \(Specialist Model\)

As a specialized medical model with an intrinsic slow\-thinking mechanism, HuatuoGPT\-o1\-8B was evaluated using only the structured user prompt, with no additional system prompt\.

User Prompt for HuatuoGPT\-o1\-8BUser Prompt:You are a medical AI assistant trained to perform detailed, step\-by\-step diagnostic reasoning\.Clinical Case: casePlease analyze this case using systematic clinical reasoning\. Structure your response as:\#\# Thinking \- List the key clinical findings and symptoms \- Generate 3\-5 differential diagnoses with rationale \- Systematically evaluate each diagnosis against the evidence \- Eliminate unlikely diagnoses with clear reasoning \- Arrive at your most likely working diagnosis\#\# Final Response \- One\-sentence summary with the most likely diagnosis and critical supporting evidence \- Clearly state any uncertainties or ambiguities in the case

#### B\.3\.2GPT\-4o\-mini \(Generalist Model\)

To approximate the clinical reasoning behavior native to the specialist model, GPT\-4o\-mini was provided with the identical user prompt above, supplemented by a comprehensive system prompt\. This system prompt established a senior diagnostician persona and included a one\-shot example of high\-quality reasoning\.\(see Figure[6](https://arxiv.org/html/2606.31608#A2.F6)\)

User Prompt and System Prompt for GPT\-4o\-miniUser Prompt:Identical to the HuatuoGPT\-o1\-8B user prompt\.System Prompt:You are a senior clinical diagnostician with years of experience in medicine\. You excel at explaining your clinical reasoning step\-by\-step, as if teaching a medical student or junior resident\.example output: \#\# ThinkingAlright, let’s think about what we’re dealing with here\. We’ve got a 53\-year\-old woman who has been experiencing some pretty troubling symptoms over the past decade—abdominal pain, swelling, and constant vomiting\. That sounds like it might have something serious going on inside her belly\. She doesn’t mention feeling nauseous or losing weight, which is interesting because those would typically accompany more severe conditions\. At first glance, I’m thinking there’s definitely something unusual happening in her abdomen\.Now, looking at the imaging results, there’s a large mass in her left upper abdomen\. It measures about 10 by 10 centimeters and looks kind of messy on both the ultrasound and CT scans—it’s described as having varied echo patterns and low density compared to other organs\. This makes me wonder if it’s either a cystic structure or maybe even a tumor\. But wait, after they gave her a contrast agent during the CT scan, the mass didn’t really light up much; it stayed relatively dark compared to the rest of the area\. Hmm, that suggests it isn’t aggressively growing or highly vascularized, so perhaps it’s not cancerous?During surgery, they find this encapsulated mass near where the spleen meets the liver\. They decide to remove the whole thing along with part of the spleen\. When they take a closer look at the tissue under the microscope, they see these blood vessels with different\-sized openings covered by flat and taller endothelial cells\. Oh\! And there are little finger\-like projections reaching out into these spaces\. Sounds almost like a specific type of growth pattern\. Plus, when they test the cells, they’re positive for certain markers like CD31 and CD68, indicating they line blood vessels, while being negative for others like CD34, suggesting they aren’t typical blood vessel cells\. So far, everything seems to point towards a benign condition called a splenic littoral cell angioma\.So, putting all this together—the age of the patient, the nature of the mass, its appearance on imaging, and especially how the cells behave—I’m confident that this is indeed a benign splenic littoral cell angioma\. Everything checks out with the histological features too\. Yeah, that feels right\. Still, I can’t shake off the thought that there might be another explanation lurking somewhere, just in case\. Maybe something less common? Like a hemangioma or even a rare form of sarcoma? Those could also fit the profile, though nothing else quite matches the specifics of this case\. For now, I’ll stick with my conclusion, but keep an eye open for anything new that comes up\.\#\# Final ResponseThe most likely diagnosis for this 53\-year\-old woman based on the provided information is \*\*benign splenic littoral cell angioma\*\*\.This conclusion is supported by several key points: 1\. \*\*Symptoms\*\*: Her long\-standing abdominal discomfort and vomiting suggest a chronic issue rather than an acute one\. 2\. \*\*Imaging Findings\*\*: The mass appears ill\-defined and heterogeneous on both ultrasound and CT scans, without aggressive enhancement post\-contrast, hinting at a benign or slowly\-growing entity\. 3\. \*\*Surgical and Histopathological Examination\*\*: The resected mass shows characteristics consistent with a splenic littoral cell angioma, including the presence of variably sized vascular channels lined with endothelial cells and the absence of atypia or mitotic figures\.While other possibilities such as hemangiomas or rare types of sarcomas exist, their presentation does not align perfectly with the specific details observed in this case\. Therefore, given the combination of clinical, radiographic, and pathological data, benign splenic littoral cell angioma remains the most plausible diagnosis\. However, it’s important to remain vigilant and consider alternative explanations should additional information arise\.When you analyze a case: \- Think aloud in a natural, conversational tone\. \- Walk through your thought process sequentially, explaining why you consider each piece of evidence\. \- Verbalize your uncertainties, alternative interpretations, and reasoning adjustments as you go\. \- Use phrases like "Let me think about this\.\.\.", "At first, I wondered if\.\.\.", "But wait, this doesn’t quite fit because\.\.\.", "Putting this all together\.\.\." \- Be thorough and transparent in your reasoning\. Don’t just list facts—explain how they influence your diagnostic thinking\. \- Show your work: describe how you weight evidence, rule out alternatives, and converge on the most likely diagnosis\.Your goal is to produce a rich, educational chain\-of\-thought narrative that helps others understand not just WHAT you conclude, but HOW and WHY you arrived at that conclusion\.Figure 6:System Prompt for GPT\-4o\-mini\.

## Appendix CCLExEval Rubric

To ensure high inter\-rater reliability and granular assessment of clinical reasoning, all human experts and automated evaluators utilized the standardized CLExEval rubric, visually summarized in Figure[7](https://arxiv.org/html/2606.31608#A5.F7)\. Each dimension is scored on a discrete scale: 0\.00, 0\.25, 0\.50, 0\.75, or 1\.00\.

To maintain strict scoring standards, evaluators were provided with the following dimension definitions during the calibration phase:

- •Diagnostic Precision:Accuracy and specificity of the final diagnosis\. \(0\.00 = Wrong diagnosis; 1\.00 = Fully correct, precise, and well\-justified\)\.
- •Differential Reasoning Quality:Breadth and depth of alternative diagnoses considered\. \(0\.00 = Single diagnosis stated without justification; 1\.00 = Fully systematic differential evaluating 3–5 alternatives\)\.
- •Evidence Integration / Grounding:Use of case evidence and avoidance of hallucinations\. \(0\.00 = Ignores case data; 1\.00 = Fully evidence\-based reasoning\)\.
- •Diagnostic Justification Depth:Quality, depth, and completeness of the causal explanations linking evidence to the diagnosis\.
- •Completeness vs\. Overload:Coverage of key clinical findings without the inclusion of distracting or irrelevant information\.
- •Clinical Plausibility / Soundness:Medical validity and strict adherence to established clinical standards and pathophysiology\.
- •Communication / Interpretability:Clarity, structural coherence, and professional presentation of the output\.

### C\.1Human Evaluation and Statistical Design

For clinical validation, our panel comprised two licensed physicians and two senior clinical interns, each with 4\.5 years of formal medical education\. After a structured calibration phase, two independent expert pairs double\-scored balanced subsets of the narratives to reduce fatigue across the 5,600 annotations\. We quantified inter\-rater reliability across this dual\-review design using Cohen’sκ\\kappa\(Cohen,[1960](https://arxiv.org/html/2606.31608#bib.bib40)\)and Krippendorff’sα\\alpha\(Krippendorff,[2011](https://arxiv.org/html/2606.31608#bib.bib41)\)\. Finally, we applied two\-way mixed ANOVA tests\(Field,[2024](https://arxiv.org/html/2606.31608#bib.bib43)\)and post\-hoc Tukey HSD tests\(Tukey,[1949](https://arxiv.org/html/2606.31608#bib.bib42)\)to evaluate performance degradation and structural differences under uncertainty\.

## Appendix DMetric Definitions

Here we provide the formal definitions for the CLExEval metrics used to quantify reasoning stability, information sensitivity, latent diagnostic potential, and judge reliability\.

##### Information Scarcity Sensitivity \(ISS\)\.

Information Scarcity Sensitivity measures how much diagnostic accuracy degrades as clinical information is progressively removed:

ISS=AccNarrative−AccL3\.\\mathrm\{ISS\}=\\mathrm\{Acc\}\_\{\\mathrm\{Narrative\}\}\-\\mathrm\{Acc\}\_\{L3\}\.\(1\)A higher ISS indicates greater sensitivity to missing information and suggests reliance on explicit clinical cues rather than robust causal reasoning\.

##### Reasoning Gap\.

For modelmmand rubric dimensiondd, we define the reasoning gap as:

Δm,d=Scorem,d\(Narrative\)−Scorem,d\(L3\)\.\\Delta\_\{m,d\}=\\mathrm\{Score\}\_\{m,d\}^\{\(\\mathrm\{Narrative\}\)\}\-\\mathrm\{Score\}\_\{m,d\}^\{\(L3\)\}\.\(2\)This measures how much a qualitative reasoning dimension declines between the full case and the most information\-scarce version\.

##### Label Dependence Factor \(LDF\)\.

The Label Dependence Factor measures inconsistency between the unmasked Narrative and Level 0, where the explicit diagnosis label is removed but most clinical evidence remains:

LDF=AccNarrative−AccL0\.\\mathrm\{LDF\}=\\mathrm\{Acc\}\_\{\\mathrm\{Narrative\}\}\-\\mathrm\{Acc\}\_\{L0\}\.\(3\)A higher LDF suggests dependence on explicit diagnostic labels or surface\-level extraction rather than independent clinical deduction\.

##### Monotonicity Violation Rate \(MVR\)\.

MVR measures how often a model violates the expected pattern that performance should not improve when clinical information is removed\. Letci,l∈\{0,1\}c\_\{i,l\}\\in\\\{0,1\\\}indicate whether the model correctly diagnoses caseiiat masking levelll\. A monotonicity violation occurs when:

ci,l=0andci,l\+1=1\.c\_\{i,l\}=0\\quad\\text\{and\}\\quad c\_\{i,l\+1\}=1\.\(4\)We define:

MVR=∑i∑l𝕀\(ci,l=0∧ci,l\+1=1\)∑i∑l1\.\\mathrm\{MVR\}=\\frac\{\\sum\_\{i\}\\sum\_\{l\}\\mathbb\{I\}\(c\_\{i,l\}=0\\land c\_\{i,l\+1\}=1\)\}\{\\sum\_\{i\}\\sum\_\{l\}1\}\.\(5\)A higher MVR indicates greater reasoning instability, because the model succeeds with less information after failing with more information\.

##### Maximum Diagnostic Potential \(MDA\)\.

MDA estimates latent diagnostic potential by measuring whether the model reaches the correct diagnosis at any masking level:

MDA=1N∑i=1N𝕀\(maxl⁡ci,l=1\)\.\\mathrm\{MDA\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\left\(\\max\_\{l\}c\_\{i,l\}=1\\right\)\.\(6\)High MDA with lower Narrative accuracy suggests that the model may possess the relevant diagnostic knowledge but fails to retrieve or express it reliably under some context conditions\.

##### Reasoning\-to\-Output Mismatch \(ROM\)\.

ROM measures cases where the model internally considers the correct diagnosis but does not commit to it in the final answer\. Letℱ\\mathcal\{F\}be the set of failed instances where the final prediction differs from the gold diagnosisGiG\_\{i\}, and letTiT\_\{i\}denote the model’s internalThinkingtrace:

ROM=∑i∈ℱ𝕀\(Gi∈Ti\)\|ℱ\|\.\\mathrm\{ROM\}=\\frac\{\\sum\_\{i\\in\\mathcal\{F\}\}\\mathbb\{I\}\(G\_\{i\}\\in T\_\{i\}\)\}\{\|\\mathcal\{F\}\|\}\.\(7\)A high ROM indicates reasoning\-output misalignment, where correct latent knowledge appears in the reasoning trace but is omitted, changed, or suppressed in the final response\. A low ROM indicates that failures are more likely due to genuine knowledge gaps or hallucinated reasoning\.

##### Hallucination Approval Rate \(HAR\)\.

HAR measures the fraction of human\-verified clinical failures that an automated judge incorrectly approves:

HAR=∑i∈ℱhuman𝕀\(Ji≥τ\)\|ℱhuman\|,\\mathrm\{HAR\}=\\frac\{\\sum\_\{i\\in\\mathcal\{F\}\_\{\\mathrm\{human\}\}\}\\mathbb\{I\}\(J\_\{i\}\\geq\\tau\)\}\{\|\\mathcal\{F\}\_\{\\mathrm\{human\}\}\|\},\(8\)whereℱhuman\\mathcal\{F\}\_\{\\mathrm\{human\}\}is the set of expert\-verified failure cases,JiJ\_\{i\}is the automated judge score, andτ\\tauis the approval threshold\. In our experiments,τ=0\.75\\tau=0\.75\. A higher HAR indicates that an automated judge is more likely to approve clinically incorrect outputs\.

##### Hidden Knowledge Paradox\.

The Hidden Knowledge Paradox occurs when a model has high MDA but lower final diagnostic accuracy in full\-context or narrative settings\. This indicates that the model can reach the correct diagnosis under at least one information condition, but fails to express it reliably when the full clinical narrative contains distracting or competing cues\.

##### Attention Dispersion\.

Attention Dispersion refers to cases where additional clinical context reduces diagnostic focus by spreading the model’s reasoning across competing cues or distractors\. Empirically, it appears as non\-monotonic behavior: the model may fail under fuller context but recover the correct diagnosis after some information is removed\.

##### Crossover Event\.

A Crossover Event occurs at a masking level where the relative performance of two models reverses\. For example, if modelaaoutperforms modelbbat lower masking levels but falls belowbbat higher masking levels, the crossing point marks a regime where robustness to information scarcity differs between models\.

## Appendix EExtended Analysis of Judge Behavior

A dimensional breakdown of the Hallucination Approval Rate \(HAR\) scores reveals four recurring patterns in automated judge behavior:

1\.GPT\-4o\-mini: fluency\-sensitive judging\.On Diagnostic Precision, GPT\-4o\-mini is comparatively strict, approving only 15\.5% of wrong diagnoses\. However, on Differential Reasoning, it approves 97\.0% of incorrect reasoning chains, suggesting sensitivity to fluent chain\-of\-thought structure rather than clinical logic alone\.

2\.DeepSeek\-R1\-Distill\-Llama\-8B: stricter reasoning critique\.This is the strictest judge on reasoning \(HAR=33\.3%HAR=33\.3\\%\), suggesting that its reasoning\-oriented training may help it critique logic more effectively\. However, it approves 100% of failures on Communication, indicating that fluent presentation remains difficult to penalize\.

3\.HuatuoGPT\-o1\-8B: permissive specialist judging\.The medical specialist does not function as a strict differentiator in this evaluation, with a 100% HAR across almost all dimensions\. This suggests that specialized medical tuning does not necessarily produce stricter evaluation behavior\.

4\.Llama\-3\.1\-8B\-Instruct: compressed scoring\.This model exhibits central\-tendency bias, approving 88\.7% of all failures, including 81\.0% of wrong diagnoses\. Unlike HuatuoGPT\-o1\-8B, which often assigns maximum scores, Llama\-3\.1\-8B\-Instruct compresses many outputs into a passing range, functioning as a variance suppressor\.

### E\.1The Self\-Preference Test

We extended our analysis to measure Self\-Evaluation Bias by calculating the difference between a model’s Self\-Score and its Peer\-Average Score \(the resulting bias gaps are quantified in Table[6](https://arxiv.org/html/2606.31608#A5.T6)\)\.

Bias=Mean\(Self\-Score\)−Mean\(Peer\-Scores\)\\text\{Bias\}=\\text\{Mean\}\(\\text\{Self\-Score\}\)\-\\text\{Mean\}\(\\text\{Peer\-Scores\}\)\(9\)
Table 6:Quantifying self\-preference bias\.The specialized medical model exhibits stronger self\-preference bias \(red\) than the generalist model \(orange\)\.These results caution against replacing human annotation with standalone LLM judges in high\-stakes domains\. Even the lowest\-HAR judge in our experiment \(GPT\-4o\-mini\) assigns passing scores to nearly half of the consensus failures\. While LLMs may assist with scalable screening or formatting checks, they do not yet provide a reliable substitute for expert assessment of medical correctness or clinical reasoning\.

![Refer to caption](https://arxiv.org/html/2606.31608v1/sections/Fig4.png)Figure 7:The CLExEval framework\.A 7\-dimensional clinical reasoning assessment rubric used to evaluate model outputs\. The framework separates surface\-level diagnostic accuracy from deep reasoning coherence, with each dimension scored on a scale from 0\.00 \(poor\) to 1\.00 \(excellent\)\.

## Appendix FExtended Expert Evaluation Profiles

Dimensional Performance Disparity:As visualized in Figure[8](https://arxiv.org/html/2606.31608#A6.F8), GPT\-4o\-mini maintains a strong baseline of structural coherence across all seven CLExEval dimensions\. The largest performance gap occurs inDifferential Reasoning\(0\.88 vs\. 0\.55\), reflecting the specialist model’s tendency to prematurely fixate on single incorrect diagnoses under verbose clinical context\. This dimensional breakdown also illustrates theGeneralist’s Illusiondiscussed in the main text: GPT\-4o\-mini’s highCommunicationscore \(0\.94\) can inflate perceived competence despite a lowerDiagnostic Precisionscore \(0\.71\)\.

![Refer to caption](https://arxiv.org/html/2606.31608v1/plot_model_comparison.png)Figure 8:Expert evaluation profile\.Comparison of GPT\-4o\-mini \(blue\) vs\. HuatuoGPT \(red\) across 7 expert rubric dimensions\. The generalist model consistently outperforms the specialist, particularly inDifferential ReasoningandCommunication\.GPT\-4o\-mini: The U\-Shaped ROM Curve\.GPT\-4o\-mini’s ROM exhibited a non\-monotonic U\-shaped pattern, peaking at the Narrative level \(100%\) and dropping sharply to its lowest point at Level 1 \(53\.3%\), representing the highest rate of failures where the correct diagnosis was absent from the reasoning trace\. This suggests that the model loses diagnostic anchors when initial cues are removed, then partially adapts under sustained information constraints \(see Figure[9](https://arxiv.org/html/2606.31608#A6.F9)\)\.

HuatuoGPT: Progressive Knowledge Degradation\.In contrast, HuatuoGPT showed a general downward trend, dropping to 32\.3% at Level 3\. This indicates that its failures are driven by progressive knowledge degradation rather than decision\-making hesitation; as information is removed, the model’s ability to retrieve the correct diagnosis from its latent knowledge base deteriorates proportionally\.

![Refer to caption](https://arxiv.org/html/2606.31608v1/rom_trajectory.png)Figure 9:ROM trajectory across masking levels\. The visualization tracks Reasoning\-to\-Output Mismatch as clinical information is systematically removed\. GPT\-4o\-mini \(blue\) demonstrates a non\-monotonic “U\-shaped” curve, peaking at 100% in the Narrative level before dropping to its lowest ROM at Level 1 \(53\.3%\)\. HuatuoGPT \(purple\) exhibits a general downward trend, indicating that information loss reduces the model’s retrieval reliability\.
## Appendix GExtended Reasoning Fingerprints

We quantified stability using the Monotonicity Violation Rate \(MVR\)\. We found that 27\.5% of GPT\-4o\-mini’s correct answers at Level 3 were non\-monotonic recoveries, where the model failed withmoreinformation but succeeded withless\. This indicates Attention Dispersion, where verbose context may distract the model from retrieving knowledge it can otherwise express \(visualized in Figure[10](https://arxiv.org/html/2606.31608#A7.F10)\)\.

![Refer to caption](https://arxiv.org/html/2606.31608v1/04_fingerprint_gpt_4o_mini.png)\(a\)GPT\-4o\-mini: High Instability\.
![Refer to caption](https://arxiv.org/html/2606.31608v1/04_fingerprint_huatuogpt_o1.png)\(b\)HuatuoGPT: Hidden Knowledge Recovery\.

Figure 10:Reasoning fingerprints \(heatmap analysis\)\.Green indicates correct diagnosis; red indicates failure\. \(a\) GPT\-4o\-mini exhibits “checkerboard” instability \(green→\\tored→\\togreen\), indicating non\-monotonic reasoning behavior\. \(b\) HuatuoGPT frequently fails the Full Narrative \(red column 1\) but recovers the correct diagnosis at Level 0 \(green column 2\), visualizing theHidden Knowledge Paradoxwhere verbose context contributes to retrieval failure\.
## Appendix HFluency Bias Under Level 0 Masking

To qualitatively illustrate theEvaluation IllusionandFluency Biasdiscussed, we present the complete model output for the case referenced in Figure[1](https://arxiv.org/html/2606.31608#S0.F1)\.

In this instance, the ground\-truth diagnosis \(Type C pyloric atresia\) was masked under the Level 0 condition\. Without the explicit diagnosis, both models \(HuatuoGPT\-o1\-8B and GPT\-4o\-mini\) produced an incorrect clinical conclusion \(duodenal atresia\)\. However, as shown in Figure[11](https://arxiv.org/html/2606.31608#A8.F11)and Figure[12](https://arxiv.org/html/2606.31608#A8.F12), the incorrect conclusion is embedded in a highly articulate and structurally coherent reasoning trace\. We highlight the models’ unsupported confident assertions inred\. This authoritative presentation helps explain why automated LLM\-as\-a\-Judge evaluators can incorrectly assign a perfect1\.00/1\.001\.00/1\.00diagnostic score\.

HuatuoGPT\-o1\-8B output \(Level 0 masking\)\#\#\# AI MODEL OUTPUT:\#\# ThinkingAlright, let’s think about what we’re dealing with here\. We’ve got a little guy who just came into the world five days ago, weighs around 1700 grams, and has been having some trouble from the get\-go—non\-bilious vomiting right from birth\. That’s already setting off alarm bells because it suggests something might be blocking his digestive tract somewhere past the stomach but before bile enters the picture\.Now, he hasn’t passed meconium either, which can sometimes happen if there’s an issue like Hirschsprung’s disease or even intestinal atresia\. So, I’m thinking that could definitely play a part here too\.When they did the first X\-ray, all we saw was a ’single bubble’ of air in the stomach, meaning nothing else seemed to have air beyond that point\. This makes sense if there’s a blockage further down, confirming my suspicions about possible atresia\.But then, when they took another look with an X\-ray a couple of days later, things changed—they now see a ’double bubble\.’Ah, so now there’s air not only in the stomach but also in the proximal duodenum\! This tells us the obstruction isn’t complete; there must be a pouch\-like formation allowing some air through\.The fact that the ultrasound didn’t show anything unusual kind of rules out other conditions like midgut volvulus, where you’d expect fluid levels or abnormal blood flow signs\. And given these clues, both duodenal atresia and pyloric atresia seem plausible\.So, during the operation, they find exactly what we suspected—a duodenal atresia with that classic ’apple\-core sign,’ where the third portion of the duodenum is blocked\.They fix it with a duodenogastrostomy, putting in a special tube to help him eat properly until everything heals\.After the surgery, he starts doing well—he gains weight nicely, and once the tube is out, he transitions smoothly to regular food\. By the time he leaves the hospital, he’s looking much better, and after following up over a year, he seems completely fine\!All these details really make duodenal atresia stand out as the main problem\. It fits perfectly with how he looked, acted, and responded to treatment\.Yeah, I feel pretty confident saying that’s what went wrong here\.\#\# Final ResponseBased on the provided information,the most likely diagnosis for this neonate isduodenal atresia\.Figure 11:HuatuoGPT\-o1\-8B output \(Level 0 output\)\.
Gpt\-4o\-mini output \(Level 0 masking\)\#\#\# AI MODEL OUTPUT:\#\# ThinkingAlright, let’s break this case down step\-by\-step\.\#\#\# Key Clinical Findings and Symptoms1\.Age and Birth Weight: A 5\-day\-old male, low birth weight at 1700g\. 2\.Clinical Presentation: Presenting with non\-bilious vomiting since birth and has not emitted meconium\. 3\.Physical Examination: Appears in poor general condition, features a ’trisomic facies,’ and has a distended abdomen \("plat abdomen"\)\. 4\.Imaging Findings: \- Initial abdominal X\-ray showed a single bubble of gastric air with no gas beyond it—often indicative of a proximal obstruction\. \- Second X\-ray revealed a "double bubble" sign, which is characteristic of duodenal atresia\. 5\.Lab Results: Biological check\-up was normal\.\#\#\# Differential Diagnoses1\.Duodenal Atresia:The most natural consideration given the classic presentation of double bubble on imaging and clinical findings of non\-bilious vomiting and failure to pass meconium\.2\.Pyloric Stenosis: Possible due to the age \(early infant\) and vomiting, but it’s typically associated with projectile vomiting and usually presents after a few weeks, which does not fit here\. 3\.Hirschsprung Disease: Possible given the failure to pass meconium, but this usually causes a more distal obstruction leading to abdominal distension and is less typical in the upper GI tract\. 4\.Malrotation with Midgut Volvulus: This can present with similar symptoms but is generally associated with bilious vomiting—here, the vomiting is non\-bilious, which makes this diagnosis less likely\. 5\.Intestinal Atresia \(other than duodenum\): Possible if a different segment of the intestines were obstructed, but would have been less likely given the prominent double bubble sign, particularly localized to the duodenum\.\#\#\# Evaluation of Diagnoses\-Duodenal Atresia:Given the key symptoms \(non\-bilious vomiting, double bubble on X\-ray\) and the surgical findings that confirmed the diagnosis, this remains the most compelling diagnosis\.\-Pyloric Stenosis: Rules out based on age of onset and imaging findings\. \-Hirschsprung Disease: Less likely because it does not correlate with the X\-ray findings and the absence of bowel gas does not suggest a distal obstruction consistent with Hirschsprung\. \-Malrotation with Midgut Volvulus: Less likely due to non\-bilious vomiting\. The X\-ray findings do not support this\. \-Intestinal Atresia: Other atresias are possible but less common; the imaging and clinical picture strongly suggest a proximal issue localized to the duodenum\.\#\#\# Elimination of Unlikely DiagnosesAfter evaluating the above conditions, it is clear that: \- Pyloric stenosis and malrotation are eliminated due to inconsistency with the clinical picture\. \- Hirschsprung’s Disease does not match the typical signs we see, which leads us to lean away from it as well\.\#\#\# ConclusionPutting all of this together, the findings are most consistent withduodenal atresia, especially in the context of the associated "trisomic facies" which can sometimes link to syndromic conditions like Down syndrome, known to have higher rates of duodenal atresia\.\#\# Final ResponseThe most likely diagnosis for this 5\-day\-old male isduodenal atresia, supported by the classic double bubble sign on abdominal X\-rays, non\-bilious vomiting since birth, and the absence of meconium passage\.The immediate clinical picture and surgical findings confirm the diagnosis of duodenal atresia\.Figure 12:Gpt\-4o\-mini output \(Level 0 output\)
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

Similar Articles

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

Submit Feedback

Similar Articles

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering