MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models
Summary
MedBench v5 is a dynamic, process-oriented benchmark for clinical multimodal models that integrates hallucination detection and stress testing, moving beyond static QA to evaluate reasoning and stability under information-flow stressors.
View Cached Full Text
Cached at: 06/24/26, 07:45 AM
# MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models
Source: [https://arxiv.org/html/2606.24155](https://arxiv.org/html/2606.24155)
\\keepXColumns
Ding JinruJiang ChuchuCo\-first author\. Email: jiangchuchu@pjlab\.org\.cnShanghai Artificial Intelligence LaboratoryLu LuCo\-first author\. Email: lulu@pjlab\.org\.cnShanghai Artificial Intelligence LaboratoryPang WenraoShanghai Artificial Intelligence LaboratoryBian MouxiaoShanghai Artificial Intelligence LaboratoryGao ZhuangzhiShanghai Artificial Intelligence LaboratoryChen JiangyuanShanghai Artificial Intelligence LaboratoryPeng xinweiShanghai Artificial Intelligence LaboratoryChen RuiyaoShanghai Artificial Intelligence LaboratoryRen SijieShanghai Artificial Intelligence LaboratoryLu RenjieShanghai Artificial Intelligence LaboratoryHan BinShanghai Artificial Intelligence LaboratoryLiu MeilingShanghai Artificial Intelligence LaboratoryXu JieCorresponding author\. xujie@pjlab\.org\.cnShanghai Artificial Intelligence Laboratory
###### Abstract
Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection\. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models \(language, vision–language, and agent systems\) that moves from static QA to dynamic, process\-oriented evaluation\. MedBench v5 features: \(1\) a dual\-dimensional framework combining Clinical Cognitive Responsiveness \(14 sub\-dimensions\) and Medical Atomic Skills \(4 agent environments\), covering 63 tasks; \(2\) three switchable information\-flow stressors \(omission, contradiction, evidence delay\) for factorized degradation analysis; \(3\) a dynamic process audit protocol with five reasoning nodes that produces model\-specific failure fingerprints; \(4\) hallucination propagation monitoring across initiation, propagation, anchoring, and contradiction interaction—capturing silent hallucination\. Experiments on frontier models show that strong overall task performance does not guarantee process stability: stressors mainly disrupt contradiction detection, diagnosis updating, hallucination propagation, and contradiction\-based self\-correction, while final evidence grounding can remain superficially stable\. MedBench v5 provides a unified infrastructure for capability profiling, controllable stress testing, process auditing, and hallucination trajectory analysis in clinical AI evaluation\.
*Keywords*Multimodal Model Evaluation, Information\-Flow Stressor, Process Audit Protocol, Hallucination Propagation Monitoring, Clinical AI Benchmark
## 1Introduction
Large language models \(LLMs\) and multimodal foundation models have shown growing potential in medical applications, including online pre\-consultation, clinical documentation, intelligent follow\-up, patient education, medical question answering, and clinical decision support\(Singhalet al\.,[2023](https://arxiv.org/html/2606.24155#bib.bib51); Jung,[2025](https://arxiv.org/html/2606.24155#bib.bib53); Liuet al\.,[2024](https://arxiv.org/html/2606.24155#bib.bib54);song2025large; Aydinet al\.,[2024](https://arxiv.org/html/2606.24155#bib.bib55); Acostaet al\.,[2022](https://arxiv.org/html/2606.24155#bib.bib50); Mooret al\.,[2023](https://arxiv.org/html/2606.24155#bib.bib52)\)\. However, real\-world medical practice is not a static question\-answering task\. It is inherently dynamic and iterative: physicians must reason under uncertainty, actively elicit missing history, reconcile conflicting evidence, update diagnostic hypotheses, and make sequential decisions as new information becomes available\(Sooknanan and Seemungal,[2019](https://arxiv.org/html/2606.24155#bib.bib3); Ballet al\.,[2015](https://arxiv.org/html/2606.24155#bib.bib4); Meyeret al\.,[2021](https://arxiv.org/html/2606.24155#bib.bib5); Weinsteinet al\.,[2017](https://arxiv.org/html/2606.24155#bib.bib56); Thampyet al\.,[2019](https://arxiv.org/html/2606.24155#bib.bib57); McCoyet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib58)\)\.
In contrast, most medical LLM benchmarks have predominantly adopted a static, single\-turn question\-answering \(QA\) paradigm, where the model receives a complete case description or exam\-style prompt and produces a single final answer\. Representative examples include MedQA\(Zhang and Chung,[2021](https://arxiv.org/html/2606.24155#bib.bib59)\), CMExam\(Liuet al\.,[2023](https://arxiv.org/html/2606.24155#bib.bib15)\), MedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2606.24155#bib.bib60)\), and CBLUE\(Zhanget al\.,[2022](https://arxiv.org/html/2606.24155#bib.bib16)\)\. Although these benchmarks have played an important role in measuring medical knowledge, language understanding, and exam\-style reasoning, they provide limited evidence of whether a model can operate safely in realistic clinical workflows\(Kim and Yoon,[2025](https://arxiv.org/html/2606.24155#bib.bib62); Sunet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib63); Bielicket al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib64)\)\. Recent studies have increasingly highlighted this mismatch between static benchmark performance and clinical readiness\(Jianget al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib2); Chenet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib65); Wuet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib66)\)\. For example, systematic evidence suggests a persistent knowledge–practice gap: models that achieve strong performance on knowledge\-based medical exams may perform substantially worse on practice\-oriented or safety\-critical tasks\(Gonget al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib1)\)\. Similarly, when evaluation shifts from static cases to multi\-turn or agent\-based clinical interaction, diagnostic performance can degrade markedly, revealing failures that are hidden by single\-turn QA evaluation\(Sangwonet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib7); Schmidgallet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib6)\)\.
Recognizing these limitations, recent benchmarks have begun to move toward practice\-oriented and interactive evaluation\(Liuet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib69)\)\. One line of work introduces multi\-turn diagnostic dialogue tasks in which models must actively ask questions, gather missing information, and decide when sufficient evidence has been collected\. Benchmarks such as MediQ\(Liet al\.,[2024](https://arxiv.org/html/2606.24155#bib.bib35)\), Q4Dx\(Werthaimet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib17)\), and VivaBench\(Chiuet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib19)\)evaluate whether models can conduct sequential information seeking rather than merely answer a fully specified prompt\. Another line of work studies robustness under incomplete, hidden, or adversarial patient information\. For instance, MedConceal\(Hanet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib68)\)evaluates hidden\-concern reasoning in medical dialogue, while MedDialBench\(Luoet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib67)\)examines diagnostic robustness under parametrically controlled non\-cooperative patient behaviors\. In parallel, agentic simulation environments\(Liuet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib70)\)such as MedAgentBench\(Jianget al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib22)\), AgentClinic\(Schmidgallet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib6)\), ClinEnv\(Luet al\.,[2026b](https://arxiv.org/html/2606.24155#bib.bib20)\), and MeDxAgent\(Sanghviet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib18)\)embed models in more realistic clinical workflows\(Yanet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib71)\), requiring them to retrieve information, interact with simulated patients or electronic health records, consult tools or specialist agents, and perform sequential clinical actions\.
Safety evaluation has also become increasingly important as medical LLMs move closer to deployment\(Asgariet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib72); Roustan and Bastardot,[2025](https://arxiv.org/html/2606.24155#bib.bib73)\)\. In particular, hallucination is a critical risk in clinical settings, because unsupported or fabricated claims may appear plausible while leading to unsafe diagnostic or therapeutic decisions\(Zhuet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib26); Kimet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib74)\)\. Existing hallucination\-oriented benchmarks, including Med\-HALT\(Palet al\.,[2023](https://arxiv.org/html/2606.24155#bib.bib27)\), MedHallu\(Panditet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib28)\), and multimodal benchmarks such as Med\-HallMark\(Chenet al\.,[2024](https://arxiv.org/html/2606.24155#bib.bib32)\)and MedVH\(Guet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib33)\), evaluate whether models can detect or avoid factual inaccuracies in medical responses\. These efforts provide valuable tools for measuring factual reliability, especially at the response or final\-answer level\.
Despite this progress, existing practice\-oriented benchmarks remain largely observational rather than diagnostic\(Chenet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib65)\)\. They can reveal that model performance degrades in interactive or safety\-critical settings, but they often cannot explain where and why the degradation occurs\(Sunet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib63); Zhouet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib76); Wanget al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib77)\)\. We identify four key limitations\. First, many benchmarks still rely on end\-to\-end outcome scores, making it difficult to localize failures to specific reasoning stages such as recognizing missing information, asking appropriate follow\-up questions, detecting contradictions, updating diagnoses, or grounding conclusions in evidence\(Qiuet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib75)\)\. Second, although recent interactive benchmarks introduce incomplete or adversarial information, few provide a controllable information\-flow design that systematically toggles omission, contradiction, and delay to distinguish general task difficulty from specific cognitive failure modes\(Liet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib78); Yanet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib79)\)\. Third, existing evaluations often under\-specify the atomic operational skills required by clinical AI systems, such as structured data manipulation\(Shiet al\.,[2024](https://arxiv.org/html/2606.24155#bib.bib80)\), retrieval\-augmented reasoning\(Xionget al\.,[2024](https://arxiv.org/html/2606.24155#bib.bib81)\), long\-horizon research synthesis\(Huanget al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib82)\), and adversarial safety defense\(Zhanget al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib83)\)in executable or sandboxed environments\. Fourth, hallucination evaluation is usually treated as a standalone factuality task, decoupled from the main clinical reasoning trajectory\(Asgariet al\.,[2025](https://arxiv.org/html/2606.24155#bib.bib72)\)\. As a result, current benchmarks rarely trace how unsupported facts emerge, persist across turns, interact with contradictions, and eventually contaminate final diagnostic conclusions\(Luet al\.,[2026a](https://arxiv.org/html/2606.24155#bib.bib84); Yanget al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib85)\)\.
To address these gaps, we introduceMedBench v5, a holistic benchmark for clinical multimodal model evaluation that moves beyond static QA toward dynamic, process\-oriented, and hallucination\-aware assessment\. MedBench v5 combines a dual\-dimensional capability framework with a stress\-audit\-tracing protocol, enabling both broad capability coverage and fine\-grained failure attribution\. Specifically, MedBench v5 introduces four key components:
- •Dual\-Dimensional Evaluation Framework\.We organize clinical model evaluation along two complementary dimensions:Clinical Cognitive Responsiveness\(CCR\) andMedical Atomic Skills\(MAS\)\. CCR covers 14 clinical capability dimensions spanning medical QA, natural language understanding and generation, clinical reasoning, multimodal perception, decision support, interaction, memory, tool use, safety, and multi\-agent collaboration\. MAS further instantiates four executable agent\-based environments—DataAgent, RAGAgent, DeepResearch, and SafetyAgent—to evaluate structured data interaction, retrieval\-augmented generation, long\-horizon evidence synthesis, and adversarial safety defense\. Together, these dimensions define 18 capability areas across 63 clinical tasks\.
- •Switchable Information\-Flow Stressors\.We design three independently togglable stressors—information omission,contradiction injection, andevidence delay—to systematically perturb clinical information flow\. By comparing no\-stress, single\-stressor, and multi\-stressor conditions, this design enables controlled attribution of performance degradation to specific information\-flow disruptions rather than treating interactive difficulty as an undifferentiated source of error\.
- •Dynamic Five\-Node Process Audit\.We establish a five\-node audit protocol that evaluates model behavior acrossinformation gap detection,follow\-up strategy,contradiction detection,diagnosis update, andevidence grounding\. Instead of scoring only the final answer, the audit records process\-level behavioral traces and generates a reasoning failure profile for each model, revealing where the clinical reasoning chain breaks under different stress conditions\.
- •Hallucination Propagation Monitoring\.Complementing the five\-node audit, we monitor hallucination propagation throughout the multi\-turn trajectory\. This module tracks four progressive dimensions—initiation, propagation, anchoring, and hallucination–contradiction interaction—to quantify when unsupported claims first appear, whether they persist or cross\-contaminate later reasoning, whether they become anchored in the final diagnostic evidence chain, and whether explicit contradictions suppress or induce further fabrication\.
By integrating broad capability evaluation, executable atomic skill testing, controllable information\-flow stressors, process\-level auditing, and trajectory\-level hallucination monitoring, MedBench v5 provides a clinically grounded and diagnostically transparent benchmark for medical LLMs and multimodal clinical AI systems\. Rather than merely asking whether a model produces the correct final answer, MedBench v5 evaluates how the answer is reached, where the reasoning process fails, and how unsupported information propagates under realistic clinical uncertainty\.
Figure1:Overview of the MedBench v5 evaluation framework\.MedBench v5 is organized into two complementary dimensions\. The left panel summarizesClinical Cognitive Responsiveness\(CCR\), which covers 14 clinical capability dimensions spanning language\-based reasoning, multimodal perception and decision support, agentic interaction, memory, tool use, safety, and multi\-agent collaboration\. The right panel presentsMedical Atomic Skills\(MAS\), which instantiate four executable agent\-based evaluation environments: DataAgent for clinical data querying, RAGAgent for retrieval\-augmented medical question answering, DeepResearch for long\-horizon evidence synthesis, and SafetyAgent for adversarial safety evaluation\. Together, CCR and MAS define 18 capability areas across 63 clinical tasks and provide a holistic benchmark for evaluating clinical multimodal models\.
## 2Methodology
To move beyond traditional static evaluation, we propose a unified process\-diagnostic protocol for clinical multimodal model assessment\. The methodology consists of two layers\. First, we introduce a dual\-dimensional evaluation framework that defines the capability space of MedBench v5 throughClinical Cognitive Responsiveness\(CCR\) andMedical Atomic Skills\(MAS\)\. CCR captures broad clinical reasoning and interaction capabilities across language, vision\-language, and agentic settings, while MAS isolates four core operational skills required for clinical AI systems to interact with data, knowledge bases, long\-horizon tasks, and safety constraints\. Second, we introduce a coupled stress\-audit\-tracing mechanism that challenges model reasoning with controllable information\-flow stressors, audits behavioral deviations across five reasoning nodes, and traces the propagation of unsupported facts throughout multi\-turn trajectories\. Together, these two layers define both the evaluation space and the diagnostic protocol for identifying not only whether a model fails, but also where and how the failure emerges\.
### 2\.1The Dual\-Dimensional Evaluation Framework
MedBench v5 organizes clinical model evaluation along two complementary dimensions\. The first dimension, Clinical Cognitive Responsiveness, measures high\-level clinical capabilities across diverse modalities and interaction paradigms\. The second dimension, Medical Atomic Skills, evaluates executable skill modules that are difficult to assess through static question answering alone\. As shown in Figure[1](https://arxiv.org/html/2606.24155#S1.F1), the complete framework contains 14 CCR dimensions and 4 MAS skill\-oriented testbeds\.
#### Clinical Cognitive Responsiveness
Clinical Cognitive Responsiveness \(CCR\) characterizes a medical AI system’s ability to understand clinical inputs, generate clinically appropriate outputs, reason over multimodal evidence, interact with users, and adapt to evolving clinical contexts\. Within MedBench v5, CCR is operationalized through three complementary evaluation tracks that collectively span 14 capability dimensions and 52 datasets\.
- •TheLLM trackfocuses on text\-based clinical capabilities\. It covers five dimensions: medical knowledge question answering, medical natural language understanding, medical natural language generation, clinical reasoning and decision\-making, and medical safety and ethics\. Representative tasks include exam\-style medical reasoning, specialty consultation, medication guidance, clinical entity extraction, prescription review, clinical record generation, patient\-friendly explanation, differential diagnosis, treatment planning, outcome prediction, risk assessment, regulatory compliance, and ethical decision\-making\.
- •Themultimodal trackevaluates vision\-language perception, cross\-modal reasoning, and multimodal clinical decision support\. It covers medical vision\-language perception and document OCR, multimodal semantic reasoning, and multimodal clinical decision support\. Representative tasks include lesion detection, medical image classification, OCR from report images, visual question answering, report generation, image quality control, longitudinal image understanding, 3D multi\-timepoint reasoning, multimodal differential diagnosis, treatment planning, disease course analysis, and telemedicine dialogue generation\.
- •Theagentic interaction trackevaluates interactive, tool\-augmented, and context\-aware clinical reasoning\. It covers clinical goal decomposition and reflection, medical tool use and tool learning, clinical context awareness and interaction, long\-term memory and context management, medical multi\-agent collaboration, and medical safety, ethics, and compliance\. Representative tasks include clinical pathway planning, goal decomposition, error reflection, information retrieval API calling, clinical operation API calling, role\-adaptive dialogue, long\-term conversational tracking, long\-document question answering, multi\-system coordination across diagnostic and therapeutic scenarios, and compliance\-aware interaction\.
The complete task taxonomy, dataset descriptions, and evaluation metrics for each CCR dimension are provided in AppendixLABEL:tab:medbench\_v5\_ccr\.
#### Medical Atomic Skills
Medical Atomic Skills \(MAS\) complement CCR by isolating four executable skill\-oriented testbeds that are central to real\-world clinical AI systems but are not fully captured by static QA benchmarks\. While CCR defines the breadth of clinical cognitive capabilities, MAS focuses on how models execute core operational procedures, including structured data interaction, evidence retrieval, long\-horizon research synthesis, and adversarial safety defense\. The four MAS modules are DataAgent, RAGAgent, DeepResearch, and SafetyAgent, as illustrated in Figure[1](https://arxiv.org/html/2606.24155#S1.F1)\.
- •DataAgentevaluates clinical data interaction over structured and semi\-structured sources, including MySQL, PostgreSQL, CSV files, and unstructured clinical text\. Given a user request, the agent performs multi\-turn natural\-language\-to\-SQL interaction, executes database queries, conducts trend or anomaly analysis, and generates interpretable responses\. Each task is evaluated over 3–5 interaction rounds\. Metrics include accuracy \(Acc\) and F1 score\.
- •RAGAgentevaluates retrieval\-augmented clinical question answering over a constructed medical knowledge base\. The agent performs query rewriting, vector\-based retrieval, evidence re\-ranking, conflict\-aware evidence integration, and answer generation with source attribution\. This module tests whether the model can retrieve relevant evidence, reconcile conflicting information, and generate grounded responses\. Metrics include mean average precision \(MAP\) and Agent\-as\-a\-Judge evaluation\.
- •DeepResearchevaluates long\-horizon medical research planning and synthesis across multi\-source literature, including PubMed, web resources, and other academic repositories\. Given a user topic, the agent parses the research question, gathers information, performs logical and causal reasoning, constructs evidence chains, and generates a structured research report\. Metrics include semantic recall \(SemRec\) and Agent\-as\-a\-Judge evaluation\.
- •SafetyAgentevaluates adversarial robustness and compliance under red\-team interactions\. Based on the OpenRT framework\(Wanget al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib48)\), the agent uses an attack sample library, autonomously selects attack strategies, generates attack prompts, and executes them in a sandboxed environment against the target model\. The evaluation covers harmful medical misinformation, dangerous tool commands, malicious instructions, privilege escalation, privacy leakage, and medical ethics violations\. Metrics include interception ratio and Agent\-as\-a\-Judge evaluation\.
Together, CCR and MAS provide complementary views of clinical model capability\. CCR measures broad responsiveness across clinical reasoning, communication, multimodal understanding, and interaction, whereas MAS evaluates whether the model can reliably execute core operational skills required in data\-driven, retrieval\-augmented, research\-oriented, and safety\-critical clinical settings\.
### 2\.2The Dynamic Process Audit Framework
Traditional evaluation is largely outcome\-oriented: it checks only whether the final answer is correct, much like a judge who reads nothing but the verdict\. However, in clinical reasoning tasks, an erroneous conclusion may arise from different process\-level failures, such as overlooking missing information, accepting contradictory evidence, failing to revise a diagnosis when new evidence appears, or grounding the final answer in unsupported facts\. To enable root\-cause analysis beyond final accuracy, we design a dynamic process audit framework that actively embeds diagnostic probes into the information flow, observes the model’s multi\-turn reasoning trajectory, records behavioral traces, and quantifies deviations from expected clinical reasoning behaviors\.
Figure2:Switchable Information\-Flow Stressors\.This figure illustrates the four basic information\-flow conditions used to construct dynamic clinical reasoning scenarios: baseline, information omission \(OO\), contradiction injection \(II\), and evidence delay \(DD\)\. Using a case vignette of a 62\-year\-old female patient with epigastric symptoms, the figure shows how the same original instance can be transformed by withholding key evidence, injecting inconsistent clinical facts, or delaying the release of laboratory and imaging findings across turns\. The baseline condition provides all necessary information in order, whereas the stressed conditions probe whether the model can detect missing evidence, identify contradictions, maintain diagnostic uncertainty, and update its reasoning when new evidence becomes available\. Beyond the single\-stressor conditions shown here, the full protocol further includes pairwise combinations \(O\+IO\{\+\}I,O\+DO\{\+\}D,I\+DI\{\+\}D\) and a triple combination \(O\+I\+DO\{\+\}I\{\+\}D\), enabling controlled analysis of both isolated and compounded information\-flow disruptions\.As illustrated in Figure[2](https://arxiv.org/html/2606.24155#S2.F2), we instantiate this framework through three controlled information\-flow stressors\. The first stressor,Information Omission\(OO\), removes key objective evidence from the initial case description, such as laboratory or imaging findings\. This setting is designed to test whether the model can identify missing but clinically necessary information, avoid silent hallucination, and formulate appropriate follow\-up requests before committing to a diagnosis\. The second stressor,Contradiction Injection\(II\), introduces a deliberately inconsistent clinical statement into an otherwise coherent case\. For example, a case may simultaneously contain evidence suggesting negative hepatitis markers and an injected statement indicating positive hepatitis B surface antigen\. This setting examines whether the model can detect internal inconsistency, re\-weight conflicting evidence, and recommend confirmatory testing rather than uncritically accepting the injected contradiction\. The third stressor,Evidence Delay\(DD\), changes the temporal order of information release by presenting only partial clinical information at the beginning and providing laboratory or imaging evidence in later turns\. This setting evaluates whether the model can maintain diagnostic uncertainty, proactively identify needed investigations, and update its reasoning when delayed evidence becomes available\.
These three stressors are aligned with five process\-level audit nodes:Information Gap Detection,Follow\-up Strategy,Contradiction Detection,Diagnosis Update, andEvidence Groundingas defined in Figure[3](https://arxiv.org/html/2606.24155#S2.F3)and Table[1](https://arxiv.org/html/2606.24155#S2.T1)\. Specifically, information omission primarily probes whether the model detects missing evidence and requests appropriate follow\-up information; contradiction injection probes whether the model recognizes and handles inconsistent evidence; and evidence delay probes whether the model can revise its diagnosis and ground the final conclusion in newly supplied evidence\. In this way, the framework evaluates not only the final answer, but also the model’s intermediate behavior under incomplete, inconsistent, and temporally delayed clinical information flows\.
Figure3:Dynamic Five\-Node Audit Framework with Hallucination Propagation Monitoring\.The figure summarizes our process\-level audit protocol for dynamic clinical reasoning scenarios\. The upper panel shows the five sequential audit nodes:Information Gap Detection,Follow\-up Strategy,Contradiction Detection,Diagnosis Update, andEvidence Grounding\. Each node is associated with targeted metrics that quantify whether the model detects missing information, asks clinically useful follow\-up questions, recognizes inconsistencies, updates its diagnosis when new evidence appears, and grounds its final conclusion in the provided context\. The lower panel shows the complementary hallucination propagation monitoring module, which tracks how unsupported claims are initiated, propagated across turns, anchored in diagnostic reasoning, and affected by injected contradictions\. Together, these two components enable fine\-grained localization of process\-level failures beyond final\-answer correctness\.Table 1:Five\-node process audit protocol for evaluating dynamic clinical reasoning behaviors\. The table defines each audit node, its reasoning phase, and the core query used for judge\-based process evaluation\.#### Dynamic audit dataset construction\.
Building on the above framework, we construct a dynamic process\-audit dataset from the three evaluation tracks of MedBench v5: the LLM track, the multimodal track, and the agent track\. Because process auditing requires traceable interaction trajectories rather than single\-shot responses, we prioritize open\-ended question\-answering and generation tasks that can be naturally reconstructed into multi\-turn clinical interactions\. For each selected dataset, we randomly sample five instances\. In total, the LLM track contributes 10 datasets with 50 instances, the multimodal track contributes 3 datasets with 15 instances, and the agent track contributes 5 datasets with 25 instances, resulting in 18 selected datasets and 90 original evaluation instances\.
Each original instance is then converted into a family of controlled information\-flow scenarios according to the stressor design described above\. Specifically, we generate eight variants for each instance: a baseline condition without any stressor, three single\-stressor conditions, namely information omission \(OO\), contradiction injection \(II\), and evidence delay \(DD\), three pairwise combinations \(O\+IO\{\+\}I,O\+DO\{\+\}D, andI\+DI\{\+\}D\), and one triple\-stressor condition \(O\+I\+DO\{\+\}I\{\+\}D\)\. The baseline condition preserves the original information flow, whereas the stressed conditions modify the completeness, consistency, or temporal ordering of clinical evidence\. The combined conditions are included to examine interaction effects among stressors, such as whether missing information masks injected contradictions, whether early unsupported assumptions persist after delayed evidence is supplied, or whether contradictions bias subsequent diagnostic updates\.
Consequently, the 90 original evaluation instances are expanded into 720 dynamic stress\-testing scenarios, enabling controlled comparisons across no\-stress, single\-stressor, and multi\-stressor settings\. For each scenario, we record the full multi\-turn interaction trajectory, the stressor configuration, the timing of evidence release, node\-level audit behaviors, and hallucination\-related events\. These records support subsequent process\-level analysis of where reasoning deviations occur, how they evolve across turns, and whether unsupported information is propagated into the final clinical conclusion\.
#### Linking stressors to five audit nodes\.
We formalize each dynamic scenario as a target\-model execution followed by a judge\-based process audit\. Letxix\_\{i\}denote theii\-th original evaluation instance and lets∈𝒮s\\in\\mathcal\{S\}denote one of the eight stress conditions, where
𝒮=\{∅,O,I,D,O\+I,O\+D,I\+D,O\+I\+D\}\.\\mathcal\{S\}=\\\{\\emptyset,O,I,D,O\{\+\}I,O\{\+\}D,I\{\+\}D,O\{\+\}I\{\+\}D\\\}\.After applying stress conditionss, the original instance is converted into a multi\-turn user input sequence
Uis=\(ui,1s,…,ui,Tiss\),U\_\{i\}^\{s\}=\(u\_\{i,1\}^\{s\},\\ldots,u\_\{i,T\_\{i\}^\{s\}\}^\{s\}\),where different turns may contain the initial case description, follow\-up prompts, delayed evidence, or injected contradictory information\. The target modelMMinteracts only with this perturbed sequence and produces a response at each turn, yielding the complete observable trajectory
τis=\{\(ui,ts,yi,ts\)\}t=1Tis,\\tau\_\{i\}^\{s\}=\\\{\(u\_\{i,t\}^\{s\},y\_\{i,t\}^\{s\}\)\\\}\_\{t=1\}^\{T\_\{i\}^\{s\}\},whereyi,tsy\_\{i,t\}^\{s\}is the model response at turntt\.
A separate judge modelJJthen evaluates the trajectory using the scenario metadatamism\_\{i\}^\{s\}, the gold\-standard referencegisg\_\{i\}^\{s\}, and the predefined audit protocolpisp\_\{i\}^\{s\}:
ais=J\(τis,mis,gis,pis\)\.a\_\{i\}^\{s\}=J\(\\tau\_\{i\}^\{s\},m\_\{i\}^\{s\},g\_\{i\}^\{s\},p\_\{i\}^\{s\}\)\.Here,mism\_\{i\}^\{s\}records the stressor configuration, withheld evidence, injected contradictions, delayed information, and their release turns, whilepisp\_\{i\}^\{s\}specifies the expected node\-level audit targets\. The judge outputaisa\_\{i\}^\{s\}consists of structured annotations for five audit nodes\. Unless otherwise specified, each metric is computed only over applicable scenarios, and cases with empty denominators are treated as not applicable and excluded from the corresponding aggregate\.
ForInformation Gap Detection, we assess whether the model recognizes clinically necessary missing information before committing to a conclusion\. LetKisK\_\{i\}^\{s\}be the set of predefined key information gaps andK^is\\hat\{K\}\_\{i\}^\{s\}be the subset detected by the model and verified by the judge\. The gap detection ratio is
GDRis=\|K^is∩Kis\|\|Kis\|\.\\mathrm\{GDR\}\_\{i\}^\{s\}=\\frac\{\|\\hat\{K\}\_\{i\}^\{s\}\\cap K\_\{i\}^\{s\}\|\}\{\|K\_\{i\}^\{s\}\|\}\.We also measure whether the model actively requests additional information\. Letqis=1q\_\{i\}^\{s\}=1if the model asks at least one clinically relevant follow\-up question before reaching a conclusion, andqis=0q\_\{i\}^\{s\}=0otherwise\. Over scenarios requiring follow\-up, denoted byℛ\\mathcal\{R\}, the inquiry ratio is
IR=1\|ℛ\|∑\(i,s\)∈ℛqis\.\\mathrm\{IR\}=\\frac\{1\}\{\|\\mathcal\{R\}\|\}\\sum\_\{\(i,s\)\\in\\mathcal\{R\}\}q\_\{i\}^\{s\}\.
ForFollow\-up Strategy, we evaluate the relevance and efficiency of the model’s follow\-up questions\. LetQisQ\_\{i\}^\{s\}be the set of follow\-up questions asked by the model,Qi,relsQ\_\{i,\\mathrm\{rel\}\}^\{s\}the subset judged clinically relevant, andQi,redsQ\_\{i,\\mathrm\{red\}\}^\{s\}the subset judged redundant, irrelevant, repeated, or already answered by the context\. We define
PIRis=\|Qi,rels\|\|Qis\|,RIRis=\|Qi,reds\|\|Qis\|,\\mathrm\{PIR\}\_\{i\}^\{s\}=\\frac\{\|Q\_\{i,\\mathrm\{rel\}\}^\{s\}\|\}\{\|Q\_\{i\}^\{s\}\|\},\\qquad\\mathrm\{RIR\}\_\{i\}^\{s\}=\\frac\{\|Q\_\{i,\\mathrm\{red\}\}^\{s\}\|\}\{\|Q\_\{i\}^\{s\}\|\},wherePIR\\mathrm\{PIR\}is the precision inquiry ratio andRIR\\mathrm\{RIR\}is the redundant inquiry ratio\.
ForContradiction Detection, we measure whether the model identifies inconsistent information introduced by the contradiction stressor\. LetCisC\_\{i\}^\{s\}be the set of predefined contradictions andC^is\\hat\{C\}\_\{i\}^\{s\}be the subset explicitly detected by the model and verified by the judge\. The contradiction detection ratio and contradiction ignorance ratio are
CDRis=\|C^is∩Cis\|\|Cis\|,CIRis=\|Cis∖C^is\|\|Cis\|\.\\mathrm\{CDR\}\_\{i\}^\{s\}=\\frac\{\|\\hat\{C\}\_\{i\}^\{s\}\\cap C\_\{i\}^\{s\}\|\}\{\|C\_\{i\}^\{s\}\|\},\\qquad\\mathrm\{CIR\}\_\{i\}^\{s\}=\\frac\{\|C\_\{i\}^\{s\}\\setminus\\hat\{C\}\_\{i\}^\{s\}\|\}\{\|C\_\{i\}^\{s\}\|\}\.These metrics quantify, respectively, whether contradictions are recognized and whether they are ignored during subsequent reasoning\.
ForDiagnosis Update, we evaluate whether the model revises its diagnostic hypothesis when delayed, corrected, or conflicting evidence becomes available\. Let𝒰is\\mathcal\{U\}\_\{i\}^\{s\}be the set of predefined update opportunities\. For each update opportunityt∈𝒰ist\\in\\mathcal\{U\}\_\{i\}^\{s\}, letri,ts=1r\_\{i,t\}^\{s\}=1if the model updates its diagnostic reasoning in a clinically justified direction, andri,ts=0r\_\{i,t\}^\{s\}=0otherwise\. The rational update ratio is
RURis=1\|𝒰is\|∑t∈𝒰isri,ts\.\\mathrm\{RUR\}\_\{i\}^\{s\}=\\frac\{1\}\{\|\\mathcal\{U\}\_\{i\}^\{s\}\|\}\\sum\_\{t\\in\\mathcal\{U\}\_\{i\}^\{s\}\}r\_\{i,t\}^\{s\}\.We further measure two update failures\. Premature closure occurs when the model reaches a definitive conclusion before sufficient evidence is available\. Letcis=1c\_\{i\}^\{s\}=1if premature closure is observed and0otherwise\. Over the applicable scenario set𝒫\\mathcal\{P\}, the premature closure ratio is
PCR=1\|𝒫\|∑\(i,s\)∈𝒫cis\.\\mathrm\{PCR\}=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{\(i,s\)\\in\\mathcal\{P\}\}c\_\{i\}^\{s\}\.Stubborn maintenance occurs when the model fails to revise an earlier hypothesis after later evidence weakens or contradicts it\. Letℬis⊆𝒰is\\mathcal\{B\}\_\{i\}^\{s\}\\subseteq\\mathcal\{U\}\_\{i\}^\{s\}be the subset of update opportunities requiring substantial revision, and letbi,ts=1b\_\{i,t\}^\{s\}=1if the model unjustifiably maintains the prior hypothesis\. The stubborn maintenance ratio is
SMRis=1\|ℬis\|∑t∈ℬisbi,ts\.\\mathrm\{SMR\}\_\{i\}^\{s\}=\\frac\{1\}\{\|\\mathcal\{B\}\_\{i\}^\{s\}\|\}\\sum\_\{t\\in\\mathcal\{B\}\_\{i\}^\{s\}\}b\_\{i,t\}^\{s\}\.
ForEvidence Grounding, we evaluate whether the final clinical conclusion is supported by evidence actually provided in the scenario\. LetEisE\_\{i\}^\{s\}be the set of evidence statements cited or relied upon in the final response,Ei,supsE\_\{i,\\mathrm\{sup\}\}^\{s\}the subset supported by the scenario context, andEi,fabsE\_\{i,\\mathrm\{fab\}\}^\{s\}the subset unsupported, contradicted, or fabricated\. We define evidence faithfulness and fabricated citation ratio as
EFis=\|Ei,sups\|\|Eis\|,FCRis=\|Ei,fabs\|\|Eis\|\.\\mathrm\{EF\}\_\{i\}^\{s\}=\\frac\{\|E\_\{i,\\mathrm\{sup\}\}^\{s\}\|\}\{\|E\_\{i\}^\{s\}\|\},\\qquad\\mathrm\{FCR\}\_\{i\}^\{s\}=\\frac\{\|E\_\{i,\\mathrm\{fab\}\}^\{s\}\|\}\{\|E\_\{i\}^\{s\}\|\}\.While evidence grounding focuses on the support status of the final conclusion, trajectory\-level hallucination behaviors are analyzed separately in the hallucination propagation module described below\.
#### Hallucination propagation monitoring\.
Beyond the five\-node behavioral audit, we further monitor hallucination propagation across the recorded multi\-turn trajectory\. This module is analyzed separately from node\-level scoring because hallucinations may emerge at any stage of the interaction, persist across multiple turns, interact with contradictions, and eventually contaminate the final clinical conclusion\. The structured design of our dynamic scenarios enables such tracking, since each case specifies turn\-level information release rules, expected model behaviors, gold\-standard responses, and predefined error traps\.
For each scenario\(xi,s\)\(x\_\{i\},s\)and turntt, letRi,tsR\_\{i,t\}^\{s\}denote the set of clinical facts that have been released up to turntt, and letZi,tsZ\_\{i,t\}^\{s\}denote the set of factual claims extracted from the model response at that turn\. The judge identifies unsupported claims as
Hi,ts=\{z∈Zi,ts∣z⋠Ri,ts\},H\_\{i,t\}^\{s\}=\\\{z\\in Z\_\{i,t\}^\{s\}\\mid z\\not\\preceq R\_\{i,t\}^\{s\}\\\},wherez⋠Ri,tsz\\not\\preceq R\_\{i,t\}^\{s\}indicates that the claim is not supported by the information available at that turn, is contradicted by the scenario context, or is inconsistent with the gold\-standard trajectory\.
Based on these turn\-level hallucination events, we organize hallucination monitoring into four dimensions\.Initiationcaptures whether unsupported numerical values or clinical facts are generated when relevant information is absent or withheld\.Propagationmeasures whether an unsupported claim introduced in an earlier turn persists or is reused as a premise in later reasoning\.Anchoringexamines whether hallucinated content becomes part of the final diagnostic evidence chain or causes omission of genuine contradictory evidence\.Hallucination–contradiction interactionevaluates whether injected contradictions suppress unsupported assumptions through clarification and revision, or instead trigger new fabricated explanations\. These four dimensions are quantified by eight hallucination propagation metrics: numerical fabrication ratio \(NFR\), unsubstantiated fact ratio \(UFR\), hallucination persistence ratio \(HPR\), hallucination cross\-contamination ratio \(HCCR\), definitive hallucination dependency ratio \(DHDR\), critical hallucination omission \(CHO\), contradiction\-induced hallucination suppression ratio \(CIHSR\), and contradiction\-induced hallucination generation ratio \(CIHGR\)\. Detailed definitions and formulas are provided in Appendix[C](https://arxiv.org/html/2606.24155#A3)\.
## 3Experiments and Results
### 3\.1Experimental Setup
#### Dual\-Dimensional Evaluation
We evaluate a broad set of general\-purpose and medical\-oriented models on two complementary components of MedBench v5: Clinical Cognitive Responsiveness \(CCR\) and Medical Atomic Skills \(MAS\)\. CCR includes three tracks: text\-based LLM evaluation, multimodal clinical evaluation, and agentic interaction evaluation\. MAS evaluates four executable agent environments: DataAgent, RAGAgent, DeepResearch, and SafetyAgent\. All task\-level scores are normalized to a 0–100 scale, with higher values indicating better performance\. For each CCR track, we report the macro\-average across tasks: 36 tasks for CCR\-LLM, 12 tasks for CCR\-multimodal, and 11 tasks for CCR\-agent\. For MAS, we report both environment\-level scores and the macro\-average across the four environments\. Because different model variants are evaluated in different tracks, we report track\-level results separately rather than merging them into a single overall leaderboard\.
#### Dynamic Process Audit
For the dynamic process\-audit experiments, we use the process\-audit subset constructed from the three CCR tracks\. The subset contains 18 datasets and 90 instances in total: 50 instances from the LLM track, 15 instances from the multimodal track, and 25 instances from the agent track\. For each track, perturbation and five\-node audit experiments are conducted on the top\-three models ranked by task\-level SOTA rate, i\.e\., the fraction of datasets on which a model achieves the best score within that track\. For LLM track, the top\-3 models are Claude Opus 4\.7, Qwen3\.7\-Max\-Preview and Gemini\-3\.1\-Pro\-Preview; for multimodal track, the top\-3 models are Claude Opus 4\.7, Gemini\-3\.1\-Pro\-Preview and GPT\-5\.5; for agent track, the top\-3 models are Kimi\-K2\.6, Claude Opus 4\.7 and Qwen3\.7\-Max\-Preview\.
### 3\.2Capability Profiling on CCR
Table[2](https://arxiv.org/html/2606.24155#S3.T2)summarizes the macro\-average performance across the three CCR tracks\. The results show distinct patterns across text\-based, multimodal, and agentic clinical evaluation\.
On the CCR\-LLM track, leading frontier models achieve closely clustered scores\. Claude Opus 4\.7 obtains the highest score, while Qwen3\.7\-Max\-Preview, Gemini\-3\.1\-Pro\-Preview, Kimi\-K2\.6, GLM\-5\.1, GPT\-5\.5, Grok\-4\.20 Beta, Doubao\-Seed\-2\.0\-pro, and DeepSeek\-V4\-Pro all fall within a relatively narrow range\. This suggests that text\-based clinical reasoning tasks provide limited separation among the strongest general\-purpose models\.
By contrast, the CCR\-multimodal track remains substantially more challenging\. The best\-performing models reach only around 50 points on average, and the performance gap across models is larger than in the LLM track\. Doubao\-Seed\-2\.0\-pro, Gemini\-3\.5\-Flash, GPT\-5\.5, and Qwen3\.7\-Plus obtain the strongest multimodal results, but all models still show clear room for improvement on tasks involving visual perception, OCR, longitudinal image understanding, 3D multi\-timepoint reasoning, and multimodal clinical decision support\.
The CCR\-agent track shows generally high scores among frontier models, with Claude Opus 4\.7, Kimi\-K2\.6, Qwen3\.7\-Max\-Preview, and GPT\-5\.5 ranking near the top\. However, these results should be interpreted as broad interaction\-level responsiveness rather than complete agentic reliability\. Whether models can execute specific operational skills is further examined by MAS\.
Table 2:Macro\-average performance on the CCR tracks\. Scores are normalized to a 0–100 scale\. “–” indicates that the corresponding model variant was not evaluated on that track\. Full task\-level results are provided in Appendix[D\.1](https://arxiv.org/html/2606.24155#A4.SS1)\.
### 3\.3Capability Profiling on MAS
Table[3](https://arxiv.org/html/2606.24155#S3.T3)reports performance on the four MAS environments\. GPT\-5\.5 achieves the highest MAS average, mainly due to strong performance on RAGAgent and SafetyAgent\. Qwen3\.7\-Max\-Preview ranks second overall and obtains the best DeepResearch score, indicating strong long\-horizon evidence synthesis\. Kimi\-K2\.6 achieves the best DataAgent score and remains competitive on RAGAgent, but performs less strongly on DeepResearch\. Doubao\-Seed\-2\.0\-pro performs well on RAGAgent and DeepResearch, but is limited by weaker SafetyAgent performance\.
Across MAS environments, RAGAgent scores are generally high, suggesting that current frontier models can benefit from retrieval\-augmented evidence pipelines\. In contrast, DeepResearch and SafetyAgent show larger model\-specific differences, indicating that long\-horizon synthesis and adversarial safety defense remain important stress points\. These results show that agentic medical capability should not be inferred from broad interaction scores alone; structured data manipulation, retrieval grounding, research synthesis, and safety defense need to be evaluated as distinct atomic skills\.
The CCR and MAS results establish the broad capability profiles of evaluated models under standard evaluation conditions\. However, aggregate scores do not reveal whether models remain reliable when clinical information is incomplete, delayed, or contradictory\. We therefore next examine model robustness under controlled information\-flow stressors and audit their behavior across process\-level reasoning nodes\.
Table 3:Performance on the four Medical Atomic Skills environments\. Scores are normalized to a 0–100 scale\. MAS Avg\. denotes the macro\-average across DataAgent, RAGAgent, DeepResearch, and SafetyAgent\.
### 3\.4Effects of Three stressors on Five Node Metrics
Figure4:Delta\-to\-Baseline heatmaps of Process‑Audit metrics under stress conditions\.Each heatmap shows the performance change \(Δ\\Delta\) of eight stress conditions \(rows\) relative to the baseline \(first row\)\. For metrics where higher is better \(↑\\uparrow\),Δ=metricstress−metricbaseline\\Delta=\\text\{metric\}\_\{\\text\{stress\}\}\-\\text\{metric\}\_\{\\text\{baseline\}\}; for metrics where lower is better \(↓\\downarrow\),Δ=metricbaseline−metricstress\\Delta=\\text\{metric\}\_\{\\text\{baseline\}\}\-\\text\{metric\}\_\{\\text\{stress\}\}\. PositiveΔ\\Delta\(blue\) indicates gain, negativeΔ\\Delta\(red\) indicates loss; baselineΔ\\Deltais always zero\.Table 4:Process\-audit evaluation results on themultimodal trackunder different information\-flow stress conditions\. Metrics are grouped by the five audit nodes\. Upward arrows indicate higher\-is\-better metrics, while downward arrows indicate lower\-is\-better metrics\.We used the baseline condition as the reference and analyzed the effects of three information\-flow stressors and their combinations on the five process\-audit nodes\. Table[4](https://arxiv.org/html/2606.24155#S3.T4)shows the results on the multimodal track \(llm track and agent track see Appendix[D\.2](https://arxiv.org/html/2606.24155#A4.SS2)\)\. The delta heatmaps in Figure[4](https://arxiv.org/html/2606.24155#S3.F4)show that information omission, contradiction injection, and evidence delay do not lead to a uniform degradation across all metrics\. Instead, they produce node\-specific changes\. Overall,Contradiction DetectionandDiagnosis Updateare the most sensitive nodes, whereasEvidence Groundingremains relatively stable across most conditions\.
For single stressors,OOmainly affects the construction of the initial evidential basis\. Under this condition, some models show improved information\-gap recognition, but this improvement does not consistently propagate to downstream reasoning nodes\. In contrast, contradiction detection and diagnosis update metrics are more likely to decline\. The effect ofIIis generally weaker, and in some cases it is associated with improved information\-gap awareness or greater diagnostic openness\. This suggests that an explicit contradiction does not necessarily cause global degradation and may sometimes induce a more cautious reasoning pattern\. Compared withII,DDhas a stronger effect on dynamic reasoning nodes, particularly those related to cross\-turn evidence integration and diagnostic revision after new evidence becomes available\.
Among the combined stress conditions,O\+DO\+Dproduces the most prominent impact\. Under this setting, all three models show varying degrees of decline inContradiction DetectionandDiagnosis Update, indicating that the combination of missing information and delayed evidence substantially increases the difficulty of dynamic reasoning\. By contrast,O\+IO\+I,I\+DI\+D, andO\+I\+DO\+I\+Ddo not exhibit a simple additive decline\. In several cases, the presence of an explicit contradiction appears to increase model sensitivity to case uncertainty, partially offsetting some of the negative effects introduced by omission or delay\.
### 3\.5Hallucination propagation under information\-flow stressors\.
The hallucination monitoring results show that information\-flow stressors mainly affecthallucination propagationandcontradiction\-related self\-correction, rather than initial hallucination generation \(The detail results see Figure[5](https://arxiv.org/html/2606.24155#A4.F5)and Appendix[D\.3](https://arxiv.org/html/2606.24155#A4.SS3)\)\. Across the three models, NFR and UFR change only slightly under most stress conditions, suggesting that omission, contradiction, and delay do not consistently increase the immediate generation of unsupported numerical or factual claims\. In contrast, HPR and HCCR show more frequent degradation, indicating that once unsupported claims are introduced, models are more likely to preserve them across turns and reuse them in subsequent diagnostic reasoning\.
Among the stressors, both information omission and evidence delay weaken hallucination control, especially through declines in HPR and HCCR\. The effect of explicit contradiction is more mixed: it may sometimes make the model more cautious, but it does not reliably prevent hallucination propagation\. Among combined conditions,O\+DO\+Dis the most disruptive setting\. All three models show clear degradation in HPR, HCCR, and CIHSR under this condition, indicating that when evidence is both missing and delayed, models have greater difficulty suppressing exposed hallucinations and revising subsequent reasoning\.
The hallucination propagation results described above identifywhatgoes wrong: unsupported claims mostly spread through cross‑turn persistence and reasoning contamination\. The five‑node process audit provides complementary evidence onwhythis occurs\. Across all models and stressors, the most pronounced process‑level degradations appear inContradiction DetectionandDiagnosis Update\(Section[4](https://arxiv.org/html/2606.24155#S3.F4)\)\. When a model fails to recognize an injected contradiction or to adjust its belief in light of delayed evidence, pre‑existing unsupported claims are more likely to persist and contaminate subsequent reasoning\. In other words, deficits in contradiction‑based self‑correction and belief revision constitute the primary mechanism through which initial hallucinations evolve into anchored diagnostic errors\. By coupling trajectory‑level hallucination tracking with node‑level behavioral auditing, we provides a complete diagnostic loop that connects observed hallucination symptoms to their cognitive antecedents\.
### 3\.6Reliability Analysis
To assess the reliability of the automatic evaluation, we conducted an algorithm–human agreement analysis on 440 samples randomly selected from the open\-domain datasets of CCR\. Each sample was independently rated by human annotators using a 5\-point Likert scale by comparing the model answer with the paired reference answer\. The detailed Likert\-5 rubric and scoring instructions are provided in Appendix[7](https://arxiv.org/html/2606.24155#A2.T7)\. The inter\-rater reliability among the three annotators, measured by intraclass correlation \(ICC\(A,k\)\)\(McGraw and Wong,[1996](https://arxiv.org/html/2606.24155#bib.bib89)\), was 0\.74 \(95% CI: 0\.66–0\.79\), indicating good consistency of human judgments\.
For agreement analysis, the continuous algorithm scores ranging from 0 to 100 were discretized into five equal\-width levels to match the human 5\-point Likert ratings\. Spearman’s rank correlation between the raw algorithm scores and human ratings wasρ=0\.26\\rho=0\.26\(p<0\.001p<0\.001\)\. Quadratic weighted Cohen’s\(Cohen,[1968](https://arxiv.org/html/2606.24155#bib.bib87)\)κ\\kappawas0\.320\.32\(95% bootstrap CI\(Tibshirani and Efron,[1993](https://arxiv.org/html/2606.24155#bib.bib88)\):0\.200\.20–0\.430\.43\), indicating fair agreement between the automatic evaluation and human judgments\(Landis and Koch,[1977](https://arxiv.org/html/2606.24155#bib.bib86)\)\.
## 4Related works
Existing work has increasingly recognized that evaluating medical LLMs solely on static, outcome\-based metrics is insufficient, and that process\-oriented assessment under conditions of diagnostic uncertainty is essential\. Long et al\.\(Longet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib41)\)introduced EviMed with Information Coverage Rate \(ICR\) to quantify evidence elicitation in interactive consultation, finding that strong diagnostic reasoning does not guarantee effective information collection, while Li et al\.\(Liet al\.,[2024](https://arxiv.org/html/2606.24155#bib.bib35)\)proposed MediQ where the system refrains from making diagnostic decisions when unconfident and elicits missing details via follow\-up questions\. Regarding diagnostic updating under sequential evidence, Pan et al\.\(Panet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib42)\)introduced DDX\-TRACE, demonstrating that final diagnosis scores can misrepresent workup quality as models may guess plausible diagnoses without essential evidence or update uncertainty poorly, while a measurement study\(Wang,[2026](https://arxiv.org/html/2606.24155#bib.bib43)\)documented "Convergence Regression"—models correctly identifying diagnoses at intermediate stages but abandoning them when subsequent evidence triggers pattern\-matching to alternatives—creating a 30% Access\-Stability Dissociation invisible under single\-shot evaluation\. For evidence grounding, Ma et al\.\(Maet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib44)\)introduced CiteVQA with Strict Attributed Accuracy \(SAA\), revealing "Attribution Hallucination" where models produce correct answers while citing incorrect evidence, and Fan et al\.\(Fanet al\.,[2026](https://arxiv.org/html/2606.24155#bib.bib45)\)proposed HalluHard operationalizing groundedness through inline citations across high\-stakes domains including medical guidelines\. Collectively, these studies have advanced process\-level evaluation in specific dimensions, yet no existing framework systematically audits information gap awareness, questioning strategy, contradiction detection, diagnosis updating, and evidence fidelity within a unified architecture\.
## 5Conclusion
We introducedMedBench v5, a dynamic, process\-oriented, and hallucination\-aware benchmark for clinical multimodal AI evaluation\. Unlike static medical QA benchmarks that focus mainly on final\-answer correctness, MedBench v5 combines Clinical Cognitive Responsiveness and Medical Atomic Skills to evaluate both broad clinical capabilities and executable agentic skills across 18 capability areas and 63 tasks\.
MedBench v5 further introduces a stress\-audit\-tracing protocol with three switchable information\-flow stressors: information omission, contradiction injection, and evidence delay\. Through a five\-node process audit, the benchmark localizes failures in information gap detection, follow\-up strategy, contradiction detection, diagnosis update, and evidence grounding\. Our results show that stressors cause node\-specific degradation, with contradiction detection and diagnosis update being especially sensitive, while final evidence grounding can remain superficially stable\.
We also proposed hallucination propagation monitoring to trace unsupported claims across multi\-turn trajectories\. The results show that hallucination risk mainly arises from cross\-turn persistence, reasoning contamination, and failed contradiction\-based correction, rather than from initial fabrication alone\. Overall, MedBench v5 bridges static medical QA and realistic clinical workflow evaluation, providing a unified framework for capability profiling, stress testing, process auditing, and hallucination trajectory analysis\.
## 6Acknowledgment
Supported by Shanghai Artificial Intelligence Laboratory
## References
- J\. N\. Acosta, G\. J\. Falcone, P\. Rajpurkar, and E\. J\. Topol \(2022\)Multimodal biomedical ai\.Nature medicine28\(9\),pp\. 1773–1784\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- E\. Asgari, N\. Montaña\-Brown, M\. Dubois, S\. Khalil, J\. Balloch, J\. A\. Yeung, and D\. Pimenta \(2025\)A framework to assess clinical safety and hallucination rates of llms for medical text summarisation\.NPJ digital medicine8\(1\),pp\. 274\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1),[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- S\. Aydin, M\. Karabacak, V\. Vlachos, and K\. Margetis \(2024\)Large language models in patient education: a scoping review of applications in medicine\.Frontiers in medicine11,pp\. 1477898\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- J\. R\. Ball, B\. T\. Miller, and E\. P\. Balogh \(2015\)Improving diagnosis in health care\.National Academies Press\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- C\. G\. Bielick, A\. Awwad, J\. Ellen, L\. Jalilian, L\. G\. McCoy, V\. Mishra, E\. Osmanlliu, S\. R\. Pfohl, and L\. A\. Celi \(2026\)Moving beyond the benchmarks: five foundational principles for meaningful ai evaluation in healthcare\.PLOS Digital Health5\(5\),pp\. e0001115\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- J\. Chen, D\. Yang, T\. Wu, Y\. Jiang, X\. Hou, M\. Li, S\. Wang, D\. Xiao, K\. Li, and L\. Zhang \(2024\)Detecting and evaluating medical hallucinations in large vision language models\.arXiv preprint arXiv:2406\.10185\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1)\.
- W\. Chen, G\. Yu, Y\. Cheung, M\. Ding, J\. Liu, Z\. Ma, W\. Wang, and L\. Shen \(2025\)Beyond the leaderboard: rethinking medical benchmarks for large language models\.arXiv preprint arXiv:2508\.04325\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1),[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- C\. Chiu, S\. Pitis, and M\. van der Schaar \(2026\)Simulating viva voce examinations to evaluate clinical reasoning in large language models\.Advances in Neural Information Processing Systems38\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- J\. Cohen \(1968\)Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit\.\.Psychological bulletin70\(4\),pp\. 213\.Cited by:[§3\.6](https://arxiv.org/html/2606.24155#S3.SS6.p2.6)\.
- D\. Fan, S\. Delsad, N\. Flammarion, and M\. Andriushchenko \(2026\)HalluHard: a hard multi\-turn hallucination benchmark\.External Links:2602\.01031,[Link](https://arxiv.org/abs/2602.01031)Cited by:[§4](https://arxiv.org/html/2606.24155#S4.p1.1)\.
- E\. J\. Gong, C\. S\. Bang, J\. J\. Lee, and G\. H\. Baik \(2025\)Knowledge\-practice performance gap in clinical large language models: systematic review of 39 benchmarks\.Journal of Medical Internet Research27,pp\. e84120\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- Z\. Gu, J\. Chen, F\. Liu, C\. Yin, and P\. Zhang \(2026\)MedVH: toward systematic evaluation of hallucination for large vision language models in the medical context\.Advanced Intelligent Systems8\(1\),pp\. 2500255\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1)\.
- Y\. Han, J\. Chan, J\. Chen, M\. Ai, S\. Du, and Y\. Guo \(2026\)MedConceal: a benchmark for clinical hidden\-concern reasoning under partial observability\.arXiv preprint arXiv:2604\.08788\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- Y\. Huang, Y\. Chen, H\. Zhang, K\. Li, H\. Zhou, M\. Fang, L\. Yang, X\. Li, L\. Shang, S\. Xu,et al\.\(2025\)Deep research agents: a systematic examination and roadmap\.arXiv preprint arXiv:2506\.18096\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- Y\. Jiang, K\. C\. Black, G\. Geng, D\. Park, J\. Zou, A\. Y\. Ng, and J\. H\. Chen \(2025\)MedAgentBench: a virtual ehr environment to benchmark medical llm agents\.Nejm Ai2\(9\),pp\. AIdbp2500144\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- Z\. Jiang, H\. Chen, Y\. Wu, Y\. Qin, C\. Pei, D\. Zeng, B\. Sheng, and T\. Y\. Wong \(2026\)Beyond multiple\-choice questions: rethinking evaluation frameworks for large language models for clinical medicine\.Elsevier\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- K\. Jung \(2025\)Large language models in medicine: clinical applications, technical challenges, and ethical considerations\.Healthcare Informatics Research31\(2\),pp\. 114–124\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- S\. Kim and H\. Yoon \(2025\)Questioning our questions: how well do medical qa benchmarks evaluate clinical capabilities of language models?\.InProceedings of the 24th Workshop on Biomedical Language Processing,pp\. 274–296\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- Y\. Kim, H\. Jeong, S\. Chen, S\. S\. Li, C\. Park, M\. Lu, K\. Alhamoud, J\. Mun, C\. Grau, M\. Jung,et al\.\(2025\)Medical hallucinations in foundation models and their impact on healthcare\.arXiv preprint arXiv:2503\.05777\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1)\.
- J\. R\. Landis and G\. G\. Koch \(1977\)The measurement of observer agreement for categorical data\.biometrics,pp\. 159–174\.Cited by:[§3\.6](https://arxiv.org/html/2606.24155#S3.SS6.p2.6)\.
- S\. S\. Li, V\. Balachandran, S\. Feng, J\. S\. Ilgen, E\. Pierson, P\. W\. Koh, and Y\. Tsvetkov \(2024\)Mediq: question\-asking llms and a benchmark for reliable interactive clinical reasoning\.Advances in Neural Information Processing Systems37,pp\. 28858–28888\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1),[§4](https://arxiv.org/html/2606.24155#S4.p1.1)\.
- Y\. Li, X\. Jie, W\. Ruan, X\. Zhang, H\. Zhu, Y\. Gao, C\. Du, and R\. Liu \(2026\)Beyond idealized patients: evaluating llms under challenging patient behaviors in medical consultations\.arXiv preprint arXiv:2603\.29373\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- J\. Liu, W\. Wang, Z\. Ma, G\. Huang, Y\. Su, K\. Chang, H\. Li, L\. Shen, M\. R\. Lyu, and W\. Chen \(2026\)Medchain: bridging the gap between llm agents and clinical practice with interactive sequence\.Advances in Neural Information Processing Systems38\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- J\. Liu, P\. Zhou, Y\. Hua, D\. Chong, Z\. Tian, A\. Liu, H\. Wang, C\. You, Z\. Guo, L\. Zhu,et al\.\(2023\)Benchmarking large language models on cmexam\-a comprehensive chinese medical exam dataset\.Advances in Neural Information Processing Systems36,pp\. 52430–52452\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- R\. Liu, K\. Xue, X\. Zhang, and S\. Zhang \(2025\)Interactive evaluation for medical llms via task\-oriented dialogue system\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 4871–4896\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- S\. Liu, A\. P\. Wright, A\. B\. Mccoy, S\. S\. Huang, J\. Z\. Genkins, J\. F\. Peterson, Y\. A\. Kumah\-Crystal, W\. Martinez, B\. Carew, D\. Mize,et al\.\(2024\)Using large language model to guide patients to create efficient and comprehensive clinical care message\.Journal of the American Medical Informatics Association31\(8\),pp\. 1665–1670\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- Z\. Long, Z\. Bao, and Z\. Wei \(2026\)Strong reasoning isn’t enough: evaluating evidence elicitation in interactive diagnosis\.External Links:2601\.19773,[Link](https://arxiv.org/abs/2601.19773)Cited by:[§4](https://arxiv.org/html/2606.24155#S4.p1.1)\.
- J\. Lu, J\. Liu, X\. Zheng, M\. Yang, J\. Wang, P\. Wang, and Y\. Zhang \(2026a\)MHB: medical hallucination benchmark for large language models in complex clinical tasks\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 38971–38978\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- Y\. Lu, Y\. Lin, W\. Shi, J\. B\. Tamo, X\. Zhao, J\. Wang, and M\. D\. Wang \(2026b\)ClinEnv: an interactive multi\-stage long horizon ehr environment for agents\.External Links:2606\.02568,[Link](https://arxiv.org/abs/2606.02568)Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- X\. Luo, X\. Jiang, and J\. Wu \(2026\)MedDialBench: benchmarking llm diagnostic robustness under parametric adversarial patient behaviors\.arXiv preprint arXiv:2604\.06846\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- D\. Ma, J\. Li, Z\. Wang, Y\. Wang, J\. Kong, W\. Zeng, J\. Xiao, J\. Yang, W\. Zhang, B\. Wang, and C\. He \(2026\)CiteVQA: benchmarking evidence attribution for trustworthy document intelligence\.External Links:2605\.12882,[Link](https://arxiv.org/abs/2605.12882)Cited by:[§4](https://arxiv.org/html/2606.24155#S4.p1.1)\.
- L\. G\. McCoy, R\. Swamy, N\. Sagar, M\. Wang, S\. Bacchi, J\. M\. N\. Fong, N\. C\. Tan, K\. Tan, T\. A\. Buckley, P\. Brodeur,et al\.\(2025\)Assessment of large language models in clinical reasoning: a novel benchmarking study\.NEJM AI2\(10\),pp\. AIdbp2500120\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- K\. O\. McGraw and S\. P\. Wong \(1996\)Forming inferences about some intraclass correlation coefficients\.\.Psychological methods1\(1\),pp\. 30\.Cited by:[§3\.6](https://arxiv.org/html/2606.24155#S3.SS6.p1.1)\.
- A\. N\. Meyer, T\. D\. Giardina, L\. Khawaja, and H\. Singh \(2021\)Patient and clinician experiences of uncertainty in the diagnostic process: current understanding and future directions\.Patient Education and Counseling104\(11\),pp\. 2606–2615\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- M\. Moor, O\. Banerjee, Z\. S\. H\. Abad, H\. M\. Krumholz, J\. Leskovec, E\. J\. Topol, and P\. Rajpurkar \(2023\)Foundation models for generalist medical artificial intelligence\.Nature616\(7956\),pp\. 259–265\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)Medmcqa: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InConference on health, inference, and learning,pp\. 248–260\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2023\)Med\-halt: medical domain hallucination test for large language models\.InProceedings of the 27th Conference on Computational Natural Language Learning \(CoNLL\),pp\. 314–334\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1)\.
- J\. Pan, W\. Shen, J\. Li, J\. Canisius, F\. Bitzer, P\. Roßmüller, J\. Yang, V\. Kreutzinger, D\. Rueckert, and B\. Wiestler \(2026\)DDX\-trace: a benchmark for medical diagnostic trajectories in vlms\.External Links:2605\.23629,[Link](https://arxiv.org/abs/2605.23629)Cited by:[§4](https://arxiv.org/html/2606.24155#S4.p1.1)\.
- S\. Pandit, J\. Xu, J\. Hong, Z\. Wang, T\. Chen, K\. Xu, and Y\. Ding \(2025\)Medhallu: a comprehensive benchmark for detecting medical hallucinations in large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 2858–2873\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1)\.
- P\. Qiu, C\. Wu, S\. Liu, Y\. Fan, W\. Zhao, Z\. Chen, H\. Gu, C\. Peng, Y\. Zhang, Y\. Wang,et al\.\(2025\)Quantifying the reasoning abilities of llms on clinical cases\.Nature Communications16\(1\),pp\. 9799\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- D\. Roustan and F\. Bastardot \(2025\)The clinicians’ guide to large language models: a general perspective with a focus on hallucinations\.Interactive journal of medical research14\(1\),pp\. e59823\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1)\.
- A\. Sanghvi, N\. Akash, R\. Imam, A\. Sharma, and M\. Jain \(2026\)MeDxAgent: multi\-agent consultation for interactive medical diagnosis\.External Links:2606\.03416,[Link](https://arxiv.org/abs/2606.03416)Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- K\. L\. Sangwon, J\. Zhang, R\. Steele, J\. Stryker, J\. V\. Lee, J\. Choi, K\. Vishwanath, D\. A\. Alber, D\. Kondziolka, M\. Mankowski,et al\.\(2025\)Evaluating large language model diagnostic performance on jama clinical challenges via a multi\-agent conversational framework\.medRxiv,pp\. 2025–08\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- S\. Schmidgall, R\. Ziaei, C\. Harris, J\. W\. Kim, E\. P\. Reis, J\. Jopling, and M\. Moor \(2026\)AgentClinic: a multimodal benchmark for tool\-using clinical ai agents\.npj Digital Medicine\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1),[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- W\. Shi, R\. Xu, Y\. Zhuang, Y\. Yu, J\. Zhang, H\. Wu, Y\. Zhu, J\. C\. Ho, C\. Yang, and M\. D\. Wang \(2024\)Ehragent: code empowers large language models for few\-shot complex tabular reasoning on electronic health records\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 22315–22339\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- J\. Sooknanan and T\. Seemungal \(2019\)Not so elementary–the reasoning behind a medical diagnosis\.MedEdPublish8,pp\. 234\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- L\. Sun, C\. Gibbons, J\. Hernández\-Orallo, X\. Wang, L\. Jiang, D\. Stillwell, F\. Luo, and X\. Xie \(2025\)Beyond benchmarks: evaluating generalist medical artificial intelligence with psychometrics\.Journal of medical Internet research27,pp\. e70901\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1),[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- H\. Thampy, E\. Willert, and S\. Ramani \(2019\)Assessing clinical reasoning: targeting the higher levels of the pyramid\.Journal of general internal medicine34\(8\),pp\. 1631–1636\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- R\. J\. Tibshirani and B\. Efron \(1993\)An introduction to the bootstrap\.Monographs on statistics and applied probability57\(1\),pp\. 1–436\.Cited by:[§3\.6](https://arxiv.org/html/2606.24155#S3.SS6.p2.6)\.
- B\. Wang, I\. Xia, Y\. Zhang, J\. Wang, F\. Ouyang, S\. Han, A\. Cohan, H\. Yu, and Z\. Yao \(2025\)From scores to steps: diagnosing and improving llm performance in evidence\-based medical calculations\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 10820–10844\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- S\. X\. Wang \(2026\)Measuring the unmeasurable: a diagnostic sensor for ai reasoning pathology in sequential clinical decision\-making\.medRxiv,pp\. 2026–03\.Cited by:[§4](https://arxiv.org/html/2606.24155#S4.p1.1)\.
- X\. Wang, Y\. Chen, J\. Li, Y\. Wang, Y\. Yao, T\. Gu, J\. Li, Y\. Teng, Y\. Wang, and X\. Hu \(2026\)Openrt: an open\-source red teaming framework for multimodal llms\.arXiv preprint arXiv:2601\.01592\.Cited by:[4th item](https://arxiv.org/html/2606.24155#S2.I2.i4.p1.1)\.
- A\. Weinstein, S\. Gupta, R\. Pinto\-Powell, J\. Jackson, J\. Appel, D\. Roussel, and M\. Daniel \(2017\)Diagnosing and remediating clinical reasoning difficulties: a faculty development workshop\.MedEdPORTAL13,pp\. 10650\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p1.1)\.
- M\. Werthaim, M\. Kimhi, A\. Apartsin, and Y\. Aperstein \(2026\)A benchmark for evaluating diagnostic questioning efficiency of llms in patient conversations\.Scientific Reports\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- J\. Wu, B\. Gu, R\. Zhou, K\. Xie, D\. Snyder, Y\. Jiang, V\. Carducci, R\. Wyss, R\. J\. Desai, E\. Alsentzer,et al\.\(2026\)BRIDGE: benchmarking large language models for understanding real\-world clinical practice texts\.Nature Biomedical Engineering,pp\. 1–16\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- G\. Xiong, Q\. Jin, Z\. Lu, and A\. Zhang \(2024\)Benchmarking retrieval\-augmented generation for medicine\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 6233–6251\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- C\. Yan, X\. Fu, Y\. Xiong, T\. Wang, S\. C\. Hui, J\. Wu, and X\. Liu \(2025\)LLM sensitivity evaluation framework for clinical diagnosis\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 3083–3094\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- W\. Yan, H\. Liu, T\. Wu, Q\. Chen, W\. Wang, H\. Chai, and J\. Wang \(2026\)Clinicallab: aligning agents for multi\-departmental clinical diagnostics in the real world\.Advances in Neural Information Processing Systems38\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p3.1)\.
- S\. Yang, H\. Yuan, W\. Zhang, J\. Wang, Y\. Qian, W\. Chen, F\. Wang, and L\. Zhu \(2026\)ClinHallu: a benchmark for diagnosing stage\-wise hallucinations in medical mllm reasoning\.arXiv preprint arXiv:2606\.14697\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- H\. Zhang, J\. Huang, K\. Mei, Y\. Yao, Z\. Wang, C\. Zhan, H\. Wang, and Y\. Zhang \(2025\)Agent security bench \(asb\): formalizing and benchmarking attacks and defenses in llm\-based agents\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 35331–35366\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- N\. Zhang, M\. Chen, Z\. Bi, X\. Liang, L\. Li, X\. Shang, K\. Yin, C\. Tan, J\. Xu, F\. Huang,et al\.\(2022\)Cblue: a chinese biomedical language understanding evaluation benchmark\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 7888–7915\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- R\. Zhang and A\. C\. Chung \(2021\)MedQ: lossless ultra\-low\-bit neural network quantization for medical image segmentation\.Medical Image Analysis73,pp\. 102200\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p2.1)\.
- S\. Zhou, W\. Xie, J\. Li, Z\. Zhan, M\. Song, H\. Yang, C\. Espinoza, L\. Welton, X\. Mai, Y\. Jin,et al\.\(2025\)Automating expert\-level medical reasoning evaluation of large language models\.npj Digital Medicine\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p5.1)\.
- Z\. Zhu, Y\. Zhang, X\. Zhuang, F\. Zhang, Z\. Wan, Y\. Chen, Q\. QingqingLong, Y\. Zheng, and X\. Wu \(2025\)Can we trust ai doctors? a survey of medical hallucination in large language and large vision\-language models\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 6748–6769\.Cited by:[§1](https://arxiv.org/html/2606.24155#S1.p4.1)\.
## Appendix AMutimodal tasks for Clinical Cognitive Responsiveness and Medical Atomic Skills
### A\.1Clinical Cognitive Responsiveness
The databases of Clinical Cognitive Responsiveness are listed in tableLABEL:tab:medbench\_v5\_ccr
Table 5:Overview of Clinical Cognitive ResponsivenessDimensionDatasetMetricsDescriptionMedical Knowledge QAMedExamAccuracyCovers basic, professional, and public medical subjects\. All items are objective multiple\-choice questions, including K\-type \(best single\-answer\) and S\-type \(clinical vignette best\-answer\) formats\.MedHCMacro\-Recall & LLM\-as\-a\-JudgeHealth consultation dataset covering common medical examinations \(internal medicine, surgery, lab tests, genetic testing\) across cardiovascular, oncology, metabolic, and nutritional diseases\.MedMCMacro\-Recall & LLM\-as\-a\-JudgeMedication consultation dataset spanning 31 clinical departments and 320 diseases, covering disease names, drug names, indications, and treatment regimens\.MedSpeQAMacro\-Recall & LLM\-as\-a\-JudgeSpecialty\-specific QA dataset covering oncology, cardiology, respiratory, gastroenterology, and imaging\. Questions involve symptoms, signs, history, medications, and family history\.MedHGAccuracyTriage/department recommendation dataset\. Given doctor–patient dialogues about symptoms and history, the model must recommend the appropriate clinical department from a predefined set\.MedLitQAMacro\-Recall & LLM\-as\-a\-JudgeMedical literature comprehension and reasoning QA, evaluating the model’s ability to extract knowledge and perform logical reasoning based on real clinical literature excerpts\.MedRehabMacro\-Recall & LLM\-as\-a\-JudgePost\-discharge rehabilitation management dataset, evaluating the model’s ability to deliver personalized rehabilitation plans and guidance across rehabilitation domains\.MedRxPlanMacro\-Recall & LLM\-as\-a\-JudgePrecision medication education dataset in a case\-based QA format, covering 10 major organ systems \(respiratory, circulatory, digestive, urinary, neurological, endocrine, reproductive, musculoskeletal, immune, hematological\)\.MedPsychCareMacro\-Recall & LLM\-as\-a\-JudgeOpen\-domain QA dataset focusing on psychological counseling and emotional support, covering common mental health issues, crisis intervention, chronic psychological problems, and cross\-scenario psychological needs\.MedPsychQAMacro\-Recall & LLM\-as\-a\-JudgePsychological knowledge QA dataset designed for public mental health literacy, evaluating accurate responses across 12 psychology knowledge domains\.Medical Language GenerationMedRecordGenMacro\-Recall & LLM\-as\-a\-JudgeOpen\-domain QA evaluating the generation of structured, standardized clinical records from doctor–patient interactions, covering outpatient records, admission notes, and discharge summaries\.MedPopularLLM\-as\-a\-JudgeHealth science communication generation, covering five core public health domains: common disease education, chronic disease management, public health, special populations, and general health knowledge\.MedSummaryMacro\-Recall & LLM\-as\-a\-JudgeClinical document summarization evaluating both completeness of clinical information and conciseness, covering inpatient records, outpatient records, lab/imaging reports, surgical notes, and discharge summaries\.MedExplainMacro\-Recall & LLM\-as\-a\-JudgeMedical terminology explanation for patient comprehension, balancing scientific accuracy with layperson readability to bridge doctor–patient communication gaps\.MedTeachLLM\-as\-a\-JudgeClinical teaching case generation from real patient records, covering five scenario types while ensuring information completeness, privacy protection, and pedagogical alignment\.Complex Medical ReasoningCMB\-Clin\-extendedMacro\-Recall & LLM\-as\-a\-JudgeBased on complex real\-world clinical records, evaluating the model’s ability to apply medical knowledge for diagnosis and treatment in authentic clinical scenarios\.DDx\-advancedAccuracyMultiple\-choice questions derived from real patient records \(per SCARE guidelines\), covering demographics, symptoms, clinical concerns, treatment/surgical history, medications, allergies, family history, and lifestyle factors\. Correct answers may include one or multiple options\.MedTreatMacro\-Recall & LLM\-as\-a\-JudgePrecision treatment planning for specific diseases in complex clinical scenarios, evaluating standardized and personalized therapeutic recommendations\.MedOutcomeAccuracy & Macro\-Recall & LLM\-as\-a\-JudgeClinical outcome prediction \(cured, improved, unchanged, deceased\) based on patient information, key interventions, and disease\-specific clinical guidelines\.MedAnalysisMacro\-Recall & LLM\-as\-a\-JudgePersonalized risk assessment using clinical scoring systems and formulas \(e\.g\., APACHE II, CHADS2\) to evaluate disease progression, treatment complications, and prognosis risks\.MedDiagMacro\-Recall & LLM\-as\-a\-JudgePrimary care diagnosis recommendation covering common respiratory, digestive, and community\-acquired diseases, requiring diagnosis and evidence based on patient information and basic examinations\.MedDifferMacro\-Recall & LLM\-as\-a\-JudgeDifferential diagnosis recommendation for primary care, covering common diseases and requiring differential results with supporting evidence\.MedCareMacro\-Recall & LLM\-as\-a\-JudgeAppropriate treatment and management recommendations for general practitioners, covering symptomatic treatment of common diseases and long\-term chronic disease management\.MedPrimaryMacro\-Recall & LLM\-as\-a\-JudgeClinical decision support for general practitioners in inpatient settings, requiring further diagnostic and therapeutic plans based on patient records and basic examination results\.MedPHMMacro\-Recall & LLM\-as\-a\-JudgePersonalized health management for chronic disease patients, covering diet/exercise planning, lifestyle intervention, acute exacerbation prevention/treatment, and treatment efficacy management for hypertension, diabetes, and COPD\.Medical Language UnderstandingSMDocAccuracyMedical text structuring from real clinical documents \(patient demographics, symptom descriptions, lab/imaging results\), evaluating extraction of specific clinical entities such as vital signs and examination findings\.MedRxCheckAccuracy & Macro\-Recall & LLM\-as\-a\-JudgePre\-prescription intelligent review covering single\-choice, multiple\-choice, and open\-ended questions, evaluating the model’s ability to audit prescription rationality, safety, and compliance\.MedInsureCheckMacro\-Recall & LLM\-as\-a\-JudgeAutomated audit of medical insurance claims for compliance and reasonableness based on simulated insurance data and healthcare insurance regulations\.MedInsureCalcAccuracyMedical insurance fee calculation and payment management across diverse scenarios, including basic settlement, complex rules \(e\.g\., Category\-B self\-pay first, deductible\-ceiling interaction\), robustness testing, and edge cases\.MedChartQCMacro\-Recall & LLM\-as\-a\-JudgeMedical document quality control evaluating completeness, standardization, and logical consistency across admission records, progress notes, surgical records, discharge summaries, and outpatient records\.MedReportQCMicro\-F1 & Macro\-Recall & LLM\-as\-a\-JudgeCT imaging report quality control, evaluating the model’s knowledge of common disease CT manifestations and diagnostic accuracy to reduce manual QC workload\.MedPathQCMacro\-Recall & LLM\-as\-a\-JudgeClinical pathway quality control across the full chain of admission determination, process monitoring, and variance management, supporting standardized and homogeneous care delivery\.MedTermMacro\-Recall & LLM\-as\-a\-JudgePrecision interpretation of core clinical and research medical terminology for healthcare professionals\.MedSynonymMacro\-Recall & LLM\-as\-a\-JudgeMulti\-scenario medical synonym matching \(clinical consultations, record writing, surgical communication, lab report interpretation, research writing, academic presentations, doctor–patient communication\) across six domains: basic medicine, clinical diagnosis, disease pathology, therapeutic intervention, pharmacology, and laboratory medicine\.Healthcare Safety & EthicsMedSafetyAccuracyMultiple\-choice questions based on healthcare quality and safety core regulations, laws, and industry standards, covering the full spectrum of clinical safety scenarios\.MedEthicsAccuracySingle\-choice questions built from classic medical ethics textbooks and domestic/international policies, covering clinical ethics, research ethics, genetics and reproductive ethics, psychiatric ethics, end\-of\-life care ethics, interpersonal ethics, public health ethics, traditional Chinese medicine ethics, health management ethics, organ transplantation ethics, and pandemic ethics\.Medical Visual Perception & Text ExtractionMedDetectIoU & AccuracyTarget detection in medical images \(CT, MRI, etc\.\) guided by clinical text descriptions, evaluating the model’s ability to localize and identify imaging targets\.MedClassAccuracyMulti\-modal image classification fusing medical imaging \(CT, MRI\) with clinical text \(chief complaints, history summaries, lab indicators\) for precise image\-level classification\.MedOCR1\-N\.E\.D\. \(Normalized Edit Distance\)Named entity recognition on medical imaging reports; accurate text recognition from report images is the prerequisite for downstream content understanding\.Cross\-modal Semantic Understanding & ReasoningMedVQAMacro\-Recall & LLM\-as\-a\-JudgeVisual question answering combining imaging report text with patient demographics to provide preliminary diagnosis and further reasoning recommendations\.MedGenMacro\-Recall & LLM\-as\-a\-JudgeMulti\-modal report generation from paired image–report data \(ultrasound, X\-ray, pathology, endoscopy\), evaluating both content accuracy and linguistic fluency of generated clinical reports\.MedQCMicro\-F1Chest X\-ray image quality control covering 11 QC dimensions including artifacts and improper positioning, using constrained\-domain answers\.MedSeqImMacro\-Recall & LLM\-as\-a\-JudgeLongitudinal imaging sequence understanding with clinical and temporal annotations, evaluating analysis of imaging changes, treatment response prediction, and multi\-task temporal reasoning\.Med3DMTVQAAccuracy3D multi\-timepoint visual QA based on real radiology reports \(CT plain scan, contrast\-enhanced CT, and follow\-up scans\), evaluating 3D volume data \+ multi\-sequence \+ temporal comparison understanding\.Clinical Decision Support & ReasoningMedDiffDxMacro\-Recall & LLM\-as\-a\-JudgeMulti\-modal clinical case differential diagnosis, evaluating the generation of probabilistic differential lists with supporting evidence and diagnostic accuracy\.MedTherapyMacro\-Recall & LLM\-as\-a\-JudgeMulti\-modal treatment planning with annotated treatment regimens and outcomes, evaluating personalized therapy recommendations with multi\-dimensional justification\.MedCourseMacro\-Recall & LLM\-as\-a\-JudgeChronic disease longitudinal follow\-up with multi\-modal data \(imaging, treatment records, complete disease course annotations\), evaluating disease progression analysis and individualized modeling\.MedRealMMLLM\-as\-a\-Judge \(case\-specific rubric\)Real\-world telemedicine dialogues from de\-identified Chinese online consultation platforms, with multi\-turn conversations and patient\-uploaded medical images; models generate physician responses at key decision nodes\.Clinical Task Planning & ReasoningMedDecompLLM\-as\-a\-JudgeClinical goal decomposition across five major clinical scenarios \(outpatient, emergency, etc\.\), evaluating the model’s ability to decompose abstract clinical goals into executable, logical, and comprehensive task sequences\.MedPathPlanLLM\-as\-a\-JudgeMulti\-department complex clinical pathway planning with full\-process data and multiple care pathways, evaluating the generation of compliant and personalized clinical pathways\.MedCOTLLM\-as\-a\-JudgeChain\-of\-thought reasoning on multi\-step complex clinical cases across outpatient, emergency, and chronic disease management scenarios, evaluating logical consistency and accuracy of reasoning\.MedReflectLLM\-as\-a\-JudgeError\-annotated clinical decision cases across four major scenarios, with initial plans, error descriptions, and expert corrections, evaluating the model’s ability to identify biases/errors and propose reasonable improvements\.Medical Tool Invocation & ExecutionMedRetAPILLM\-as\-a\-JudgeClinical information retrieval across 8 major clinical information\-need scenarios for both providers and patients, evaluating query generation accuracy for information access\.MedCallAPILLM\-as\-a\-JudgeExternal system API invocation for clinical operations across 6 scenarios with operational and parameter requirements, evaluating compliant API call generation\.Medical Scenario Perception & InteractionMedIntentIDAccuracyMulti\-scenario doctor–patient dialogue intent recognition covering 6 dialogue types with full context, evaluating classification accuracy and contextual understanding\.MedRoleAdaptLLM\-as\-a\-JudgeMulti\-role medical dialogue adaptation \(patient, physician, etc\.\) with role\-specific information and communication goals, evaluating role\-appropriate and adaptive response generation\.Memory & Context RetentionMedLongConvLLM\-as\-a\-JudgeLong\-term conversational tracking across three interaction types—chronic disease management, common disease management, and rehabilitation guidance—each covering multiple diseases\.MedLongQALLM\-as\-a\-JudgeLong\-document medical QA covering clinical records, research literature, and complex queries, evaluating the model’s ability to synthesize information across lengthy documents with deep comprehension and answer accuracy\.Medical Multi\-Agent CollaborationMedCollabLLM\-as\-a\-JudgeMulti\-system collaborative medical scenarios covering five collaboration modes: diagnostic assistance, treatment execution, chronic disease management, emergency coordination, and rehabilitation guidance, evaluating task decomposition and system coordination\.
### A\.2Medical Atomic Skills
The description of Medical Atomic Skills is listed in table[6](https://arxiv.org/html/2606.24155#A1.T6)
Table 6:Overview of Agent Atomic Skill Evaluation Datasets
## Appendix BLikert\-5 for Human Rating
The detail of Likert\-5 for Human Rating is listed in table[7](https://arxiv.org/html/2606.24155#A2.T7)
Table 7:Medical Relevance Rating Scale \(Likert\-5\)Scoring Instructions:
- •Rating object:Compare each model answer with its paired reference answer per question\.
- •Blinding:The model source should not be known during rating; only compare text content\.
- •Handling uncertainty:If the reference answer is incomplete, use recognized medical guidelines or clinical routine as supplementary basis\. Additional correct information provided by the model is not penalized, but consistency scoring remains based on the core overlap\.
- •Use of scores:This rating is an ordinal categorical variable\. Subsequent analyses include inter\-rater consistency \(e\.g\., Kappa coefficient\) and correlation with automatic metrics \(e\.g\., Spearman correlation coefficient\)\.
## Appendix CMonitoring Hallucination Propagation
In addition to the five\-node process audit, we monitor how unsupported claims are initiated, propagated, anchored, and modulated by contradictions under different information\-flow stressors\. Unlike conventional hallucination evaluation, which typically checks only whether the final answer contains unsupported content, our protocol traces hallucination behavior throughout the full multi\-turn trajectory\. This allows us to distinguish early fabrication, downstream reuse of hallucinated content, final\-decision contamination, and hallucinations induced or suppressed by explicit contradictions\.
For each scenario\(xi,s\)\(x\_\{i\},s\)and turntt, the judge extracts factual claims from the model response and compares them against the information available up to that turn\. LetRi,tsR\_\{i,t\}^\{s\}be the set of released clinical facts,Zi,tsZ\_\{i,t\}^\{s\}the set of factual claims extracted from the model response, andHi,tsH\_\{i,t\}^\{s\}the subset of unsupported claims:
Hi,ts=\{z∈Zi,ts∣z⋠Ri,ts\}\.H\_\{i,t\}^\{s\}=\\\{z\\in Z\_\{i,t\}^\{s\}\\mid z\\not\\preceq R\_\{i,t\}^\{s\}\\\}\.Here,z⋠Ri,tsz\\not\\preceq R\_\{i,t\}^\{s\}means that the claim is not supported by the available scenario context, is contradicted by the context, or is inconsistent with the gold\-standard trajectory\. Based on these hallucination events, the judge produces count\-based annotations that are converted into eight ratio\-based metrics grouped into four dimensions: initiation, propagation, anchoring, and hallucination–contradiction interaction\. When the denominator of a metric is zero, the metric is treated as not applicable for that sample and excluded from macro\-averaging\.
#### Initiation\.
The initiation dimension measures whether hallucinations are generated when relevant information is absent, withheld, or not yet released\. We use two metrics\. Thenumerical fabrication ratio\(NFR\) captures fabricated clinical numbers, such as invented vital signs, laboratory values, imaging measurements, or medication doses\. Theunsubstantiated fact ratio\(UFR\) captures unsupported non\-numerical factual claims, such as invented medical history, medication use, allergies, procedures, or prior diagnoses:
NFR=\#fabricated\_numeric\_claims\#total\_numeric\_claims,\\mathrm\{NFR\}=\\frac\{\\\#\\mathrm\{fabricated\\\_numeric\\\_claims\}\}\{\\\#\\mathrm\{total\\\_numeric\\\_claims\}\},\(1\)UFR=\#unsupported\_factual\_claims\#total\_factual\_claims\.\\mathrm\{UFR\}=\\frac\{\\\#\\mathrm\{unsupported\\\_factual\\\_claims\}\}\{\\\#\\mathrm\{total\\\_factual\\\_claims\}\}\.\(2\)
#### Propagation\.
The propagation dimension measures whether hallucinated content persists across turns or contaminates later reasoning\. Thehallucination persistence ratio\(HPR\) measures the proportion of initiated hallucinations that reappear in subsequent turns\. Thehallucination cross\-contamination ratio\(HCCR\) measures whether hallucinated claims are reused as premises for later diagnostic reasoning:
HPR=\#persistent\_hallucinations\#initiated\_hallucinations,\\mathrm\{HPR\}=\\frac\{\\\#\\mathrm\{persistent\\\_hallucinations\}\}\{\\\#\\mathrm\{initiated\\\_hallucinations\}\},\(3\)HCCR=\#cross\_contamination\_events\#cross\_contamination\_opportunities\.\\mathrm\{HCCR\}=\\frac\{\\\#\\mathrm\{cross\\\_contamination\\\_events\}\}\{\\\#\\mathrm\{cross\\\_contamination\\\_opportunities\}\}\.\(4\)
#### Anchoring\.
The anchoring dimension measures whether hallucinated content becomes fixed in the final diagnostic conclusion\. Thedefinitive hallucination dependency ratio\(DHDR\) measures the fraction of final evidence items that are hallucinated or unsupported\. Thecritical hallucination omission\(CHO\) measures whether the model omits genuine critical evidence, especially evidence that would contradict or weaken a hallucinated reasoning path:
DHDR=\#hallucinated\_final\_evidence\#final\_evidence\_items,\\mathrm\{DHDR\}=\\frac\{\\\#\\mathrm\{hallucinated\\\_final\\\_evidence\}\}\{\\\#\\mathrm\{final\\\_evidence\\\_items\}\},\(5\)CHO=\#critical\_hallucination\_omissions\#critical\_evidence\_opportunities\.\\mathrm\{CHO\}=\\frac\{\\\#\\mathrm\{critical\\\_hallucination\\\_omissions\}\}\{\\\#\\mathrm\{critical\\\_evidence\\\_opportunities\}\}\.\(6\)
#### Hallucination–Contradiction Interaction\.
The final dimension evaluates how hallucinations interact with explicit contradictions\. Thecontradiction\-induced hallucination suppression ratio\(CIHSR\) measures whether contradictions help the model suppress or revise previously exposed hallucinations, for example by asking for clarification, acknowledging inconsistency, or withdrawing unsupported assumptions\. In contrast, thecontradiction\-induced hallucination generation ratio\(CIHGR\) measures whether contradictions trigger new hallucinations, such as fabricated explanations introduced to reconcile inconsistent evidence:
CIHSR=\#contradiction\_suppressed\_hallucinations\#hallucinations\_exposed\_by\_contradictions,\\mathrm\{CIHSR\}=\\frac\{\\\#\\mathrm\{contradiction\\\_suppressed\\\_hallucinations\}\}\{\\\#\\mathrm\{hallucinations\\\_exposed\\\_by\\\_contradictions\}\},\(7\)CIHGR=\#contradiction\_induced\_new\_hallucinations\#contradiction\_events\.\\mathrm\{CIHGR\}=\\frac\{\\\#\\mathrm\{contradiction\\\_induced\\\_new\\\_hallucinations\}\}\{\\\#\\mathrm\{contradiction\\\_events\}\}\.\(8\)
Table[8](https://arxiv.org/html/2606.24155#A3.T8)summarizes the eight hallucination propagation metrics\. Except for CIHSR, where a higher value indicates stronger self\-correction after contradiction exposure, lower values indicate fewer hallucination\-related failures\.
Table 8:Hallucination propagation metrics used in the dynamic audit protocol\.For each stress condition, we first compute the above metrics at the sample level and then aggregate them across all valid samples\. Undefined cases caused by zero denominators are excluded from the corresponding metric average rather than being counted as zero\. This prevents scenarios without hallucination opportunities, contradiction events, or numeric claims from artificially lowering the estimated hallucination risk\.
## Appendix DAdditional Results
Table 9:Proportion of datasets where each model achieved the top score\.### D\.1Detail Results of CCR
Table 10:Performance of models on LLm track tasksTaskDeepSeek\-V4\-Pro
Doubao\-Seed\-2\.0\-pro
Qwen3\.7\-Max\-Preview
Kimi\-K2\.6
GLM\-5\.1
GPT\-5\.5
Claude Opus 4\.7
Gemini\-3\.1\-Pro\-Preview
Grok\-4\.20 Beta
MedGemma 1\.5
MedReason
HuatuoGPT
1\.6TN/AN/A1T744BN/AN/AN/AN/A4B8B72BMedExam90\.2495\.1990\.7892\.6592\.6591\.1887\.8391\.9886\.5045\.8641\.3185\.16MedHC63\.6564\.9466\.2166\.0565\.3563\.4868\.2264\.8162\.4847\.8736\.9849\.51MedMC85\.1482\.9986\.6184\.9885\.6478\.3182\.9786\.7480\.4351\.3834\.0556\.18MedSpeQA75\.3575\.1777\.9376\.5777\.4275\.4977\.2876\.2874\.3353\.4643\.0151\.53MedHG62\.0064\.0057\.5055\.0067\.5067\.5054\.5063\.5073\.5082\.5047\.5040\.50MedLitQA53\.7862\.6765\.6569\.4868\.3659\.7472\.5362\.1165\.3219\.7928\.2048\.47MedRehab62\.7662\.2764\.8763\.6964\.3160\.5465\.2663\.8162\.3040\.6934\.9042\.60MedRxPlan75\.9474\.9477\.2278\.6076\.4577\.2476\.7175\.2674\.2555\.5735\.8750\.74MedPsychCare60\.5459\.6762\.6861\.2062\.2161\.5461\.5362\.1859\.1956\.1442\.8246\.81MedPsychQA68\.8068\.6272\.8269\.0172\.1268\.1070\.2371\.4768\.6353\.0243\.6349\.72MedRecordGen72\.9877\.3676\.8779\.1177\.6079\.6981\.0176\.3874\.6859\.0854\.7163\.54MedPopular81\.8476\.8778\.4480\.4477\.6676\.3384\.1679\.8081\.2863\.8759\.7263\.96MedSummary75\.2975\.9074\.3476\.3175\.4571\.8681\.2576\.0480\.9070\.0748\.2274\.05MedExplain56\.8854\.3957\.4356\.2656\.7958\.2857\.4756\.8655\.4642\.4540\.0544\.59MedTeach98\.2796\.6799\.47100\.0099\.2099\.87100\.0096\.2798\.6738\.9366\.9372\.40CMB\-Clin\-extended71\.0171\.2872\.2272\.1772\.3372\.1073\.0871\.3870\.2755\.9748\.8461\.97DDx\-advanced35\.3334\.6745\.3338\.0034\.0021\.3329\.3329\.3337\.331\.331\.339\.33MedTreat59\.0957\.4757\.9560\.9560\.0060\.9661\.8660\.5957\.0236\.2232\.9241\.54MedOutcome46\.6744\.7346\.4046\.8747\.0745\.5347\.9946\.6750\.1238\.5539\.9638\.44MedAnalysis94\.3394\.2796\.9397\.3397\.0095\.3396\.0798\.0095\.1368\.6759\.7379\.80MedDiag78\.2778\.6479\.2678\.8079\.5378\.0981\.2978\.6377\.0664\.4251\.3266\.84MedDiffer45\.2444\.4043\.7748\.3444\.4049\.4247\.9844\.1844\.3234\.4028\.0932\.66MedCare79\.9075\.4781\.3979\.7581\.8780\.2482\.1283\.1078\.3475\.0259\.7672\.25MedPrimary78\.6676\.5376\.9980\.9679\.8480\.9684\.7680\.6774\.1345\.6339\.0048\.55MedPHM42\.0447\.1749\.3446\.6048\.6344\.3547\.3646\.6143\.7633\.2916\.8427\.82SMDoc67\.9770\.4771\.9872\.0367\.3169\.2171\.8171\.9572\.5235\.2316\.4068\.03MedRxCheck40\.5139\.6941\.0738\.7539\.4439\.7037\.1239\.6737\.716\.656\.0926\.26MedInsureCheck67\.8663\.7367\.0864\.8464\.3972\.5074\.7068\.7273\.2163\.1162\.2065\.74MedInsureCalc75\.6280\.3876\.1977\.0575\.8176\.2976\.2976\.2978\.6728\.6730\.5774\.57MedChartQC44\.2948\.9455\.4654\.6953\.6952\.3055\.6554\.0152\.2327\.4619\.0128\.24MedReportQC62\.9466\.5165\.4065\.8463\.3363\.8866\.1666\.5464\.5138\.1431\.0351\.21MedPathQC66\.5658\.3370\.9968\.2870\.7370\.4374\.0969\.4366\.7346\.3742\.3442\.89MedTerm71\.6271\.7372\.6171\.4372\.5171\.0571\.7471\.8169\.0254\.5745\.6251\.38MedSynonym70\.2081\.2782\.0774\.8077\.1373\.7377\.4780\.8770\.6738\.9332\.6759\.00MedSafety47\.3354\.6762\.6761\.3346\.6768\.6758\.6767\.3347\.333\.330\.0039\.33MedEthics42\.0044\.0050\.0046\.6742\.6750\.0053\.3360\.6749\.335\.330\.0040\.67Table 11:Performance of multimodal medical tasksTaskDeepSeek\-V4\-Pro
Doubao\-Seed\-2\.0\-pro
Qwen3\.7\-Plus
Kimi\-K2\.6
GLM\-5\.1
GPT\-5\.5
Claude Opus 4\.7
Gemini\-3\.5\-Flash
Grok\-4\.20 Beta
MedGemma 1\.5
1600BN/A35B1000B744BN/AN/AN/AN/A4BMedDetect8\.7427\.6312\.898\.882\.1310\.644\.5927\.874\.9811\.38MedClass57\.8665\.7168\.5767\.8655\.0065\.7172\.8669\.299\.293\.57MedOCR1\.0065\.3364\.4660\.650\.9768\.3764\.9557\.9559\.7920\.77MedVQA32\.9047\.1144\.5732\.1545\.3832\.9338\.7440\.7026\.4223\.43MedGen35\.2041\.2541\.3040\.4437\.1242\.1642\.8543\.0837\.4823\.16MedQC11\.8516\.3413\.9515\.1213\.2514\.3512\.587\.3517\.770\.86MedSeqIm36\.1242\.1460\.6641\.9940\.9942\.9342\.8042\.3639\.8125\.52MedDiffDx64\.0865\.9741\.9948\.7965\.8063\.9967\.6163\.0560\.0246\.90MedTherapy57\.5456\.4642\.8865\.7559\.9959\.2462\.9258\.5256\.7346\.45MedCourse69\.1867\.0967\.2969\.2573\.2271\.0772\.3568\.7268\.5057\.39MedRealMM81\.9382\.7460\.5968\.1984\.9786\.2253\.1584\.205\.1246\.15Med3DMTVQA37\.7442\.0271\.6737\.7429\.3343\.2240\.3145\.1135\.1631\.05Table 12:Performance of models on agent track tasksTaskDeepSeek\-V4\-Pro
Doubao\-Seed\-2\.0\-pro
Qwen3\.7\-Max\-Preview
Kimi\-K2\.6
GLM\-5\.1
GPT\-5\.5
Claude Opus 4\.7
Gemini\-3\.1\-Pro\-Preview
Grok\-4\.20 Beta
MedGemma 1\.5
MedReason
HuatuoGPT
1\.6TN/AN/A1T744BN/AN/AN/AN/A4B8B72BMedDecomp95\.4796\.5398\.1399\.3397\.4796\.1399\.7393\.2097\.3379\.2069\.3372\.27MedPathPlan94\.6792\.6794\.2792\.4085\.0795\.3393\.2090\.4089\.4782\.0066\.0077\.47MedCOT99\.8099\.8098\.80100\.0099\.9099\.9099\.9095\.50100\.0093\.0076\.3080\.30MedReflect96\.2092\.5096\.6099\.7096\.3099\.0099\.3094\.6096\.6077\.9066\.1073\.20MedRetAPI98\.6895\.1399\.2199\.8797\.5095\.0099\.4796\.9789\.7481\.7177\.2479\.61MedCallAPI82\.5295\.2389\.9396\.2991\.2689\.6794\.4492\.4591\.7985\.4377\.3588\.74MedIntentID75\.1781\.2191\.9583\.8991\.2890\.6086\.5891\.9586\.580\.000\.0069\.80MedRoleAdapt98\.8089\.5098\.7099\.6097\.2097\.2098\.4098\.1097\.4075\.0075\.0079\.80MedLongConv98\.8090\.5098\.9099\.0098\.1096\.7099\.2099\.1097\.7073\.5057\.5079\.90MedLongQA84\.2783\.8786\.6792\.0089\.7386\.5393\.0786\.4094\.0074\.6772\.5380\.40MedCollab100\.0099\.87100\.0099\.33100\.0099\.87100\.00100\.00100\.0079\.8767\.7377\.07
### D\.2Detail results of stressor\-audit
Table 13:Process\-audit evaluation results on theLLM trackunder different information\-flow stress conditions\. Metrics are grouped by the five audit nodes\. Upward arrows indicate higher\-is\-better metrics, while downward arrows indicate lower\-is\-better metrics\.Table 14:Process\-audit evaluation results on theagent trackunder different information\-flow stress conditions\. Metrics are grouped by the five audit nodes\. Upward arrows indicate higher\-is\-better metrics, while downward arrows indicate lower\-is\-better metrics\.
### D\.3Detail Results of Hallucination Propagation Monitoring
Figure5:Delta\-to\-Baseline heatmaps of Hallucination propagation metrics under stress conditions\.Each heatmap shows the performance change \(Δ\\Delta\) of eight stress conditions \(rows\) relative to the baseline \(first row\)\. For metrics where higher is better \(↑\\uparrow\),Δ=metricstress−metricbaseline\\Delta=\\text\{metric\}\_\{\\text\{stress\}\}\-\\text\{metric\}\_\{\\text\{baseline\}\}; for metrics where lower is better \(↓\\downarrow\),Δ=metricbaseline−metricstress\\Delta=\\text\{metric\}\_\{\\text\{baseline\}\}\-\\text\{metric\}\_\{\\text\{stress\}\}\. PositiveΔ\\Delta\(blue\) indicates gain, negativeΔ\\Delta\(red\) indicates loss; baselineΔ\\Deltais always zero\.Table 15:Hallucination propagation evaluation results on thellm trackunder different information\-flow stress conditions\. Metrics are grouped by the four hallucination dimensions\.Table 16:Hallucination propagation evaluation results on themultimodal trackunder different information\-flow stress conditions\. Metrics are grouped by the four hallucination dimensions\.Table 17:Hallucination propagation evaluation results on theagent trackunder different information\-flow stress conditions\. Metrics are grouped by the four hallucination dimensions\.Similar Articles
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
This paper introduces ClinicalBench and the EpiKG system, evaluating assertion-aware retrieval for clinical question answering on MIMIC-IV data across multiple LLMs. It demonstrates that handling negation and temporality in retrieval significantly improves performance over standard baselines.
MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion
MuteBench is a benchmark for evaluating multimodal fusion models under modality missing and within-modality missing conditions across clinical datasets. It provides insights into architecture robustness and suggests that diffusion-based imputation can help.
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.
Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models
BloomBench is a cognitively grounded bilingual (English-Arabic) multimodal benchmark for Vision-Language Models, systematically evaluating six cognitive levels based on Bloom's Taxonomy. Experiments reveal significant cognitive asymmetries and cross-lingual performance gaps in current models.
MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models
MCBench is a new benchmark for assessing the safety of omnimodal large language models across vision, audio, and text modalities. It includes 1196 scenarios and finds current models struggle with cross-modal safety reasoning.