MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models

arXiv cs.CL Papers

Summary

MEDSYN is a multilingual multimodal benchmark for evaluating MLLMs on complex clinical cases with up to 7 distinct visual evidence types per case. The study reveals that while frontier models match human experts on differential diagnosis generation, all MLLMs show significant gaps in final diagnosis selection due to poor synthesis of heterogeneous clinical evidence.

arXiv:2602.21950v3 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx–FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE (e.g., medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance. We will open-source our benchmark and code.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models

Source: https://arxiv.org/html/2602.21950

Boqi Chen¹,², Xudong Liu³,², Jiachuan Peng²,⁴, Marianne Frey-Marti⁵, Kyle Lam⁶, Bang Zheng⁷, Lin Li⁴, Jianing Qiu²

¹ETH Zurich, ²MBZUAI, ³Amazon, ⁴University of Oxford, ⁵University of Bern, ⁶Imperial College London, ⁷Peking University

²Equal contribution. Correspondence: [email protected]

## Abstract

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While frontier models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx–FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE (*e.g.*, medical history) and (ii) a cross-modal CE utilization gap. We introduce *Evidence Sensitivity* to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance.

![MEDSYN Benchmark Overview](https://arxiv.org/html/2602.21950v3/figures/MEDSYN.png)

## 1 Introduction

![Figure 1a](figure1a) ![Figure 1b](figure1b)

**Figure 1:** (a) Clinicians curate a broad differential diagnosis (DDx) list before determining a final diagnosis (FDx) via evidence synthesis. (b) Models exhibit a substantial gap between DDx coverage rate and FDx accuracy, far exceeding that observed in human experts.

Multimodal Large Language Models (MLLMs) have demonstrated great potential in advancing clinical applications, yet the benchmarks used to evaluate them remain limited and fragmented. Early benchmarks primarily target single-image visual question answering (VQA) such as basic object recognition. More recent efforts move toward more realistic settings by requiring integrative reasoning over multiple images, but several limitations persist.

First, despite including multiple images, individual questions draw images from the same clinical-evidence (CE) type, *e.g.*, cross-sectional CT scans. In clinical practice, clinicians typically leverage heterogeneous CE types, spanning laboratory tests, imaging from multiple modalities, microscopy images, and even omics data. This is particularly relevant for accurate diagnosis in complex clinical cases such as multimorbidity. While MedXpertQA MM contains a multi-CE subset, it constitutes only a small fraction of the benchmark, with an average of 2.74 CE types per case.

Second, most benchmarks emphasize selecting a final diagnosis (FDx). However, real diagnostic workflows typically begin with generating a differential diagnosis (DDx), *i.e.*, a set of plausible conditions consistent with findings from one or more CE types, and then determine the FDx by synthesizing evidence across all available CE types.

Finally, most existing benchmarks are English-only, limiting the assessment of MLLMs' multilingual capabilities in clinical contexts.

To address these limitations, we introduce MEDSYN, a multilingual, multimodal benchmark of complex clinical cases, where each question contains, on average, **3.97 CE types** and **8.42 images** drawn from up to **7 distinct CE types**. Mirroring real-world diagnostic workflows, our benchmark evaluates models on two tasks: (i) **DDx generation** and (ii) **FDx selection**.

We benchmark 18 state-of-the-art MLLMs, including both general-purpose models (proprietary and open-source) and domain-specific medical models, with open-source scales ranging from 2B to 78B parameters.

We further conduct two ablation studies using two medical MLLMs and their base model: (i) perturb textual CE by removing it or replacing it with length-matched random token strings; and (ii) run a leave-one-out analysis where each CE type is withheld and measure the resulting update in the model's answer posterior, comparing the same CE provided as raw images versus expert-derived diagnostic findings from them.

Our key findings are summarized as follows:

- **1.** We show that leading models outperform expert clinicians on DDx generation but underperform on FDx selection, suggesting a capability gap between identifying plausible conditions from heterogeneous CE types and synthesizing them into a singular, accurate diagnosis. This gap varies by language, highlighting concerning cross-lingual disparities.

- **2.** We find that MLLMs overrely on textual inputs, skewing evidence weighting toward less discriminative CE (*e.g.*, medical history). Removing such evidence increases attention to image tokens and, counterintuitively, improves diagnostic accuracy despite less CE being available.

- **3.** We demonstrate that visual understanding remains a major bottleneck: cross-modal misalignment distorts how MLLMs calibrate different CE types, yielding a *cross-modal CE utilization gap*. We introduce a novel metric, termed *Evidence Sensitivity*, to quantify this gap and show that a smaller gap correlates with higher diagnostic accuracy. We further show that this metric provides actionable guidance for targeted interventions to improve model performance.

## 2 Related Work

| Benchmark | # Images | Multilingual | Avg. # Images per Case | Avg. # Evidence Types per Case |
|---|---|---|---|---|
| VQA-Rad | 204 | ✗ | 1 | 1 |
| VQA-Med | 500 | ✗ | 1 | 1 |
| Slake | 96 | ✓ | 1 | 1 |
| PMC-VQA | 29k | ✗ | 1 | 1 |
| OmniMedVQA | 118k | ✗ | 1 | 1 |
| GMAI-MMBench | 21k | ✗ | 1 | 1 |
| MedXpertQA MM | 2.8k | ✗ | 2.1 | 2.74 |
| MEDSYN | 3.6k | ✓ | 8.42 | 3.97 |

**Table 1:** Comparison of MEDSYN with existing multimodal medical benchmarks.

### Multimodal Large Language Models

General-purpose MLLMs have demonstrated remarkable zero-shot capabilities on medical tasks, including diagnosing complex clinical cases, owing to the extensive clinical knowledge encoded in their LLM backbones. To further improve performance, recent work has explored domain-specific adaptation by fine-tuning MLLMs on specialized medical data spanning diverse CE types.

Despite these advances, current MLLMs remain prone to hallucinations and biases that impede safe deployment in real-world clinical environments. These challenges highlight the need for comprehensive benchmarks that evaluate models' ability in complex and realistic diagnostic settings.

### Multimodal Medical Benchmarks

Existing medical benchmarks fall short of reflecting real-world clinical demands. Early benchmarks primarily target single-image VQA in narrow domains such as radiology and pathology. Recently, more generalized benchmarks such as PMC-VQA, OmniMedVQA, and GMAI-MMBench have been proposed to assess MLLMs capabilities across diverse CE types. For instance, OmniMedVQA covers 12 CE types, including multiple imaging modalities (*e.g.*, CT, MRI, X-ray, ultrasound), microscopy, and specialized imaging such as colonoscopy.

However, individual questions in these benchmarks remain isolated single-image snapshots tied to a single CE type. Although MedXpertQA MM introduces multi-image, multi-CE VQA, this constitutes a small fraction of the benchmark and includes a limited number of CE types per question.

Finally, most benchmarks are English-only, limiting evaluation of multilingual capabilities of MLLMs in clinical contexts. A detailed comparison is provided in Table 1.

## 3 Benchmark

![Figure 2](figure2)

**Figure 2:** Example final diagnosis selection tasks in English (top) and Chinese (bottom). Colors mark different visual clinical evidence (CE) types referenced in the question, with corresponding expert-derived diagnostic findings; gray denotes textual CE. In our experiment, each CE type is input as either raw images or text findings, not both.

### 3.1 Data Collection and Preprocessing

#### Data Collection

We collect consecutive English cases from the *New England Journal of Medicine Case Record* series and Chinese case reports published in the *National Medical Journal of China* from November 2015 to October 2025. For each case, we extract background and initial discussion up to DDx, along with all referenced tables and figures.

In particular, the background contains textual CE including medical history and physical examination findings. The referenced tables and figures comprise visual CE from diagnostic investigations (*i.e.*, objective, technical tests ordered by clinicians), such as laboratory studies, diagnostic imaging (*e.g.*, CT, MRI and X-ray), microscopy and electrophysiological measurements (*e.g.*, EEG).

Finally, the initial discussion presents the expert-derived diagnostic findings from visual CE. The ground-truth (GT) FDx is taken from the FDx section or, if not explicitly stated, determined by a clinician based on the full report.

All collected cases are subject to expert validation, and cases are excluded if (i) any individual visual CE is not diagnostically interpretable (*e.g.*, low image quality, incomplete anatomical coverage, or artifacts), or (ii) the complete set of CE is diagnostically insufficient (*i.e.*, the clinician is not able to confirm the FDx from all provided evidence).

After validation, we obtain 452 cases in total (398 in English and 54 in Chinese).

**Figure 3** summarizes visual CE type distribution **(a)** and counts per case **(b)**.

#### Data Preprocessing

For each case, we manually crop and label images by their CE type. For the same evidence from different time points, we add an additional reference when it is obtained. Image captions are added to the discussion, from which we employ GPT-5 to extract and organize expert-derived diagnostic findings based on CE types through in-context learning. This establishes a one-to-one mapping from visual CE to textual interpretation.

Since a single CE can be reviewed by multiple clinicians, we utilize GPT-5 to summarize individual interpretations into a singular, cohesive summary to mitigate redundancy. The template prompts and examples of processed data are in Appendix A.2.

Finally, we conduct a manual quality check by cross-referencing the processed data with the original reports to ensure completeness and mitigate hallucinations, and a randomly selected 20% subset is further validated by clinicians.

![Figure 3a](figure3a) ![Figure 3b](figure3b)

**Figure 3:** (a) Distribution of visual clinical evidence (CE) types. (b) Number of visual CE types per case.

### 3.2 Benchmark Design

To mirror real-world clinical workflow, we design two distinct tasks: (i) **DDx generation**, which involves information gathering and hypothesis generation, and (ii) **FDx selection**, which requires evidence synthesis and diagnostic verification.

#### DDx Generation

The DDx generation is an open-ended generation task where MLLMs are asked to present a differential list ranked from the most probable to least probable diagnosis.

#### FDx Selection

For FDx selection, we construct closed-ended multiple-choice questions (MCQs) where the single correct answer is the GT FDx and distractors are drawn from the model-generated DDx. We employ GPT-5 to generate a differential list and then select distractors according to two criteria:

(i) distractors must exclude the FDx and any synonym or wording variant thereof;

(ii) distractors should be clinically proximate to the FDx (*e.g.*, a closely related histologic subtype).

We apply an adversarial refinement process where we iteratively identify and eliminate potential shortcuts in how models distinguish correct from incorrect diagnoses. Finally, the MCQs are reviewed by expert clinicians based on content validity and clinical relevance to ensure that discriminative CE explicitly disqualifies all distractors, thereby eliminating diagnostic ambiguity among the options.

**Figure 2** illustrates two cases from FDx selection tasks with color-coded CE types. Each visual CE is paired with an expert-derived summary of diagnostic findings from it. During inference, a given CE type is provided as either visual or textual input, not both. Examples of DDx generation tasks are shown in Appendix A.3.

#### Metrics

For DDx generation, we follow the evaluation framework and employ GPT-5 as an automated judge to score generated DDx on a 0–5 scale. We report the coverage rate, defined as the percentage of cases whose DDx include the FDx, considering scores ≥ 4 (*i.e.*, DDx comprises the exact FDx or a highly synonymous condition) as a positive coverage based on clinician's suggestion.

For FDx selection, we report overall accuracy.

## 4 Experiments

### 4.1 Evaluation

We evaluate a diverse set of MLLMs. Proprietary models include representative GPT, Gemini, and Claude 4.5 Opus. Open-source models span 2B–72B parameters, covering widely used families such as Qwen, InternVL, and DeepSeek-VL. We also include three medical MLLMs: HuatuoGPT, Lingshu, and Med-Mantis.

The evaluation is conducted using the VLMEvalKit framework on 8 NVIDIA A6000 GPUs. We evaluate all models using a zero-shot setting. We also recruit two senior physicians to evaluate the English subset of the benchmark.

### 4.2 Main Results

| Type | Model | English | | | Chinese | | | Overall | | |
|---|---|---|---|---|---|---|---|---|---|---|
| | | DDx Coverage Rate (%) | FDx Selection Acc. (%) | | DDx Coverage Rate (%) | FDx Selection Acc. (%) | | DDx Coverage Rate (%) | FDx Selection Acc. (%) | |
| Expert | Senior Physician | 77.13 | 72.11 | | — | — | | — | — | |

Similar Articles

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Hugging Face Daily Papers

MedConclusion introduces a large-scale benchmark of 5.7 million PubMed structured abstracts for evaluating LLMs on biomedical conclusion generation from structured scientific evidence. The study finds that conclusion writing is behaviorally distinct from summarization and that current automatic metrics cluster strong models closely together.