Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

arXiv cs.CL 06/18/26, 04:00 AM Papers
Summary
This paper investigates LLM-based metrics for evaluating clinical significance in radiology report generation. It identifies discrimination bias in existing LLM evaluators and proposes training lightweight interpretable metrics to improve the balance between error detection and tolerance of harmless variations.
arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.
Original Article
View Cached Full Text
Cached at: 06/18/26, 05:46 AM
# Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports
Source: [https://arxiv.org/html/2606.18797](https://arxiv.org/html/2606.18797)
Qingyu Lu1Ruochen Li211footnotemark:1Liang Ding3Yufei Xia4Youxiang Zhu5Dacheng Tao1 1Nanyang Technological University2Technical University of Munich3Alibaba 4University of Glasgow5University of Massachusetts Boston qingyu\.lu\.ai@gmail\.com

###### Abstract

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care\. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar\. Although Large Language Models \(LLMs\) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation\. We study this boundary usingReEvalMedbenchmark as testbed and evaluate metric\-level clinical significance from detecting true clinical errors \("*Discrimination*"\) and tolerating insignificant variations \("*Robustness*"\)\. Across 8 LLM evaluators under one\-pass and two\-pass settings, we identify a widespread*discrimination bias*: models effectively detect errors but also over\-penalize harmless rephrasings\. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3\-8B and MedGemma\-4B\. Our trained metric sharpens the clinical significance boundary, surpassing 32B\-scale medical LLMs and remaining competitive with proprietary models\. Crucially, the more costly two\-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness\. These findings suggest one\-pass trained metrics as the practical choice for cost\-sensitive deployment, with two\-pass inference reserved for settings where D–R balance is critical\. We will release the dataset and metric\.

Beyond Scalar Scores: Exploring LLM\-based Metrics for Clinical Significance Evaluation in Radiology Reports

Qingyu Lu1††thanks:Equal contribution\.Ruochen Li211footnotemark:1Liang Ding3Yufei Xia4Youxiang Zhu5Dacheng Tao11Nanyang Technological University2Technical University of Munich3Alibaba4University of Glasgow5University of Massachusetts Bostonqingyu\.lu\.ai@gmail\.com

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.18797v1/x1.png)Figure 1:Discrimination–Robustness accuracy of 8 LLM\-as\-evaluators, 2 medical metrics, and our post\-trained models on ReEvalMed\.Discrimination bias: most LLMs fall below the diagonal \(“D\>\>R”\)\.Radiology reports describe imaging findings\(e\.g\., lesion characteristics, anatomical abnormalities\) that directly guide clinical diagnosis and treatmentTannoet al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib34)\), making clinically faithful evaluation essential for generated reports\. Although vision\-language models \(VLMs\) can generate such reports automatically from medical imagesBannuret al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib32)\); Chenet al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib33)\), reliable evaluation remains an open challenge\. A clinician\-trusted evaluation metric should detect errors that could significantlyaffect clinical decisions while tolerating stylistic or clinically insignificant variation\. However, as recent studyLiet al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib12)\)demonstrate, both traditional lexical metrics and LLM\-based medical report metrics that correlate strongly with human judgement \(e\.g\. GREENOstmeieret al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib27)\)and RaTEScoreZhaoet al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib30)\)\) often fail to separate clinically significant errors from minor variations\. This boundary blurring undermines their reliability as indicators of clinical acceptability\.

LLMs offer a path toward the interpretable evaluation that prior metrics lack: beyond scalar scoring, they can identify error spans, classify clinical error aspects, and assess clinical significance, thereby providing fine\-grained feedback\. However, empirical validation of LLMs as radiology report evaluators remains limited, and it is unclear whether their interpretive capacity translates into reliable clinical judgment\. To this end, we study this question from two perspectives: the zero\-shot behavior of current open\-source and proprietary LLMs, and whether targeted data augmentation and fine\-tuning can close the performance gap\.

Motivated by recent LLM\-based evaluators for generated texts\(Luet al\.,[2024](https://arxiv.org/html/2606.18797#bib.bib37); Kocmi and Federmann,[2023](https://arxiv.org/html/2606.18797#bib.bib38)\), we design one\-pass and two\-pass prompts for radiology evaluation \(Figure[2](https://arxiv.org/html/2606.18797#S1.F2)\), where the former directly outputs structured error annotations and the latter separates ➀ error span detection from ➁ clinical significance judgment\. As shown in Figure[1](https://arxiv.org/html/2606.18797#S1.F1), we observe a consistentdiscrimination bias:current LLM evaluators struggle to distinguish clinically significant discrepancies from harmless report variations, resulting in high discrimination but low robustness accuracy\.

![Refer to caption](https://arxiv.org/html/2606.18797v1/x2.png)Figure 2:Score\-based metrics \(e\.g\. BERTScore\) assign comparable scores to clinically significant errors and harmless rephrasings\. Our LLM\-as\-evaluator outputs structured severity judgments, better distinguishing significant errors from insignificant variations\.To mitigate such discrimination bias and better delineate the boundary of clinical significance, we synthesize a balanced dataset reports annotated along 12 aspects and three error types \(omission, fabrication, inaccuracy error\) with Claude Sonnet, with clinician verification to ensure data quality\. To further validate the effectiveness of our synthesized dataset, we train a radiology report evaluation metric based on Qwen3\-8B and MedGemma\-4B\. Our trained metric significantly improve the clinical significance boundary, outperforming 32B\-scale medical LLMs and remaining competitive with proprietary LLMs\. We further find that two\-pass inference does not eliminate errors, but instead redistributes them between discrimination and robustness\. Our contributions are three\-fold:

- •Analysis of LLM\-based Evaluators\.We systematically evaluate 11 score\-based metrics and 8 LLM\-based metrics on ReEvalMed using one\-pass and two\-pass LLM prompting strategies, analyze the error pattern and revealing a consistent discrimination bias among most LLM metrics\.
- •Clinically grounded data synthesis\.We synthesize a balanced dataset of 4k radiology report pairs covering all 12 aspects of ReEvalMed’s error taxonomy \(spanning omission, fabrication, and factual error\), annotated at the span level and verified for clinical validity\.
- •Lightweight interpretable metric\.Building on the synthesized data, we train Qwen3\-8B via supervised fine\-tuning \(SFT\) and reinforcement learning \(RL\) techniques using both one\-pass and two\-pass prompting format, achieving 78\.5% discrimination and 70\.5% robustness accuracy, surpassing medical LLMs at the 32B scale \(such as Lingshu\-32BXuet al\.\([2025b](https://arxiv.org/html/2606.18797#bib.bib35)\)and Hulu\-Med\-32BJianget al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib36)\)\)\.

Table 1:ReEvalMed error taxonomy and test set composition\. The benchmark spans 12 error aspects across two evaluation dimensions,*Discrimination*\(detecting clinically significant errors\) and*Robustness*\(tolerating clinically insignificant variations\), each with 200 samples \(400 total\)\.
## 2Preliminary

#### Task formulation

A radiology report evaluation metric takes a reference report \("REF"\) and a generated candidate report \("TGT"\) as input, and outputs a quality judgment\. Traditional metrics, including lexical metrics \(BLEUPapineniet al\.\([2002](https://arxiv.org/html/2606.18797#bib.bib13)\), ROUGE\-LLin \([2004](https://arxiv.org/html/2606.18797#bib.bib15)\)\), embedding\-based metrics \(BERTScoreZhanget al\.\([2019](https://arxiv.org/html/2606.18797#bib.bib20)\)\), and clinical NLP metrics \(RadGraphJainet al\.\([2021](https://arxiv.org/html/2606.18797#bib.bib19)\), CheXbertSmitet al\.\([2020](https://arxiv.org/html/2606.18797#bib.bib18)\)\), all reduce this judgment to a singlecontinuous score\. However, a scalar score cannot distinguish clinically significant errors from insignificant ones, as both may incur the same penalty under any scalar metric\. We therefore adopt astructured textual outputthat explicitly labels each discrepancy as*significant*or*insignificant*, while also identifying its error span and clinical aspect\. This provides fine\-grained attribution beyond what scalar scoring can express\.

#### ReEvalMed benchmark

To assess whether a metric aligns with clinical judgment, ReEvalMedLiet al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib12)\)provides a fine\-grained meta\-evaluation benchmark for radiology report evaluation\. Given paired reference and candidate reports with clinician\-defined significance labels, ReEvalMed tests whether a metric can distinguish clinically significant errors from clinically insignificant variations\. Its criteria were co\-developed with clinicians to reflect clinically relevant discrepancies in radiology reports\. ReEvalMed further organizes discrepancies along two axes:*Error Type*\(Omission, Fabrication, Inaccuracy\) and*Error Aspect*\(e\.g\., Location, Severity, Negation\), as summarized in Table[1](https://arxiv.org/html/2606.18797#S1.T1)\. The test set contains 400 report pairs drawn from MIMIC\-CXRJohnsonet al\.\([2019](https://arxiv.org/html/2606.18797#bib.bib17)\)\.

#### Discrimination and Robustness

ReEvalMed introduce two metrics along two complementary dimensions to measure the quality of evaluation:

- •Discrimination\(200 test samples\): the ability to detect errors that could materially affect diagnosis or treatment, such as changing lesion characterization from benign to malignant or substantially altering lesion size\.
- •Robustness\(200 test samples\): the ability to remain unpenalised by clinically harmless variations, such as synonymous wording, equivalent anatomical descriptions, or negligible measurement differences\.

A clinically significance\-sensitive metric should achieve high accuracy on both dimensions simultaneously, i\.e\., correctly flagging clinically significant errors while tolerating harmless variations\.

## 3Analysis on LLM\-based Metrics

### 3\.1LLM\-as\-Evaluator Methodology

Motivated by recent LLM\-based evaluators for generated texts\(Luet al\.,[2024](https://arxiv.org/html/2606.18797#bib.bib37); Kocmi and Federmann,[2023](https://arxiv.org/html/2606.18797#bib.bib38)\), we design structured prompts that instruct an LLM to compare a REF and TGT pair and output fine\-grained error annotations rather than a scalar score\. Specifically, the evaluator identifies error spans and assigns each a severity level from three classes:Critical\(internal contradictions that severely undermine report trust\),Significant\(errors that meaningfully alter clinical decision\-making\), andInsignificant\(stylistic variations or clinically harmless deviations\)\. We explore two inference paradigms for this task\.

#### One\-pass

The LLM receives REF and TGT in a single prompt and directly outputs a JSON object containing three severity buckets \(critical,significant, andinsignificant\), each mapping error spans to their error aspects, together with a free\-text explanation\.

#### Two\-pass

One\-pass inference couples span detection and severity judgment into a single output, which may cause compounding errors\. To address this, we decouple inference into two passes:

- •Pass 1 \(Span Detection\): the LLM identifies all discrepancies between REF and TGT, outputting a JSON array of error spans with their aspects \(e\.g\.,pneumothorax \-\- Description\)\.
- •Pass 2 \(Severity Judgment\): for each detected span, a second call outputs exactly one word \(Critical,Significant, orInsignificant\), conditioned on aspect\-specific criteria\.

#### D/R classification rule

Note that the three severity levels are*error\-level*labels assigned to individual spans, whereas D and R are*metric\-level*scores that measure how well a metric’s predictions align with clinical ground truth\. To bridge the two levels, we aggregate span\-level severities into a binary report\-level prediction\. Denoting a TGT–REF pair as\(t,r\)\(t,r\), letncn\_\{\\text\{c\}\},nsn\_\{\\text\{s\}\},nin\_\{\\text\{i\}\}be the number ofCritical,Significant, andInsignificantspans identified:

cls\(t,r\)=\{sig\.nc\+ns\>0,ins\.nc=ns=0,ni\>0\.\\textsc\{cls\}\(t,r\)=\\begin\{cases\}\\textit\{sig\.\}&n\_\{c\}\+n\_\{s\}\>0,\\\\ \\textit\{ins\.\}&n\_\{c\}=n\_\{s\}=0,\\;n\_\{i\}\>0\.\\end\{cases\}\(1\)TheDiscrimination score\(D\) is the accuracy of predictingsig\.on the Discrimination subset, where all pairs contain clinically significant errors; theRobustness score\(R\) is the accuracy of predictinginsig\.on the Robustness subset, where all pairs contain only clinically insignificant variations\.

### 3\.2Experimental Setup

TypeMetricMaximin ThresholdD \(↑\\uparrow\)R \(↑\\uparrow\)Avg \(↑\\uparrow\)Gap \(↓\\downarrow\)NLPBLEU19\.019\.019\.0 \(±\\pm4\.0\)0\.0BERTScore24\.024\.024\.0 \(±\\pm4\.2\)0\.0AlignScore51\.051\.051\.0 \(±\\pm5\.0\)0\.0Med\.RadGraph28\.528\.528\.5 \(±\\pm4\.5\)0\.0RadBERTScore26\.526\.526\.5 \(±\\pm4\.2\)0\.0RaTEScore35\.535\.535\.5 \(±\\pm4\.8\)0\.0CheXbert47\.049\.048\.0 \(±\\pm5\.0\)2\.0LLMGREEN53\.541\.547\.5 \(±\\pm4\.8\)12\.0RadFact80\.056\.068\.0 \(±\\pm4\.5\)24\.0FineRadScore86\.568\.577\.5 \(±\\pm4\.0\)18\.0CRIMSON75\.080\.577\.8\(±\\pm4\.2\)5\.5Table 2:Comparison of score\-based metrics on ReEvalMed\. D and R scores are in accuracy \(%\)\. Maximin threshold is applied to obtain balanced results\. Best results are inbold\. Numbers in parentheses denote 95% bootstrap CIs for Avg\.#### Baselines

We evaluate 11 score\-based metrics that output continuous scores: 3 NLP metrics \(BLEUPapineniet al\.\([2002](https://arxiv.org/html/2606.18797#bib.bib13)\), BERTScoreZhanget al\.\([2019](https://arxiv.org/html/2606.18797#bib.bib20)\), and AlignScoreZhaet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib40)\)\); 4 medical metrics \(RadGraphJainet al\.\([2021](https://arxiv.org/html/2606.18797#bib.bib19)\), RaTEScoreZhaoet al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib30)\), CheXbertSmitet al\.\([2020](https://arxiv.org/html/2606.18797#bib.bib18)\)\), and RadBERTScore, which replaces the generic BERTScore encoder with a radiology\-domain encoder from RadEvalXuet al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib9)\); and 4 LLM\-based metrics \(GREENOstmeieret al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib27)\), which prompts GPT\-4 for finding\-level error annotations, and CRIMSONBaharoonet al\.\([2026](https://arxiv.org/html/2606.18797#bib.bib41)\), which fine\-tunes MedGemma on 140K report pairs with GPT\-5\-generated severity labels via LoRA\. FineRadScoreHuanget al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib10)\)is an LLM\-based line\-by\-line correction metric; RadFactBannuret al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib32)\)is an LLM\-based entailment metric suite for radiology report evaluation, and we use its logical precision/recall scores\. We instantiate RadFact and FineRadScore with GPT\-5\.1 for our experiments\.

#### LLM\-as\-evaluators

We evaluate 3 proprietary LLMs \(GPT\-5\.1, Claude Sonnet 4\.5, Gemini 3 Pro\) and 5 open\-source LLMs \(Qwen3\-MaxYanget al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib47)\), LingShu\-32BXuet al\.\([2025b](https://arxiv.org/html/2606.18797#bib.bib35)\), Hulu\-Med\-32BJianget al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib36)\), Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib47)\), and MedGemma\-4BSellergrenet al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib49)\)\) using our one\-pass and two\-pass prompts\.

#### Prompt

Prompt templates for bothone\-passandtwo\-passparadigms are provided in Appendix[I](https://arxiv.org/html/2606.18797#A9)\.

#### Hyperparameters

All LLM model inferences use greedy decoding \(temperature = 0\.0\) with a maximum of 1024 output tokens\.

#### Device and platform

Proprietary models are accessed via their official APIs\. Open\-source models \(≤\\leq32B\) are served with vLLMKwonet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib46)\)on a single NVIDIA A800 GPU\. The same device is used for all training experiments in Section[5](https://arxiv.org/html/2606.18797#S5)\.

TypeModel1\-Pass2\-PassD \(↑\\uparrow\)R \(↑\\uparrow\)Avg \(↑\\uparrow\)Gap \(↓\\downarrow\)D \(↑\\uparrow\)R \(↑\\uparrow\)Avg \(↑\\uparrow\)Gap \(↓\\downarrow\)ProprietaryGPT\-5\.198\.542\.570\.5 \(±\\pm3\.5\)56\.095\.069\.582\.3\(±\\pm3\.5\)25\.5Claude Sonnet 4\.591\.584\.087\.8\(±\\pm3\.2\)7\.598\.051\.074\.5 \(±\\pm3\.5\)47\.0Gemini 3 Pro94\.057\.575\.8 \(±\\pm3\.8\)36\.565\.079\.572\.3 \(±\\pm4\.2\)14\.5Open\-sourceQwen3\-Max97\.053\.575\.3 \(±\\pm3\.8\)43\.596\.545\.070\.8 \(±\\pm3\.8\)51\.5LingShu\-32B99\.564\.081\.8 \(±\\pm3\.3\)35\.597\.55\.551\.5 \(±\\pm2\.0\)92\.0Hulu\-Med\-32B100\.023\.061\.5 \(±\\pm3\.0\)77\.096\.052\.574\.3 \(±\\pm3\.8\)43\.5Qwen3\-8B96\.58\.052\.3 \(±\\pm2\.3\)88\.595\.59\.052\.3 \(±\\pm2\.5\)86\.5MedGemma\-4B49\.09\.029\.0 \(±\\pm4\.0\)40\.067\.067\.567\.3 \(±\\pm4\.8\)0\.5Table 3:Comparison of LLM\-as\-evaluator on ReEvalMed benchmark\. D and R scores are in accuracy \(%\)\. Numbers in parentheses denote the half\-width of the 95% bootstrap CI for Avg\. Best results are inbold\.

### 3\.3Meta\-Evaluation

#### Score\-based metrics

To align the accuracy scale with LLM\-based metrics, score\-based metrics require a decision threshold to convert continuous values into binary D/R predictions\. We sweep all thresholds and apply themaximin criterionmaxθ⁡min⁡\(Dθ,Rθ\)\\max\_\{\\theta\}\\min\(D\_\{\\theta\},R\_\{\\theta\}\), selecting the operating point that maximises the worse of the two dimensions111For FineRadScoreHuanget al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib10)\)with severity\-level labels, the selected threshold maps severity levels<2<2to insignificant errors and levels≥2\\geq 2to significant errors\.\.

#### Average and Gap

We reportAvg=\(D\+R\)/2=\(\\text\{D\}\+\\text\{R\}\)/2for overall accuracy andGap=\|D−R\|=\|\\text\{D\}\-\\text\{R\}\|for D–R imbalance\. An ideal evaluator achieves high Avg with low Gap\. We report the 95% confidence interval \(CI\) of Avg using bootstrap resampling over test cases\. Detailed results in Appendix[B\.3](https://arxiv.org/html/2606.18797#A2.SS3)\.

### 3\.4Results

#### Score\-based metrics fail to distinguish significance

Score\-based metrics require a decision threshold to convert continuous scores into binary significant/insignificant predictions\. We select the maximin thresholdmaxθ⁡min⁡\(Dθ,Rθ\)\\max\_\{\\theta\}\\min\(D\_\{\\theta\},R\_\{\\theta\}\)that maximises the worse of the two dimensions, balancing D and R accuracy \(see Appendix[G](https://arxiv.org/html/2606.18797#A7), Figure[9](https://arxiv.org/html/2606.18797#A7.F9)for full trade\-off curves\)\. As shown in Table[2](https://arxiv.org/html/2606.18797#S3.T2)\(left\), most non\-LLM metrics remain below 55% on both D and R \(e\.g\. BLEU 19\.0/19\.0, CheXbert 47\.0/49\.0\), indicating they cannot reliably separate significant errors from harmless variations\. LLM\-based score metrics perform notably better: CRIMSON achieves the best maximin point \(75\.0/80\.5\), substantially outperforming GREEN \(53\.5/41\.5\) and all non\-LLM baselines\.

#### LLM\-as\-Evaluators detect errors but over\-penalize harmless variation

Compared with score\-based metrics, LLM\-as\-Evaluators achieve substantially higher Discrimination\. As shown in Table[3](https://arxiv.org/html/2606.18797#S3.T3), most LLMs reach D≥\\geq94% under 1\-Pass, far exceeding the best score\-based result \(CRIMSON, 75\.0%\)\. However, this sensitivity does not imply a reliable clinical\-significance boundary\. Robustness varies widely, ranging from 8\.0% \(Qwen3\-8B\) to 84\.0% \(Claude Sonnet 4\.5\), with most models below 60%\. Claude Sonnet 4\.5 is the only model that performs strongly on both dimensions \(D=91\.5%, R=84\.0%\); all other LLM evaluators remain substantially imbalanced, typically detecting clinically significant errors while over\-penalizing clinically insignificant variation\.

#### 1\-Pass vs\. 2\-Pass

Switching from 1\-Pass to 2\-Pass does not uniformly improve the Discrimination–Robustness balance\. Instead, it shifts the bias pattern in model\-dependent ways\. Some models show a reduced Gap: GPT\-5\.1 narrows from 56\.0 to 25\.5, Hulu\-Med\-32B from 77\.0 to 43\.5, and Gemini 3 Pro from 36\.5 to 14\.5\. In contrast, others become more imbalanced: Claude Sonnet 4\.5 increases from 7\.5 to 47\.0, and LingShu\-32B from 35\.5 to 92\.0\. These results suggest that decoupling span detection from significance judgment does not eliminate errors, but redistributes them between Discrimination and Robustness\.

#### Discrimination bias

Another consistent pattern emerges across both paradigms: nearly all models exhibit a large Gap, with D substantially exceeding R\. This reflects a systematic tendency to label clinically harmless variations as significant errors, which we term*discrimination bias*\. Without explicit supervision on the clinical\-significance boundary, models favor sensitivity over selectivity, detecting clinically meaningful errors but failing to reliably separate them from insignificant variation\. Since 2\-Pass inference only redistributes this bias rather than resolving it, we next explore targeted data augmentation and fine\-tuning in Sections[4](https://arxiv.org/html/2606.18797#S4)–[5](https://arxiv.org/html/2606.18797#S5)\.

![Refer to caption](https://arxiv.org/html/2606.18797v1/x3.png)Figure 3:Overview of our data synthesis and training pipeline\.Left:Reference reports are stratified\-sampled from MIMIC\-CXR and paired with error specifications for controlled error injection\.Right:Base models are fine\-tuned through SFT and RL under both One\-Pass and Two\-Pass evaluation paradigms\.

## 4Report Synthesis via Clinical Error Injection

Section[3](https://arxiv.org/html/2606.18797#S3)shows that discrimination bias is pervasive across LLM evaluators, yet no large\-scale resource explicitly supervises the boundary between clinically significant errors and harmless report variations\. We therefore construct 4,000 REF–TGT medical report pairs with fine\-grained annotations across 12 error aspects \(omission, fabrication, and inaccuracy\) and balanced D–R supervision\. This resource is designed to mitigate the bias when training lightweight metrics and to serve as a reusable resource, as shown in Figure[3](https://arxiv.org/html/2606.18797#S3.F3)\(left\)\.

### 4\.1Stratified Sampling

Error categories naturally differ in their length requirements:*Stylistic Variation*typically requires longer, multi\-sentence reports, whereas*Severity Omission*can occur in short single\-finding reports\. Following stratified sampling practice in machine translation evaluationSaldías Fuenteset al\.\([2022](https://arxiv.org/html/2606.18797#bib.bib48)\), we sample 2,000 reference reports from the 227K MIMIC\-CXR reportsJohnsonet al\.\([2019](https://arxiv.org/html/2606.18797#bib.bib17)\)using four character\-length buckets: 0–200, 200–350, 350–500, and\>\>500\. Category\-specific keyword filters derived from the error specifications ensure lexical relevance\. The resulting length distribution across the 12 error aspects is shown in Appendix[E](https://arxiv.org/html/2606.18797#A5)\.

### 4\.2Error Specifications

A central challenge is distinguishing clinically significant textual differences from harmless rewrites\. We address this by prompting Claude Sonnet 4\.5 to analyse the ReEvalMed test samples for each error category and distil structured*error specifications*, each containing: \(1\) a precise clinical error definition, \(2\) common medical manifestation patterns in CXR reports, \(3\) discrimination text patterns showing changes that alter clinical meaning \(e\.g\. “left upper lobe nodule”→\\to“pulmonary nodule”\), and \(4\) robustness text patterns showing clinically harmless reformulations \(e\.g\. “hepatic lesion in segment 7”→\\to“right hepatic lobe lesion”\)\. To avoid potential test\-set leakage, we deliberately prohibit direct reuse or paraphrasing of ReEvalMed examples during generation, using the specifications only as abstract category\-level guidance for diverse, novel realizations\. We present the error injection prompt \(Appendix[I](https://arxiv.org/html/2606.18797#A9)\) and representative examples in Appendix[H](https://arxiv.org/html/2606.18797#A8)\.

### 4\.3Error Injection and Report Synthesis

Each reference report is paired with one Discrimination injection \(clinically significant error\) and one Robustness injection\(clinically harmless variation\)\.222We generate a single D/R pair per reference to maximize distributional diversity and reduce overfitting to repeated reference patterns\.Across all error categories, with 100 D and 100 R samples per category, the corpus contains4,000 \(REF, TGT\) pairs\. Claude Sonnet 4\.5 generates each TGT conditioned on the sampled REF and the corresponding error specification, and each sample is annotated with span\-level labels \(error spans, type, severity\) to support two\-pass training\.

### 4\.4Quality Control via Automatic Sanity Checks and Human Evaluation

To ensure generation quality, we add automatic sanity checks \(e\.g\. non\-empty output, reasonable length ratio\) to detect obviously invalid generations and trigger immediate A qualified clinician further reviewed 400 randomly sampled pairs from the 4,000 synthesized samples, consisting of 200 Discrimination and 200 Robustness pairs, to assess clinical validity and label quality\.

Overall,93\.0%of the reviewed samples passed clinician verification, with 28 of 400 pairs flagged\. Manual inspection showed that the flagged cases mainly fell into three categories: severity mislabelling \(the assigned severity disagrees with clinical judgement\), injection failure \(error injection produces no meaningful change\), and aspect mismatch \(the injected change is purely stylistic, deviating from the assigned error type\)\. Representative flagged cases are provided in Appendix[F](https://arxiv.org/html/2606.18797#A6)as practical references for constructing clinically grounded datasets\. We will also release the complete clinician review report and detailed statistical analysis to support transparency and reproducibility\.

## 5Lightweight Metric Training

### 5\.1Training Stages

We evaluate the synthesized 4K corpus by training a lightweight radiology report metric and testing it on ReEvalMed, measuring both evaluation accuracy and discrimination bias\. The corpus is balanced 1:1 between Discrimination and Robustness samples to discourage degenerate always\-positive or always\-negative predictions\. As shown in Figure[3](https://arxiv.org/html/2606.18797#S3.F3)\(right\), both one\-pass and two\-pass variants follow the same two\-stage pipeline: supervised fine\-tuning \(SFT\) for output\-format learning, followed by reinforcement learning to align predictions with the clinical significance boundary\.

#### Base models

We select two lightweight LLMs of different scales:Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib47)\), a general\-purpose 8B instruction\-tuned LLM, andMedGemma\-4BSellergrenet al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib49)\), a 4B model pre\-trained on biomedical corpora\. This allows us to examine whether domain\-specific pre\-training complements our clinically grounded data\.

#### SFT

The SFT stage trains the model to produce the target output format and acquire basic error detection ability\. For one\-pass, the model learns to produce a structured JSON object containing severity buckets and error spans from a single REF\+TGT input\. For two\-pass, we construct separate training samples for Pass 1 and Pass 2, then fine\-tune the same base model on both parts so that it masters error span extraction \(Pass 1: a JSON array of error spans with clinical aspects\) and severity classification \(Pass 2: labelling each span as Critical, Significant, or Insignificant\)\.

#### RL

SFT learns the output format and may improve performance, but does not explicitly optimize the clinical significance boundary\. We apply DPORafailovet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib22)\)to explicitly optimize this boundary\. We apply DPORafailovet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib22)\)to explicitly optimize this boundary\. Using the same 4K corpus, we construct preference pairs by treating the original ground\-truth severity label as the*chosen*response and its flipped counterpart \(e\.g\., Significant→\\rightarrowInsignificant\) as the*rejected*response\. Note that for two\-pass, since SFT already learns span extraction reliably, DPO is applied only to Pass 2 severity classification\.

#### Training framework

All models are trained using LoRAHuet al\.\([2022](https://arxiv.org/html/2606.18797#bib.bib43)\)via LLaMA\-FactoryZhenget al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib45)\)\. Detailed hyperparameters are provided in Appendix[A](https://arxiv.org/html/2606.18797#A1)\.

### 5\.2Results

ModelSettingD \(↑\\uparrow\)R \(↑\\uparrow\)Avg \(↑\\uparrow\)Gap \(↓\\downarrow\)One\-PassQwen3Base96\.58\.052\.3 \(±\\pm2\.3\)88\.5\+ SFT90\.019\.054\.5 \(±\\pm3\.5\)71\.0\+ SFT \+ RL86\.039\.562\.8 \(±\\pm4\.2\)46\.5MedGemmaBase49\.09\.029\.0 \(±\\pm4\.0\)40\.0\+ SFT60\.593\.076\.8 \(±\\pm4\.0\)32\.5\+ SFT \+ RL75\.580\.077\.8\(±\\pm4\.2\)4\.5Two\-PassQwen3Base95\.59\.052\.3 \(±\\pm2\.5\)86\.5\+ SFT84\.052\.068\.0 \(±\\pm4\.3\)32\.0\+ SFT \+ RL78\.570\.574\.5 \(±\\pm4\.2\)8\.0MedGemmaBase67\.067\.567\.3 \(±\\pm4\.8\)0\.5\+ SFT67\.084\.075\.5 \(±\\pm4\.2\)17\.0\+ SFT \+ RL66\.590\.578\.5\(±\\pm4\.0\)24\.0Table 4:Training results on ReEvalMed based on Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib47)\)and MedGemma\-4BSellergrenet al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib49)\)\. D and R scores are in accuracy \(%\)\. Best results are inbold\.Table[4](https://arxiv.org/html/2606.18797#S5.T4)summarises the training results\.

#### Best results

After post\-training, MedGemma\-4B under 1\-Pass achieves the best D–R alignment \(D=75\.5%, R=80\.0%, Gap=4\.5\), while Qwen3\-8B under 2\-Pass reaches D=78\.5%, R=70\.5% \(Gap=8\.0\)\. Both surpass 32B\-scale medical LLMs such as LingShu\-32B and Hulu\-Med\-32B \(Table[3](https://arxiv.org/html/2606.18797#S3.T3)\), and approach Claude Sonnet 4\.5 \(D=91\.5%, R=84\.0%\) at a fraction of the inference cost\.

#### SFT and RL serve complementary roles

SFT teaches models to produce structured outputs, but does not explicitly optimize the clinical significance boundary: Qwen3\-8B remains heavily D\-biased after SFT \(1\-Pass Gap: 88\.5→\\to71\.0; 2\-Pass Gap: 86\.5→\\to32\.0\), while MedGemma\-4B overcorrects toward R\-bias under 1\-Pass \(D=60\.5%, R=93\.0%\)\. RL then narrows the Gap from both directions: for Qwen3\-8B it boosts R while preserving D \(2\-Pass Gap: 32\.0→\\to8\.0\); for MedGemma\-4B it recovers D from 60\.5% to 75\.5% while maintaining high R \(Gap: 32\.5→\\to4\.5\)\. The two stages are therefore complementary: SFT establishes the output format, while RL explicitly optimizes the D–R decision boundary\.

#### 1\-Pass vs\. 2\-Pass depends on the model

For Qwen3\-8B, 2\-Pass substantially outperforms 1\-Pass \(Gap 8\.0 vs\. 46\.5\), as decoupling span detection from severity judgment helps the model avoid conflating*where*an error is with*how serious*it is\. For MedGemma\-4B, however, 1\-Pass yields better alignment \(Gap 4\.5 vs\. 24\.0\): under 2\-Pass, R continues to rise \(84\.0→\\to90\.5%\) but D stagnates at 66\.5%, amplifying R\-bias with each training stage\. This suggests the the optimal pipeline depends on the base model’s inherent D/R bias\.

#### Effectiveness of data and training

Across both models and paradigms, training on our synthesized data consistently reduces the Gap, demonstrating the effectiveness of targeted data augmentation for D–R alignment\. Even with lightweight LoRA training on models as small as 4B, the resulting metrics achieve competitive performance with proprietary LLMs, validating both the quality of our synthesized dataset and the training framework\.

## 6Related Work

#### Radiology report generation

Paired image–report datasets such as CheXpert PlusChambonet al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib7)\), PadChestBustoset al\.\([2020](https://arxiv.org/html/2606.18797#bib.bib4)\), and the IU X\-Ray collection distributed through Open\-iDemner\-Fushmanet al\.\([2016](https://arxiv.org/html/2606.18797#bib.bib5)\)have enabled radiology report generation models, including recent multimodal LLMs such asμ2\\mu^\{2\}LLMLiet al\.\([2025b](https://arxiv.org/html/2606.18797#bib.bib8)\)\. Unlike this generation\-focused literature, our work evaluates whether automatic metrics can identify clinically significant errors generated by report generation LLMs while tolerating clinically insignificant report variation\.

#### Radiology report evaluation

Lexical metricsPapineniet al\.\([2002](https://arxiv.org/html/2606.18797#bib.bib13)\); Lin \([2004](https://arxiv.org/html/2606.18797#bib.bib15)\); Banerjee and Lavie \([2005](https://arxiv.org/html/2606.18797#bib.bib16)\)remain common despite limitations for clinical text\. RadEvalXuet al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib9)\)unifies lexical, contextual, clinical concept\-based, and LLM\-based radiology text metrics\. CREPEChoet al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib14)\), CheXbertSmitet al\.\([2020](https://arxiv.org/html/2606.18797#bib.bib18)\), and RadGraphJainet al\.\([2021](https://arxiv.org/html/2606.18797#bib.bib19)\)provide error counts or structured representations but still reduce quality to scalar scores\. GREENOstmeieret al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib27)\), RGRGTanidaet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib28)\), and VERTBolognaet al\.\([2026](https://arxiv.org/html/2606.18797#bib.bib11)\)use LLMs as judges, while CLEARJianget al\.\([2025b](https://arxiv.org/html/2606.18797#bib.bib3)\)evaluates CheXpert conditions and fine\-grained attributes; none jointly decompose error types and D–R behavior\. RaTEScoreZhaoet al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib30)\)and MAIRAHylandet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib29)\)target factual grounding but remain scalar\. ReEvalMedLiet al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib12)\)introduces D–R evaluation and reveals failures of existing metrics; we build on it with targeted data construction and post\-training\.

#### LLM\-as\-judge

Zhenget al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib23)\)established the LLM\-as\-judge paradigm for general NLP tasks\. In the medical domain,Singhalet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib24)\)demonstrated expert\-level performance on clinical QA\. However, LLMs used as report evaluators tend to over\-penalise stylistic variationsLiuet al\.\([2023b](https://arxiv.org/html/2606.18797#bib.bib21)\), a failure mode our Robustness dimension systematically quantifies\.

#### LLM post\-training

Reinforcement learning from human feedbackOuyanget al\.\([2022](https://arxiv.org/html/2606.18797#bib.bib42)\)and DPORafailovet al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib22)\)are standard alignment methods, while LoRAHuet al\.\([2022](https://arxiv.org/html/2606.18797#bib.bib43)\)and LLaMA\-FactoryZhenget al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib45)\)make SFT–RL post\-training efficient and accessible\. Recent work further simplifies preference optimisation by removing the reference modelHonget al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib44)\)\. We use this training stack to build a lightweight D–R\-aligned clinical report evaluator\.

## 7Conclusion

We systematically evaluate LLM\-based metrics for clinical significance in radiology reports, revealing a consistent*discrimination bias*: LLMs readily detect true clinical errors but struggle to tolerate harmless variations\. To mitigate this, we synthesize 4k report pairs and train lightweight metrics via SFT and RL, sharpening the significance boundary to surpass 32B\-scale medical LLMs while remaining competitive with proprietary LLMs\. We also find two\-pass inference redistributes rather than eliminates errors between discrimination and robustness\.

## Limitations

The limitations of this study can be summarised as follows:

- •Quality control\.The most rigorous validation of synthesized report pairs would require expert clinician annotation, which is prohibitively expensive at scale\. As a practical compromise, we randomly sampled 400 generated pairs for manual verification \(Section[4](https://arxiv.org/html/2606.18797#S4)\), leaving full\-scale clinician review for future work\.
- •Training algorithm\.The primary goal of this paper is to explore the proposed LLM\-based evaluation framework and demonstrate the effectiveness of our synthesized data; metric training serves as a validation vehicle rather than the core contribution\. We therefore adopt DPO as a straightforward offline preference optimisation method\. Modern reinforcement learning algorithms \(e\.g\. PPO, GRPO\) typically require large\-scale rollouts that exceed our current data budget; exploring these and other advanced training pipelines is a promising direction for future work\.
- •Domain coverage\.Our models are trained and evaluated on MIMIC\-CXR \(English, US clinical setting\) and Open\-i for domain generalization; out\-of\-distribution generalisation to other languages, imaging modalities, or clinical environments remains to be studied\.
- •Inference cost\.The two\-pass approach doubles inference cost relative to one\-pass\. In high\-throughput settings, one\-pass DPO may therefore provide a more favorable trade\-off between efficiency and D/R alignment\.

## Ethics Statement

We take ethical considerations seriously and strictly adhere to the ACL Code of Ethics\. All data used in this study are derived from publicly available research datasets\. For MIMIC\-CXR, we complied with the PhysioNet credentialing process and data\-use agreement requirements, including completion of the required human\-subjects research training and access under the approved data\-use terms\. No additional protected patient information was collected or used beyond the de\-identified data available through the authorized dataset access\. A clinician co\-author participated in the human verification of synthesized data \(Section[4](https://arxiv.org/html/2606.18797#S4)\) to ensure clinical validity\. Our proposed approaches and metrics do not include statements that induce the model to generate harmful information; the evaluation framework focuses solely on assessing the clinical significance of textual differences in radiology reports\. We ensure that the findings and conclusions of this paper are reported accurately and objectively\.

## References

- M\. Baharoon, T\. Heintz, S\. Raissi, M\. Alabbad, M\. Alhammad, H\. AlOmaish, S\. E\. Kim, O\. Banerjee, and P\. Rajpurkar \(2026\)CRIMSON: a clinically\-grounded llm\-based metric for generative radiology report evaluation\.arXiv preprint arXiv:2603\.06183\.External Links:[Link](https://arxiv.org/abs/2603.06183)Cited by:[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1)\.
- S\. Banerjee and A\. Lavie \(2005\)METEOR: an automatic metric for mt evaluation with improved correlation with human judgments\.InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization,pp\. 65–72\.External Links:[Link](https://aclanthology.org/W05-0909/)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Bannur, K\. Bouzid, D\. C\. Castro, A\. Schwaighofer, A\. Thieme, S\. Bond\-Taylor, M\. Ilse, F\. Pérez\-García, V\. Salvatelli, H\. Sharma,et al\.\(2024\)Maira\-2: grounded radiology report generation\.arXiv preprint arXiv:2406\.04449\.External Links:[Link](https://arxiv.org/abs/2406.04449)Cited by:[§1](https://arxiv.org/html/2606.18797#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1)\.
- F\. Bologna, J\. Corbeil, M\. Wilkens, and A\. Ben Abacha \(2026\)VERT: reliable llm judges for radiology report evaluation\.arXiv preprint arXiv:2604\.03376\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2604.03376),[Link](https://arxiv.org/abs/2604.03376)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Bustos, A\. Pertusa, J\. Salinas, and M\. de la Iglesia\-Vayá \(2020\)PadChest: a large chest x\-ray image dataset with multi\-label annotated reports\.Medical Image Analysis66,pp\. 101797\.External Links:[Document](https://dx.doi.org/10.1016/j.media.2020.101797),[Link](https://arxiv.org/abs/1901.07441)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px1.p1.1)\.
- P\. Chambon, J\. Delbrouck, T\. Sounack, S\. Huang, Z\. Chen, M\. Varma, S\. Q\. Truong, C\. T\. Chuong, and C\. P\. Langlotz \(2024\)CheXpert plus: augmenting a large chest x\-ray dataset with text radiology reports, patient demographics and additional image formats\.arXiv preprint arXiv:2405\.19538\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2405.19538),[Link](https://arxiv.org/abs/2405.19538)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px1.p1.1)\.
- Z\. Chen, M\. Varma, J\. Delbrouck, M\. Paschali, L\. Blankemeier, D\. Van Veen, J\. M\. J\. Valanarasu, A\. Youssef, J\. P\. Cohen, E\. P\. Reis,et al\.\(2024\)Chexagent: towards a foundation model for chest x\-ray interpretation\.InAAAI 2024 Spring Symposium on Clinical Foundation Models,External Links:[Link](https://arxiv.org/abs/2401.12208)Cited by:[§1](https://arxiv.org/html/2606.18797#S1.p1.1)\.
- G\. Cho, S\. Jang, H\. Ko, I\. Baek, and C\. M\. Park \(2025\)CREPE: rapid chest x\-ray report evaluation by predicting multi\-category error counts\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 21749–21766\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1102.pdf)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- D\. Demner\-Fushman, M\. D\. Kohli, M\. B\. Rosenman, S\. E\. Shooshan, L\. Rodriguez, S\. Antani, G\. R\. Thoma, and C\. J\. McDonald \(2016\)Preparing a collection of radiology examinations for distribution and retrieval\.Journal of the American Medical Informatics Association23\(2\),pp\. 304–310\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocv080)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Hong, N\. Lee, and J\. Thorne \(2024\)Orpo: monolithic preference optimization without reference model\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 11170–11189\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.626/)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px4.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9),2106\.09685Cited by:[Appendix A](https://arxiv.org/html/2606.18797#A1.p1.1),[§5\.1](https://arxiv.org/html/2606.18797#S5.SS1.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px4.p1.1)\.
- A\. Huang, O\. Banerjee, K\. Wu, E\. P\. Reis, and P\. Rajpurkar \(2024\)FineRadScore: a radiology report line\-by\-line evaluation technique generating corrections with severity scores\.InProceedings of the 9th Machine Learning for Healthcare Conference,Proceedings of Machine Learning Research, Vol\.252\.External Links:[Link](https://proceedings.mlr.press/v252/huang24a.html)Cited by:[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1),[footnote 1](https://arxiv.org/html/2606.18797#footnote1)\.
- S\. L\. Hyland, S\. Bannur, K\. Bouzid, D\. C\. Castro, M\. Ranjit, A\. Schwaighofer, F\. Pérez\-García, V\. Salvatelli, S\. Srivastav, A\. Thieme,et al\.\(2023\)Maira\-1: a specialised large multimodal model for radiology report generation\.arXiv preprint arXiv:2311\.13668\.External Links:[Link](https://arxiv.org/abs/2311.13668)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Jain, A\. Agrawal, A\. Saporta, S\. Q\. Truong, D\. N\. Duong, T\. Bui, P\. Chambon, Y\. Zhang, M\. P\. Lungren, A\. Y\. Ng,et al\.\(2021\)Radgraph: extracting clinical entities and relations from radiology reports\.arXiv preprint arXiv:2106\.14463\.External Links:[Link](https://arxiv.org/abs/2106.14463)Cited by:[§2](https://arxiv.org/html/2606.18797#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Jiang, Y\. Wang, S\. Song, T\. Hu, C\. Zhou, B\. Pu, Y\. Zhang, Z\. Yang, Y\. Feng, J\. T\. Zhou,et al\.\(2025a\)Hulu\-med: a transparent generalist model towards holistic medical vision\-language understanding\.arXiv preprint arXiv:2510\.08668\.External Links:[Link](https://arxiv.org/abs/2510.08668)Cited by:[3rd item](https://arxiv.org/html/2606.18797#S1.I1.i3.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px2.p1.1)\.
- Y\. Jiang, C\. Chen, S\. Wang, F\. Li, Z\. Tang, B\. M\. Mervak, L\. Chelala, C\. M\. Straus, R\. Chahine, S\. G\. A\. Iii, and C\. Tan \(2025b\)CLEAR: a clinically grounded tabular framework for radiology report evaluation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 15914–15933\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.862/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.862)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- A\. E\. Johnson, T\. J\. Pollard, N\. R\. Greenbaum, M\. P\. Lungren, C\. Deng, Y\. Peng, Z\. Lu, R\. G\. Mark, S\. J\. Berkowitz, and S\. Horng \(2019\)MIMIC\-cxr\-jpg, a large publicly available database of labeled chest radiographs\.arXiv preprint arXiv:1901\.07042\.External Links:[Link](https://arxiv.org/abs/1901.07042)Cited by:[§2](https://arxiv.org/html/2606.18797#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.18797#S4.SS1.p1.1)\.
- T\. Kocmi and C\. Federmann \(2023\)GEMBA\-mqm: detecting translation quality error spans with gpt\-4\.InProceedings of the Eighth Conference on Machine Translation,pp\. 768–775\.External Links:[Link](https://aclanthology.org/2023.wmt-1.64/)Cited by:[§1](https://arxiv.org/html/2606.18797#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.18797#S3.SS1.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.External Links:[Link](https://arxiv.org/abs/2309.06180)Cited by:[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px5.p1.1)\.
- R\. Li, J\. Li, B\. Jian, K\. Yuan, and Y\. Zhu \(2025a\)Reevalmed: rethinking medical report evaluation by aligning metrics with real\-world clinical judgment\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 11823–11837\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.598/)Cited by:[Table 10](https://arxiv.org/html/2606.18797#A4.T10),[§1](https://arxiv.org/html/2606.18797#S1.p1.1),[§2](https://arxiv.org/html/2606.18797#S2.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Li, P\. Qin, H\. Wu, D\. Nie, A\. J\. Thirunavukarasu, J\. Yu, and L\. Zhang \(2025b\)μ2\{\\mu\}^\{2\}tokenizer: Differentiable multi\-scale multi\-modal tokenizer for radiology report generation\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,External Links:[Link](https://arxiv.org/abs/2507.00316)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Lin \(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§2](https://arxiv.org/html/2606.18797#S2.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023a\)G\-eval: NLG evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153)Cited by:[§C\.2](https://arxiv.org/html/2606.18797#A3.SS2.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023b\)G\-eval: nlg evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153/)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px3.p1.1)\.
- Q\. Lu, L\. Ding, K\. Zhang, J\. Zhang, and D\. Tao \(2025\)MQM\-APE: toward high\-quality error annotation predictors with automatic post\-editing in LLM translation evaluators\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 5570–5587\.External Links:[Link](https://aclanthology.org/2025.coling-main.374/)Cited by:[§C\.6](https://arxiv.org/html/2606.18797#A3.SS6.p1.1)\.
- Q\. Lu, B\. Qiu, L\. Ding, K\. Zhang, T\. Kocmi, and D\. Tao \(2024\)Error analysis prompting enables human\-like translation evaluation in large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 8801–8816\.External Links:[Link](https://aclanthology.org/2024.findings-acl.520/)Cited by:[§1](https://arxiv.org/html/2606.18797#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.18797#S3.SS1.p1.1)\.
- S\. Ostmeier, J\. Xu, Z\. Chen, M\. Varma, L\. Blankemeier, C\. Bluethgen, A\. E\. M\. Md, M\. Moseley, C\. Langlotz, A\. S\. Chaudhari,et al\.\(2024\)Green: generative radiology report evaluation and error notation\.InFindings of the association for computational linguistics: EMNLP 2024,pp\. 374–390\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.21/)Cited by:[§1](https://arxiv.org/html/2606.18797#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.External Links:[Link](https://arxiv.org/abs/2203.02155)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px4.p1.1)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th annual meeting of the Association for Computational Linguistics,pp\. 311–318\.External Links:[Link](https://aclanthology.org/P02-1040/)Cited by:[§2](https://arxiv.org/html/2606.18797#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html)Cited by:[§5\.1](https://arxiv.org/html/2606.18797#S5.SS1.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px4.p1.1)\.
- B\. Saldías Fuentes, G\. Foster, M\. Freitag, and Q\. Tan \(2022\)Toward more effective human evaluation for machine translation\.InProceedings of the 2nd Workshop on Human Evaluation of NLP Systems \(HumEval\),A\. Belz, M\. Popović, E\. Reiter, and A\. Shimorina \(Eds\.\),Dublin, Ireland,pp\. 76–89\.External Links:[Link](https://aclanthology.org/2022.humeval-1.7/),[Document](https://dx.doi.org/10.18653/v1/2022.humeval-1.7)Cited by:[§4\.1](https://arxiv.org/html/2606.18797#S4.SS1.p1.1)\.
- A\. Sellergren, S\. Kazemzadeh, T\. Jaroensri, A\. Kiraly, M\. Traverse, T\. Kohlberger, S\. Xu, F\. Jamil, C\. Hughes, C\. Lau,et al\.\(2025\)Medgemma technical report\.arXiv preprint arXiv:2507\.05201\.External Links:[Link](https://arxiv.org/abs/2507.05201)Cited by:[Table 9](https://arxiv.org/html/2606.18797#A2.T9),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.18797#S5.SS1.SSS0.Px1.p1.1),[Table 4](https://arxiv.org/html/2606.18797#S5.T4)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.External Links:[Link](https://www.nature.com/articles/s41586-023-06291-2)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px3.p1.1)\.
- A\. Smit, S\. Jain, P\. Rajpurkar, A\. Pareek, A\. Y\. Ng, and M\. Lungren \(2020\)Combining automatic labelers and expert annotations for accurate radiology report labeling using bert\.InProceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\),pp\. 1500–1519\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.117/)Cited by:[§2](https://arxiv.org/html/2606.18797#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- T\. Tanida, P\. Müller, G\. Kaissis, and D\. Rueckert \(2023\)Interactive and explainable region\-guided radiology report generation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 7433–7442\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2023/html/Tanida_Interactive_and_Explainable_Region-Guided_Radiology_Report_Generation_CVPR_2023_paper.html)Cited by:[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- R\. Tanno, D\. G\. Barrett, A\. Sellergren, S\. Ghaisas, S\. Dathathri, A\. See, J\. Welbl, C\. Lau, T\. Tu, S\. Azizi,et al\.\(2025\)Collaboration between clinicians and vision–language models in radiology report generation\.Nature Medicine31\(2\),pp\. 599–608\.External Links:[Link](https://www.nature.com/articles/s41591-024-03302-1)Cited by:[§1](https://arxiv.org/html/2606.18797#S1.p1.1)\.
- J\. Xu, X\. Zhang, J\. Abderezaei, J\. Bauml, R\. Boodoo, F\. Haghighi, A\. Ganjizadeh, E\. Brattain, D\. Van Veen, Z\. Meng, D\. W\. Eyre, and J\. Delbrouck \(2025a\)RadEval: a framework for radiology text evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Suzhou, China,pp\. 546–557\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-demos.40),[Link](https://aclanthology.org/2025.emnlp-demos.40.pdf)Cited by:[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- W\. Xu, H\. P\. Chan, L\. Li, M\. Aljunied, R\. Yuan, J\. Wang, C\. Xiao, G\. Chen, C\. Liu, Z\. Li,et al\.\(2025b\)Lingshu: a generalist foundation model for unified multimodal medical understanding and reasoning\.arXiv preprint arXiv:2506\.07044\.External Links:[Link](https://arxiv.org/abs/2506.07044)Cited by:[3rd item](https://arxiv.org/html/2606.18797#S1.I1.i3.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[Table 9](https://arxiv.org/html/2606.18797#A2.T9),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.18797#S5.SS1.SSS0.Px1.p1.1),[Table 4](https://arxiv.org/html/2606.18797#S5.T4)\.
- Y\. Zha, Y\. Yang, R\. Li, and Z\. Hu \(2023\)AlignScore: evaluating factual consistency with a unified alignment function\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11328–11348\.External Links:[Link](https://aclanthology.org/2023.acl-long.634/)Cited by:[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2019\)Bertscore: evaluating text generation with bert\.arXiv preprint arXiv:1904\.09675\.External Links:[Link](https://arxiv.org/abs/1904.09675)Cited by:[§2](https://arxiv.org/html/2606.18797#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1)\.
- W\. Zhao, C\. Wu, X\. Zhang, Y\. Zhang, Y\. Wang, and W\. Xie \(2024\)Ratescore: a metric for radiology report generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 15004–15019\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.836/)Cited by:[§1](https://arxiv.org/html/2606.18797#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.18797#S3.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)Cited by:[§C\.2](https://arxiv.org/html/2606.18797#A3.SS2.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px3.p1.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, and Z\. Luo \(2024\)Llamafactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 3: system demonstrations\),pp\. 400–410\.External Links:[Link](https://aclanthology.org/2024.acl-demos.38/)Cited by:[Appendix A](https://arxiv.org/html/2606.18797#A1.p1.1),[§5\.1](https://arxiv.org/html/2606.18797#S5.SS1.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2606.18797#S6.SS0.SSS0.Px4.p1.1)\.

## Appendix AModel and Training Hyperparameters

Table[5](https://arxiv.org/html/2606.18797#A1.T5)lists the model configuration and training hyperparameters used in all experiments\. We fine\-tune two base models \(Qwen3\-8B and MedGemma\-4B\) using LoRAHuet al\.\([2022](https://arxiv.org/html/2606.18797#bib.bib43)\)via LLaMA\-FactoryZhenget al\.\([2024](https://arxiv.org/html/2606.18797#bib.bib45)\)\. SFT and DPO share the same LoRA and optimiser settings; DPO\-specific parameters \(preference construction\) are listed separately\.

Table 5:Model and training hyperparameters for all SFT and DPO experiments\. LoRA and optimiser settings are shared across stages; DPO\-specific parameters govern preference pair construction via rejection sampling\.
## Appendix BAdditional Analysis

### B\.1Generalization to Open\-i

Table 6:Comparison of evaluators on the Open\-i 50\-sample benchmark\. D and R scores are in accuracy \(%\)\. Avg is the mean of D and R, and Gap is the absolute difference between D and R\.We further constructed a small Open\-i benchmark to test evaluator performance on an external chest X\-ray dataset\. Following the ReEvalMed pipeline, we sampled 50 cases from Open\-i and applied controlled data augmentation to create 25 significant\-error samples for discrimination and 25 insignificant\-error samples for robustness\. To avoid potential same\-model bias in the generation of evaluation samples, we used GPT\-5\.5 for this augmentation process; GPT\-5\.5 is not included among the evaluator models assessed in this paper\. All augmented samples were reviewed by clinicians to ensure that the injected errors were correct, clinically meaningful when intended, and free from duplicates or ambiguous cases\.

As shown in Table[6](https://arxiv.org/html/2606.18797#A2.T6), most methods achieve strong results on Open\-i\. One likely reason is that Open\-i reports are generally simple and contain relatively short sentences\. This makes both significant\-error detection and robustness to minor wording changes easier than in ReEvalMed\. However, the results still reveal differences among evaluators\. Qwen3\-8B shows strong discrimination but weaker robustness, indicating a tendency to over\-detect minor changes as significant errors\. After SFT and RL, Qwen3\-8B achieves better balance\. MedGemma\-4B performs well on the base model and reaches perfect performance after SFT and RL\.

### B\.2Error Aspect Analysis

Beyond the overall D–R gap, we further ask*which*types of clinical perturbations are responsible for evaluator errors\. Specifically, we conduct an aspect\-level confusion analysis by comparing the ground\-truth error aspect assigned during dataset construction with the aspect predicted by each LLM evaluator\. This analysis allows us to move beyond aggregate discrimination and robustness accuracy and inspect whether models confuse, over\-predict, or under\-recognize particular clinical error categories, such as negation, location, uncertainty, or stylistic variation\.

#### LLM Evaluators

Figure[5](https://arxiv.org/html/2606.18797#A2.F5)visualizes the aspect\-level flows for eight LLM evaluators, with one\-pass predictions on the left, ground\-truth aspects in the middle, and two\-pass predictions on the right\. A common pattern is that model predictions collapse toward a small set of broad clinical aspects, especially Description, Location, Severity, and Comparison/Progression, while fine\-grained categories such as Modality, Noise, Uncertainty, Terminology, and Stylistic Variation are often under\-recognized\. For example, GPT\-5\.1 and Claude Sonnet 4\.5 both over\-predict Description under two\-pass inference \(60→\\to113 and 60→\\to96, respectively\), whereas Qwen3\-8B and MedGemma\-4B show even stronger concentration, mapping many cases to Description or Location\. Qwen3\-Max is comparatively better calibrated, but still shifts substantially toward Location in the two\-pass setting \(60→\\to99\)\.

The Sankey diagrams also show that two\-pass inference does not simply “fix” one\-pass errors; rather, it redistributes them\. In several models, the second pass reduces some over\-predicted categories but introduces new failure modes, including large Unclassified flows for Gemini 3 Pro and LingShu\-32B, suggesting that span extraction and severity judging can propagate formatting or classification errors\. Across models, explicit factual aspects such as Location, Severity, and Negation are more frequently detected, while subtle imaging\- or wording\-specific aspects such as Modality, Noise, and Terminology remain difficult\. This aspect\-level evidence supports our central finding that current LLM evaluators tend to over\-detect discrepancies but remain under\-calibrated in distinguishing clinically meaningful errors from harmless variations\.

#### Training Comparison\.

Figure[6](https://arxiv.org/html/2606.18797#A2.F6)shows that Qwen3\-8B exhibits strong aspect\-level bias before post\-training\. In the one\-pass setting, the base model over\-predicts broad and visually salient categories such as Location and Description, while rarely assigning fine\-grained aspects such as Noise, Terminology, or Uncertainty\. SFT improves the output distribution only partially: it increases detection of Negation but still collapses many cases into Location and Description\. RL produces a more redistributed one\-pass prediction pattern, reducing the extreme Location bias and increasing flows into Comparison/Progression, Negation, and other clinically meaningful aspects\. In the two\-pass setting, however, Base and SFT both heavily collapse into Description, suggesting that simply teaching the two\-stage format does not resolve aspect confusion\. The RL stage substantially changes this behavior by shifting mass away from Description and toward Location and Comparison/Progression, which is consistent with the improved D–R balance reported in Table[4](https://arxiv.org/html/2606.18797#S5.T4)\.

Figure[7](https://arxiv.org/html/2606.18797#A2.F7)shows a different pattern for MedGemma\-4B\. In one\-pass inference, the base model is dominated by Location predictions, indicating a strong tendency to interpret diverse report discrepancies as anatomical\-site errors\. After SFT, the distribution becomes much broader, with substantial flows into Comparison/Progression, Severity, Description, Uncertainty, and Size/Distance\. Adding RL further smooths the prediction distribution and reduces the Location collapse, which helps explain why MedGemma\-4B achieves the best one\-pass D–R alignment after SFT\+RL\. In contrast, the two\-pass MedGemma variants remain highly concentrated on Description, with SFT and RL additionally increasing Terminology predictions\. This suggests that MedGemma benefits strongly from post\-training in the one\-pass setting, whereas its two\-pass behavior is more constrained by the aspect labels produced during span extraction\.

### B\.3Statistical Analysis

#### Confidence intervals\.

We report 95% bootstrap confidence intervals \(CIs\) for all evaluation scores in Table[7](https://arxiv.org/html/2606.18797#A2.T7), Table[8](https://arxiv.org/html/2606.18797#A2.T8)and Table[9](https://arxiv.org/html/2606.18797#A2.T9)\. For each metric or model, we resample test cases with replacement and recompute Discrimination \(D\), Robustness \(R\), Avg, and Gap over 10,000 bootstrap replicates\. We report the half\-width of the 95% percentile bootstrap CI in parentheses\. All CIs are reported in percentage points\.

Across these tables, the 95% bootstrap CIs are generally moderate, with Avg half\-widths mostly around 3–5 percentage points\. The large gaps between conventional similarity metrics and the best LLM\-based evaluators are therefore unlikely to be explained by sampling noise alone\. By contrast, several top\-performing LLM\-based metrics and models have overlapping CIs, so their small numerical differences should be interpreted cautiously rather than as definitive rankings\.

#### Pairwise Significance Test\.

We further conduct pairwise paired significance tests to compare metrics and evaluators on Avg in Figure[4](https://arxiv.org/html/2606.18797#A2.F4)\. For each pair of methods, we compute the per\-example correctness difference on the same discrimination and robustness test cases, and test whether the paired Avg difference is significantly different from zero using a paired randomization test\. Since all methods are evaluated on the same test instances, this paired test controls for example\-level difficulty and is more appropriate than an independent\-sample comparison\.

The significance matrix shows that broad differences between weak similarity\-based metrics and stronger LLM\-based evaluators are statistically significant after Holm correction\. However, several top\-performing methods have no significant differences from each other, including CRIMSON, FineRadScore, and our MedGemma SFT\+RL evaluator\. This suggests that their small numerical Avg differences should be interpreted cautiously, while the improvements from trained evaluators over their base counterparts are statistically reliable\.

![Refer to caption](https://arxiv.org/html/2606.18797v1/x4.png)Figure 4:Pairwise paired significance tests on Avg across metrics and evaluators\. Each cell compares the row method against the column method using a paired randomization test over matched test cases\. Blue indicates that the row method significantly outperforms the column method, red indicates that it performs significantly worse, and gray indicates no significant difference after Holm correction \(p≥0\.05p\\geq 0\.05\)\.TypeMetricMaximin ThresholdD \(↑\\uparrow\)R \(↑\\uparrow\)Avg \(↑\\uparrow\)Gap \(↓\\downarrow\)NLPBLEU19\.0 \(±\\pm5\.5\)19\.0 \(±\\pm5\.5\)19\.0 \(±\\pm4\.0\)0\.0 \(±\\pm8\.0\)BERTScore24\.0 \(±\\pm6\.0\)24\.0 \(±\\pm6\.0\)24\.0 \(±\\pm4\.2\)0\.0 \(±\\pm8\.5\)AlignScore51\.0 \(±\\pm7\.0\)51\.0 \(±\\pm7\.0\)51\.0 \(±\\pm5\.0\)0\.0 \(±\\pm10\.5\)Med\.RadGraph28\.5 \(±\\pm6\.5\)28\.5 \(±\\pm6\.0\)28\.5 \(±\\pm4\.5\)0\.0 \(±\\pm10\.0\)RadBERTScore26\.5 \(±\\pm6\.0\)26\.5 \(±\\pm6\.5\)26\.5 \(±\\pm4\.2\)0\.0 \(±\\pm10\.0\)RaTEScore35\.5 \(±\\pm7\.0\)35\.5 \(±\\pm7\.0\)35\.5 \(±\\pm4\.8\)0\.0 \(±\\pm10\.0\)CheXbert47\.0 \(±\\pm7\.0\)49\.0 \(±\\pm7\.0\)48\.0 \(±\\pm5\.0\)2\.0 \(±\\pm10\.0\)LLMGREEN53\.5 \(±\\pm7\.0\)41\.5 \(±\\pm7\.0\)47\.5 \(±\\pm4\.8\)12\.0 \(±\\pm10\.0\)RadFact80\.0 \(±\\pm5\.5\)56\.0 \(±\\pm7\.0\)68\.0 \(±\\pm4\.5\)24\.0 \(±\\pm9\.5\)FineRadScore86\.5\(±\\pm5\.0\)68\.5 \(±\\pm6\.5\)77\.5 \(±\\pm4\.0\)18\.0 \(±\\pm8\.0\)CRIMSON75\.0 \(±\\pm6\.0\)80\.5\(±\\pm5\.5\)77\.8\(±\\pm4\.2\)5\.5 \(±\\pm8\.0\)Table 7:Comparison of score\-based metrics on ReEvalMed\. With Numbers in parentheses denote the half\-width of 95% bootstrap CIs\.TypeModel1\-Pass2\-PassD \(↑\\uparrow\)R \(↑\\uparrow\)Avg \(↑\\uparrow\)Gap \(↓\\downarrow\)D \(↑\\uparrow\)R \(↑\\uparrow\)Avg \(↑\\uparrow\)Gap \(↓\\downarrow\)ProprietaryGPT\-5\.198\.5 \(±\\pm2\.0\)42\.5 \(±\\pm7\.0\)70\.5 \(±\\pm3\.5\)56\.0 \(±\\pm7\.0\)95\.0 \(±\\pm3\.0\)69\.5 \(±\\pm6\.5\)82\.3\(±\\pm3\.5\)25\.5 \(±\\pm7\.0\)Claude Sonnet 4\.591\.5 \(±\\pm4\.0\)84\.0\(±\\pm5\.5\)87\.8\(±\\pm3\.2\)7\.5\(±\\pm6\.5\)98\.0\(±\\pm2\.0\)51\.0 \(±\\pm7\.0\)74\.5 \(±\\pm3\.5\)47\.0 \(±\\pm7\.0\)Gemini 3 Pro94\.0 \(±\\pm3\.5\)57\.5 \(±\\pm6\.5\)75\.8 \(±\\pm3\.8\)36\.5 \(±\\pm7\.5\)65\.0 \(±\\pm6\.5\)79\.5\(±\\pm6\.0\)72\.3 \(±\\pm4\.2\)14\.5 \(±\\pm9\.0\)Open\-sourceQwen3\-Max97\.0 \(±\\pm2\.5\)53\.5 \(±\\pm7\.0\)75\.3 \(±\\pm3\.8\)43\.5 \(±\\pm7\.5\)96\.5 \(±\\pm2\.5\)45\.0 \(±\\pm7\.0\)70\.8 \(±\\pm3\.8\)51\.5 \(±\\pm7\.5\)LingShu\-32B99\.5 \(±\\pm1\.0\)64\.0 \(±\\pm6\.5\)81\.8 \(±\\pm3\.3\)35\.5 \(±\\pm6\.5\)97\.5 \(±\\pm2\.5\)5\.5 \(±\\pm3\.5\)51\.5 \(±\\pm2\.0\)92\.0 \(±\\pm4\.0\)Hulu\-Med\-32B100\.0\(±\\pm0\.0\)23\.0 \(±\\pm6\.0\)61\.5 \(±\\pm3\.0\)77\.0 \(±\\pm6\.0\)96\.0 \(±\\pm3\.0\)52\.5 \(±\\pm7\.0\)74\.3 \(±\\pm3\.8\)43\.5 \(±\\pm7\.5\)Qwen3\-8B96\.5 \(±\\pm2\.5\)8\.0 \(±\\pm4\.0\)52\.3 \(±\\pm2\.3\)88\.5 \(±\\pm5\.0\)95\.5 \(±\\pm3\.0\)9\.0 \(±\\pm4\.0\)52\.3 \(±\\pm2\.5\)86\.5 \(±\\pm5\.0\)MedGemma\-4B49\.0 \(±\\pm7\.0\)9\.0 \(±\\pm4\.0\)29\.0 \(±\\pm4\.0\)40\.0 \(±\\pm8\.0\)67\.0 \(±\\pm6\.5\)67\.5 \(±\\pm6\.5\)67\.3 \(±\\pm4\.8\)0\.5\(±\\pm10\.0\)Table 8:Comparison of LLM\-as\-evaluator on ReEvalMed benchmark\. D and R scores are in accuracy \(%\)\. Numbers in parentheses denote the half\-width of 95% bootstrap CIs\. Best results are inbold\.Table 9:Training results on ReEvalMed based on Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib47)\)and MedGemma\-4BSellergrenet al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib49)\)\. D and R scores are in accuracy \(%\)\. Numbers in parentheses denote the half\-width of 95% bootstrap CIs\. Best results are inbold\.![Refer to caption](https://arxiv.org/html/2606.18797v1/x5.png)![Refer to caption](https://arxiv.org/html/2606.18797v1/x6.png)![Refer to caption](https://arxiv.org/html/2606.18797v1/x7.png)![Refer to caption](https://arxiv.org/html/2606.18797v1/x8.png)![Refer to caption](https://arxiv.org/html/2606.18797v1/x9.png)![Refer to caption](https://arxiv.org/html/2606.18797v1/x10.png)![Refer to caption](https://arxiv.org/html/2606.18797v1/x11.png)![Refer to caption](https://arxiv.org/html/2606.18797v1/x12.png)Figure 5:Aspect\-flow Sankey diagrams across eight LLM evaluators\. Each diagram shows one\-pass predicted aspects on the left, ground\-truth aspects in the middle, and two\-pass predicted aspects on the right\. Row totals are displayed on all axes\.![Refer to caption](https://arxiv.org/html/2606.18797v1/x13.png)

\(a\) Qwen3\-8B, 1\-pass

![Refer to caption](https://arxiv.org/html/2606.18797v1/x14.png)

\(b\) Qwen3\-8B, 2\-pass

Figure 6:Aspect\-level Sankey comparison across training stages for Qwen3\-8B\. Each panel visualizes the flow from ground\-truth error aspects to model\-predicted aspects under Base, SFT, and SFT\+RL settings\.![Refer to caption](https://arxiv.org/html/2606.18797v1/x15.png)

\(a\) MedGemma\-4B, 1\-pass

![Refer to caption](https://arxiv.org/html/2606.18797v1/x16.png)

\(b\) MedGemma\-4B, 2\-pass

Figure 7:Aspect\-level Sankey comparison across training stages for MedGemma\-4B\. Each panel visualizes the flow from ground\-truth error aspects to model\-predicted aspects under Base, SFT, and SFT\+RL settings\.

## Appendix CDiscussion

### C\.1Main Contribution and Practical Value

#### Contribution\.

Our contribution goes beyond applying ReEvalMed as a benchmark\. While ReEvalMed reports D and R mainly through averaged metric scores, we reformulate clinical\-significance evaluation as a classification problem, whichreduces the influence of outlier scores and makes the D–R behavior of LLM\-as\-judge metrics more explicit\. This redefinition allows the discrimination bias of LLM evaluators to be measured more directly: a reliable evaluator should both identify clinically significant errors and tolerate insignificant variations\. A second contribution is showing thatsignificance\-oriented data augmentation can improve this balance\. By synthesizing report pairs with controlled significance levels and applying a standard SFT–DPO post\-training pipeline, we can calibrate lightweight LLM evaluators on the D–R boundary without designing a complex task\-specific training algorithm\.

### C\.2Same\-Model Bias on Claude

Same\-model bias is a potential concern in LLM\-as\-judge studiesZhenget al\.\([2023](https://arxiv.org/html/2606.18797#bib.bib23)\); Liuet al\.\([2023a](https://arxiv.org/html/2606.18797#bib.bib2)\), especially when model\-generated outputs are evaluated by the same or closely related model\. However, this concern does not apply to our main evaluation setting\. All main results, including Tables[3](https://arxiv.org/html/2606.18797#S3.T3)and[4](https://arxiv.org/html/2606.18797#S5.T4), are evaluated on ReEvalMed, a clinically annotated benchmark thatis not generated by Claude\. Thus, our evaluation does not form a closed loop in which Claude\-generated data are judged again by Claude\.Claude is used only to construct supervision signals for data augmentation, which are then used to post\-train lightweight Qwen and MedGemma models, not to optimize Claude itself\. Using a stronger model to generate supervision and transfer this capability to smaller models is a common distillation practice, and should not be interpreted as same\-model bias in the evaluation\.

To further reduce the possibility that model\-generated errors are later judged by the same model, we used GPT\-5\.5 rather than Claude when constructing the Open\-i evaluation set\. GPT\-5\.5 is not included among the evaluator models in this paper\. This design minimizes potential same\-model bias in the external evaluation and further supports that our trained metrics do not merely imitate Claude’s judgments, but improve their ability to distinguish clinically significant errors from insignificant variations\.

### C\.3Potential Data Leakage

Potential data leakage from MIMIC\-CXR is unlikely to directly affect the main conclusions of this work\. Both the ReEvalMed test set and the samples constructed in our study are based on MIMIC\-CXR reports with controlled injected clinical errors\. Even if an LLM had been exposed to the original ground\-truth MIMIC\-CXR reports during pretraining, such exposure would not imply an ability to correctly identify diverse injected errors or determine their clinical significance, which is the central focus of our evaluation\.

Moreover, ReEvalMed was released only at the end of 2025, whereas the evaluated LLMs were trained before that release, making direct benchmark leakage unlikely\. To further reduce concerns about dataset\-specific leakage, we additionally introduce an external Open\-i evaluation set\. Since Open\-i is independent of MIMIC\-CXR, this held\-out evaluation provides complementary evidence that the observed behavior is not solely driven by memorization of MIMIC\-CXR reports\.

Although limited domain coverage remains a limitation, as discussed in the Limitation section, the added Open\-i analysis helps assess whether the conclusions generalize beyond the primary MIMIC\-CXR\-derived setting\.

### C\.4CRIMSON and MedGemma Fine\-tuning

Compared with CRIMSON, our MedGemma\-4B \+ SFT \+ RL evaluator changes the error profile from a more balanced score\-based evaluator to a more robustness\-oriented evaluator\. CRIMSON achieves stronger discrimination accuracy \(75\.0 vs\. 66\.5\), whereas our model achieves substantially higher robustness accuracy \(90\.5 vs\. 80\.5\)\. As a result, the overall Avg is numerically similar, with our model slightly higher \(78\.5 vs\. 77\.8\), but the paired significance test indicates that this Avg difference is not statistically significant\. This suggests that the main gain of our approach is not a large increase in aggregate Avg over CRIMSON, but a shift toward better preservation of clinically insignificant variations\.

The improvement is most visible on robustness cases, where our model correctly accepts more benign perturbations than CRIMSON\. In particular, robustness accuracy improves substantially for fabrication\-type cases \(50\.0 to 100\.0\), severity\-related cases \(73\.3 to 96\.7\), contradiction cases \(30\.0 to 80\.0\), and terminology/style\-related cases\. This indicates that SFT \+ RL makes the evaluator less likely to over\-penalize clinically harmless edits\. However, this comes with a trade\-off in discrimination: CRIMSON remains better at detecting significant errors overall, especially for size/distance, terminology, uncertainty, and severity\-related significant errors\. Thus, our method improves robustness and reduces false alarms on insignificant changes, while CRIMSON remains more sensitive to certain clinically significant discrepancies\.

### C\.5Critical and Significant Severity Levels

Our evaluation maps bothcriticalandsignificanterrors to the clinically significant class\. This design follows the practical goal of identifying report differences that may affect clinical interpretation or downstream decisions\. However, the boundary between critical and significant can itself be subjective and context\-dependent\. For this reason, we treat the central task as distinguishing clinically meaningful from clinically insignificant differences, while leaving finer\-grained severity calibration as an important direction for future work\.

### C\.6Potential Error Propagation

Our current two\-pass design is itself motivated by common practice in machine translation evaluation\. As pointed out in related work such as MQM\-APELuet al\.\([2025](https://arxiv.org/html/2606.18797#bib.bib1)\), in error span detection, completely missing an error span is relatively rare, while producing too many redundant spans is often a more common issue\. Based on this observation, pass2 does not handle spans missed by pass1, but it can suppress redundant or unreasonable spans\. We will clarify this design motivation and add the corresponding citation more explicitly in the revised version\.

### C\.7Why Use a Standard Post\-training Design

We use a standard SFT followed by DPO training design because our goal is to test whether targeted supervision can sharpen the clinical\-significance boundary, rather than to introduce a new optimization algorithm\. SFT teaches the model the structured output format and exposes it to controlled clinical\-error patterns\. DPO then directly optimizes preference pairs that reflect the desired D–R behavior\. This simple and reproducible design makes the effect of the data and supervision signal easier to interpret\.

The same principle also applies to our prompting design\. Our prompts are intended to provide a clear and consistent evaluation protocol for testing whether the proposed clinical\-significance formulation is effective, rather than to exhaustively optimize prompt engineering strategies\. A systematic exploration of alternative prompts, prompt ensembles, and model\-specific prompting designs is therefore left for future work\.

## Appendix DError Category Descriptions

Table[10](https://arxiv.org/html/2606.18797#A4.T10)provides detailed descriptions of the error categories in ReEvalMed, organised by Error Aspect and Error Type\.

TypeAspectDescriptionSignificant ExampleInsignificant ExampleO / F / ELocationErrors in the precise anatomical site of a finding \(e\.g\., laterality, lobe, region\)\.REF:left\-sided rib fractures
TGT:right rib fracturesREF:left retrocardiac opacity
TGT:opacity behind the heart on the left sideO / F / ESeverityErrors in describing the extent or clinical seriousness of a finding\.REF:Heart is mildly enlarged
TGT:Severely enlarged heartREF:Severe cardiomegaly
TGT:Moderate\-to\-severe cardiomegalyO / F / EDescriptionErrors in morphological characteristics such as shape, margins, or appearance\.REF:irregular mass with spiculated margins
TGT:round, smooth massREF:patchy ill\-defined opacities
TGT:faint and poorly marginated opacitiesO / F / EComp\./Prog\.Errors in describing interval changes compared to prior imaging\.REF:Pulmonary edema has improved
TGT:Pulmonary edema has worsenedREF:No interval change in pleural effusion
TGT:Pleural effusion is essentially unchangedSNegationIncorrect presence or absence of a finding\.REF:No evidence of pneumothorax
TGT:Pneumothorax is presentREF:No pleural effusion is seen
TGT:There is no definite pleural effusionSModalityConflicts with the imaging modality\.REF:Refer to prior CT torso for details
TGT:Refer to prior abdominal ultrasoundREF:Consider chest CT for further evaluation
TGT:CT can be considered for further assessmentSSize/DistanceErrors in quantitative measurements\.REF:3\-cm mass in the lingula
TGT:8\-cm mass in the lingulaREF:ET tube within 1 cm of the carina
TGT:ET tube within 0\.9 cm of the carinaSContradictionInternal logical inconsistencies within the same report\.REF:The lungs are clear\.
TGT:The lungs are clear\. There is consolidation in the right baseREF:No evidence of larger pleural effusions
TGT:No evidence of larger pleural effusions\. Minimal effusions may existSUncertaintyIncorrect use of hedging terms conveying diagnostic uncertainty\.REF:Whether this is pneumonia is radiographically indeterminate
TGT:Pneumonia existsREF:A possible infiltrate is suggested
TGT:An infiltrate is likely presentSTerminologyInaccurate or unclear medical terminology\.REF:A cavitary lesion, suggesting tuberculosis
TGT:A hole, suggesting infectionREF:A 3\-cm mass
TGT:A 3\-cm lesionSNoiseGrammatical mistakes, typographical errors, or other linguistic noise\.REF:3\-cm mass in the lingula has grown
TGT:3\-cm lingula margins has been growing irregularlyREF:subtle opacity may represent atelectasis
TGT:subtble opaciti may represent atelectasiSStylistic Var\.Variations in phrasing that do not alter clinical meaning\.REF:Bilateral left greater than right pleural effusion
TGT:Fluid accumulation on both sides of the chest, more on the rightREF:lung… pulmonary edema… pleural effusions
TGT:pleural effusions… lung… pulmonary edemaTable 10:Descriptions and examples of the error categories in ReEvalMedLiet al\.\([2025a](https://arxiv.org/html/2606.18797#bib.bib12)\), organized over 12 clinical aspects\. Each aspect is associated with up to three error types: Omission \(O\), Fabrication \(F\), and Inaccuracy \(E\)\. S: Inaccuracy, fabrication, and omission errors are randomly and evenly distributed\.Each category appears in both the Discrimination and Robustness test sets\. Examples are shown as REF–TGT pairs, where significant examples represent clinically meaningful errors and insignificant examples represent harmless variations\.
## Appendix ELength Distribution of Synthesized Data

Figure[8](https://arxiv.org/html/2606.18797#A5.F8)shows the length distribution of our synthesized data compared to the ReEvalMed test set, broken down by error category and stratified across four length buckets\. The generated corpus covers all buckets in every category, confirming that stratified sampling produces diverse report lengths\.

![Refer to caption](https://arxiv.org/html/2606.18797v1/length_distribution_comparison.png)Figure 8:Length distribution comparison between the ReEvalMed test set \(top axis\) and our generated training corpus \(bottom axis\), stacked by length bucket across all 12 aspects\.
## Appendix FHuman Validation on Report Synthesis

To verify the clinical fidelity of our synthesized data, we randomly sample 100 pairs \(50 D \+ 50 R\) covering the full range of error types and have a clinician manually review each pair\. For each sample, the clinician assesses whether: \(1\) the injected error is clinically plausible and correctly categorised, \(2\) the severity label \(significant vs\. insignificant\) is appropriate, and \(3\) the generated report reads naturally without obvious artefacts\.

#### Case studies

Table[11](https://arxiv.org/html/2606.18797#A6.T11)presents representative examples from the human validation, illustrating both successful and problematic generations\.

Table 11:Representative cases from human validation of synthesized data\. Cases are categorised by failure type:Severity Mislabel\(severity label disagrees with clinical judgement\),Injection Failure\(error injection produces no meaningful change\), andAspect Mismatch\(injected change is purely stylistic, deviating from the assigned error type\)\.

## Appendix GScore\-based Metric Results

Unlike LLM\-based metrics that output binary significant/insignificant predictions, score\-based metrics produce continuous scores requiring a decision threshold to convert to D/R accuracy\. In the main paper \(Table[3](https://arxiv.org/html/2606.18797#S3.T3)\), we report results at themaximin thresholdmaxθ⁡min⁡\(Dθ,Rθ\)\\max\_\{\\theta\}\\min\(D\_\{\\theta\},R\_\{\\theta\}\), which selects the operating point that maximises the worse of the two dimensions\.

Figure[9](https://arxiv.org/html/2606.18797#A7.F9)shows the full D–R trade\-off curves for all 8 score\-based metrics\. Each curve traces the \(D, R\) accuracy pair as the decision threshold varies; the filled circle marks the maximin operating point reported in Table[3](https://arxiv.org/html/2606.18797#S3.T3)\. CRIMSON achieves the best maximin point \(75\.0, 80\.5\), substantially outperforming all other score\-based metrics\. NLP metrics \(BLEU, BERTScore, AlignScore\) cluster in the low\-accuracy region, while medical metrics \(RadGraph, RaTEScore, CheXbert\) offer moderate improvements\.

![Refer to caption](https://arxiv.org/html/2606.18797v1/x17.png)Figure 9:D–R trade\-off curves for all score\-based metrics\. Each curve traces \(D, R\) accuracy across decision thresholds; filled circles mark the maximin operating pointmaxθ⁡min⁡\(Dθ,Rθ\)\\max\_\{\\theta\}\\min\(D\_\{\\theta\},R\_\{\\theta\}\)reported in Table[3](https://arxiv.org/html/2606.18797#S3.T3)\.
## Appendix HError Specification Examples

We provide two representative error specifications generated by Claude Sonnet 4\.5 \(Section[4](https://arxiv.org/html/2606.18797#S4)\)\. Each specification contains a clinical definition, medical patterns, and contrastive text patterns that distinguish significant errors from harmless variations\.

Error Specification: Location – OmissionDefinition:Omission of anatomical location information present in the reference report, including missing specific anatomical sites, laterality \(left/right\), spatial relationships, or regional descriptors\.Medical Patterns:•Missing laterality specification \(left/right, bilateral\)•Missing specific anatomical region or lobe identification•Missing spatial descriptors \(upper/lower, anterior/posterior\)Discrimination Text Patterns\(significant\):•“left upper lobe nodule”→\\to“pulmonary nodule” \(missing critical laterality and lobe\)•“bilateral pleural effusions”→\\to“pleural effusion” \(missing bilateral nature\)•“fracture of L3 vertebral body”→\\to“lumbar fracture” \(missing specific vertebral level\)Robustness Text Patterns\(insignificant\):•“hepatic lesion in segment 7”→\\to“right hepatic lobe lesion” \(alternative valid descriptor\)•“distal esophagus”→\\to“lower esophagus” \(synonymous location terms\)•“left lung base”→\\to“left lower lung” \(equivalent regional description\)

Error Specification: Description – FabricationDefinition:Adding findings, observations, or anatomical descriptions in TGT that do not exist in REF, including introducing new pathological findings, normal structures not mentioned, or additional descriptive details\.Medical Patterns:•Adding new pathological findings not present in REF \(e\.g\. pleural effusion, nodules\)•Adding quantitative details or measurements that do not exist in REF•Introducing temporal or comparative information not in REFDiscrimination Text Patterns\(significant\):•REF: “clear lungs”→\\toTGT adds “bilateral pleural effusions”•REF: “normal heart”→\\toTGT adds “cardiomegaly”•REF: “opacity present”→\\toTGT adds “large 5cm mass”Robustness Text Patterns\(insignificant\):•REF: “no acute findings”→\\toTGT: “heart, lungs, and mediastinum show no acute findings”•REF: “unremarkable”→\\toTGT lists “bones intact, soft tissues normal”•REF gives findings→\\toTGT adds “adequate inspiration” as contextual information

## Appendix IPrompt Templates

We provide the full prompt templates used for inference \(Section[3](https://arxiv.org/html/2606.18797#S3)\) and data synthesis \(Section[4](https://arxiv.org/html/2606.18797#S4)\)\. In all templates,\{REF\}and\{TGT\}are replaced with the ground\-truth and target reports, respectively\.

One\-Pass Evaluation PromptSYSTEM:You are a clinical evaluation agent for chest X\-ray reports\. You will be given: 1\) a ground\-truth report written by professional radiologists, and 2\) a target report to be evaluated\.Your task is to:•Identify all clinically significant and insignificant errors in the target report\.•Classify each error into one of the predefined error categories\.•Determine the error type \(Omission / Fabrication / Inaccuracy\) when applicable\.•Provide a concise explanation of the errors and their classifications\.Follow all definitions and constraints provided in subsequent instructions\.USER:You must evaluate the target report against the ground\-truth report and produce a structured JSON output\.Your output must be a JSON object with the following structure:•“critical”: a dictionary mapping error spans to error aspects for critical errors•“significant”: a dictionary mapping error spans to error aspects for significant errors•“insignificant”: a dictionary mapping error spans to error aspects for insignificant errors•“explanation”: a concise explanation of the errors identified and the rationale for the categorizationDefinition of significance:•Critical: Refer to internal inconsistencies or logical contradictions, such as describing the presence of a structure previously stated to be absent, that can severely undermine clinician trust in the report and are considered critical failures\.•Significant: Errors that meaningfully alter, mislead, or distort clinical decision\-making\.•Insignificant: Errors that represent stylistic variation, minor wording changes, or clinically harmless deviations\.Error Aspects:Below are the 12 error categories and their descriptions:\{error\_aspect\_info\}Every error must be assigned exactly one of these aspects\.Error Types:For the “Comparison/Progression”, “Description”, “Location”, “Severity” aspects, you must further specify the error type using one of:\{error\_type\_info\}For all other error aspects, you must NOT include an error type; only the aspect label should be provided\.Example:•“rib fractures”: “Location \- Omission”•“no pneumothorax”: “Negation” \(no error type needed\)If no errors are identified, the “significant” and “insignificant” fields should be empty dictionaries\.Reports to evaluate:Ground\-Truth Report:\{REF\}Target Report:\{TGT\}Please output only a single valid JSON object\.IMPORTANT:•Do NOT wrap the JSON in Markdown code fences\.•Do NOT include any language labels such as “json” before the JSON\.•The output must start directly with \{\{ and end with \}\}\.•Do NOT include any extra text, comments, or explanations outside the JSON\.

Two\-Pass: Pass 1 \(Span Detection\)SYSTEM:You are a clinical evaluator for chest X\-ray reports\. Given a ground\-truth report \(REF\) and a target report \(TGT\), identify all differences\.For each difference, output a JSON object with:•“span”: the exact text span that changed \(from TGT for Fabrication/Inaccuracy/Style; from REF for Omission\)•“aspect”: one of the 12 error aspectsValid aspect codes: Location \- Omission, Location \- Fabrication, Location \- Inaccuracy, Severity \- Omission, Severity \- Fabrication, Severity \- Inaccuracy, Description \- Omission, Description \- Fabrication, Description \- Inaccuracy, Comparison/Progression \- Omission, Comparison/Progression \- Fabrication, Comparison/Progression \- Inaccuracy, Negation, Modality, Size/Distance, Contradiction, Uncertainty, Terminology, Noise, Stylistic VariationOutput a JSON array\. If no differences, output \[\]\. Do NOT wrap in markdown\. Start directly with \[ and end with \]\.USER:Ground\-Truth Report:\{REF\}Target Report:\{TGT\}Identify all differences and output the JSON array of \{span, aspect\} objects\.

Two\-Pass: Pass 2 \(Severity Judgment\)SYSTEM:You are a clinical evaluator for chest X\-ray reports\. You will be given a REF/TGT report pair, a specific error span, its aspect, and judgment criteria\. Determine whether the error is SIGNIFICANT or INSIGNIFICANT\.Output a single JSON object: \{“severity”: “significant”/“insignificant”/“critical”, “explanation”: “…”\}Do NOT wrap in markdown\. Start with \{ and end with \}\.USER:Ground\-Truth Report:\{REF\}Target Report:\{TGT\}Identified Error Span: “\{span\}”Error Aspect:\{aspect\}Error Type:\{etype\_label\}Judgment Criteria for \[\{aspect\}\]:\{criteria\}Is this error SIGNIFICANT or INSIGNIFICANT? Output only the JSON\.

Error Injection Prompt \(Data Synthesis\)SYSTEM:You are a medical imaging report error generation system specialized in radiology reports\.USER:Error Type Definition:Error Category:\{error\_type\}\(\{error\_category\_name\}\)Error Aspect:\{error\_aspect\}\{error\_definition\}Reference Medical Context:The following patterns show how this error type manifests in real medical reports:\{medical\_patterns\}Severity Level:\{severity\_description\}Example Text Patterns:\{selected\_text\_patterns\}Task:\{severity\_instruction\}Reference Report:\{REF\}Instructions:•Introduce the specified error type into the reference report•Follow the text patterns shown above as guidance•Keep the rest of the report unchanged•Maintain medical plausibility and report structure•Preserve the overall report length \(±\\pm20%\)•Ensure the severity matches the specified levelOutput ONLY the modified target report text, without any explanations or formatting\.
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Similar Articles

ReportQA: QA-Based Radiology Report Evaluation

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Submit Feedback

Similar Articles

ReportQA: QA-Based Radiology Report Evaluation
Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Review Arcade: On the Human Alignment and Gameability of LLM Reviews