SDR: Set-Distance Rewards for Radiology Report Generation
Summary
This paper proposes set-distance rewards for reinforcement learning in chest X-ray report generation, using embedding-based set-to-set distances between generated and reference reports. Post-training with these rewards via GRPO consistently outperforms supervised fine-tuning and exact-match rewards, and enables efficient test-time scaling.
View Cached Full Text
Cached at: 06/02/26, 03:47 PM
# SDR: Set-Distance Rewards for Radiology Report Generation
Source: [https://arxiv.org/html/2606.00440](https://arxiv.org/html/2606.00440)
H\.Ibrahim Gulluk1 gulluk@stanford\.edu&Max Van Puyvelde∗2,3 maxvpuyv@stanford\.edu&Wim Van Criekinge3 wim\.vancriekinge@ugent\.be&Olivier Gevaert2 ogevaert@stanford\.edu1Department of Electrical Engineering, Stanford University 2Department of Biomedical Data Science, Stanford University School of Medicine 3Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University
###### Abstract
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision–language models\. However, for chest X\-ray report generation, the standard rewards \(i\.e\. exact\-match accuracy and step\-level processes\) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain\. We address this gap with a set\-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets\. We propose the use of set\-to\-set distances between generated and reference embeddings as continuous, permutation\-invariant rewards\. Across two datasets and three vision–language models \(Qwen3\-VL\-2B/4B, Gemma3\-4B\), post\-training with set\-to\-set distance based rewards via GRPO consistently outperforms supervised fine\-tuning and exact\-match GRPO on all headline metrics \(BERTScore, RadGraph F1 and CheXbert F1 by average %6\.80, %7\.82 and %4\.45 relative improvements respectively\)\. The same set distances also enable test\-time best\-of\-NNselection: scoring candidates by their distance to training\-report embeddings outperforms random selection on our trained models as well as three closed\-source LLMs \(Mistral\-Small, Gemini\-2\.5 Flash\-Lite, GPT\-4o\-mini\) with on average %16\.4 relative improvement on BERTScore\. Used as a streaming signal, they support a more efficient form of test\-time scaling: pruning low\-scoring candidates mid\-generation reduces generated tokens by over 50% while preserving the Findings quality of full best\-of\-NNselection\. Together these results establish set\-distance rewards as a unified signal for both post\-training and test\-time scaling in chest X\-ray report generation\. Our code is publicly[available](https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA)\.
## 1Introduction
Medical image reports play a central role in clinical workflows, including diagnosis, treatment planning, and patient monitoring\. Therefore, improving the efficiency and accuracy of medical image reporting using AI models has attracted increasing attention\. Researchers have developed vision–language models for medical image report generation across various imaging modalities\(Liet al\.,[2025](https://arxiv.org/html/2606.00440#bib.bib4); Liuet al\.,[2021](https://arxiv.org/html/2606.00440#bib.bib5); Hamamciet al\.,[2024](https://arxiv.org/html/2606.00440#bib.bib6)\)
Similar to other medical imaging modalities, chest X\-ray report generation constitutes a critical component of clinical workflows, as chest radiography is among the most commonly performed and widely accessible imaging techniques in medicine\. Assisted report generation systems have the potential to reduce radiologist workload while improving reporting consistency and accuracy\. Consequently, chest X\-ray report generation using vision–language models has been extensively studiedLiuet al\.\([2019](https://arxiv.org/html/2606.00440#bib.bib1)\); Liet al\.\([2023b](https://arxiv.org/html/2606.00440#bib.bib2)\); Endoet al\.\([2021](https://arxiv.org/html/2606.00440#bib.bib3)\)\.
In addition, recent advances in the reasoning capabilities of both language\-only and vision–language models have demonstrated improved performance on complex tasks such as mathematical problem solving, coding, and medical visual question answeringShaoet al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib7)\); Luoet al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib9)\); Liet al\.\([2023a](https://arxiv.org/html/2606.00440#bib.bib8)\); Wuet al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib10)\)\. Reinforcement learning–based fine\-tuning has shown promising results in further enhancing the reasoning abilities of these modelsRafailovet al\.\([2023](https://arxiv.org/html/2606.00440#bib.bib11)\); Schulmanet al\.\([2017](https://arxiv.org/html/2606.00440#bib.bib12)\); Shaoet al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib7)\)\.
Specifically, GRPO has been shown to achieve competitive performance without requiring explicit preferred and non\-preferred pairs\. However, this type of reward\-based reinforcement learning raises an important question regarding the design of the reward function, i\.e\., how to appropriately reward or penalize a language model based on its generated outputs\. Binary reward functions based on the correctness of the outputs are commonly used in reward design\. However, such discrete supervision can induce noise, motivating approaches assigning partial rewards to intermediate steps or the overall generation process even when the final answer is incorrect\. This has led to the development of process reward modelsKhalifaet al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib13)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.00440#bib.bib14)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib15)\)\.
However, assigning rewards to each step in the reasoning process may not be feasible, as step\-level annotations are often unavailable, and verifying each step using external sources can be computationally expensive\. Moreover, in chest X\-ray reports, clinicians provide findings that do not necessarily form a causal or sequential structure that can be interpreted as a chain of thought, making step\-by\-step verification less meaningful\. Instead, these findings are often independent of one another and may be presented in an arbitrary order\.
To address these challenges, we propose a set distance–based reward formulation\. Specifically, we obtain embeddings of sentences from both the ground truth report and the generated report, and compute distances between these two sets of vectors\. These distances are then used as a reward signal during GRPO training\. In this way, we provide a continuous reward signal that accounts for the unordered and independent nature of chest X\-ray findings\.
##### Contributions\.
Our main contributions are summarized as follows:
- •Set\-distance reward functions for GRPO post\-training\.We address the infeasibility of process reward modelling for radiology reports by treating each report as an unordered set of sentence embeddings and using set\-to\-set distances \(Chamfer and Hausdorff over cosine distance\) as a continuous, permutation\-invariant reward signal during GRPO\. Across three vision–language backbones and two report\-generation benchmarks our set\-distance reward consistently outperforms both the SFT baseline and discrete exact\-match GRPO post\-training on the evaluation metrics\.
- •Set\-distance–guided test\-time response selection\.We further use the same family of set distances as a test\-time scaling / best\-of\-NNselection signal: for each test image we sampleKKcandidate reports from the model and pick the one whose embedding set is closest to the embedding sets of the training reports\. This inference\-time procedure improves closed\-source generalist LLMs \(GPT, Gemini, Mistral\) over a random\-selection baseline, averaged over multiple candidates per sample\.
- •Distance\-based on\-the\-fly pruning of generations\.As an extension of the above, we compute the running set distance between the partially generated text and the training distribution during inference, and prune candidates whose distance crosses a threshold before they are fully generated\. This early\-stopping scheme attains comparable quality with substantially fewer generated tokens, demonstrating that the same set\-distance signal that drives our reward can also be used to lower the compute cost of test\-time scaling\.
## 2Related Work
Medical vision–language models have been proposed to expand the applications of AI in the medical domain\. MedViLL, a BERT\-based model, was introduced inMoonet al\.\([2022](https://arxiv.org/html/2606.00440#bib.bib22)\)and is capable of performing tasks such as medical diagnosis, image–report retrieval, and medical visual question answering\. Med\-Flamingo adapts the Flamingo architecture to medical image–text data for tasks including medical VQA and rationale generationMooret al\.\([2023](https://arxiv.org/html/2606.00440#bib.bib23)\)\.
Enhancing reasoning in medical vision\-language models has gained attention following the reasoning improvements in general models\. The MedReason dataset was proposed to enhance this fieldWuet al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib10)\)\. Med\-R1 models are generalist vision\-language models trained with reinforcement learning on diverse medical image modalitiesLaiet al\.\([2026](https://arxiv.org/html/2606.00440#bib.bib25)\)\. Similarly, MedVLM\-R1, which is trained with GRPO, increases medical image reasoningPanet al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib26)\)
Beyond medical models, reward design in RL post\-training, particularly for GRPO, remains an active area of research\. While discrete correctness\-based rewards have shown strong gains, especially in mathematical reasoningShaoet al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib7)\), continuous rewards are being investigated to reduce the noise introduced by binary supervision, since partially correct intermediate reasoning steps may still be valuable even when the final answer is incorrectKhalifaet al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib13)\)\.
To address these challenges, the authors proposed reasoning\-driven process reward modelingSheet al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib27)\)\. Entropy\-Regularized Process Reward Modeling \(ER\-PRM\)Zhanget al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib28)\)was introduced to add a KL\-regularized Markov decision process to ensure that the model remains close to its initial distribution during process reward modeling\. EDU\-PRM, on the other hand, applies entropy\-driven sampling to generate reasoning stepsCaoet al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib29)\)\.
## 3Method
We first fine\-tune vision–language models using SFT on chest X\-ray reports and then we post\-train them via Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.00440#bib.bib7)\)with a format reward that constrains the output to a structured reasoning template, and additional set\-based semantic rewards that score the clinical content of the generated report against the reference report\. The key design choice behind the semantic reward is to treat each section of a report as an unordered set of sentence embeddings rather than as a single sequence, reflecting the observation that individual chest X\-ray findings are permutation\-invariant and generally orthogonal rather than forming a causal chain\.
### 3\.1Sentence\-level report representation
FindingsThe left hemithorax is almost completely opacified, presumedly related to further enlargement of the pleural base carcinoma\.The visualized aerated portion of the left lung shows extensive interstitial and airspace density that could be caused by coexisting atelectasis or pneumonia\.The right lobe is grossly clear, although the lung volume is low\.The heart size may be enlarged\.The tip of the right Mediport catheter is located in the superior vena cava\.ImpressionInterval worsening of the left pleural base mass, with almost complete opacification of the left hemithorax\.Pneumonia and or atelectasis in the aerated portion of the left lung is not excluded\.Grossly clear right lung\.Possible cardiomegaly\.Chest X\-raySentenceTransformerFROZENall\-mpnet\-base\-v2𝐞1F\\mathbf\{e\}^\{F\}\_\{1\}𝐞2F\\mathbf\{e\}^\{F\}\_\{2\}𝐞3F\\mathbf\{e\}^\{F\}\_\{3\}𝐞4F\\mathbf\{e\}^\{F\}\_\{4\}𝐞5F\\mathbf\{e\}^\{F\}\_\{5\}𝐞1I\\mathbf\{e\}^\{I\}\_\{1\}𝐞2I\\mathbf\{e\}^\{I\}\_\{2\}𝐞3I\\mathbf\{e\}^\{I\}\_\{3\}𝐞4I\\mathbf\{e\}^\{I\}\_\{4\}ℰ\(r\)\\mathcal\{E\}\(r\)⊂ℝd\\subset\\mathbb\{R\}^\{d\}
Figure 1:Sentence\-level encoding of a chest X\-ray report\.Each study pairs a radiograph with a free\-text report composed of a Findings and an Impression section\. We split both sections into individual sentences and embed each sentence independently with the frozen pre\-trainedall\-mpnet\-base\-v2sentence transformer, producing onedd\-dimensional vector per sentence\. The resulting unordered collection of sentence embeddingsℰ\(r\)=\{𝐞1F,…,𝐞5F,𝐞1I,…,𝐞4I\}⊂ℝd\\mathcal\{E\}\(r\)=\\\{\\mathbf\{e\}^\{F\}\_\{1\},\\dots,\\mathbf\{e\}^\{F\}\_\{5\},\\mathbf\{e\}^\{I\}\_\{1\},\\dots,\\mathbf\{e\}^\{I\}\_\{4\}\\\}\\subset\\mathbb\{R\}^\{d\}serves as the ground\-truth report representation in our set\-reward pipeline\.A chest X\-ray reportyyconsists of two sections, a Findings sectionyFy^\{F\}and an Impression sectionyIy^\{I\}\. We split each section into individual sentences using a standard sentence segmenter, yieldingyF=\(s1F,…,snFF\)y^\{F\}=\(s^\{F\}\_\{1\},\\dots,s^\{F\}\_\{n\_\{F\}\}\)andyI=\(s1I,…,snII\)y^\{I\}=\(s^\{I\}\_\{1\},\\dots,s^\{I\}\_\{n\_\{I\}\}\), where the sentence countsnF,nI∈ℕn\_\{F\},n\_\{I\}\\in\\mathbb\{N\}may vary across studies\. Each sentencessis mapped to a fixed\-dimensional semantic embedding𝐞=Eϕ\(s\)∈ℝd\\mathbf\{e\}=E\_\{\\phi\}\(s\)\\in\\mathbb\{R\}^\{d\}by a frozen pre\-trained sentence transformerEϕE\_\{\\phi\}\(specificallyall\-mpnet\-base\-v2Reimers and Gurevych \([2019](https://arxiv.org/html/2606.00440#bib.bib30)\)\)\. The report is then represented by the two sets of embeddings
ℰF\(y\)=\{Eϕ\(siF\):1≤i≤nF\},ℰI\(y\)=\{Eϕ\(sjI\):1≤j≤nI\},\\mathcal\{E\}^\{F\}\(y\)\\;=\\;\\bigl\\\{\\,E\_\{\\phi\}\(s^\{F\}\_\{i\}\)\\,:\\,1\\leq i\\leq n\_\{F\}\\,\\bigr\\\},\\qquad\\mathcal\{E\}^\{I\}\(y\)\\;=\\;\\bigl\\\{\\,E\_\{\\phi\}\(s^\{I\}\_\{j\}\)\\,:\\,1\\leq j\\leq n\_\{I\}\\,\\bigr\\\},\(1\)both subsets ofℝd\\mathbb\{R\}^\{d\}\(Figure[1](https://arxiv.org/html/2606.00440#S3.F1)\)\. BecauseℰF\(y\)\\mathcal\{E\}^\{F\}\(y\)andℰI\(y\)\\mathcal\{E\}^\{I\}\(y\)are sets, they are invariant under permutations of the underlying sentences, which captures the clinical intuition that the listing order of individual findings carries no diagnostic meaning\. Throughout this section the unhatted symbolyydenotes the ground\-truth reference report paired with the input X\-ray, andy^\\hat\{y\}denotes a generation produced by the model\.
##### Format reward\.
To make the generated report easily parseable, we require the outputy^\\hat\{y\}to follow the template<think\>y^F\\hat\{y\}^\{F\}</think\> <answer\>y^I\\hat\{y\}^\{I\}</answer\>, wherey^F\\hat\{y\}^\{F\}andy^I\\hat\{y\}^\{I\}are the generated Findings and Impression respectively, with each tag occurring exactly once, in the specified order, and enclosing non\-empty content\. The format reward is the binary indicatorRfmt\(y^\)=valid\(y^\)∈\{0,1\}R\_\{\\mathrm\{fmt\}\}\(\\hat\{y\}\)=\\mathrm\{valid\}\(\\hat\{y\}\)\\in\\\{0,1\\\}; whenRfmt=1R\_\{\\mathrm\{fmt\}\}=1the two section strings can be extracted unambiguously fromy^\\hat\{y\}and fed into the semantic reward defined below\.
### 3\.2Set\-based semantic reward
Given a generated reporty^\\hat\{y\}whose sectionsy^F\\hat\{y\}^\{F\}andy^I\\hat\{y\}^\{I\}have been extracted according to the template above, we form their sets of sentence embeddingsℰF\(y^\)\\mathcal\{E\}^\{F\}\(\\hat\{y\}\)andℰI\(y^\)\\mathcal\{E\}^\{I\}\(\\hat\{y\}\)using exactly the same encoderEϕE\_\{\\phi\}as for the reference report in Eq\. \([1](https://arxiv.org/html/2606.00440#S3.E1)\)\. This yields, for each training example, two pairs of embedding sets,
\(ℰF\(y^\),ℰF\(y\)\),\(ℰI\(y^\),ℰI\(y\)\),\\bigl\(\\mathcal\{E\}^\{F\}\(\\hat\{y\}\),\\,\\mathcal\{E\}^\{F\}\(y\)\\bigr\),\\qquad\\bigl\(\\mathcal\{E\}^\{I\}\(\\hat\{y\}\),\\,\\mathcal\{E\}^\{I\}\(y\)\\bigr\),\(2\)each consisting of one generated and one reference set of vectors inℝd\\mathbb\{R\}^\{d\}\.
Let
𝒟:2ℝd×2ℝd⟶\[0,1\]\\mathcal\{D\}\\,:\\,2^\{\\mathbb\{R\}^\{d\}\}\\times 2^\{\\mathbb\{R\}^\{d\}\}\\;\\longrightarrow\\;\[0,1\]be any symmetric set\-to\-set distance normalised to the unit interval, that is,𝒟\(𝒜,ℬ\)=𝒟\(ℬ,𝒜\)\\mathcal\{D\}\(\\mathcal\{A\},\\mathcal\{B\}\)=\\mathcal\{D\}\(\\mathcal\{B\},\\mathcal\{A\}\)and𝒟\(𝒜,𝒜\)=0\\mathcal\{D\}\(\\mathcal\{A\},\\mathcal\{A\}\)=0for all non\-empty finite𝒜,ℬ⊂ℝd\\mathcal\{A\},\\mathcal\{B\}\\subset\\mathbb\{R\}^\{d\}\. Concrete choices for𝒟\\mathcal\{D\}are described in Section[A](https://arxiv.org/html/2606.00440#A1)\. For each sectionS∈\{F,I\}S\\in\\\{F,I\\\}we define the section\-level semantic reward as the similarity induced by𝒟\\mathcal\{D\},
RsemS\(y^,y\)=1−𝒟\(ℰS\(y^\),ℰS\(y\)\)∈\[0,1\],S∈\{F,I\}\.R^\{S\}\_\{\\mathrm\{sem\}\}\(\\hat\{y\},\\,y\)\\;=\\;1\\;\-\\;\\mathcal\{D\}\\\!\\bigl\(\\mathcal\{E\}^\{S\}\(\\hat\{y\}\),\\,\\mathcal\{E\}^\{S\}\(y\)\\bigr\)\\;\\in\\;\[0,1\],\\qquad S\\in\\\{F,I\\\}\.\(3\)The total set\-based semantic reward is the sum of the two section\-level rewards,
Rsem\(y^,y\)=RsemF\(y^,y\)\+RsemI\(y^,y\)∈\[0,2\]\.R\_\{\\mathrm\{sem\}\}\(\\hat\{y\},\\,y\)\\;=\\;R^\{F\}\_\{\\mathrm\{sem\}\}\(\\hat\{y\},y\)\\;\+\\;R^\{I\}\_\{\\mathrm\{sem\}\}\(\\hat\{y\},y\)\\;\\in\\;\[0,2\]\.\(4\)
We instantiate𝒟\\mathcal\{D\}with set\-to\-set distances built on top of the cosine distance between unit\-norm sentence embeddings: for two embeddings𝐮,𝐯∈ℝd\\mathbf\{u\},\\mathbf\{v\}\\in\\mathbb\{R\}^\{d\}we used\(𝐮,𝐯\)=12\(1−𝐮⊤𝐯\)∈\[0,1\]d\(\\mathbf\{u\},\\mathbf\{v\}\)=\\tfrac\{1\}\{2\}\(1\-\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}\)\\in\[0,1\], which is well matched to the unit\-normalised outputs ofEϕE\_\{\\phi\}\. Throughout the rest of this subsection we use𝐞^∈ℰS\(y^\)\\hat\{\\mathbf\{e\}\}\\in\\mathcal\{E\}^\{S\}\(\\hat\{y\}\)to denote the embedding of a sentence in the generated sectiony^S\\hat\{y\}^\{S\}and𝐞∈ℰS\(y\)\\mathbf\{e\}\\in\\mathcal\{E\}^\{S\}\(y\)for that of a sentence in the matched reference sectionySy^\{S\}, with cardinalitiesn=\|ℰS\(y^\)\|n=\|\\mathcal\{E\}^\{S\}\(\\hat\{y\}\)\|andm=\|ℰS\(y\)\|m=\|\\mathcal\{E\}^\{S\}\(y\)\|\. With this notation, for each sectionS∈\{F,I\}S\\in\\\{F,I\\\}, the*Chamfer*set distance averages the nearest\-neighbour cost in each direction,
𝒟Chamfer\(ℰS\(y^\),ℰS\(y\)\)=12\(1n∑𝐞^∈ℰS\(y^\)min𝐞∈ℰS\(y\)d\(𝐞^,𝐞\)\+1m∑𝐞∈ℰS\(y\)min𝐞^∈ℰS\(y^\)d\(𝐞^,𝐞\)\),\\mathcal\{D\}\_\{\\mathrm\{Chamfer\}\}\\\!\\bigl\(\\mathcal\{E\}^\{S\}\(\\hat\{y\}\),\\,\\mathcal\{E\}^\{S\}\(y\)\\bigr\)\\;=\\;\\tfrac\{1\}\{2\}\\\!\\left\(\\frac\{1\}\{n\}\\\!\\sum\_\{\\hat\{\\mathbf\{e\}\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(\\hat\{y\}\)\}\\min\_\{\\mathbf\{e\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(y\)\}d\(\\hat\{\\mathbf\{e\}\},\\mathbf\{e\}\)\\;\+\\;\\frac\{1\}\{m\}\\\!\\sum\_\{\\mathbf\{e\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(y\)\}\\min\_\{\\hat\{\\mathbf\{e\}\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(\\hat\{y\}\)\}d\(\\hat\{\\mathbf\{e\}\},\\mathbf\{e\}\)\\right\),\(5\)so that the first term rewards every generated sentence for matching some reference sentence and the second term penalises reference sentences not covered by any generated sentence\. The*Hausdorff*set distance replaces the means with maxima, returning the worst\-case uncovered sentence in either direction,
𝒟Hausdorff\(ℰS\(y^\),ℰS\(y\)\)=max\{max𝐞^∈ℰS\(y^\)min𝐞∈ℰS\(y\)d\(𝐞^,𝐞\),max𝐞∈ℰS\(y\)min𝐞^∈ℰS\(y^\)d\(𝐞^,𝐞\)\},\\mathcal\{D\}\_\{\\mathrm\{Hausdorff\}\}\\\!\\bigl\(\\mathcal\{E\}^\{S\}\(\\hat\{y\}\),\\,\\mathcal\{E\}^\{S\}\(y\)\\bigr\)\\;=\\;\\max\\\!\\left\\\{\\max\_\{\\hat\{\\mathbf\{e\}\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(\\hat\{y\}\)\}\\min\_\{\\mathbf\{e\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(y\)\}d\(\\hat\{\\mathbf\{e\}\},\\mathbf\{e\}\),\\;\\max\_\{\\mathbf\{e\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(y\)\}\\min\_\{\\hat\{\\mathbf\{e\}\}\\,\\in\\,\\mathcal\{E\}^\{S\}\(\\hat\{y\}\)\}d\(\\hat\{\\mathbf\{e\}\},\\mathbf\{e\}\)\\right\\\},\(6\)which makes it strictly harsher than Chamfer so that a single uncovered sentence on either side is enough to dominate the reward\. Considering that we use these rewards in post\-training, meaning the model is already trained with SFT, the Hausdorff\-based reward can be seen as penalizing the model when it generates outliers or contradictory observations Both metrics are symmetric, invariant to the orderings ofℰS\(y^\)\\mathcal\{E\}^\{S\}\(\\hat\{y\}\)andℰS\(y\)\\mathcal\{E\}^\{S\}\(y\)by construction, and lie in\[0,1\]\[0,1\]since the base distancedddoes\. Concretely,1−𝒟Chamfer1\-\\mathcal\{D\}\_\{\\mathrm\{Chamfer\}\}behaves like a soft coverage reward – it grows whenever any additional generated sentence finds a close reference match – whereas1−𝒟Hausdorff1\-\\mathcal\{D\}\_\{\\mathrm\{Hausdorff\}\}behaves like a worst\-case recall reward that only saturates once every clinically relevant finding has been mentioned\. Empirically we find that these two choices are the most useful set rewards for GRPO post\-training, and we report results for them throughout Sec\.[4](https://arxiv.org/html/2606.00440#S4)\.
##### Combined reward\.
The final scalar reward passed to GRPO is the weighted sumR\(y^,y\)=λfmtRfmt\(y^\)\+λsemRsem\(y^,y\)R\(\\hat\{y\},y\)=\\lambda\_\{\\mathrm\{fmt\}\}\\,R\_\{\\mathrm\{fmt\}\}\(\\hat\{y\}\)\+\\lambda\_\{\\mathrm\{sem\}\}\\,R\_\{\\mathrm\{sem\}\}\(\\hat\{y\},y\)with non\-negative weightsλfmt,λsem≥0\\lambda\_\{\\mathrm\{fmt\}\},\\lambda\_\{\\mathrm\{sem\}\}\\geq 0\. WhenRfmt\(y^\)=0R\_\{\\mathrm\{fmt\}\}\(\\hat\{y\}\)=0the section strings cannot be reliably extracted, and we setRsem\(y^,y\):=0R\_\{\\mathrm\{sem\}\}\(\\hat\{y\},y\):=0so that a malformed generation is penalised through both reward components simultaneously\. In the experiments, we simply setλfmt=λsem=1\\lambda\_\{\\mathrm\{fmt\}\}=\\lambda\_\{\\mathrm\{sem\}\}=1\.
### 3\.3Inference\-time response selection
ℰF\(y^\)\\mathcal\{E\}^\{F\}\(\\hat\{y\}\)ℰF\(y\)\\mathcal\{E\}^\{F\}\(y\)𝐞^1\\hat\{\\mathbf\{e\}\}\_\{1\}𝐞^2\\hat\{\\mathbf\{e\}\}\_\{2\}𝐞^3\\hat\{\\mathbf\{e\}\}\_\{3\}𝐞1\\mathbf\{e\}\_\{1\}𝐞2\\mathbf\{e\}\_\{2\}𝐞3\\mathbf\{e\}\_\{3\}𝐞4\\mathbf\{e\}\_\{4\}d\(𝐞^2,𝐞2\)d\(\\hat\{\\mathbf\{e\}\}\_\{2\},\\mathbf\{e\}\_\{2\}\)
\(a\)Set\-distance computation between two embedding sets\. Each dot is one sentence embedding; the two ellipses are the full embedding sets of a generated report \(amber\) and the matched reference report \(slate\)\. Dashed lines are the nearest\-neighbour cosine distances that Chamfer / Hausdorff aggregate\.\(Eqs\. \([5](https://arxiv.org/html/2606.00440#S3.E5)\) and \([6](https://arxiv.org/html/2606.00440#S3.E6)\)\)
Embedding spaceℝd\\mathbb\{R\}^\{d\}\(2D schematic\)y^\(1\)\\hat\{y\}^\{\(1\)\}y^\(2\)\\hat\{y\}^\{\(2\)\}y^\(3\)\\hat\{y\}^\{\(3\)\}y^\(4\)\\hat\{y\}^\{\(4\)\}y^⋆\\hat\{y\}^\{\\star\}training reports\{r\(t\)\}t=1N\\\{r^\{\(t\)\}\\\}\_\{t=1\}^\{N\}candidatey^\(k\)∼π\(⋅∣x\)\\hat\{y\}^\{\(k\)\}\\\!\\sim\\\!\\pi\(\\cdot\\mid x\)selectedy^⋆=argmink𝔇\(y^\(k\)\)\\hat\{y\}^\{\\star\}=\\arg\\min\_\{k\}\\mathfrak\{D\}\(\\hat\{y\}^\{\(k\)\}\)
\(b\)Inference\-time response selection in embedding space\. Each dot is itself an embedding*set*of the form drawn in \(a\): amber dots are candidatesy^\(k\)\\hat\{y\}^\{\(k\)\}, small slate dots are training reportsr\(t\)r^\{\(t\)\}, and the rust star is the selected responsey^⋆\\hat\{y\}^\{\\star\}\.
Figure 2:Set distances and inference\-time response selection\.\(a\)A single set\-distance computation between a generated report \(three sentence embeddings\) and a matched reference report \(four sentence embeddings\) visualization\.\(b\)The full inference\-time selection pipeline of Eq\. \([10](https://arxiv.org/html/2606.00440#S3.E10)\)\.The set\-based distance metrics introduced in Section[A](https://arxiv.org/html/2606.00440#A1)are not restricted to reward design during GRPO fine\-tuning; they can equally well serve as a purely*inference\-time*selection criterion that does not require any gradient updates\. In this section we describe a simple but effective test\-time pipeline in which a generative policyπ\\pi— either our GRPO\-fine\-tuned VLM or a frozen closed\-source generalist LLM such as GPT\-4o or Gemini — is queriedKKtimes on the same chest X\-ray, and the candidate report that is most consistent with the training distribution of real radiology reports is retained\.
Concretely, let𝒯=\{r\(t\)\}t=1N\\mathcal\{T\}=\\\{\\,r^\{\(t\)\}\\,\\\}\_\{t=1\}^\{N\}denote the training corpus, with each reportr\(t\)r^\{\(t\)\}represented by its two section embedding setsℰF\(r\(t\)\)\\mathcal\{E\}^\{F\}\(r^\{\(t\)\}\)andℰI\(r\(t\)\)\\mathcal\{E\}^\{I\}\(r^\{\(t\)\}\)as defined in Section[3\.1](https://arxiv.org/html/2606.00440#S3.SS1)\. For a chest X\-rayxxat test time, we drawKKindependent generationsy^\(1\),y^\(2\),…,y^\(K\)∼i\.i\.d\.π\(⋅∣x\)\\hat\{y\}^\{\(1\)\},\\,\\hat\{y\}^\{\(2\)\},\\,\\dots,\\,\\hat\{y\}^\{\(K\)\}\\;\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}\\;\\pi\(\\,\\cdot\\mid x\\,\)extract the Findings and Impression sections from eachy^\(k\)\\hat\{y\}^\{\(k\)\}\(via the<think\>/<answer\>template of Sec\.[3](https://arxiv.org/html/2606.00440#S3)\) and compute their sentence\-embedding setsℰS\(y^\(k\)\)\\mathcal\{E\}^\{S\}\(\\hat\{y\}^\{\(k\)\}\)forS∈\{F,I\}S\\in\\\{F,I\\\}\.
##### Distance from a generation to the training distribution\.
Since𝒯\\mathcal\{T\}is itself a set of sets of embeddings, scoring a single candidate against𝒯\\mathcal\{T\}requires a second aggregation on top of the set\-to\-set metric𝒟\(⋅,⋅\)\\mathcal\{D\}\(\\cdot,\\cdot\)of Section[A](https://arxiv.org/html/2606.00440#A1)\. We first compute, for each sectionS∈\{F,I\}S\\in\\\{F,I\\\}and each training reporttt, the per\-report distance
𝔇S,t\(y^\)=𝒟\(ℰS\(y^\),ℰS\(r\(t\)\)\),t=1,…,N\.\\mathfrak\{D\}\_\{S,\\,t\}\(\\hat\{y\}\)\\;=\\;\\mathcal\{D\}\\\!\\bigl\(\\mathcal\{E\}^\{S\}\(\\hat\{y\}\),\\,\\mathcal\{E\}^\{S\}\(r^\{\(t\)\}\)\\bigr\),\\qquad t=1,\\dots,N\.\(7\)The distance from the generationy^\\hat\{y\}to the training distribution𝒯\\mathcal\{T\}is then any of the following three aggregations,
𝔇Smin\(y^\)=min1≤t≤N𝔇S,t\(y^\),𝔇Savg\(y^\)=1N∑t=1N𝔇S,t\(y^\),𝔇SkNN\(y^\)=1k∑t∈𝒩k\(y^\)𝔇S,t\(y^\),\\mathfrak\{D\}^\{\\min\}\_\{S\}\(\\hat\{y\}\)=\\min\_\{1\\leq t\\leq N\}\\mathfrak\{D\}\_\{S,\\,t\}\(\\hat\{y\}\),\\quad\\mathfrak\{D\}^\{\\mathrm\{avg\}\}\_\{S\}\(\\hat\{y\}\)=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}\\mathfrak\{D\}\_\{S,\\,t\}\(\\hat\{y\}\),\\quad\\mathfrak\{D\}^\{\\mathrm\{kNN\}\}\_\{S\}\(\\hat\{y\}\)=\\frac\{1\}\{k\}\\\!\\sum\_\{t\\in\\mathcal\{N\}\_\{k\}\(\\hat\{y\}\)\}\\\!\\mathfrak\{D\}\_\{S,\\,t\}\(\\hat\{y\}\),\(8\)where𝒩k\(y^\)⊆\{1,…,N\}\\mathcal\{N\}\_\{k\}\(\\hat\{y\}\)\\subseteq\\\{1,\\dots,N\\\}indexes thekktraining reports with the smallest𝔇S,t\(y^\)\\mathfrak\{D\}\_\{S,\\,t\}\(\\hat\{y\}\)\. Each aggregation encodes a different notion of “closeness to the training distribution”:𝔇min\\mathfrak\{D\}^\{\\min\}asks whethery^\\hat\{y\}resembles*any*single real report,𝔇avg\\mathfrak\{D\}^\{\\mathrm\{avg\}\}scores against the whole corpus uniformly, and𝔇kNN\\mathfrak\{D\}^\{\\mathrm\{kNN\}\}is a noise\-robust middle ground\.
##### Selection rule\.
The total distance of a candidate to the training distribution is the sum of its two section\-level distances,
𝔇\(y^\)=𝔇F\(y^\)\+𝔇I\(y^\),\\mathfrak\{D\}\(\\hat\{y\}\)\\;=\\;\\mathfrak\{D\}\_\{F\}\(\\hat\{y\}\)\\;\+\\;\\mathfrak\{D\}\_\{I\}\(\\hat\{y\}\),\(9\)matching the additive structure of the semantic reward in Eq\. \([4](https://arxiv.org/html/2606.00440#S3.E4)\)\. The selected response is the candidate with the smallest such distance,
y^⋆=argmink∈\{1,…,K\}𝔇\(y^\(k\)\)\.\\hat\{y\}^\{\\star\}\\;=\\;\\operatorname\*\{arg\\,min\}\_\{k\\in\\\{1,\\dots,K\\\}\}\\;\\mathfrak\{D\}\\bigl\(\\hat\{y\}^\{\(k\)\}\\bigr\)\.\(10\)Any of the distance metrics𝒟\\mathcal\{D\}from Section[A](https://arxiv.org/html/2606.00440#A1)can be plugged into Eq\. \([7](https://arxiv.org/html/2606.00440#S3.E7)\) and combined with any of the three aggregations of Eq\. \([8](https://arxiv.org/html/2606.00440#S3.E8)\), yielding a large family of selection policies whose relative merits we study empirically in Section[3\.3](https://arxiv.org/html/2606.00440#S3.SS3)\.
As the pipeline is purely inference\-time, no parameter of the generation policyπ\\piis updated, and the only training\-time artefact it relies on is the pre\-computed set\{ℰS\(r\(t\)\)\}t,S\\\{\\,\\mathcal\{E\}^\{S\}\(r^\{\(t\)\}\)\\,\\\}\_\{t,S\}of reference embeddings, which depends only on the frozen sentence encoderEϕE\_\{\\phi\}and can be cached on disk once per training corpus\. This makes the approach applicable to closed\-source generalist LLMs \(GPT\-4o, Gemini\) for which gradients are unavailable and domain\-specific fine\-tuning is not possible at all\. As generalist LLMs are known to hallucinate clinically implausible findings, drift off\-topic, or produce phrasings that are absent from genuine radiology practice\. By preferring, amongKKstochastic candidates, the one whose sentence embeddings fall into regions ofℝd\\mathbb\{R\}^\{d\}that are closest to the real reports in𝒯\\mathcal\{T\}, the selection rule in Eq\. \([10](https://arxiv.org/html/2606.00440#S3.E10)\) exploits the training corpus as a gradient\-free prior on clinically plausible output\. Although our method adds additional cost of sentence embedding and distance calculations, this can be done in parallel without a need for GPU\. We report, in Section[3\.3](https://arxiv.org/html/2606.00440#S3.SS3), the performance gain that this simple test\-time procedure yields closed\-source generalist LLMs, across every combination of set\-distance metric𝒟∈\{Chamfer,Hausdorff,OT,POT\}\\mathcal\{D\}\\in\\\{\\mathrm\{Chamfer\},\\,\\mathrm\{Hausdorff\},\\,\\mathrm\{OT\},\\,\\mathrm\{POT\}\\\}and aggregation in\{min,avg,kNN\}\\\{\\textsc\{min\},\\,\\textsc\{avg\},\\,\\textsc\{kNN\}\\\}\.
In addition to known set\-distance metrics, We derive hungarian matching based set\-distances with additional modifications on the samples that are not matched during hungarian matching, that are denotes asHung\-NN,Hung\-Pen\\mathrm\{Hung\\text\{\-\}NN\},\\,\\mathrm\{Hung\\text\{\-\}Pen\}\. The details are provided in Section[A](https://arxiv.org/html/2606.00440#A1)\.
## 4Experiments
Throughout the experiments we use MIMIC\-CXR and RexGradient datasetsJohnsonet al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib31)\); Zhanget al\.\([2025a](https://arxiv.org/html/2606.00440#bib.bib32)\)\. We fine\-tune \(SFT\) and then post\-train three vision–language models – Qwen3\-VL\-2B, Qwen3\-VL\-4B and Gemma3\-4BYanget al\.\([2025](https://arxiv.org/html/2606.00440#bib.bib34)\); Teamet al\.\([2024](https://arxiv.org/html/2606.00440#bib.bib33)\)– on the chest X\-ray report generation task with GRPO via the reward configurations we described\. Throughout this section we abbreviate them as:SFT\(the supervised fine\-tuned checkpoint without GRPO\);RfmtR\_\{\\mathrm\{fmt\}\}\(the format reward alone, i\.e\. the binary template indicator of Sec\.[3](https://arxiv.org/html/2606.00440#S3)\);RexactR\_\{\\mathrm\{exact\}\}\(format and a discrete exact\-match accuracy reward\); and our two proposed set\-based semantic rewards combined with the format reward,RChamR\_\{\\mathrm\{Cham\}\}andRHausR\_\{\\mathrm\{Haus\}\}\(Chamfer and Hausdorff set distances respectively, defined in App\.[A](https://arxiv.org/html/2606.00440#A1)\)\. Tabs\.[1](https://arxiv.org/html/2606.00440#S4.T1)and[2](https://arxiv.org/html/2606.00440#S4.T2)report mean test\-set scores over 5 random seeds on headline metrics covering embedding\-based \(BERTScore F1, COMET\) and clinical \(RadGraph averaged F1, CheXbert macro F1\) families on the Findings section\. The final Mean block in each table averages each \(reward, metric\) cell across the models in that dataset\. We observe that Chamfer based rewarding produce the highest performance across all trained models and all evaluation metrics for RexGradient dataset\. For MIMIC\-CXR on the other hand, both Chamfer and Hausdorff based rewardings reaching the highest performances\. These results suggest that, in addition to Supervised Finetuning, Set\-Distance Based Rewarding improves the performance of report generation drastically\. Additional results including impression section results and more evaluation metrics are provided in App\.[D](https://arxiv.org/html/2606.00440#A4)\. Experimental setup is provided in Appendix[C](https://arxiv.org/html/2606.00440#A3)
Table 1:GRPO post\-training results on ReXGradient \(Findings\)\.Results for different reward functions are provided\. Chamfer Distance based rewarding outperforms all the other methods on average for all the evaluation metrics\.Table 2:GRPO post\-training results on MIMIC\-CXR \(Findings\)\.Results for different reward functions are provided\. Hausdorff Distance based rewarding has the highest performance on average\.
## 5Inference\-time response selection
Beyond the GRPO post\-training results of Sec\.[4](https://arxiv.org/html/2606.00440#S4), we also evaluate the inference\-time response selection pipeline of Sec\.[3\.3](https://arxiv.org/html/2606.00440#S3.SS3)on 13 models – our GRPO\-fine\-tuned Qwen3\-VL\-4B variants and the closed\-source generalist LLMs Mistral\-Small, Gemini Flash\-Lite, GPT\-4o mini and GPT\-5 mini\. For each closed\-source model we evaluate two distinct prompt templates, denoted\[p1\]and\[p2\]in the tables; the verbatim prompts are reproduced in App\.[B](https://arxiv.org/html/2606.00440#A2)\. For every test sample we draw multiple generations from the model and select one with each \(distance metric, aggregation\) combination of Sec\.[A](https://arxiv.org/html/2606.00440#A1)\. Selected responses are scored on a suite of NLP and clinical metrics and compared against a random\-selection baseline averaged over multiple runs\.
Table 3:Headline results \(Findings\)\.For every model and every of five clinically meaningful NLP metrics we report the best score obtained by any \(distance metric, aggregation\) selection policy\. The matched random\-selection baseline is shown in italics under each model row, and the percentage improvement over random is given in parentheses\. Bold marks the best value per column\.Tab\.[3](https://arxiv.org/html/2606.00440#S5.T3)summarises the highest performance improvements over random response picking among all possible distance metric and aggregation method combinations\. It can easily be concluded that appropriate metric and aggregation method to choose the closest generations to the training distribution among all candidate generated texts can improve the overall performance substantially over randomly selection\. Similar results for Impression part is reported in Section[E](https://arxiv.org/html/2606.00440#A5)\(Table[23](https://arxiv.org/html/2606.00440#A5.T23)\) in Appendix\. along with per\-model breakdowns for every evaluation metrics\. In addition, a method×\\,\\times\\,metric heatmap visualising the same data \(Fig\.[3](https://arxiv.org/html/2606.00440#A5.F3)\) is deferred to App\.[E](https://arxiv.org/html/2606.00440#A5)\.
### 5\.1Distance\-guided pruning of generations
The inference\-time selection rule of Sec\.[3\.3](https://arxiv.org/html/2606.00440#S3.SS3)requires generating allKKcandidates to completion before any of them can be scored against the training distribution, so the per\-test\-image generation cost is exactlyKKtimes the cost of a single decoding\. We now describe an extension that uses the same training\-distribution distance, but applied to partial generations during decoding, to prune unpromising candidates before they are completed and thereby reduce total token usage\.
Concretely, allKKcandidates first generate their opening sentence in lock\-step\. From the second sentence onwards, before each new sentence is decoded we encode every still\-active candidatey^\(k\)\\hat\{y\}^\{\(k\)\}into its sentence\-embedding setsℰS\(y^:t\(k\)\)\\mathcal\{E\}^\{S\}\\\!\\bigl\(\\hat\{y\}^\{\(k\)\}\_\{:t\}\\bigr\)over the firstttgenerated sentences and score it against the training distribution by𝔇\(y^\(k\)\)\\mathfrak\{D\}\\\!\\bigl\(\\hat\{y\}^\{\(k\)\}\\bigr\)\(Eq\. \([9](https://arxiv.org/html/2606.00440#S3.E9)\), evaluated on the partial output\)\. The bottom\-scoring fraction of the active candidates by this score is then dropped from further decoding so that their token generation simply stops and decoding continues sentence\-by\-sentence with this prune\-then\-decode loop until a single surviving candidate remains, which is then decoded to its end\-of\-sequence token and returned as the selected response\. In our experiments, we set the pruning fractionρ=0\.5\\rho=0\.5
Because pruned candidates stop producing tokens at the moment they are eliminated, this scheme strictly reduces the total number of generated tokens compared with the full\-generation\-then\-select pipeline of Sec\.[3\.3](https://arxiv.org/html/2606.00440#S3.SS3)\. We compare it on the Findings section against \(i\) random selection among theKKcandidates and \(ii\) the standard distance\-based selection of Sec\.[3\.3](https://arxiv.org/html/2606.00440#S3.SS3)on complete candidates\. Tab\.[4](https://arxiv.org/html/2606.00440#S5.T4)reports BERTScore F1, RadGraph F1and CheXbert F1\(the three headline metrics also used in the GRPO tables\) for every \(model, distance metric\) combination, together with the percentage of tokens saved by pruning relative to full generation\. The full per\-metric breakdown and Impression numbers are in App\.[G](https://arxiv.org/html/2606.00440#A7)\.
As a proof of concept, we apply this pruning policy to five closed\-source LLMs \(Mistral\-Small, Gemini 2\.5 Flash\-Lite, Gemini 3\.1 Flash\-Lite, GPT\-4o mini and GPT\-5 mini\) and report the token savings they would yield in an equivalent open\-weights deployment with the same inference budget\. The procedure is also summarised as Alg\.[1](https://arxiv.org/html/2606.00440#alg1)in App\.[G](https://arxiv.org/html/2606.00440#A7)\.
Although pruning requires additional set\-distance calculations per sentence addition during generation, which is one of the limitations of our work, considering that the sentence\-transformer is much more smaller model with 420MB in size, saving tokens from larger models is still a gain\. As it can be seen, Pruning beats random selection in all metrics across all models with an average relative improvements\+12\.7%\+12\.7\\%,\+17\.1%\+17\.1\\%,\+6\.2%\+6\.2\\%on BS\-F1, RG\-F1 and CXB\-F1 while saving 42\.1–60\.1% of the generation tokens It does not always match the standard full\-generation policy, but the gap is small on the metrics and the token saving is substantial\.
Table 4:Distance\-guided pruning of generations \(Findings\)\.For every \(model, distance\) pair we report the percentage of generation tokens saved by the pruning policy and three headline metrics scored under three selection policies:Random\(uniform random pick among theKKstochastic candidates\),Standard\(full\-generation pipeline of Sec\.[3\.3](https://arxiv.org/html/2606.00440#S3.SS3), distance\-based selection on complete candidates\) andPruning\(distance\-guided early\-pruning during decoding, this work\)\. Bold marks the column\-best of \{random, standard, pruning\} within each metric block\.
## 6Conclusion
We introduced set\-distance rewards as a unified signal for chest X\-ray report generation\. By representing each report as unordered sets of sentence embeddings and scoring generated reports against references with set\-to\-set distances, we obtain a continuous, permutation\-invariant reward\. At training time, GRPO post\-training with our Chamfer\- and Hausdorff\-based rewards consistently outperformed both supervised fine\-tuning and discrete exact\-match GRPO across three vision–language backbones on two datasets\. At inference time, the same family of set distances served as a gradient\-free best\-of\-NNselection criterion, improving both our GRPO\-fine\-tuned models and a panel of closed\-source generalist LLMs over a random\-selection baseline\. Finally, evaluated on partial decodings, the same signal supported a sentence\-level pruning policy that retained the quality benefits of full best\-of\-NNselection while cutting roughly half of the generated tokens\.
## References
- \[1\]L\. Cao, R\. Chen, Y\. Zou, C\. Peng, H\. Xu, Y\. Wang, W\. Ning, Q\. Chen, M\. Peng, Z\. Chen,et al\.\(2025\)More bang for the buck: process reward modeling with entropy\-driven uncertainty\.arXiv preprint arXiv:2503\.22233\.Cited by:[§2](https://arxiv.org/html/2606.00440#S2.p4.1)\.
- \[2\]M\. Endo, R\. Krishnan, V\. Krishna, A\. Y\. Ng, and P\. Rajpurkar\(2021\)Retrieval\-based chest x\-ray report generation using a pre\-trained contrastive language\-image model\.InMachine learning for health,pp\. 209–219\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p2.1)\.
- \[3\]I\. E\. Hamamci, S\. Er, and B\. Menze\(2024\)Ct2rep: automated radiology report generation for 3d medical imaging\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 476–486\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p1.1)\.
- \[4\]A\. Johnson, T\. Pollard, R\. Mark, S\. Berkowitz, and S\. Horng\(2024\-07\)MIMIC\-CXR Database\.PhysioNet\.Note:Version 2\.1\.0External Links:[Document](https://dx.doi.org/10.13026/4jqj-jw95),[Link](https://doi.org/10.13026/4jqj-jw95)Cited by:[§4](https://arxiv.org/html/2606.00440#S4.p1.7)\.
- \[5\]M\. Khalifa, R\. Agarwal, L\. Logeswaran, J\. Kim, H\. Peng, M\. Lee, H\. Lee, and L\. Wang\(2025\)Process reward models that think\.arXiv preprint arXiv:2504\.16828\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p4.1),[§2](https://arxiv.org/html/2606.00440#S2.p3.1)\.
- \[6\]Y\. Lai, J\. Zhong, M\. Li, S\. Zhao, Y\. Li, K\. Psounis, and X\. Yang\(2026\)Med\-r1: reinforcement learning for generalizable medical reasoning in vision\-language models\.IEEE Transactions on Medical Imaging\.Cited by:[§2](https://arxiv.org/html/2606.00440#S2.p2.1)\.
- \[7\]C\. Li, K\. Chang, C\. Yang, H\. Wu, W\. Chen, H\. Bansal, L\. Chen, Y\. Yang, Y\. Chen, S\. Chen,et al\.\(2025\)Towards a holistic framework for multimodal llm in 3d brain ct radiology report generation\.Nature Communications16\(1\),pp\. 2258\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p1.1)\.
- \[8\]C\. Li, J\. Liang, A\. Zeng, X\. Chen, K\. Hausman, D\. Sadigh, S\. Levine, L\. Fei\-Fei, F\. Xia, and B\. Ichter\(2023\)Chain of code: reasoning with a language model\-augmented code emulator\.arXiv preprint arXiv:2312\.04474\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p3.1)\.
- \[9\]M\. Li, B\. Lin, Z\. Chen, H\. Lin, X\. Liang, and X\. Chang\(2023\)Dynamic graph enhanced contrastive learning for chest x\-ray report generation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 3334–3343\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p2.1)\.
- \[10\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 39578–39601\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p4.1)\.
- \[11\]G\. Liu, Y\. Liao, F\. Wang, B\. Zhang, L\. Zhang, X\. Liang, X\. Wan, S\. Li, Z\. Li, S\. Zhang,et al\.\(2021\)Medical\-vlbert: medical visual language bert for covid\-19 ct report generation with alternate learning\.IEEE transactions on neural networks and learning systems32\(9\),pp\. 3786–3797\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p1.1)\.
- \[12\]G\. Liu, T\. H\. Hsu, M\. McDermott, W\. Boag, W\. Weng, P\. Szolovits, and M\. Ghassemi\(2019\)Clinically accurate chest x\-ray report generation\.InMachine learning for healthcare conference,pp\. 249–269\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p2.1)\.
- \[13\]L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng,et al\.\(2024\)Improve mathematical reasoning in language models by automated process supervision\.arXiv preprint arXiv:2406\.06592\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p3.1)\.
- \[14\]J\. H\. Moon, H\. Lee, W\. Shin, Y\. Kim, and E\. Choi\(2022\)Multi\-modal understanding and generation for medical images and text via vision\-language pre\-training\.IEEE Journal of Biomedical and Health Informatics26\(12\),pp\. 6070–6080\.Cited by:[§2](https://arxiv.org/html/2606.00440#S2.p1.1)\.
- \[15\]M\. Moor, Q\. Huang, S\. Wu, M\. Yasunaga, Y\. Dalmia, J\. Leskovec, C\. Zakka, E\. P\. Reis, and P\. Rajpurkar\(2023\)Med\-flamingo: a multimodal medical few\-shot learner\.InMachine learning for health \(ML4H\),pp\. 353–367\.Cited by:[§2](https://arxiv.org/html/2606.00440#S2.p1.1)\.
- \[16\]J\. Pan, C\. Liu, J\. Wu, F\. Liu, J\. Zhu, H\. B\. Li, C\. Chen, C\. Ouyang, and D\. Rueckert\(2025\)Medvlm\-r1: incentivizing medical reasoning capability of vision\-language models \(vlms\) via reinforcement learning\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 337–347\.Cited by:[§2](https://arxiv.org/html/2606.00440#S2.p2.1)\.
- \[17\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p3.1)\.
- \[18\]N\. Reimers and I\. Gurevych\(2019\-11\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/1908.10084)Cited by:[§3\.1](https://arxiv.org/html/2606.00440#S3.SS1.p1.9)\.
- \[19\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p3.1)\.
- \[20\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p3.1),[§2](https://arxiv.org/html/2606.00440#S2.p3.1),[§3](https://arxiv.org/html/2606.00440#S3.p1.1)\.
- \[21\]S\. She, J\. Liu, Y\. Liu, J\. Chen, X\. Huang, and S\. Huang\(2025\)R\-prm: reasoning\-driven process reward modeling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 13449–13462\.Cited by:[§2](https://arxiv.org/html/2606.00440#S2.p4.1)\.
- \[22\]G\. Team, T\. Mesnard, C\. Hardin, R\. Dadashi, S\. Bhupatiraju, S\. Pathak, L\. Sifre, M\. Rivière, M\. S\. Kale, J\. Love,et al\.\(2024\)Gemma: open models based on gemini research and technology\.arXiv preprint arXiv:2403\.08295\.Cited by:[§4](https://arxiv.org/html/2606.00440#S4.p1.7)\.
- \[23\]J\. Wu, W\. Deng, X\. Li, S\. Liu, T\. Mi, Y\. Peng, Z\. Xu, Y\. Liu, H\. Cho, C\. Choi,et al\.\(2025\)Medreason: eliciting factual medical reasoning steps in llms via knowledge graphs\.arXiv preprint arXiv:2504\.00993\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p3.1),[§2](https://arxiv.org/html/2606.00440#S2.p2.1)\.
- \[24\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4](https://arxiv.org/html/2606.00440#S4.p1.7)\.
- \[25\]H\. Zhang, P\. Wang, S\. Diao, Y\. Lin, R\. Pan, H\. Dong, D\. Zhang, P\. Molchanov, and T\. Zhang\(2024\)Entropy\-regularized process reward model\.arXiv preprint arXiv:2412\.11006\.Cited by:[§2](https://arxiv.org/html/2606.00440#S2.p4.1)\.
- \[26\]X\. Zhang, J\. N\. Acosta, J\. Miller, O\. Huang, and P\. Rajpurkar\(2025\)Rexgradient\-160k: a large\-scale publicly available dataset of chest radiographs with free\-text reports\.arXiv preprint arXiv:2505\.00228\.Cited by:[§4](https://arxiv.org/html/2606.00440#S4.p1.7)\.
- \[27\]Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin\(2025\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 10495–10516\.Cited by:[§1](https://arxiv.org/html/2606.00440#S1.p4.1)\.
## Appendix ASet\-to\-set distance metrics
All metrics defined below operate on two finite, non\-empty sets𝒜=\{𝐚1,…,𝐚n\}\\mathcal\{A\}=\\\{\\mathbf\{a\}\_\{1\},\\dots,\\mathbf\{a\}\_\{n\}\\\}andℬ=\{𝐛1,…,𝐛m\}\\mathcal\{B\}=\\\{\\mathbf\{b\}\_\{1\},\\dots,\\mathbf\{b\}\_\{m\}\\\}of unit\-norm sentence embeddings inℝd\\mathbb\{R\}^\{d\}, where the cardinalitiesn=\|𝒜\|n=\|\\mathcal\{A\}\|andm=\|ℬ\|m=\|\\mathcal\{B\}\|may differ\. Letd:ℝd×ℝd→\[0,1\]d:\\mathbb\{R\}^\{d\}\\\!\\times\\\!\\mathbb\{R\}^\{d\}\\to\[0,1\]be a base point\-to\-point distance; throughout this work we use the cosine distance
d\(𝐮,𝐯\)=12\(1−𝐮⊤𝐯‖𝐮‖2‖𝐯‖2\)∈\[0,1\],d\(\\mathbf\{u\},\\mathbf\{v\}\)\\;=\\;\\tfrac\{1\}\{2\}\\\!\\left\(1\\;\-\\;\\frac\{\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}\}\{\\\|\\mathbf\{u\}\\\|\_\{2\}\\,\\\|\\mathbf\{v\}\\\|\_\{2\}\}\\right\)\\;\\in\\;\[0,1\],\(11\)which is well matched to the unit\-normalised outputs of the sentence transformerEϕE\_\{\\phi\}\. The pairwise cost matrix isM∈ℝn×mM\\in\\mathbb\{R\}^\{n\\times m\}withMij=d\(𝐚i,𝐛j\)M\_\{ij\}=d\(\\mathbf\{a\}\_\{i\},\\mathbf\{b\}\_\{j\}\), and for the optimal\-transport based metrics we additionally rescale it to the unit interval,M~=M/maxijMij\\widetilde\{M\}=M/\\max\_\{ij\}M\_\{ij\}, so that the resulting transport cost lies in\[0,1\]\[0,1\]\. Each metric𝒟\\mathcal\{D\}is designed so that𝒟\(𝒜,𝒜\)=0\\mathcal\{D\}\(\\mathcal\{A\},\\mathcal\{A\}\)=0and larger values correspond to greater dissimilarity, which translates into the similarity reward1−𝒟1\-\\mathcal\{D\}used in Section[3\.2](https://arxiv.org/html/2606.00440#S3.SS2)\.
##### Chamfer distance\.
The \(symmetric\) Chamfer distance averages, over each set, the nearest\-neighbour cost to the other set:
𝒟Chamfer\(𝒜,ℬ\)=12\(1n∑i=1nmin1≤j≤mMij\+1m∑j=1mmin1≤i≤nMij\)\.\\mathcal\{D\}\_\{\\mathrm\{Chamfer\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\tfrac\{1\}\{2\}\\\!\\left\(\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\min\_\{1\\leq j\\leq m\}M\_\{ij\}\\;\+\\;\\frac\{1\}\{m\}\\sum\_\{j=1\}^\{m\}\\min\_\{1\\leq i\\leq n\}M\_\{ij\}\\right\)\.\(12\)The first term rewards every generated sentence for being close to*some*reference sentence; the second term penalises reference sentences that are not covered by any generated sentence\. Chamfer does not enforce a one\-to\-one correspondence and can re\-use the same target sentence for multiple sources\.
##### Hausdorff distance\.
The \(symmetric\) Hausdorff distance replaces the averages in Eq\. \([12](https://arxiv.org/html/2606.00440#A1.E12)\) with maxima, yielding the worst\-case nearest\-neighbour cost in either direction:
𝒟Hausdorff\(𝒜,ℬ\)=max\{max1≤i≤nmin1≤j≤mMij,max1≤j≤mmin1≤i≤nMij\}\.\\mathcal\{D\}\_\{\\mathrm\{Hausdorff\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\max\\\!\\left\\\{\\max\_\{1\\leq i\\leq n\}\\min\_\{1\\leq j\\leq m\}M\_\{ij\},\\;\\max\_\{1\\leq j\\leq m\}\\min\_\{1\\leq i\\leq n\}M\_\{ij\}\\right\\\}\.\(13\)A single uncovered sentence on either side is enough to dominate the score, which makes Hausdorff a strictly harsher reward than Chamfer\.
##### Optimal transport \(Wasserstein\) distance\.
We assign the uniform discrete measure to each set,𝐩=1n𝟏n\\mathbf\{p\}=\\tfrac\{1\}\{n\}\\mathbf\{1\}\_\{n\}and𝐪=1m𝟏m\\mathbf\{q\}=\\tfrac\{1\}\{m\}\\mathbf\{1\}\_\{m\}, and define the transport polytope
Γ\(𝐩,𝐪\)=\{γ∈ℝ≥0n×m:γ1m=𝐩,γ⊤𝟏n=𝐪\}\.\\Gamma\(\\mathbf\{p\},\\mathbf\{q\}\)\\;=\\;\\bigl\\\{\\,\\gamma\\in\\mathbb\{R\}\_\{\\geq 0\}^\{\\,n\\times m\}\\;:\\;\\gamma\\,\\mathbf\{1\}\_\{m\}=\\mathbf\{p\},\\;\\;\\gamma^\{\\\!\\top\}\\mathbf\{1\}\_\{n\}=\\mathbf\{q\}\\,\\bigr\\\}\.The optimal\-transport \(Wasserstein\) distance is then
𝒟OT\(𝒜,ℬ\)=minγ∈Γ\(𝐩,𝐪\)∑i=1n∑j=1mγijM~ij\.\\mathcal\{D\}\_\{\\mathrm\{OT\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\min\_\{\\gamma\\in\\Gamma\(\\mathbf\{p\},\\mathbf\{q\}\)\}\\;\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{m\}\\gamma\_\{ij\}\\,\\widetilde\{M\}\_\{ij\}\.\(14\)OT recovers a soft, mass\-preserving many\-to\-many alignment between the two sets and penalises systematic differences in their distribution even when individual nearest\-neighbour distances are small\.
##### Sinkhorn \(entropy\-regularised\) distance\.
To obtain a smoother, faster\-to\-compute proxy for OT we add an entropic penaltyH\(γ\)=−∑i,jγij\(logγij−1\)H\(\\gamma\)=\-\\sum\_\{i,j\}\\gamma\_\{ij\}\\bigl\(\\log\\gamma\_\{ij\}\-1\\bigr\):
𝒟Sinkhornε\(𝒜,ℬ\)=minγ∈Γ\(𝐩,𝐪\)∑i,jγijM~ij−εH\(γ\),ε\>0,\\mathcal\{D\}^\{\\,\\varepsilon\}\_\{\\mathrm\{Sinkhorn\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\min\_\{\\gamma\\in\\Gamma\(\\mathbf\{p\},\\mathbf\{q\}\)\}\\;\\sum\_\{i,j\}\\gamma\_\{ij\}\\,\\widetilde\{M\}\_\{ij\}\\;\-\\;\\varepsilon\\,H\(\\gamma\),\\qquad\\varepsilon\>0,\(15\)solved by Sinkhorn–Knopp iterations\. We evaluate three regimesε∈\{0\.01,0\.1,0\.5\}\\varepsilon\\in\\\{0\.01,\\,0\.1,\\,0\.5\\\}: smallerε\\varepsilonapproaches exact OT at higher iteration cost, while largerε\\varepsilonyields a smoother and more strongly regularised estimate\.
##### Unbalanced optimal transport\.
When\|𝒜\|≠\|ℬ\|\|\\mathcal\{A\}\|\\neq\|\\mathcal\{B\}\|or when individual sentences on one side genuinely have no counterpart on the other, the hard marginal constraints in Eq\. \([14](https://arxiv.org/html/2606.00440#A1.E14)\) can be inappropriate\. Unbalanced OT relaxes them with soft Kullback–Leibler penalties of strengthτ\>0\\tau\>0:
𝒟UOTε,τ\(𝒜,ℬ\)=minγ∈ℝ≥0n×m\\displaystyle\\mathcal\{D\}^\{\\,\\varepsilon,\\tau\}\_\{\\mathrm\{UOT\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\min\_\{\\gamma\\in\\mathbb\{R\}\_\{\\geq 0\}^\{\\,n\\times m\}\}∑i,jγijM~ij−εH\(γ\)\\displaystyle\\sum\_\{i,j\}\\gamma\_\{ij\}\\,\\widetilde\{M\}\_\{ij\}\\;\-\\;\\varepsilon\\,H\(\\gamma\)\(16\)\+τKL\(γ1m∥𝐩\)\+τKL\(γ⊤𝟏n∥𝐪\)\.\\displaystyle\\;\+\\;\\tau\\,\\mathrm\{KL\}\\\!\\bigl\(\\gamma\\mathbf\{1\}\_\{m\}\\,\\big\\\|\\,\\mathbf\{p\}\\bigr\)\\;\+\\;\\tau\\,\\mathrm\{KL\}\\\!\\bigl\(\\gamma^\{\\\!\\top\}\\mathbf\{1\}\_\{n\}\\,\\big\\\|\\,\\mathbf\{q\}\\bigr\)\.We fixε=0\.1\\varepsilon=0\.1and report two relaxation regimesτ∈\{0\.5,1\.0\}\\tau\\in\\\{0\.5,\\,1\.0\\\}: smallerτ\\tautolerates larger discrepancies in set size by allowing more mass to be created or destroyed\.
##### Gromov–Wasserstein distance\.
Gromov–Wasserstein \(GW\) compares the*intrinsic geometry*of the two sets rather than their absolute coordinates: it asks how well the pairwise distances inside𝒜\\mathcal\{A\}can be mapped to the pairwise distances insideℬ\\mathcal\{B\}\. LetC𝒜∈ℝn×nC^\{\\mathcal\{A\}\}\\in\\mathbb\{R\}^\{n\\times n\}andCℬ∈ℝm×mC^\{\\mathcal\{B\}\}\\in\\mathbb\{R\}^\{m\\times m\}be the rescaled intra\-set distance matrices, withCik𝒜=d\(𝐚i,𝐚k\)/maxi′k′d\(𝐚i′,𝐚k′\)C^\{\\mathcal\{A\}\}\_\{ik\}=d\(\\mathbf\{a\}\_\{i\},\\mathbf\{a\}\_\{k\}\)\\,/\\,\\max\_\{i^\{\\prime\}k^\{\\prime\}\}d\(\\mathbf\{a\}\_\{i^\{\\prime\}\},\\mathbf\{a\}\_\{k^\{\\prime\}\}\)andCℬC^\{\\mathcal\{B\}\}analogously\. The \(entropic\) GW distance is
𝒟GWε\(𝒜,ℬ\)=minγ∈Γ\(𝐩,𝐪\)∑i,j,k,ℓ\(Cik𝒜−Cjℓℬ\)2γijγkℓ−εH\(γ\),\\mathcal\{D\}^\{\\,\\varepsilon\}\_\{\\mathrm\{GW\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\min\_\{\\gamma\\in\\Gamma\(\\mathbf\{p\},\\mathbf\{q\}\)\}\\;\\sum\_\{i,j,k,\\ell\}\\bigl\(C^\{\\mathcal\{A\}\}\_\{ik\}\-C^\{\\mathcal\{B\}\}\_\{j\\ell\}\\bigr\)^\{\\\!2\}\\,\\gamma\_\{ij\}\\,\\gamma\_\{k\\ell\}\\;\-\\;\\varepsilon\\,H\(\\gamma\),\(17\)which we solve withε=0\.1\\varepsilon=0\.1\. GW is invariant under isometries of either set and is therefore well suited to comparing reports whose absolute embedding positions may shift while their internal sentence\-to\-sentence structure is preserved\. We treat the casemin\(n,m\)<2\\min\(n,m\)<2, in which the intra\-set geometry is trivial, as undefined and fall back to a zero similarity\.
##### Hungarian matching with nearest\-neighbour fallback\.
In contrast to OT\-style soft alignments, Hungarian matching enforces a strict one\-to\-one correspondence on the smaller set\. Without loss of generality assumen≤mn\\leq m, and let
π∗=argminπ:\{1,…,n\}↪\{1,…,m\}∑i=1nMi,π\(i\)\\pi^\{\\\!\*\}\\;=\\;\\operatorname\*\{arg\\,min\}\_\{\\pi:\\\{1,\\dots,n\\\}\\hookrightarrow\\\{1,\\dots,m\\\}\}\\sum\_\{i=1\}^\{n\}M\_\{i,\\pi\(i\)\}\(18\)be the optimal injective assignment returned by the Hungarian algorithm\. Writing𝒰=\{1,…,m\}∖π∗\(\{1,…,n\}\)\\mathcal\{U\}=\\\{1,\\dots,m\\\}\\setminus\\pi^\{\\\!\*\}\(\\\{1,\\dots,n\\\}\)for the unmatched indices on the larger side, the nearest\-neighbour variant attributes to each unmatched element only the cost of its closest neighbour on the smaller side and normalises bymax\(n,m\)\\max\(n,m\)so that the score is comparable across set sizes:
𝒟Hung\-NN\(𝒜,ℬ\)=1max\(n,m\)\[∑i=1nMi,π∗\(i\)\+∑j∈𝒰min1≤i≤nMij\]\.\\mathcal\{D\}\_\{\\mathrm\{Hung\\text\{\-\}NN\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\frac\{1\}\{\\max\(n,m\)\}\\\!\\left\[\\;\\sum\_\{i=1\}^\{n\}M\_\{i,\\pi^\{\\\!\*\}\(i\)\}\\;\+\\;\\sum\_\{j\\in\\mathcal\{U\}\}\\min\_\{1\\leq i\\leq n\}M\_\{ij\}\\;\\right\]\.\(19\)
##### Hungarian matching with count penalty\.
An alternative treatment of unmatched elements replaces the per\-element nearest\-neighbour cost in Eq\. \([19](https://arxiv.org/html/2606.00440#A1.E19)\) with a fixed penaltyα\>0\\alpha\>0:
𝒟Hung\-Penα\(𝒜,ℬ\)=1min\(n,m\)∑i=1min\(n,m\)Mi,π∗\(i\)\+α\|n−m\|\.\\mathcal\{D\}^\{\\,\\alpha\}\_\{\\mathrm\{Hung\\text\{\-\}Pen\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\frac\{1\}\{\\min\(n,m\)\}\\sum\_\{i=1\}^\{\\min\(n,m\)\}M\_\{i,\\pi^\{\\\!\*\}\(i\)\}\\;\+\\;\\alpha\\,\\bigl\|\\,n\-m\\,\\bigr\|\.\(20\)Here the first term is the mean cost of the matched pairs and the second term penalises any imbalance in cardinality, independently of the semantic quality of the matched sentences\. We evaluateα∈\{0\.1,0\.5\}\\alpha\\in\\\{0\.1,\\,0\.5\\\}: largerα\\alphapenalises reports with the wrong number of sentences more aggressively\.
##### Partial optimal transport\.
Partial OT \(POT\) relaxes the mass\-preservation constraint of OT by transporting only a fractionρ∈\(0,1\]\\rho\\in\(0,1\]of the total mass:
𝒟POTρ\(𝒜,ℬ\)=minγ∈ℝ≥0n×mγ1m≤𝐩γ⊤𝟏n≤𝐪𝟏n⊤γ1m=ρ∑i,jγijM~ij\.\\mathcal\{D\}^\{\\,\\rho\}\_\{\\mathrm\{POT\}\}\(\\mathcal\{A\},\\mathcal\{B\}\)\\;=\\;\\min\_\{\\begin\{subarray\}\{c\}\\gamma\\in\\mathbb\{R\}\_\{\\geq 0\}^\{\\,n\\times m\}\\\\\[1\.0pt\] \\gamma\\,\\mathbf\{1\}\_\{m\}\\leq\\mathbf\{p\}\\\\\[1\.0pt\] \\gamma^\{\\\!\\top\}\\mathbf\{1\}\_\{n\}\\leq\\mathbf\{q\}\\\\\[1\.0pt\] \\mathbf\{1\}\_\{n\}^\{\\\!\\top\}\\,\\gamma\\,\\mathbf\{1\}\_\{m\}=\\rho\\end\{subarray\}\}\\;\\sum\_\{i,j\}\\gamma\_\{ij\}\\,\\widetilde\{M\}\_\{ij\}\.\(21\)Intuitively, a fraction1−ρ1\-\\rhoof the mass on each side may be*discarded*at zero cost, capturing the intuition that not every generated sentence requires a counterpart in the reference \(and vice versa\)\. We use three settings: an adaptiveρ=min\(n,m\)/max\(n,m\)\\rho=\\min\(n,m\)/\\max\(n,m\), which is the largest ratio that avoids forcing spurious matches under asymmetric cardinalities, and two fixed valuesρ∈\{0\.5,0\.8\}\\rho\\in\\\{0\.5,\\,0\.8\\\}\.
##### Summary\.
Together, these nine families of metrics cover three complementary views of set similarity\.*Nearest\-neighbour*metrics \(Chamfer, Hausdorff\) measure local coverage and are cheap to compute but ignore one\-to\-one constraints\.*Transport\-based*metrics \(OT, Partial OT\) explicitly model how mass must be moved between the two sets and naturally handle continuous, asymmetric or geometrically structured discrepancies\.*Assignment\-based*metrics \(Hungarian\-NN, Hungarian\-Pen\) commit to a discrete one\-to\-one correspondence on the smaller set and treat cardinality mismatches with an explicit penalty\. We compare all of them as candidate semantic rewards in our GRPO fine\-tuning experiments\.
## Appendix BPrompts used for closed\-source LLM evaluations
For each closed\-source model evaluated in the response\-selection experiments \(Mistral\-Small, Gemini 2\.5 Flash\-Lite, Gemini 3\.1 Flash\-Lite, GPT\-4o mini and GPT\-5 mini\), we drawKKstochastic completions per test sample from the API under one of two prompt templates, denoted\[p1\]and\[p2\]in the result tables\. The two prompts are reproduced verbatim below; both enforce the same output schema \(Findings: <text\>followed byImpression: <text\>\), but\[p2\]additionally provides five in\-context exemplars\.
### B\.1Prompt 1 \(zero\-shot\)
Prompt 1 — zero\-shotYou are a radiology report generation model specialized in chest X\-rays\.Generate a concise clinical report for the given image\.Strict requirements:\- Output ONLY in this exact format: `Findings: <text\>` `Impression: <text\>`\- Do NOT include explanations, disclaimers, or any extra text\.\- Do NOT include phrases such as ‘‘consult a doctor’’ or ‘‘this is not medical advice’’\.\- Use professional radiology language\.\- Keep it concise and structured\.Output:
### B\.2Prompt 2 \(five\-shot\)
Prompt 2 — five\-shotYou are a radiology report generation model\. Given a chest X\-ray image, generate a concise radiology report\.Follow these strict rules:\- Output ONLY in this format: `Findings: <text\>` `Impression: <text\>`\- Do NOT include any explanations, disclaimers, or additional commentary\.\- Do NOT say things like ‘‘consult a doctor’’ or ‘‘this is not medical advice’’\.\- Match the writing style, tone, and structure of the examples below\.\- Be concise and clinically accurate\.Here are example reports:Example 0 `Findings:`Mild cardiomegaly\. No edema\. No consolidation or effusion\. No pneumothorax\. `Impression:`Mild cardiomegaly\.Example 1 `Findings:`No pneumonia is seen\. Minimal peribronchial thickening is noted\. The heart is within normal limits in size\. No bony abnormality is seen\. `Impression:`No pneumonia\. Mild peribronchial thickening\.Example 2 `Findings:`The heart size and mediastinal contours are within normal limits\. Both lungs are clear\. The visualized skeletal structures are unremarkable\. `Impression:`No active cardiopulmonary disease\.Example 3 `Findings:`The lungs are well\-expanded\. The interstitial markings are increased bilaterally\. Patchy areas of confluence are noted in the mid to lower left lung and at the right lung base\. The heart and pulmonary vascularity are normal\. The mediastinum is normal in width\. There is multilevel degenerative disc disease of the thoracic spine\. `Impression:`Bilateral interstitial pneumonia with patchy areas of alveolar infiltrate\. No pulmonary edema\. No pleural effusion\. Followup PA and lateral chest X\-ray is recommended in 3\-\-4 weeks following trial of antibiotic therapy to ensure resolution and exclude underlying malignancy\.Example 4 `Findings:`The heart size and mediastinal contours are within normal limits\. Both lungs are clear\. The visualized skeletal structures are unremarkable\. `Impression:`No active disease\.Now generate the report for the given chest X\-ray image\.Output:
## Appendix CExperimental setup
### C\.1Datasets
We had 179,778 samples in training and 45,364 samples for validation for MIMIC\-CXR\. We get all the \(image,report\) pairs in RexGradient so that we have 238,968 training and 17,007 validation samples which is splitted in the original dataset\.
### C\.2SFT \- Parameters
GPU:1xH100
Effective Batch size:96
Optimiser:AdamW
Learning rate:1e\-4
Number of Workers:8
### C\.3GRPO post\-training \- Parameters
GPU:2xH100 Nvidia GPUs
Effective Batch size:64
Optimiser:AdamW
Learning rate:2e\-5
Group sizeGG:8
Number of Workers:8
Reward weights:λfmt=1\\lambda\_\{\\mathrm\{fmt\}\}=1,λsem=1\\lambda\_\{\\mathrm\{sem\}\}=1
### C\.4Inference\-time response selection
For every test image we drawK=5K=5stochastic generations from the target policy with their default temperatures\. The training\-corpus reference embeddings\{ℰS\(r\(t\)\)\}t,S\\\{\\mathcal\{E\}^\{S\}\(r^\{\(t\)\}\)\\\}\_\{t,S\}are pre\-computed forN=5000N=5000samples once per dataset that are randomly selected and cached on disk\. This makes the sampel to training distribution distance calculations faster\. Morever, we limit the number of test samples for response selection experiments to beN=1000N=1000to limit the api costs\.
## Appendix DFull GRPO post\-training results
This appendix reports, for every \(dataset, model, report section\), the mean and sample standard deviation across 5 random seeds for every NLP metric we evaluated\. Bold marks the column\-best reward function within each table\.
### D\.1Headline tables for theImpressionsection
We first reproduce, for theImpressionsection, the same four\-metric headline tables shown in the main paper for theFindingssection \(Tabs\.[1](https://arxiv.org/html/2606.00440#S4.T1)and[2](https://arxiv.org/html/2606.00440#S4.T2)\)\. Each row pair reports the mean and sample standard deviation over 5 random seeds; the final*Mean*block averages each \(reward, metric\) cell across the models in that dataset\.
Table 5:GRPO post\-training results on ReXGradient \(Impression\)\.Mean over 5 random seeds with sample std in\. The final*Mean*block averages across models\.Bold= column\-best reward function per row\.Table 6:GRPO post\-training results on MIMIC\-CXR \(Impression\)\.Mean over 5 random seeds with sample std in\. The final*Mean*block averages across models\.Bold= column\-best reward function per row\.
### D\.2ReXGradient
#### D\.2\.1Findings
Table 7:GRPO post\-training results on ReXGradient \(Findings\)\.BLEU \(sentence & corpus,n=1…4n=1\\dots 4\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 8:GRPO post\-training results on ReXGradient \(Findings\)\.ROUGE \(F1forn=1n=1,n=2n=2,LL\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 9:GRPO post\-training results on ReXGradient \(Findings\)\.Embedding\-based / lexical \(METEOR, BERTScore F1, COMET, ChrF\+\+\+\+\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 10:GRPO post\-training results on ReXGradient \(Findings\)\.Clinical \(RadGraph, CheXbert\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.
#### D\.2\.2Impression
Table 11:GRPO post\-training results on ReXGradient \(Impression\)\.BLEU \(sentence & corpus,n=1…4n=1\\dots 4\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 12:GRPO post\-training results on ReXGradient \(Impression\)\.ROUGE \(F1forn=1n=1,n=2n=2,LL\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 13:GRPO post\-training results on ReXGradient \(Impression\)\.Embedding\-based / lexical \(METEOR, BERTScore F1, COMET, ChrF\+\+\+\+\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 14:GRPO post\-training results on ReXGradient \(Impression\)\.Clinical \(RadGraph, CheXbert\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.
### D\.3MIMIC\-CXR
#### D\.3\.1Findings
Table 15:GRPO post\-training results on MIMIC\-CXR \(Findings\)\.BLEU \(sentence & corpus,n=1…4n=1\\dots 4\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 16:GRPO post\-training results on MIMIC\-CXR \(Findings\)\.ROUGE \(F1forn=1n=1,n=2n=2,LL\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 17:GRPO post\-training results on MIMIC\-CXR \(Findings\)\.Embedding\-based / lexical \(METEOR, BERTScore F1, COMET, ChrF\+\+\+\+\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 18:GRPO post\-training results on MIMIC\-CXR \(Findings\)\.Clinical \(RadGraph, CheXbert\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.
#### D\.3\.2Impression
Table 19:GRPO post\-training results on MIMIC\-CXR \(Impression\)\.BLEU \(sentence & corpus,n=1…4n=1\\dots 4\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 20:GRPO post\-training results on MIMIC\-CXR \(Impression\)\.ROUGE \(F1forn=1n=1,n=2n=2,LL\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 21:GRPO post\-training results on MIMIC\-CXR \(Impression\)\.Embedding\-based / lexical \(METEOR, BERTScore F1, COMET, ChrF\+\+\+\+\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.Table 22:GRPO post\-training results on MIMIC\-CXR \(Impression\)\.Clinical \(RadGraph, CheXbert\)\. Mean over 5 random seeds with sample std in\. The final*Mean*block averages each \(reward, metric\) cell across the three models\.Bold= column\-best reward function per row\.
## Appendix EFull selection\-results breakdown
This appendix reports, for every model in our experiments and for every NLP metric we computed, the absolute score of every \(distance metric, aggregation\) selection policy alongside the random\-selection baseline \(13 runs total\)\. We first reproduce the headline\-summary table for theImpressionsection that complements Tab\.[3](https://arxiv.org/html/2606.00440#S5.T3)of the main paper \(Findings\); the cross\-model overall ranking and per\-model breakdowns follow\.
Table 23:Headline results \(Impressions\)\.For every model and every of five clinically meaningful NLP metrics we report the best score obtained by any \(distance metric, aggregation\) selection policy\. The matched random\-selection baseline is shown in italics under each model row, and the percentage improvement over random is given in parentheses\. Bold marks the best value per column\.### E\.1Visualisation of the headline data
Fig\.[3](https://arxiv.org/html/2606.00440#A5.F3)renders the same data as Tab\.[3](https://arxiv.org/html/2606.00440#S5.T3)\(and its Impression counterpart Tab\.[23](https://arxiv.org/html/2606.00440#A5.T23)\) as a method×\\,\\times\\,metric heatmap\.


Figure 3:Method×\\,\\times\\,metric heatmap\.Mean percentage improvement over random selection averaged across all 13 models, on five clinically meaningful metrics\. Rows are \(distance metric, aggregation\) pairs grouped by distance\-metric family; columns are NLP metrics\. Teal cells beat random, coral cells lose to it\.
## Appendix FStratified clinical\-meaningfulness analysis
This appendix reports, for every model, every \(distance metric, aggregation\) selection policy, every ground\-truth\-defined stratum and every per\-sample metric, both the absolute score and the difference vs\. the random baseline of the same stratum\. Joined\-report \(CheXbert macro F1 and per\-pathology F1\) tables follow at the end\. Random / oracle / naive baselines are banded for reference \(random is computed on the same per\-sample metrics used here, which are pooled across the joinedFindings∪\\cupImpressionreport\)\.
### F\.1Stratified tables \(four metrics\)
The five tables in this subsection report the per\-model best\-policy view and the matched percent\-improvement\-over\-random tables across all patients and within each ground\-truth\-defined stratum, on the four main metrics \(BS\-F1, ROUGE\-L F1, METEOR and RG\-F1\)\. Per\-model and per\-pathology breakdowns follow\.
Table 24:Stratified results\.Each row pair shows, for one model, the score of its best selection policy and the matched random baseline \(italic\), broken down by ground\-truth abnormality stratum \(no\-finding / single finding / multiple findings;nA=504n\_\{A\}\{=\}504,nB=206n\_\{B\}\{=\}206,nC=290n\_\{C\}\{=\}290\)\. The best policy is the \(distance, aggregation\) combination that maximises mean improvement over random across all \(stratum×\\timesmetric\) cells\. Subscripts: % improvement over random in that stratum\. RG\-F1 is per\-sample RadGraph entity F1; the corpus\-level relation\-aware variant reported in Tab\.[3](https://arxiv.org/html/2606.00440#S5.T3)is not defined per\-sample\.Table 25:All patients— mean±\\pmstd percentage improvement over random selection per \(distance metric, aggregation\) policy\. Bold: column\-best mean\.Table 26:No\-Finding stratum— mean±\\pmstd percentage improvement over random selection per \(distance metric, aggregation\) policy\. Bold: column\-best mean\.Table 27:Single\-finding stratum— mean±\\pmstd percentage improvement over random selection per \(distance metric, aggregation\) policy\. Bold: column\-best mean\.Table 28:Multi\-finding stratum— mean±\\pmstd percentage improvement over random selection per \(distance metric, aggregation\) policy\. Bold: column\-best mean\.
### F\.2Heatmaps of percent improvement over random \(all patients\)
For the two main\-paper clinical metrics \(BERTScore F1 and RadGraph F1\) we visualise, in Figs\.[4](https://arxiv.org/html/2606.00440#A6.F4)and[5](https://arxiv.org/html/2606.00440#A6.F5), the same per\-\(model, distance\) values that feed Tab\.[25](https://arxiv.org/html/2606.00440#A6.T25)– without the cross\-model averaging\. The aggregation is fixed to Avg throughout\. Teal cells indicate that the selection policy beats random selection for that \(model, distance\) combination; coral cells indicate the opposite\.
Figure 4:Percent improvement over random \(all patients\) – BERTScore F1 with Avg aggregation\.Cell value = the per\-\(model, distance\) percentage improvement over random selection on BERTScore F1, computed across the entire test set with the Avg aggregation\. Rows are distance metrics; columns are models\. Teal cells beat random, coral cells lose to it\. Companion to Tab\.[25](https://arxiv.org/html/2606.00440#A6.T25)\(which averages each column across the same per\-model values\)\.Figure 5:Percent improvement over random \(all patients\) – RadGraph F1 with Avg aggregation\.Cell value = the per\-\(model, distance\) percentage improvement over random selection on RadGraph F1, computed across the entire test set with the Avg aggregation\. Rows are distance metrics; columns are models\. Teal cells beat random, coral cells lose to it\. Companion to Tab\.[25](https://arxiv.org/html/2606.00440#A6.T25)\(which averages each column across the same per\-model values\)\.
## Appendix GFull pruning results
This appendix complements Tab\.[4](https://arxiv.org/html/2606.00440#S5.T4)of the main paper with every NLP and clinical metric we evaluated, for both theFindingsandImpressionsections, under the same three selection policies \(Random,Standard,Pruning\) and across every \(model, distance metric\) combination\.
### G\.1Pruning procedure
Alg\.[1](https://arxiv.org/html/2606.00440#alg1)summarises the distance\-guided pruning policy of Sec\.[5\.1](https://arxiv.org/html/2606.00440#S5.SS1)in pseudocode\. The procedure samplesKKcandidates’ opening sentence in lock\-step, then alternates a “decode\-one\-sentence / score against training distribution / drop the worstp%p\\%of survivors” loop until a single candidate remains; that survivor is decoded to its end\-of\-sequence token and returned\.
Algorithm 1Distance\-guided pruning of stochastic generations\.1:Input X\-ray
xx; generative policy
π\\pi; training corpus
𝒯=\{r\(t\)\}t=1N\\mathcal\{T\}=\\\{r^\{\(t\)\}\\\}\_\{t=1\}^\{N\}; candidate budget
KK; pruning fraction
p∈\(0,1\)p\\in\(0,1\)
2:Selected response
y^⋆\\hat\{y\}^\{\\star\}
3:
𝒞←\\mathcal\{C\}\\\!\\leftarrow\\\!sample
KKcandidates’ first sentence from
π\(⋅∣x\)\\pi\(\\cdot\\\!\\mid\\\!x\)
4:
t←1t\\\!\\leftarrow\\\!1⊳\\trianglerightnumber of fully decoded sentences per candidate
5:while
\|𝒞\|\>1\|\\mathcal\{C\}\|\>1do
6:
t←t\+1t\\\!\\leftarrow\\\!t\+1
7:foreach active candidate
y^\(k\)∈𝒞\\hat\{y\}^\{\(k\)\}\\\!\\in\\\!\\mathcal\{C\}do
8:decode the
tt\-th sentence of
y^\(k\)\\hat\{y\}^\{\(k\)\}from
π\\pi
9:form partial sentence\-embedding sets
ℰS\(y^:t\(k\)\)\\mathcal\{E\}^\{S\}\\\!\\bigl\(\\hat\{y\}^\{\(k\)\}\_\{:t\}\\bigr\)for
S∈\{F,I\}S\\\!\\in\\\!\\\{F,I\\\}
10:compute the partial\-output score
𝔇\(y^:t\(k\)\)\\mathfrak\{D\}\\\!\\bigl\(\\hat\{y\}^\{\(k\)\}\_\{:t\}\\bigr\)⊳\\trianglerightEq\. \([9](https://arxiv.org/html/2606.00440#S3.E9)\)
11:endfor
12:drop the highest\-scoring
⌈p\|𝒞\|⌉\\lceil p\\,\|\\mathcal\{C\}\|\\rceilcandidates from
𝒞\\mathcal\{C\}
13:endwhile
14:
y^⋆←\\hat\{y\}^\{\\star\}\\\!\\leftarrow\\\!decode the surviving candidate to its end\-of\-sequence token
15:return
y^⋆\\hat\{y\}^\{\\star\}
Table 29:Distance\-guided pruning of generations \(Findings\)\.For every \(model, distance\) pair we report the percentage of generation tokens saved by the pruning policy and three headline metrics scored under three selection policies:Random\(uniform random pick among theKKstochastic candidates\),Standard\(full\-generation pipeline of Sec\.[3\.3](https://arxiv.org/html/2606.00440#S3.SS3), distance\-based selection on complete candidates\) andPruning\(distance\-guided early\-pruning during decoding, this work\)\. Bold marks the column\-best of \{random, standard, pruning\} within each metric block\.
## Appendix HQualitative examples
We provide a set of qualitative examples that illustrate when the distance\-to\-training\-distribution selection rule \(used at inference time for closed\-source LLMs and exploratory ablations\) picks a candidate that is also the closest to the ground truth report in BERTScore\-F1\. The cases below are drawn from the chest\-X\-ray report\-generation task on ReXGradient\. For each example we show the input image, the ground\-truth report, the candidate selected by the distance rule, and one of the rejected candidates that scored lower in BERTScore\-F1 against the ground truth\.

Model:Qwen3\-VL\-4B GRPORexactR\_\{\\mathrm\{exact\}\} Distance metric:Partial OT \(ρ=0\.5\\rho\{=\}0\.5\) Aggregation:min Number of candidates:K=10K\{=\}10 Each panel below shows a candidate report with its BERTScore\-F1 against the joined ground\-truth report and its distance to the training distribution under the chosen distance metric and aggregation\.
Ground truth reportFindings:The lungs are well\-aerated and clear\. There is no evidence of focal opacification, pleural effusion or pneumothorax\. The heart is normal in size; the mediastinal contour is within normal limits\. No acute osseous abnormalities are seen\. Impression:No acute cardiopulmonary process seen\.
Selected candidateBERTScore\-F1:0\.903Dist:2\.94e\-12Findings:The lungs are well\-aerated and clear\. There is no evidence of focal opacification, pleural effusion or pneumothorax\. The heart is normal in size; the mediastinal contour is within normal limits\. No acute osseous abnormalities are seen\. Impression:1\. No acute cardiopulmonary process seen\. 2\. No displaced rib fractures identified\.
Rejected candidate \#1BERTScore\-F1:0\.211Dist:0\.872Findings:The lungs are well expanded and demonstrate no evidence of focal airspace disease or pneumothorax\. The cardiopericardial silhouette is enlarged\. The Port\-A\-Cath tip is in the superior vena cava\. The patient is rotated on the frontal view\. The visualized portions of the bowel gas pattern are grossly unremarkable\. Impression:1\. Cardiomegaly without evidence of acute cardiopulmonary disease\. 2\. Port\-A\-Cath tip in the superior vena cava\.
Rejected candidate \#3BERTScore\-F1:0\.113Dist:0\.689Findings:Umbilical venous catheter tip projects at the inferior T9 level, within the inferior right atrium\. Orogastric tube tip projects over the gastric bubble\. Lung volumes are low\. No focal airspace consolidation or pulmonary edema\. No pleural effusion or pneumothorax\. Normal cardiomediastinal silhouette\. The bowel gas pattern is nonobstructive\. No radiopaque calculi\. Impression:1\. Umbilical venous catheter tip projects at the inferior T9 level, within the inferior right atrium\. Consider retraction of 1 cm\. 2\. Low lung volumes\.
Rejected candidate \#2BERTScore\-F1:0\.044Dist:0\.707Findings:Umbilical artery catheter tip projects at the T6 level\. Umbilical vein catheter tip projects at the T7 level\. Both lungs are clear\. The visualized skeletal structures are unremarkable\. Impression:1\. Umbilical artery catheter tip projects at the T6 level\. Umbilical vein catheter tip projects at the T7 level\. 2\. No acute cardiopulmonary disease\.
Figure 6:The selection rule picks the candidate with the lowest distance to the training distribution; that candidate also has the highest BERTScore\-F1 against the ground truth\. The three rejected alternatives shown were among the candidates whose distance to the training distribution was the larger under the chosen\(metric,agg\)\(\\textsc\{metric\},\\textsc\{agg\}\)pair, and they correspondingly score lower in BERTScore\-F1\.
Model:Gemini 2\.5 Flash\-Lite Distance metric:Hungarian \+ Pen \(α=0\.5\\alpha\{=\}0\.5\) Aggregation:avg Number of candidates:K=5K\{=\}5 Each panel below shows a candidate report with its BERTScore\-F1 against the joined ground\-truth report and its distance to the training distribution under the chosen distance metric and aggregation\.
Ground truth reportFindings:Cardiac silhouette is normal in size and configuration\. Changes from the recent CABG surgery are noted\. No mediastinal widening\. No mediastinal or hilar masses or evidence adenopathy\. Lungs are clear\. No pleural effusion or pneumothorax\. Skeletal structures are intact\. Impression:1\. No active cardiopulmonary disease\. 2\. Postsurgical atelectasis and effusions have resolved since the prior exam\.
Selected candidateBERTScore\-F1:0\.364Dist:1\.91Findings:Surgical clips and wires are noted anteriorly, likely related to sternal closure\. The lungs are clear\. The cardiac silhouette and mediastinal contours are within normal limits\. No pleural effusion or pneumothorax is identified\. Visualized bones are unremarkable\. Impression:Post\-sternotomy changes\. No acute cardiopulmonary abnormality\.
Rejected candidate \#1BERTScore\-F1:0\.216Dist:2\.23Findings:Multiple sternal wires are identified anteriorly\. The lungs are clear\. The cardiomediastinal silhouette is normal\. Impression:Post\-sternotomy wires\. No acute cardiopulmonary abnormality\.
Rejected candidate \#2BERTScore\-F1:0\.142Dist:2\.06Findings:Lateral view of the chest shows a metallic surgical wire wrapping around the sternum, consistent with sternal fixation\. The lungs are clear\. The heart and mediastinum appear within normal limits for this projection\. The visualized spine shows degenerative changes\. Impression:Post\-sternotomy wires\. No acute cardiopulmonary process is identified\.
Rejected candidate \#3BERTScore\-F1:0\.222Dist:1\.98Findings:Multiple surgical wires are noted along the anterior sternum\. The lungs are clear\. The heart and mediastinal contours are within normal limits\. No acute bony abnormality is identified\. Impression:Post\-sternotomy wires\. No acute cardiopulmonary process\.
Figure 7:The selection rule picks the candidate with the lowest distance to the training distribution; that candidate also has the highest BERTScore\-F1 against the ground truth\. The three rejected alternatives shown were among the candidates whose distance to the training distribution was the larger under the chosen\(metric,agg\)\(\\textsc\{metric\},\\textsc\{agg\}\)pair, and they correspondingly score lower in BERTScore\-F1\.
Model:Gemini 2\.5 Flash\-Lite Distance metric:Hausdorff Aggregation:min Number of candidates:K=5K\{=\}5 Each panel below shows a candidate report with its BERTScore\-F1 against the joined ground\-truth report and its distance to the training distribution under the chosen distance metric and aggregation\.
Ground truth reportFindings:Examination limited by positioning and technique\. The costophrenic sulci are excluded from the field of view\. The cardiac silhouette is enlarged\. There are congestive changes bilaterally\. It is likely pleural fluid bilaterally as well\. Bony thorax is intact\. Impression:Limited examination\. CHF pattern\. Enlarged cardiac silhouette, congestive changes, and probable effusions\.
Selected candidateBERTScore\-F1:0\.203Dist:0\.599Findings:Diffuse bilateral airspace opacities are noted, more prominent at the bases, consistent with pulmonary edema\. The cardiac silhouette is enlarged\. There is a right pleural effusion\. The visualized bony structures are unremarkable\. Impression:Pulmonary edema, likely cardiogenic\. Right pleural effusion\.
Rejected candidate \#3BERTScore\-F1:0\.078Dist:0\.77Findings:There are diffuse bilateral opacities, most prominent in the lower lung zones, suggestive of airspace disease\. The cardiomediastinal silhouette is enlarged and indistinct, potentially related to underlying disease or effusion\. A nasogastric tube is present, terminating in the stomach\. Impression:Diffuse bilateral airspace disease, etiology uncertain, but concerning for pneumonia or ARDS\. Cardiomegaly and possible pericardial effusion\.
Rejected candidate \#1BERTScore\-F1:0\.029Dist:0\.827Findings:Diffuse bilateral opacities are present throughout the lungs, more pronounced at the bases\. The cardiac silhouette is enlarged\. Subcutaneous air is noted in the right axilla and upper chest wall\. Impression:Diffuse pneumonia or ARDS\. Cardiomegaly\. Subcutaneous emphysema\.
Rejected candidate \#2BERTScore\-F1:\-0\.033Dist:0\.778Findings:Diffuse bilateral airspace opacities, more prominent in the lower lobes\. cardiomegaly\. The visualized left upper extremity demonstrates a malpositioned central venous catheter\. Impression:Diffuse bilateral airspace disease, concerning for pneumonia or acute respiratory distress syndrome\. Cardiomegaly\.
Figure 8:The selection rule picks the candidate with the lowest distance to the training distribution; that candidate also has the highest BERTScore\-F1 against the ground truth\. The three rejected alternatives shown were among the candidates whose distance to the training distribution was the larger under the chosen\(metric,agg\)\(\\textsc\{metric\},\\textsc\{agg\}\)pair, and they correspondingly score lower in BERTScore\-F1\.Similar Articles
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
RTDMD is a two-stage framework combining distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. It achieves state-of-the-art results on multiple models with only 4 inference steps.
AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
AnchorDiff proposes a topology-aware masked diffusion framework for radiology report generation, integrating RadGraph-derived clinical anchors and confidence-based rewriting to achieve state-of-the-art results on MIMIC-CXR and MIMIC-RG4 benchmarks.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.
RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings
RADS uses reinforcement learning to pick the most informative samples for few-shot fine-tuning, boosting transfer-learning accuracy on low-resource, highly imbalanced clinical datasets.
Distributional Reinforcement Learning via the Cram\'er Distance
This paper introduces C-DSAC, a new distributional reinforcement learning algorithm that uses the Cramér distance to improve performance and stability in robotic benchmarks compared to standard SAC.