Discrete Diffusion Language Models for Interactive Radiology Report Drafting

arXiv cs.AI 07/03/26, 04:00 AM Papers
diffusion-models language-models radiology medical-ai report-generation interactive-drafting
Summary
This paper adapts a diffusion language model for interactive radiology report drafting, showing it matches autoregressive models in accuracy while offering unique infill capabilities that allow radiologists to fix report fragments and have the model fill in the text between them.
arXiv:2607.01436v1 Announce Type: new Abstract: Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive (AR) generation. Medical foundation models, however, remain almost entirely autoregressive. We adapt a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmark it against its same-size AR sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity-robust LLM judge. Diffusion matches or exceeds AR on all of them, and the finetuned model (3.8B active) is competitive with frontier vision-language models; its decoding is also 3.5-4.4x faster. Beyond this parity, the diffusion model offers a drafting capability AR lacks: any-order infill. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it. This suits real reports, which are often terse or inconsistent across clinicians and institutions.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:44 AM
# Discrete Diffusion Language Models for Interactive Radiology Report Drafting
Source: [https://arxiv.org/html/2607.01436](https://arxiv.org/html/2607.01436)
Max Van Puyvelde\* 1,2H\. Ibrahim Gulluk\* 3 maxvpuyv@stanford\.edugulluk@stanford\.edu Wim Van Criekinge† 2Olivier Gevaert† 1 wim\.vancriekinge@ugent\.beogevaert@stanford\.edu 1Department of Biomedical Data Science, Stanford University School of Medicine 2Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University 3Department of Electrical Engineering, Stanford University

###### Abstract

Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive \(AR\) generation\. Medical foundation models, however, remain almost entirely autoregressive\. We adapt a mixture\-of\-experts diffusion language model,DiffusionGemma\-26B, and benchmark it against its same\-size AR siblingGemma\-4\-26Bunder an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity\-robust LLM judge\. Diffusion matches or exceeds AR on all of them, and the finetuned model \(3\.83\.8B active\) is competitive with frontier vision\-language models; its decoding is also3\.53\.5–4\.4×4\.4\\timesfaster\. Beyond this parity, the diffusion model offers a drafting capability AR lacks: any\-order infill\. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it\. This suits real reports, which are often terse or inconsistent across clinicians and institutions\.

††\*Joint first authors\.†Joint senior authors\.## 1Introduction

Autoregressive \(AR\) generation, which produces text one token at a time from left to right, underlies nearly all large language and vision\-language models\. Discrete diffusion language models\[[1](https://arxiv.org/html/2607.01436#bib.bib4),[19](https://arxiv.org/html/2607.01436#bib.bib5),[18](https://arxiv.org/html/2607.01436#bib.bib3)\]are a recent alternative: they generate a sequence by iteratively denoising a fixed token canvas, with each position attending to the entire canvas rather than only to preceding tokens\. On general text these models are competitive with autoregressive models of comparable size\[[18](https://arxiv.org/html/2607.01436#bib.bib3),[22](https://arxiv.org/html/2607.01436#bib.bib6)\], which makes them a plausible backbone for domains that have so far relied on autoregression\. One open instance,DiffusionGemma\-26B\[[6](https://arxiv.org/html/2607.01436#bib.bib1)\], couples this denoising decoder with a native multimodal encoder, and belongs to a model family that also includes a same\-size autoregressive model,Gemma\-4\-26B\[[5](https://arxiv.org/html/2607.01436#bib.bib2)\]; the two share size, family, and lineage, and differ chiefly in their generative paradigm\.

Existing medical foundation models, however, are almost exclusively autoregressive\. Radiology report generation \(RRG\), the task of drafting a report from an image, is dominated by AR models\[[12](https://arxiv.org/html/2607.01436#bib.bib11),[2](https://arxiv.org/html/2607.01436#bib.bib12),[25](https://arxiv.org/html/2607.01436#bib.bib13),[9](https://arxiv.org/html/2607.01436#bib.bib16),[7](https://arxiv.org/html/2607.01436#bib.bib17),[8](https://arxiv.org/html/2607.01436#bib.bib18)\], as are medical vision\-language assistants\[[15](https://arxiv.org/html/2607.01436#bib.bib23)\]\. Whether a diffusion language model is viable as a medical foundation model, both accurate enough and useful in the clinical workflow, is largely untested\. A few diffusion models already generate CXR reports\[[23](https://arxiv.org/html/2607.01436#bib.bib8),[17](https://arxiv.org/html/2607.01436#bib.bib9),[4](https://arxiv.org/html/2607.01436#bib.bib10)\], but produce complete reports only and do not address interactive drafting\.

We finetune both the diffusion model and its autoregressive sibling on paired image\-text data from medical visual\-question\-answering datasets, under an identical LoRA recipe that varies only the generative paradigm \(same backbone size, vision tower, LoRA targets, and data\), and benchmark them against each other and frontier vision\-language models with a verbosity\-robust LLM judge\.

Beyond accuracy, the two paradigms differ in what they can be conditioned on\. Reporting practice varies: negative and normal findings are stated explicitly in some settings and omitted in others, and section conventions differ across institutions\. A tool that completes or normalizes a report around content the radiologist has already entered, at arbitrary positions, is therefore a useful drafting operation\. Because a diffusion decoder denoises the whole canvas bidirectionally, it can fill such a gap from the fixed text on both sides\. An autoregressive model, conditioning each token only on preceding text, cannot: a fragment fixed after the gap cannot inform the text filled before it\. We call this any\-order infill\.

We make three contributions\. \(i\)A diffusion language model is a competitive medical foundation model\.DiffusionGemma\-26Bequals or exceeds its autoregressive sibling on medical VQA and rivals frontier vision\-language models while decoding3\.53\.5–4\.4×4\.4\\timesfaster \([Section˜4\.1](https://arxiv.org/html/2607.01436#S4.SS1),[Section˜4\.2](https://arxiv.org/html/2607.01436#S4.SS2)\), in a matched comparison that varies only the generative paradigm\. \(ii\)Any\-order infill is a conditioning capability inherent to diffusion\.We cast infill as sampling a report given fragments fixed at arbitrary positions \([Section˜3\.3](https://arxiv.org/html/2607.01436#S3.SS3)\) and show onMIMIC\-CXRthat the diffusion model exploits context on both sides of a gap far more effectively than its autoregressive sibling \([Section˜4\.3](https://arxiv.org/html/2607.01436#S4.SS3)\)\. \(iii\) We release our code and finetuned checkpoints\.111Code:[https://github\.com/mxvp/discrete\_diffusion\_RRG](https://github.com/mxvp/discrete_diffusion_RRG)\. Checkpoints:[https://huggingface\.co/gevaertlab/diffusiongemma\-radiology\-vqa](https://huggingface.co/gevaertlab/diffusiongemma-radiology-vqa)\.

## 2Related Work

#### Diffusion for medical RRG and infill\.

RRG is dominated by autoregressive models such as MAIRA\[[12](https://arxiv.org/html/2607.01436#bib.bib11),[2](https://arxiv.org/html/2607.01436#bib.bib12)\]and ReXrank\[[25](https://arxiv.org/html/2607.01436#bib.bib13)\]\. Discrete diffusion\[[1](https://arxiv.org/html/2607.01436#bib.bib4),[19](https://arxiv.org/html/2607.01436#bib.bib5),[18](https://arxiv.org/html/2607.01436#bib.bib3)\]denoises a token canvas bidirectionally, and several systems apply it to CXR report generation:*AnchorDiff*\[[23](https://arxiv.org/html/2607.01436#bib.bib8)\]\(vision\-conditioned LLaDA\-8B, claimed as the first masked diffusion for RRG\),*MeDiM*\[[17](https://arxiv.org/html/2607.01436#bib.bib9)\]\(unified any\-to\-any generation\), and ECHO\[[4](https://arxiv.org/html/2607.01436#bib.bib10)\]\(one\-step distillation\)\. All use bidirectionality only to improve full generation, and none isolate the paradigm against a matched autoregressive backbone or expose interactive infill\. Generic diffusion infill is established\[[22](https://arxiv.org/html/2607.01436#bib.bib6),[10](https://arxiv.org/html/2607.01436#bib.bib7)\]but not framed as clinical drafting, and existing interactive report tools condition on a region\[[20](https://arxiv.org/html/2607.01436#bib.bib15)\]or a prefix\[[21](https://arxiv.org/html/2607.01436#bib.bib14)\], not on fragments fixed at arbitrary positions\.

#### Medical VQA and LLM\-as\-judge\.

VQA\-RAD\[[14](https://arxiv.org/html/2607.01436#bib.bib20)\], SLAKE\[[16](https://arxiv.org/html/2607.01436#bib.bib21)\], and VQA\-Med\[[3](https://arxiv.org/html/2607.01436#bib.bib22)\]pair radiology images with short open\- and closed\-ended questions\. Because exact\-match scoring penalizes valid paraphrases, open\-ended medical VQA is now evaluated with an LLM judge\[[15](https://arxiv.org/html/2607.01436#bib.bib23),[26](https://arxiv.org/html/2607.01436#bib.bib24)\], which we adopt \([Section˜4\.1](https://arxiv.org/html/2607.01436#S4.SS1)\)\.

## 3Method

### 3\.1Matched Backbones

We compare diffusion and autoregression with everything else held fixed\. The diffusion model isDiffusionGemma\-26B\[[6](https://arxiv.org/html/2607.01436#bib.bib1)\], a discrete diffusion language model, and its AR sibling isGemma\-4\-26B\[[5](https://arxiv.org/html/2607.01436#bib.bib2)\]; both are25\.225\.2B/3\.83\.8B\-active mixture\-of\-experts models with a SigLIP\-lineage\[[24](https://arxiv.org/html/2607.01436#bib.bib26)\]vision encoder \(∼\\sim280 image tokens\)\. We adapt each backbone with low\-rank adaptation \(LoRA\)\[[11](https://arxiv.org/html/2607.01436#bib.bib25)\]: rank\-6464updates \(α=128\\alpha\{=\}128\) on the attention and shared\-MLP projections, with the128128experts, the router, and the vision tower frozen\. The experts hold most of the weights, so adapting only the shared projections updates the model at a small fraction of the cost of a full finetune, and the identical recipe and data across both backbones leave the generative paradigm as the only deliberate variable\. The optimizer is the lone exception: each paradigm keeps the AdamW settings established for its objective, since a shared one underfits one of the two losses\. Full hyperparameters are in[Appendix˜A](https://arxiv.org/html/2607.01436#A1)\.

### 3\.2Image\-Conditioned Adaptation

We condition on the image and diffuse the text target; the image is never generated\. Both paradigms are supervised only on the target tokens, with the image and prompt held fixed, and share the same target string: the report \(Findings and Impression\) for drafting and infill, or a short answer for VQA\. A full report fits in one256256\-token canvas, so intra\-report attention is bidirectional end to end, which any\-order infill requires\.

Each paradigm is finetuned with its standard supervised objective, the only difference between the two runs: the diffusion model uses the uniform\-state dLLM objective\[[1](https://arxiv.org/html/2607.01436#bib.bib4),[6](https://arxiv.org/html/2607.01436#bib.bib1)\]\(a random fraction of the target tokens is replaced with uniform draws from the vocabulary, a random token rather than a\[MASK\]symbol, and the model is trained to recover them\), and the AR sibling uses next\-token cross\-entropy on the same targets\.

### 3\.3Any\-Order Infill

Infill fixes part of the report and has the model fill the rest, conditioned on what is fixed\. A radiologist who leaves a gap in a draft, for instance, fixes the text on either side of it\. LetFFbe the fixed positions,𝐚\\mathbf\{a\}the tokens placed there, andF¯\\bar\{F\}the positions left to fill; infill samples the free positions from the model’s conditional

𝐱F¯∼pθ\(𝐱F¯∣𝐱F=𝐚,c\),\\mathbf\{x\}\_\{\\bar\{F\}\}\\,\\sim\\,p\_\{\\theta\}\\\!\\big\(\\mathbf\{x\}\_\{\\bar\{F\}\}\\,\\mid\\,\\mathbf\{x\}\_\{F\}\{=\}\\mathbf\{a\},\\,c\\big\),\(1\)its report distribution restricted to outputs that carry𝐚\\mathbf\{a\}atFF\. A diffusion decoder samples this conditional directly, with no retraining: at each denoising step we re\-impose𝐱F=𝐚\\mathbf\{x\}\_\{F\}\{=\}\\mathbf\{a\}, before the update so the model predicts the free positions while seeing the fixed ones, and after it so they survive the step’s re\-randomization\. Because attention within the canvas is bidirectional, a free position conditions on fixed tokens to its right as much as to its left, so the gap is filled from context on both sides\.

An autoregressive model factors left to right,pθ\(𝐱∣c\)=∏ipθ\(xi∣x<i,c\)p\_\{\\theta\}\(\\mathbf\{x\}\\mid c\)=\\prod\_\{i\}p\_\{\\theta\}\(x\_\{i\}\\mid x\_\{<i\},c\), and cannot sample this conditional: a token never sees the positions after it, so text fixed after the gap cannot shape the fill before it\. The paradigms differ in what they can be conditioned on, not in writing quality;[Section˜4\.3](https://arxiv.org/html/2607.01436#S4.SS3)measures it, and[Appendix˜C](https://arxiv.org/html/2607.01436#A3)gives the sampler\.

## 4Experiments

We evaluate the adapted backbones three ways: medical VQA accuracy, inference speed, and any\-order infill\.

### 4\.1Medical VQA

We compare diffusion and AR on medical VQA and place both against frontier vision\-language models \(VLMs\)\. We evaluate on three medical\-VQA datasets: VQA\-RAD\[[14](https://arxiv.org/html/2607.01436#bib.bib20)\], SLAKE\[[16](https://arxiv.org/html/2607.01436#bib.bib21)\], and VQA\-Med\-2019\[[3](https://arxiv.org/html/2607.01436#bib.bib22)\]\(sizes in[Appendix˜B](https://arxiv.org/html/2607.01436#A2)\), each pairing an image and question with a short open or closed answer\.

We evaluate each backbone on every dataset both zero\-shot \(*base*\) and after per\-dataset finetuning \(*finetuned*\), and compare against three frontier VLMs \(Gemini\-3\.5\-Flash, GPT\-4\.1\-mini, and Claude\-Sonnet\-4\.6\)\. Finetuning adapts the backbone with LoRA on the dataset \([Appendix˜A](https://arxiv.org/html/2607.01436#A1)\)\. The frontier models are run zero\-shot in a single forward pass, without extended reasoning, and every model answers the same350350held\-out questions per dataset\.

We score with an LLM judge\. Standard exact\-match accuracy is unsuitable for a cross\-model comparison here: base and frontier models answer in full sentences and score near zero regardless of correctness \([Fig\.˜1](https://arxiv.org/html/2607.01436#S4.F1)\)\. We therefore score semantic correctness: a fixed judge \(Claude Sonnet 4\.6\) returns a binary semantic\-equivalence verdict per \(question, reference, answer\) triple, allowing paraphrase and added explanation\[[15](https://arxiv.org/html/2607.01436#bib.bib23),[26](https://arxiv.org/html/2607.01436#bib.bib24)\], the standard for open\-ended medical VQA\.

[Table˜1](https://arxiv.org/html/2607.01436#S4.T1)reports LLM\-judge accuracy for all models, and[Fig\.˜2](https://arxiv.org/html/2607.01436#S4.F2)plots it\.

![Refer to caption](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_vqamed_4.jpg)Q\.what is abnormal in the ct scan?GT\.pancreatic ductal adenocarcinomaFigure 1:A medical\-VQA example \(VQA\-Med\)\.Every model’s answer to “what is abnormal in the CT scan?” \(reference:*pancreatic ductal adenocarcinoma*\), with the LLM judge’s verdict \(✓correct,×\\timesincorrect\)\. Base and frontier models reply in full sentences that exact\-match scoring would reject regardless of content; here only the finetuned diffusion model answers correctly\.Table 1:LLM\-judge accuracy \(Claude Sonnet 4\.6, semantic\-equivalence\),n=350n\{=\}350items per dataset\.*diff*/*AR*are our two backbones \(DiffusionGemma / Gemma\-4\), evaluated zero\-shot \(*base*\) and after per\-dataset finetuning\. Frontier VLMs \(Gemini\-3\.5\-Flash, GPT\-4\.1\-mini, Claude\-Sonnet\-4\.6\) are zero\-shot, single forward pass\. Bold: best per dataset,*separately*among our models and among the frontier VLMs\.†Claude\-Sonnet\-4\.6 is also the judge model\.![Refer to caption](https://arxiv.org/html/2607.01436v1/x1.png)Figure 2:LLM\-judge accuracy \(Claude Sonnet 4\.6\)\.\(a\)base vs\. finetuned, for diffusion and AR\.\(b\)the finetuned2626B model \(3\.83\.8B active\) against three frontier non\-reasoning VLMs\.†Claude\-Sonnet\-4\.6 is the judge model\.#### Finetuning\.

Finetuning improves LLM\-judge accuracy for both paradigms, most on SLAKE:\+0\.163\+0\.163diffusion \(0\.700→0\.8630\.700\{\\to\}0\.863\) and\+0\.143\+0\.143AR \(0\.674→0\.8170\.674\{\\to\}0\.817\); VQA\-RAD\-AR gains\+0\.126\+0\.126, and the VQA\-Med gains are marginal\. Base diffusion already reaches0\.610\.61–0\.700\.70\.

#### Diffusion versus AR\.

Finetuned diffusion equals or exceeds finetuned AR on the judge metric for all three datasets, and base diffusion exceeds base AR on all three \([Table˜1](https://arxiv.org/html/2607.01436#S4.T1)\)\. On per\-item McNemar tests over the judge verdicts \(n=350n\{=\}350\), the difference is significant on SLAKE finetuned \(\+0\.046\+0\.046,p=0\.026p\{=\}0\.026\) and VQA\-RAD base \(\+0\.091\+0\.091,p<0\.001p\{<\}0\.001\); the other four diffusion\-vs\.\-AR comparisons are not significant\. The difference is concentrated on closed \(yes/no\) questions, where the answer format is irrelevant \(e\.g\. on VQA\-RAD finetuned, closed\-question accuracy is0\.8250\.825for diffusion vs\.0\.7570\.757for AR\)\. That a uniform\-state denoising model matches its next\-token sibling at equal scale, on questions that turn on fine\-grained image grounding, indicates the diffusion paradigm is a viable substrate for a medical foundation model, on which the infill capability of[Section˜4\.3](https://arxiv.org/html/2607.01436#S4.SS3)builds\.

#### Frontier VLMs\.

The finetuned2626B model \(3\.83\.8B active\) is competitive with the three frontier VLMs \([Table˜1](https://arxiv.org/html/2607.01436#S4.T1);[Fig\.˜2](https://arxiv.org/html/2607.01436#S4.F2)b\)\. Finetuned diffusion has the highest judge accuracy on SLAKE \(0\.8630\.863\); Gemini\-3\.5\-Flash is highest on VQA\-RAD \(0\.7770\.777\) and VQA\-Med \(0\.6830\.683\)\. Finetuned diffusion exceeds GPT\-4\.1\-mini on all three datasets; only Gemini\-3\.5\-Flash clearly surpasses it, on VQA\-RAD and VQA\-Med, while the judge model itself edges it on VQA\-RAD \(0\.6540\.654vs\.0\.6490\.649, within noise atn=350n\{=\}350\)\. Example per\-model answers appear in[Section˜D\.1](https://arxiv.org/html/2607.01436#A4.SS1)\.

### 4\.2Inference Speed

Latency matters for interactive drafting: the model must produce a draft fast enough to be regenerated as the radiologist works\. We characterize inference speed for the two decoders on matched hardware\.

The cost structures differ\. AR cost scales with decoded tokens: each token is one sequential forward pass \(with KV caching\), so latency grows with report length\. Diffusion cost is set by the denoising\-step budget over the256256\-token canvas: each step is one forward pass updating all unaccepted positions in parallel, independent of length\. Because latency therefore depends on token count and step budget rather than on the report’s content, we measure a generic∼\\sim256256\-token generation rather than a specific dataset\.

DiffusionGemma\-26Bdrafts3\.53\.5–4\.4×4\.4\\timesfaster than its AR sibling and at5\.75\.7–7\.1×7\.1\\timeshigher throughput \([Table˜2](https://arxiv.org/html/2607.01436#S4.T2)\); AR is timed at its natural, shorter output while diffusion fills the full canvas, so the comparison is generous to AR\.

Table 2:Inference speed on one H100 \(bf16,∼\\sim256256\-token generation\)\. AR is greedy decode; DiffusionGemma is swept over the denoising\-step budget\. Speedup is latency relative to AR\.
### 4\.3Any\-Order Infill

[Section˜3\.3](https://arxiv.org/html/2607.01436#S3.SS3)cast infill as sampling the conditional of[Eq\.˜1](https://arxiv.org/html/2607.01436#S3.E1)\. Here we evaluate the capability it affords that autoregression lacks: filling a gap in the report from the fixed text on*both*sides\. A radiologist editing one sentence of a draft, for instance, wants the surrounding text updated from both directions, where an AR model would regenerate only what follows the edit\.[Figure˜3](https://arxiv.org/html/2607.01436#S4.F3)contrasts the two paradigms\.

![Refer to caption](https://arxiv.org/html/2607.01436v1/figures/fig_infill_src.png)Figure 3:Completing a gap from both sides\.One sentence of a chest X\-ray report is masked \(the gap\) and filled from the surrounding fixed fragments\. Top: the diffusion model draws on fragments on either side and recovers the sentence correctly\. Bottom: the autoregressive sibling sees only the fragments before it, the rest greyed out, and reconstructs it incorrectly\. RealMIMIC\-CXRexample\.We mask one complete sentence \(deterministically, near the middle\) of each held\-outMIMIC\-CXR\[[13](https://arxiv.org/html/2607.01436#bib.bib19)\]report \(n=249n\{=\}249, one canvas\) and fill the resulting gap under two conditions, scoring each fill against the masked sentence by token\-F1 and by the LLM judge of[Section˜4\.1](https://arxiv.org/html/2607.01436#S4.SS1)\(semantic equivalence to the reference sentence\)\. The*bidirectional*condition clamps the fragments on both sides of the gap;*left\-only*clamps only the left, emulating an AR view\. We run this for the diffusion model and for AR; in the AR*bidirectional*condition the right\-side context is supplied in the prompt, the only way an autoregressive model can condition on it\. The within\-model gain from adding the right context measures bidirectional exploitation, and the model×\\timescontext interaction measures the capability asymmetry\. Both models are the released \(base\) checkpoints, so the result reflects architecture rather than finetuning\.

Table 3:Infill ablation onMIMIC\-CXR\(n=249n\{=\}249\): fill a masked sentence with context on both sides vs\. left only\. Token\-F1 and LLM\-judge accuracy of the fill against the masked sentence;Δ\\Deltais the gain from adding the right\-side context \(paired test\)\. AR*bidir\.*is given both sides in its prompt\.p∗<10−3\{\}^\{\\ast\}p\{<\}10^\{\-3\}; n\.s\.: not significant\.[Table˜3](https://arxiv.org/html/2607.01436#S4.T3)reports the2×22\\times 2\. The diffusion model uses the right\-side context strongly: adding it raises token\-F1 by\+0\.109\+0\.109\(pairedtt,p<10−10p\{<\}10^\{\-10\},95%95\\%CI\[\+0\.077,\+0\.141\]\[\+0\.077,\+0\.141\]\) and judge accuracy by\+0\.129\+0\.129\(p=2×10−5p\{=\}2\{\\times\}10^\{\-5\}\)\. AR does not: even when prompted with both sides, the right context does not significantly help it \(\+0\.031\+0\.031token\-F1,p=0\.08p\{=\}0\.08;−0\.016\-0\.016judge, n\.s\.\)\. The model×\\timescontext interaction is significant on both metrics \(\+0\.078\+0\.078token\-F1,p=2×10−4p\{=\}2\{\\times\}10^\{\-4\};\+0\.145\+0\.145judge,p=3×10−4p\{=\}3\{\\times\}10^\{\-4\}\): diffusion benefits about3\.5×3\.5\\timesmore from bidirectional context\. Example fills are shown in[Section˜D\.2](https://arxiv.org/html/2607.01436#A4.SS2)\.

## 5Conclusion

We studied discrete diffusion versus autoregression for chest X\-ray report drafting with two same\-size, same\-family backbones, so the generative paradigm is the only variable\. On a matched medical\-VQA benchmark scored by a verbosity\-robust LLM judge, the diffusion model matches or exceeds its AR sibling and is competitive with frontier vision\-language models, while decoding3\.53\.5–4\.4×4\.4\\timesfaster\. Beyond this, it adds a drafting capability AR lacks, any\-order infill: a sampler modification lets a radiologist fix report fragments and have the diffusion model fill the gaps between them\. OnMIMIC\-CXRit exploits context on both sides of a gap \(\+0\.109\+0\.109token\-F1,\+0\.129\+0\.129judge accuracy\) while AR does not, even when the right\-side context is in its prompt\. We release our code and finetuned checkpoints\.

## References

- \[1\]\(2021\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2107.03006)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p1.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2607.01436#S3.SS2.p2.1)\.
- \[2\]S\. Bannur, K\. Bouzid, D\. C\. Castro, A\. Schwaighofer, A\. Thieme, S\. Bond\-Taylor, M\. Ilse, F\. Pérez\-García, V\. Salvatelli, H\. Sharma,et al\.\(2024\)MAIRA\-2: grounded radiology report generation\.Note:arXiv preprint arXiv:2406\.04449External Links:[Document](https://dx.doi.org/10.48550/arXiv.2406.04449)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[3\]A\. Ben Abacha, S\. A\. Hasan, V\. V\. Datla, J\. Liu, D\. Demner\-Fushman, and H\. Müller\(2019\)VQA\-Med: overview of the medical visual question answering task at ImageCLEF 2019\.InCLEF 2019 Working Notes, CEUR Workshop Proceedings,Cited by:[Table A2](https://arxiv.org/html/2607.01436#A2.T2.4.2.2),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2607.01436#S4.SS1.p1.1)\.
- \[4\]L\. Chen, T\. You, H\. Liu, Z\. Bao, J\. Jiao, X\. Han, Z\. Ou, T\. Sun, X\. Mou, X\. Jin, and Y\. Xu\(2026\)ECHO: efficient chest x\-ray report generation with one\-step block diffusion\.Note:arXiv preprint arXiv:2604\.09450External Links:[Document](https://dx.doi.org/10.48550/arXiv.2604.09450)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]Gemma Team, Google DeepMind\(2026\)Gemma 4: open multimodal models\.Note:Model card,[https://huggingface\.co/google/gemma\-4\-26B\-A4B\-it](https://huggingface.co/google/gemma-4-26B-A4B-it)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p1.1),[§3\.1](https://arxiv.org/html/2607.01436#S3.SS1.p1.6)\.
- \[6\]Google DeepMind\(2026\)DiffusionGemma: block discrete\-diffusion language models\.Note:Model card,[https://huggingface\.co/google/diffusiongemma\-26B\-A4B\-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p1.1),[§3\.1](https://arxiv.org/html/2607.01436#S3.SS1.p1.6),[§3\.2](https://arxiv.org/html/2607.01436#S3.SS2.p2.1)\.
- \[7\]H\. I\. Gulluk and O\. Gevaert\(2026\)SemEnrich: self\-supervised semantic enrichment of radiology reports for vision\-language learning\.arXiv preprint arXiv:2604\.09887\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2604.09887)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1)\.
- \[8\]H\. I\. Gulluk, M\. Van Puyvelde, and O\. Gevaert\(2026\)OpenMedQ: broad open pretraining for medical vision\-language models\.arXiv preprint arXiv:2606\.12953\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2606.12953)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1)\.
- \[9\]H\. I\. Gulluk, M\. Van Puyvelde, W\. Van Criekinge, and O\. Gevaert\(2026\)SDR: set\-distance rewards for radiology report generation\.Note:arXiv preprint arXiv:2606\.00440External Links:[Document](https://dx.doi.org/10.48550/arXiv.2606.00440)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1)\.
- \[10\]HKU NLP Group\(2026\)DreamOn: diffusion language models for code infilling beyond fixed\-size canvas\.Note:arXiv preprint arXiv:2602\.01326External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.01326)Cited by:[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2106.09685)Cited by:[§3\.1](https://arxiv.org/html/2607.01436#S3.SS1.p1.6)\.
- \[12\]S\. L\. Hyland, S\. Bannur, K\. Bouzid, D\. C\. Castro, M\. Ranjit, A\. Schwaighofer, F\. Pérez\-García, V\. Salvatelli, S\. Srivastav, A\. Thieme,et al\.\(2023\)MAIRA\-1: a specialised large multimodal model for radiology report generation\.Note:arXiv preprint arXiv:2311\.13668External Links:[Document](https://dx.doi.org/10.48550/arXiv.2311.13668)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]A\. E\. W\. Johnson, T\. J\. Pollard, S\. J\. Berkowitz, N\. R\. Greenbaum, M\. P\. Lungren, C\. Deng, R\. G\. Mark, and S\. Horng\(2019\)MIMIC\-CXR, a de\-identified publicly available database of chest radiographs with free\-text reports\.Scientific Data6\(1\),pp\. 317\.External Links:[Document](https://dx.doi.org/10.1038/s41597-019-0322-0)Cited by:[§4\.3](https://arxiv.org/html/2607.01436#S4.SS3.p2.2)\.
- \[14\]J\. J\. Lau, S\. Gayen, A\. Ben Abacha, and D\. Demner\-Fushman\(2018\)A dataset of clinically generated visual questions and answers about radiology images\.Scientific Data5\(1\),pp\. 180251\.External Links:[Document](https://dx.doi.org/10.1038/sdata.2018.251)Cited by:[Table A2](https://arxiv.org/html/2607.01436#A2.T2.3.1.2),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2607.01436#S4.SS1.p1.1)\.
- \[15\]C\. Li, C\. Wong, S\. Zhang, N\. Usuyama, H\. Liu, J\. Yang, T\. Naumann, H\. Poon, and J\. Gao\(2023\)LLaVA\-Med: training a large language\-and\-vision assistant for biomedicine in one day\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2306.00890)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2607.01436#S4.SS1.p3.1)\.
- \[16\]B\. Liu, L\. Zhan, L\. Xu, L\. Ma, Y\. Yang, and X\. Wu\(2021\)SLAKE: a semantically\-labeled knowledge\-enhanced dataset for medical visual question answering\.InIEEE International Symposium on Biomedical Imaging \(ISBI\),External Links:[Document](https://dx.doi.org/10.1109/ISBI48211.2021.9434010)Cited by:[Table A2](https://arxiv.org/html/2607.01436#A2.T2.4.4.1.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2607.01436#S4.SS1.p1.1)\.
- \[17\]J\. Mao, Y\. Wang, L\. Chen, C\. Zhao, Y\. Tang, D\. Yang, L\. Qu, D\. Xu, and Y\. Zhou\(2025\)Discrete diffusion models with MLLMs for unified medical multimodal generation\.Note:arXiv preprint arXiv:2510\.06131External Links:[Document](https://dx.doi.org/10.48550/arXiv.2510.06131)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[18\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2502.09992)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p1.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[19\]S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov\(2024\)Simple and effective masked diffusion language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2406.07524)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p1.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]T\. Tanida, P\. Müller, G\. Kaissis, and D\. Rueckert\(2023\)Interactive and explainable region\-guided radiology report generation\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:[Document](https://dx.doi.org/10.1109/CVPR52729.2023.00718)Cited by:[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]S\. Wanget al\.\(2024\)CopilotCAD: empowering radiologists with report completion models and quantitative evidence from medical image foundation models\.Note:arXiv preprint arXiv:2404\.07424External Links:[Document](https://dx.doi.org/10.48550/arXiv.2404.07424)Cited by:[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[22\]J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong\(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2508.15487)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p1.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]S\. Yu, J\. Wang, and G\. Lu\(2026\)AnchorDiff: topology\-aware masked diffusion with confidence\-based rewriting for radiology report generation\.Note:arXiv preprint arXiv:2605\.17071External Links:[Document](https://dx.doi.org/10.48550/arXiv.2605.17071)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[24\]X\. Zhai, B\. Mustafa, A\. Kolesnikov, and L\. Beyer\(2023\)Sigmoid loss for language image pre\-training\.InInternational Conference on Computer Vision \(ICCV\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2303.15343)Cited by:[§3\.1](https://arxiv.org/html/2607.01436#S3.SS1.p1.6)\.
- \[25\]X\. Zhang, H\. Zhou, X\. Yang,et al\.\(2024\)ReXrank: a public leaderboard for ai\-powered radiology report generation\.Note:arXiv preprint arXiv:2411\.15122External Links:[Document](https://dx.doi.org/10.48550/arXiv.2411.15122)Cited by:[§1](https://arxiv.org/html/2607.01436#S1.p2.1),[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and Chatbot Arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2306.05685)Cited by:[§2](https://arxiv.org/html/2607.01436#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2607.01436#S4.SS1.p3.1)\.

## Appendix ABackbones and Adaptation Recipe

[Table˜A1](https://arxiv.org/html/2607.01436#A1.T1)lists the two backbones and their \(identical\) adaptation recipe; only the generative paradigm and its established optimizer differ\.

Table A1:The two backbones and their adaptation\. Same family, size, vision tower, LoRA targets, and data \(“same”==identical to the diffusion column\); only the generative paradigm and its optimizer differ\. Vision is frozen for both\.
## Appendix BDatasets

[Table˜A2](https://arxiv.org/html/2607.01436#A2.T2)lists the three medical\-VQA datasets and their sizes\.

Table A2:Medical\-VQA datasets\. Sizes are train / test QA pairs; evaluation uses a fixed random subset ofn=350n\{=\}350test items per dataset\.
## Appendix CInfill Sampler

We patch the uncompiled outer denoising step \(the compiled inner sampler is reassigned as an instance attribute and shadows a class\-level patch\), clamping the fixed positions on the incoming canvas and on both outgoing canvases at each step \([Fig\.˜A1](https://arxiv.org/html/2607.01436#A3.F1)\)\.

```
# clamp user-fixed tokens at fixed positions every denoising step
  # (patch the uncompiled outer _denoising_step, not the compiled inner sampler)
  def patched_step(self, *a, current_canvas=..., **kw):
      cc = where(fixed_mask, fixed_tokens, current_canvas)  # condition on fixed
      cur, argmax, *rest = orig_step(self, current_canvas=cc, ...)
      cur    = where(fixed_mask, fixed_tokens, cur)         # survive renoise
      argmax = where(fixed_mask, fixed_tokens, argmax)
      return (cur, argmax, *rest)
```

Figure A1:The any\-order infill procedure \(abridged\)\.fixed\_tokens/fixed\_maskare\[B,L\]\[B,L\]; the wrapper clamps the fixed positions on the incoming canvas and both outgoing canvases at each step\.
## Appendix DExample Outputs

### D\.1Medical VQA

Held\-out items from each dataset, with every model’s answer and the LLM judge’s verdict \(✓correct,×\\timesincorrect\)\. Base and frontier models answer in full sentences, which exact\-match scoring penalizes regardless of correctness; the judge scores meaning \([Section˜4\.1](https://arxiv.org/html/2607.01436#S4.SS1)\)\. Long answers are abbreviated with \[…\]; one sample per dataset is shown in full to illustrate this verbosity\.

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_vqarad_1.jpg)Q\.Is there ventricular enlargement?GT\.No

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_vqarad_2.jpg)Q\.is there a massGT\.yesdiff \(base\)✓This chest X\-ray shows a rounded opacity \(a white area\) in the upper left lung\. \[…\]AR \(base\)✓This chest X\-ray shows a well\-defined opacity \(a "spot" or density\) in the upper part of the left lung\. \[…\]diff \(ft\)✓yesAR \(ft\)✓YesGemini\-3\.5\-Flash×\\timesBased on the provided chest X\-ray, there is no evidence of a definitive pulmonary mass\.GPT\-4\.1\-mini×\\timesNo, there is no visible mass in this chest X\-ray\.Sonnet\-4\.6×\\timesNo, there is no discrete mass visible on this chest X\-ray\. \[…\]

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_vqarad_3.jpg)Q\.Describe the borders of the lesion\.GT\.Smoothdiff \(base\)×\\timesThe provided chest X\-ray does not show any clearly defined lesions\. The lung fields appear clear, and the heart size and shape are within normal limits\. \[…\]AR \(base\)×\\timesI cannot provide a description of a lesion because no lesion is visible in this chest X\-ray\. The image appears to show normal anatomy for a chest radiograph\. \[…\]diff \(ft\)✓Well\-definedAR \(ft\)✓The lesion has well\-defined borders\.Gemini\-3\.5\-Flash×\\timesBased on the provided chest radiograph, there is no visible lung lesion or abnormal mass\. \[…\]GPT\-4\.1\-mini×\\timesThe borders of the lesion are irregular and poorly defined\.Sonnet\-4\.6×\\timesThe lesion \(located in the right hilum/perihilar region\) has well\-defined, lobulated borders, suggesting a discrete mass rather than an infiltrative process\. \[…\]

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_vqarad_4.jpg)Q\.From what structure is the large hypodense mass emanating from?GT\.The base of the cecum

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_slake_1.jpg)Q\.How many kinds of abnormalities are there in this image?GT\.3diff \(base\)×\\timesBased on the single image provided, there is one primary area of abnormality visible: a large, hyperintense \(bright\) lesion located in the right hemisphere, specifically involving the thalamus/basal ganglia region\. A definitive medical diagnosis requires a review of the entire imaging study, multiple sequences, and the patient’s clinical history by a qualified professional\. If this image belongs to you or someone you know, you should consult with a physician for a formal interpretation and clinical plan\.AR \(base\)×\\timesBased on the provided image, which appears to be an axial MRI scan of a brain, there is a large, prominent abnormality visible\. The image shows a significant area of high signal intensity \(bright white area\) located in the central/right region of the brain\. This represents a single major type of abnormality, which could indicate several different medical conditions such as a tumor, hemorrhage, or an area of edema/infarction\. Important Medical Disclaimer: I am an AI, not a doctor\. \[…\]diff \(ft\)✓3AR \(ft\)✓There are 3 kinds of abnormalities in this image\.Gemini\-3\.5\-Flash×\\timesBased on the provided brain MRI, there are two main types of abnormalities visible: 1\. A large intracranial mass/tumor \(the well\-defined, hyperintense lesion in the deep right hemisphere/thalamic region\)\. 2\. Obstructive ventriculomegaly/hydrocephalus \(abnormal enlargement and dilation of the lateral ventricles, particularly the occipital and frontal horns, caused by mass effect and CSF obstruction\)\.GPT\-4\.1\-mini×\\timesThere are two kinds of abnormalities in this image\.Sonnet\-4\.6×\\timesLooking at this brain MRI \(DWI sequence\), I can identify 2 kinds of abnormalities: 1\. Restricted diffusion \(bright signal\) — indicating acute ischemia/infarction in the posterior fossa/cerebellum region 2\. Mass effect/enlarged structure — suggesting associated edema or structural enlargement in that area

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_slake_2.jpg)Q\.How many kinds of abnormalities are there in this image?GT\.3

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_slake_3.jpg)Q\.How many organs are there in this image?GT\.2diff \(base\)×\\timesI cannot count the number of organs in this image\. A chest X\-ray is a medical imaging tool used to visualize internal structures, but it is not used to count organs\. \[…\]AR \(base\)×\\timesI cannot provide a medical interpretation or diagnosis from this image\. \[…\]diff \(ft\)✓2AR \(ft\)✓2Gemini\-3\.5\-Flash×\\timesBased on this chest X\-ray, we can identify 3 major visible organs: the lungs \(left and right\), the heart, and the stomach \(visible via the g \[…\]GPT\-4\.1\-mini×\\timesThere are two organs visible in this image: the lungs\.Sonnet\-4\.6×\\timesIn this chest X\-ray, there are multiple organs visible, but the main identifiable ones are: 1\. Lungs \(2 — left and right\) 2\. Heart 3\. Liver \(partially visible at the right lower border\) 4\. \[…\]

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_slake_4.jpg)Q\.How many organs are there in this image?GT\.6diff \(base\)×\\timesI cannot count the number of organs from a single CT scan slice\. A CT scan consists of many sequential images that are necessary to fully identify and evaluate all internal structures\. \[…\]AR \(base\)×\\timesI cannot provide a count of organs in this image\. Interpreting medical imaging requires specialized training and context from a complete scan, which a single image cannot provide\. \[…\]diff \(ft\)✓6AR \(ft\)✓There are 6 organs in this image\.Gemini\-3\.5\-Flash×\\timesBased on this abdominal CT slice, there are 4 major distinct organs visualized: 1\. Spleen \(on the right side of the image / anatomical left\) 2\. \[…\]GPT\-4\.1\-mini×\\timesThere are four organs visible in this image: the liver, both kidneys, and portions of the intestines\.Sonnet\-4\.6×\\timesIn this abdominal CT scan, the visible organs include: 1\. Kidneys \(bilateral\) — 2 2\. Bowel loops \(small/large intestine\) 3\. Vertebral body \(bone structure\) 4\. \[…\]

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_slake_5.jpg)Q\.Is this a study of the neck?GT\.Nodiff \(base\)×\\timesYes, this is an axial CT scan of the lower head and upper neck region\. \[…\]AR \(base\)×\\timesThis single image is an axial CT slice through the level of the upper neck/base of the skull, showing structures such as the mandible, pharynx, and the base of the brain\. \[…\]diff \(ft\)✓NoAR \(ft\)✓NoGemini\-3\.5\-Flash×\\timesYes\.GPT\-4\.1\-mini×\\timesYes\.Sonnet\-4\.6×\\timesYes, this is a CT study of the neck/head junction region — specifically an axial CT image at the level of the skull base/upper neck, showi \[…\]

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_vqamed_3.jpg)Q\.what is the primary abnormality in this image?GT\.juvenile pilocytic astrocytomadiff \(base\)✓This T2\-weighted coronal MRI of the brain shows a large, well\-defined hyperintense \(bright\) area located in the cerebellum\. The appearance and location of this lesion are concerning for a cerebellar tumor\. \[…\]AR \(base\)×\\timesThis image is a coronal T2\-weighted MRI of the brain\. The primary abnormality is a large, well\-circumscribed, hyperintense \(bright\) lesion located in the midline of the cerebellum \(the posterior fossa\)\. \[…\]diff \(ft\)✓pillary astrocytomaAR \(ft\)✓The primary abnormality in this image is a juvenile pilocytic astrocytoma \(jpa\)\. Juvenile pilocytic astrocytoma is a benign, slow\-growing neoplasm of the central nervous system\. \[…\]Gemini\-3\.5\-Flash×\\timesBased on the provided coronal T2\-weighted brain MRI, the primary abnormality is a large, hyperintense mass in the brainstem \(specifically wi \[…\]GPT\-4\.1\-mini×\\timesThe primary abnormality in this image is bilateral symmetrical hyperintense lesions in the basal ganglia, indicative of a metabolic or toxic \[…\]Sonnet\-4\.6×\\timesThe primary abnormality in this image is bilateral enlargement of the lateral ventricles \(hydrocephalus\)\. \[…\]

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/vqa_vqamed_5.jpg)Q\.what is most alarming about this ultrasound?GT\.papillary carcinoma of the thyroid

### D\.2Any\-Order Infill

Four held\-outMIMIC\-CXRreports, each with one sentence masked\. Its position is marked \(⋯\\cdots\) and the masked sentence is shown in the teal box below; the four fills form a\{\\\{diffusion, AR\}×\{\\\}\\times\\\{bidirectional, left\-only\}\\\}grid\.*Bidirectional*supplies the fixed text on both sides of the gap \(for AR, in its prompt\),*left\-only*only the left\. Only the bidirectional diffusion fill reconstructs the masked sentence; the others, including AR with both sides in its prompt, cannot condition on the right\-side context\.

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/infill_midsent_1.jpg)Report\.The heart is normal in size\. The mediastinal and hilar contours appear within normal limits\.⋯\\cdotsThe lungs appear clear\. Bony structures are unremarkable\. No evidence of acute disease\.masked sentence:There is no pleural effusion or pneumothorax\.

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/infill_midsent_2.jpg)Report\.⋯\\cdotsNo airspace opacification\. No pneumothorax\. No pulmonary edema\. Mild density seen in the left costophrenic angle which may represent atelectasis or a small pleural effusion\. Suture material projecting over the right hilar area\. Narrowing of the subglottic trachea is probably due to recent intubation\. The cardiomediastinal shadow is unchanged\. No airspace opacification\. No pneumothorax\. No pulmonary edema\. Mild density seen in the left costophrenic angle which may represent atelectasis or a small pleural effusion\. Suture material projecting over the right hilar area\. Narrowing of the subglottic trachea is probably due to recent intubation\.masked sentence:The cardiomediastinal shadow is unchanged\.

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/infill_midsent_3.jpg)Report\.There are moderate bilateral pleural effusions with overlying atelectasis, underlying basilar consolidation is not excluded in the appropriate clinical setting\. The cardiomediastinal silhouette is grossly stable\.⋯\\cdotsNo evidence of pneumothorax is seen\. Bilateral pleural effusions with overlying atelectasis, underlying consolidation not excluded\. Possible mild vascular congestion\.masked sentence:There is minimal pulmonary vascular congestion\.

![[Uncaptioned image]](https://arxiv.org/html/2607.01436v1/figures/qual/infill_midsent_4.jpg)Report\.The previous bilateral pleural effusions have resolved\. Substantial apical thickening bilaterally and lung scarring, the sequela of likely radiation therapy is unchanged\.⋯\\cdotsNo evidence of pneumonia\. Resolution of pleural effusions\. Sequela of radiation induced changes including biapical scarring and fibrosis\.masked sentence:The cardiac size is normal\.
Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Similar Articles

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

Diffusion Language Models: An Experimental Analysis

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

Submit Feedback

Similar Articles

Discrete Diffusion Language Models for Interactive Radiology Report Drafting
AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
Diffusion Language Models: An Experimental Analysis
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding