Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

arXiv cs.AI 06/08/26, 04:00 AM Papers
Summary
Proposes an attention-guided encoder-decoder for longitudinal medical visual question answering, using a frozen DINO-based mask generator and auxiliary losses to improve consistency and interpretability, achieving strong results on the Medical-Diff-VQA benchmark.
arXiv:2606.06534v1 Announce Type: cross Abstract: Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small registration regularizer. The registered image pair is fed into the image encoder, followed by a frozen DINO-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer-based decoder to generate final answers. To facilitate learning stabilization and clarify the change signal, inspired by DINO-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation. On the Medical-Diff-VQA benchmark, the model delivers strong BLEU, ROUGE-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:16 AM
# Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models
Source: [https://arxiv.org/html/2606.06534](https://arxiv.org/html/2606.06534)
Qianru Zhang, Georges El Fakhri, Xiaofeng Liu Yale Biomedical Imaging Institute New Haven, CT 06510 xiaofeng\.liu@yale\.edu

###### Abstract

Longitudinal medical visual question answering \(VQA\) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point\. We propose an attention\-guided encoder\-decoder for this task with chest X\-rays\. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co\-registering the current image to the reference image with a small registration regularizer\. The registered image pair is fed into the image encoder, followed by a frozen DINO\-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs\. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer\-based decoder to generate final answers\. To facilitate learning stabilization and clarify the change signal, inspired by DINO\-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram\-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation\. On the Medical\-Diff\-VQA benchmark, the model delivers strong BLEU, ROUGE\-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask\. These results support saliency\-conditioned generation with mild pre\-alignment as a principled framework for longitudinal reasoning in medical VQA\. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously\.

## 1Introduction

Medical Visual Question Answering \(VQA\) aims to answer open\-ended clinical questions based on medical images, serving as a critical bridge from visual perception to clinical decision support\[[22](https://arxiv.org/html/2606.06534#bib.bib22)\]\. In recent years, many medical VQA approaches have relied on pretrained visual or multimodal models\[[11](https://arxiv.org/html/2606.06534#bib.bib11),[19](https://arxiv.org/html/2606.06534#bib.bib19),[50](https://arxiv.org/html/2606.06534#bib.bib50)\]\. However, most of these works focus on a single time point that follows the natural image VQA tasks, whereas radiologists routinely need to compare current and previous studies to localize change, judge progression, and reconcile apparent discrepancies\.

Longitudinal visual question answering \(Diff\-VQA\) operationalizes this workflow by conditioning answers on paired images acquired at two time points, where the difference is often the signal of interest rather than the absolute appearance\[[10](https://arxiv.org/html/2606.06534#bib.bib10)\]\. Recent benchmarks and methods for longitudinal chest X\-rays have made this task concrete by supplying paired images, questions, and change\-focused answers\[[9](https://arxiv.org/html/2606.06534#bib.bib9),[28](https://arxiv.org/html/2606.06534#bib.bib28),[49](https://arxiv.org/html/2606.06534#bib.bib49)\]\. Building on these resources, several approaches adapt vision–language models or design task\-specific architectures to better capture temporal discrepancies, including prior work that emphasizes longitudinal pretraining\[[5](https://arxiv.org/html/2606.06534#bib.bib5)\], residual alignment in the feature or pixel space\[[25](https://arxiv.org/html/2606.06534#bib.bib25)\], or region\-level retrieval and mixing\[[48](https://arxiv.org/html/2606.06534#bib.bib48)\]\. However, their attention at different time points is not explicitly encouraged to be consistent\. Moreover, current approaches mainly focus on supervised fine\-tuning, and the potential of introducing unsupervised objectives is not explored yet\. Furthermore, they also suffer from the untransparency caused by the black box property of deep learning, which may cause disbelief and concerns from related stakeholders\.

Saliency maps are a type of saliency visualization used to interpret deep learning models\. In medical imaging tasks, they are widely employed to present verifiable evidence to clinicians and enhance model interpretability and trustworthiness\[[2](https://arxiv.org/html/2606.06534#bib.bib2),[22](https://arxiv.org/html/2606.06534#bib.bib22)\]\. However, existing medical VQA models often treat saliency as a post\-hoc explanation\[[15](https://arxiv.org/html/2606.06534#bib.bib15),[18](https://arxiv.org/html/2606.06534#bib.bib18),[24](https://arxiv.org/html/2606.06534#bib.bib24)\]rather than incorporating it as intrinsic supervision during training\. In longitudinal settings, this is a missed opportunity because a consistent focus on corresponding anatomy across the two time points is essential to answering difference\-type questions faithfully\.

To mitigate the above gaps, we introduce an attention\-guided generative framework specifically designed for chest radiograph temporal comparison \([Figure1](https://arxiv.org/html/2606.06534#S2.F1)\)\. The method has two design principles: \(i\) make the two images geometrically comparable and \(ii\) ensure that what the model says it cares about also determines where it looks at both time points, inspired by natural image co\-attention\[[42](https://arxiv.org/html/2606.06534#bib.bib42),[6](https://arxiv.org/html/2606.06534#bib.bib6)\]\. Specifically, we have these modules\.∙\\bulletMicro pre\-alignment\.A lightweight CNN\-based module applies a near\-identity affine warp to the current study to mitigate small pose and scale variations without overfitting or erasing true changes\[[14](https://arxiv.org/html/2606.06534#bib.bib14)\]\.∙\\bulletDual\-path mask generation\.One path employs the self\-supervised visual prior DINO model\[[31](https://arxiv.org/html/2606.06534#bib.bib31)\]to provide robust lesion candidates, while the other path uses an adaptive mask generator to produce sample\-adaptive masks from encoder features, with a hyperparameter changed during the training process,λ\\lambdacontrolling their relative proportions\.∙\\bulletMultigranularity training objectives;Language modeling lossLlmL\_\{\\text\{lm\}\}optimizes answer quality; Mask consistency lossLmask\_main/refL\_\{\\text\{mask\\\_main/ref\}\}constrains the difference between the inner products \(of image features and the mask feature\) and masked image features; a light\-head prediction lossLpred\_main/refL\_\{\\text\{pred\\\_main/ref\}\}enables mask rebuilding; Gram\-style encourages similar patch\-to\-patch relationships and spatial structure similarity between images at different visits; and distribution normalization regularizationLKoLeoL\_\{\\text\{KoLeo\}\}enhances representational separability between samples and open\-set robustness\.

The main contributions can be summarized as:

∙\\bulletWe formalize a simple and effective way to enforce*spatially consistent attention*across paired images by using a shared attention mask as a training signal for Diff\-VQA\. Specifically, we propose a plug\-and\-play mask generator that combines DINO priors with adaptive feature\-driven masks without requiring additional annotations, balancing stability and sample adaptability\.

∙\\bulletWe use a comprehensive set of training objectives that cover classification and language semantics, representation alignments, spatial alignments, and attention alignments\.

∙\\bulletWe demonstrate competitive performance and incorporate interpretability and explainability in our framework, without the need for post\-hoc saliency analysis, providing both textual answers and visual analysis of lesions\. It alleviates the cognitive load on medical practitioners while mitigating distrust stemming from the black\-box nature of deep learning models, demonstrating significant practical value\.

## 2Related works

![Refer to caption](https://arxiv.org/html/2606.06534v1/x1.png)Figure 1:Illustration for our longitudinal medical VQA framework\.A\.First, perform approximately identical affine pre\-registration \(with parameter regularizationLregL\_\{\\text\{reg\}\}\) on the main and reference images, then extract features via the image encoder\. The registered images are fed into DINO and an adaptive mask generator, respectively; both generate temporal masks under shared weights, which are then fused via the union/fuse function and weighted averaging \(weightsλ\\lambdaand1−λ1\-\\lambda\) to produce a change\-aware shared mask applied uniformly to both registered images\. The masked images are re\-encoded, and the resulting two\-time\-point features are concatenated sequentially with the problem encoding using special separator tokens to form a multimodal prefix\. Finally, the generative decoder outputs the answer\.B\.Mask consistency lossesLmask\_mainL\_\{\\text\{mask\\\_main\}\}andLmask\_refL\_\{\\text\{mask\\\_ref\}\}constrain the distance between the masked images’ features and the product of the mask feature and images’ features\.C\.Language modelling lossLlmL\_\{\\text\{lm\}\}directly supervises answer generation\.D\.Lightweight MLP prediction heads provide auxiliary supervision viaLpred\_mainL\_\{\\text\{pred\\\_main\}\}andLpred\_refL\_\{\\text\{pred\\\_ref\}\}respectively enhancing the semantic meaning of the masks\.E\.Gram\-style similarity constraints are applied to both full and masked features \(LgramL\_\{\\text\{gram\}\}andLgram\_maskL\_\{\\text\{gram\\\_mask\}\}\) to enforce similar patch structure between the main image and the reference image\.F\.KoLeo regularizationLKoLeoL\_\{\\text\{KoLeo\}\}promotes representational separability across samples\.Difference\-aware medical VQA\.Medical\-Diff\-VQA\[[10](https://arxiv.org/html/2606.06534#bib.bib10),[9](https://arxiv.org/html/2606.06534#bib.bib9)\]provides a large\-scale benchmark of paired chest radiographs and has become a major evaluation dataset\. Methodologically, early approaches often employed transfer learning from general image difference description \(IDC\) models as a strong baseline: MCCFormers\[[47](https://arxiv.org/html/2606.06534#bib.bib47)\]utilized a transformer encoder\-decoder architecture, performing multi\-head attention similarity comparisons on patches from the two images\. IDCPCL\[[46](https://arxiv.org/html/2606.06534#bib.bib46)\]aligns visual differences and text through self\-supervised pre\-training and contrastive learning, mitigating label scarcity\. For medical applications, EKAID\[[10](https://arxiv.org/html/2606.06534#bib.bib10)\]pioneered a systematic approach to differential Med\-VQA, introducing graph\-based representations based on expert knowledge\. Subsequent approaches advanced along multiple trajectories: RegioMix\[[48](https://arxiv.org/html/2606.06534#bib.bib48)\]employs region\-level retrieval augmentation to retrieve question\-relevant image regions prior to generation\. PLURAL\[[5](https://arxiv.org/html/2606.06534#bib.bib5)\]adapted Diff\-VQA using a visual\-language model pre\-trained in two stages: natural text\-image to longitudinal chest radiographs; ReAl\[[25](https://arxiv.org/html/2606.06534#bib.bib25)\]combines generative responses with residual input and residual alignment of characteristics to explicitly highlight differences between two time phases; VED\[[27](https://arxiv.org/html/2606.06534#bib.bib27)\]introduced embeddings of image differentiation, learning a distinct d\-dimensional vector for each main/reference image and applying them to all visual tokens, allowing cross\-attention decoding to distinguish images throughout the pipeline\.

Saliency and segmentation in medical images\.Research in this area demonstrates a converging trend from interpretable visualization towards spatial supervision and unified foundational models\. Within chest radiography scenarios, a system benchmark\[[35](https://arxiv.org/html/2606.06534#bib.bib35)\]reveals that multiple saliency methods \(including Grad\-CAM\[[37](https://arxiv.org/html/2606.06534#bib.bib37)\]\) exhibit limited accuracy and stability in lesion localization\.Wollek et al\.\[[44](https://arxiv.org/html/2606.06534#bib.bib44)\]proposed an attention\-based transformer method for saliency generation in models for pneumothorax classification\. Regarding segmentation models, nnU\-Net\[[13](https://arxiv.org/html/2606.06534#bib.bib13)\]provides a robust baseline for multimodal tasks through its self\-configuration process\. Subsequent approaches integrating or replacing U\-Net with transformers \(TransUNet\[[4](https://arxiv.org/html/2606.06534#bib.bib4)\], Swin\-UNet\[[3](https://arxiv.org/html/2606.06534#bib.bib3)\], UNETR\[[8](https://arxiv.org/html/2606.06534#bib.bib8)\]\) further enhance global dependencies and multi\-scale modelling\. Concurrently, general segmentation foundational models are rapidly entering medical applications: MedSAM\[[26](https://arxiv.org/html/2606.06534#bib.bib26)\]demonstrates zero/few\-shot generalization on million\-scale medical datasets, while SAM2\[[33](https://arxiv.org/html/2606.06534#bib.bib33)\]and MedSAM2\[[51](https://arxiv.org/html/2606.06534#bib.bib51)\]extend promptable segmentation to 2D/3D and video domains\. Regarding the representation backbone, DINO\-based models are becoming a solid foundation for medical segmentation and saliency\.

DINO backbones in radiology\.DINOv2\[[29](https://arxiv.org/html/2606.06534#bib.bib29)\]has been employed for training\-free deformable medical image registration \(DINO\-Reg\), securing first place in the OncoReg challenge\[[39](https://arxiv.org/html/2606.06534#bib.bib39)\]\. This demonstrates that semantic knowledge learned from natural images generalizes to cross\-organ geometric alignment scenarios in medical data\. For single\-modality data like chest radiographs, RAD\-DINO demonstrates strong competitiveness across classification/segmentation and image\-text alignment tasks\[[31](https://arxiv.org/html/2606.06534#bib.bib31)\]\. For multimodal scenarios such as MRI, MM\-DINOv2 introduces multimodal patch embedding and whole\-modality masking DINOv2\[[36](https://arxiv.org/html/2606.06534#bib.bib36)\]\. Furthermore, interpretability efforts integrating DINOv2 representations \(e\.g\., combining ViT\-CX causal explanations with self\-supervised features\) provide evidence for clinical traceability\[[12](https://arxiv.org/html/2606.06534#bib.bib12)\]\. Building upon DINOv3\[[38](https://arxiv.org/html/2606.06534#bib.bib38)\], SegDINO\[[45](https://arxiv.org/html/2606.06534#bib.bib45)\]achieves strong competitiveness across multiple medical segmentation benchmarks using a frozen DINOv3 with a lightweight decoder paradigm, while MedDINOv3\[[20](https://arxiv.org/html/2606.06534#bib.bib20)\]attains or surpasses SOTA on segmentation tasks through multi\-scale token aggregation and domain\-adaptive pre\-training on 3\.87 million CT slices\. In general, DINOv2 provides robust general representations and cross\-task transferability, while DINOv3 further enhances high\-resolution medical segmentation\.

## 3Methods

The pipeline has two components, including a micro image registration module and a keyword\-conditioned saliency extraction module, followed by image–text encoders and a multimodal decoder\.

### 3\.1Micro image registration module

Given a main imageImain∈ℝ3×H×WI\_\{\\text\{main\}\}\\in\\mathbb\{R\}^\{3\\times H\\times W\}and a reference imageIref∈ℝ3×H×WI\_\{\\text\{ref\}\}\\in\\mathbb\{R\}^\{3\\times H\\times W\}, a shallow CNN predicts 2D affine parametersΘ=\[A𝐭\]∈ℝ2×3\\Theta=\[A\\;\\mathbf\{t\}\]\\in\\mathbb\{R\}^\{2\\times 3\}\. We warp only the main image with a differentiable grid sampler:

𝐱=A𝐱tgt\+𝐭\.\\mathbf\{x\}=A\\,\\mathbf\{x\}\_\{\\mathrm\{tgt\}\}\+\\mathbf\{t\}\.\(1\)
To keep the transform near identity and avoid erasing true anatomical change, we regularize

ℒreg=wsml‖Θ−I‖F2\+wdet\(det\(A\)−1\)2\+wtran‖𝐭‖22\\mathcal\{L\}\_\{\\text\{reg\}\}=w\_\{\\text\{sml\}\}\\\|\\Theta\-I\\\|\_\{F\}^\{2\}\+w\_\{\\det\}\(\\det\(A\)\-1\)^\{2\}\+w\_\{\\text\{tran\}\}\\\|\\mathbf\{t\}\\\|\_\{2\}^\{2\}\(2\)withwsml=10−4w\_\{\\text\{sml\}\}=10^\{\-4\},wdet=10−5w\_\{\\det\}=10^\{\-5\}, andwtran=10−6w\_\{\\text\{tran\}\}=10^\{\-6\}\. The registered image isI^main\\widehat\{I\}\_\{\\text\{main\}\}\.

### 3\.2Mask generating module

Letϕ\(⋅\)\\phi\(\\cdot\)be the shared image encoder and projector\. We first extract the original image features:

fmain=ϕ\(I^main\)∈ℝN×C,fref=ϕ\(Iref\)∈ℝN×C\.f\_\{\\text\{main\}\}=\\phi\(\\widehat\{I\}\_\{\\text\{main\}\}\)\\in\\mathbb\{R\}^\{N\\times C\},\\quad f\_\{\\text\{ref\}\}=\\phi\(I\_\{\\text\{ref\}\}\)\\in\\mathbb\{R\}^\{N\\times C\}\.\(3\)
From a frozen DINO branch, we obtain attention maps and a union prior as shown in[Equation4](https://arxiv.org/html/2606.06534#S3.E4)\. We use the cosine similarity between the embeddings of the CLS token and the patch token to represent the models’ attention on that specific patch and then reshape the spatial format\.

Amain,Aref∈\[0,1\]H×W,U=max⁡\(Amain,Aref\)\.A\_\{\\text\{main\}\},A\_\{\\text\{ref\}\}\\in\[0,1\]^\{H\\times W\},\\quad U=\\max\\\!\\left\(A\_\{\\text\{main\}\},A\_\{\\text\{ref\}\}\\right\)\.\(4\)
A lightweight 3\-layer MLP headg\(⋅\)g\(\\cdot\)produces per\-token mask probabilities, and a fusion moduleh\(⋅\)h\(\\cdot\)using a 1\-layer CNN gated head to fuse the union, intersection, and difference of these two masks’ features\.

mmain=σ\(g\(fmain\)\),mref=σ\(g\(fref\)\),m\_\{\\text\{main\}\}=\\sigma\\\!\\big\(g\(f\_\{\\text\{main\}\}\)\\big\),\\quad m\_\{\\text\{ref\}\}=\\sigma\\\!\\big\(g\(f\_\{\\text\{ref\}\}\)\\big\),\(5\)F=σ\(h\(mmain,mref\)\)\.F=\\sigma\\\!\\big\(h\(m\_\{\\text\{main\}\},m\_\{\\text\{ref\}\}\)\\big\)\.\(6\)
After reshaping the tokens into the image grid, the final mask combines the prior and adaptive components\. In the experiment, we setλ=1\\lambda=1for the initial andλ=0\.5\\lambda=0\.5for the end\. We use the cosine function for the intermediate value change\.

M=λU\+\(1−λ\)F,λ∈\[0,1\]M=\\lambda\\,U\+\(1\-\\lambda\)\\,F,\\quad\\lambda\\in\[0,1\]\(7\)
We applyMMto both visits and re\-encode the masked images using the same image encoder:

Imain′=M⊙I^main,Iref′=M⊙Iref\.I^\{\\prime\}\_\{\\text\{main\}\}=M\\odot\\widehat\{I\}\_\{\\text\{main\}\},\\quad I^\{\\prime\}\_\{\\text\{ref\}\}=M\\odot I\_\{\\text\{ref\}\}\.\(8\)
fmask\_main\\displaystyle f\_\{\\text\{mask\\\_main\}\}=ϕ\(Imain′\),fmask\\displaystyle=\\phi\\\!\\big\(I^\{\\prime\}\_\{\\text\{main\}\}\\big\),\\quad f\_\{\\text\{mask\}\}=ϕ\(M\)\\displaystyle=\\phi\\\!\\big\(M\\big\)\(9\)fmask\_ref\\displaystyle f\_\{\\text\{mask\\\_ref\}\}=ϕ\(Iref′\)\\displaystyle=\\phi\\\!\\big\(I^\{\\prime\}\_\{\\text\{ref\}\}\\big\)

### 3\.3Image encoder and projector

We use a Swin\-base model \(patch size of 4 × 4 and window size of 12 × 12\)\[[23](https://arxiv.org/html/2606.06534#bib.bib23)\]first pretrained on ImageNet\-21k\[[34](https://arxiv.org/html/2606.06534#bib.bib34)\]and further trained on classification tasks on the MIMIC\-CXR dataset with CheXpert labels\[[27](https://arxiv.org/html/2606.06534#bib.bib27)\]as our image encoder backbone\. Its penultimate feature map,ℝN×C\\mathbb\{R\}^\{N\\times C\}, directly provides a token sequence compatible with GPT\-2\[[32](https://arxiv.org/html/2606.06534#bib.bib32)\]input\. A projector with one linear layer, one 8\-head transformer encoder\[[40](https://arxiv.org/html/2606.06534#bib.bib40)\], and a two\-layer MLP maps image tokens to the text\-representational space\.

### 3\.4Text encoder

Questions are tokenized with embeddings shared by the decoder, then added with a learnable positional embedding, and passed through 6 12\-head Transformer encoder layers\.

### 3\.5Multimodal decoder

A GPT\-2 small\[[32](https://arxiv.org/html/2606.06534#bib.bib32)\]decoder from HuggingFace\[[43](https://arxiv.org/html/2606.06534#bib.bib43)\]consumes masked image tokens and question tokens to generate the answer\. We add special tokens:⟨pad⟩\\langle\\text\{pad\}\\rangle,⟨img⟩\\langle\\text\{img\}\\rangle,⟨qtn⟩\\langle\\text\{qtn\}\\rangle, and⟨ans⟩\\langle\\text\{ans\}\\rangle\. Denote the representation of the question asftextf\_\{\\text\{text\}\}, using\[\]\[\]to represent the concatenation; the input sequenceCCis

\[⟨img⟩,fmask\_main,⟨img⟩,fmask\_ref,⟨qtn⟩,ftext,⟨ans⟩\]\.\[\\langle\\text\{img\}\\rangle,f\_\{\\text\{mask\\\_main\}\},\\langle\\text\{img\}\\rangle,f\_\{\\text\{mask\\\_ref\}\},\\langle\\text\{qtn\}\\rangle,f\_\{\\text\{text\}\},\\langle\\text\{ans\}\\rangle\]\.\(10\)

### 3\.6Training

We optimize the decoder for answer quality and add four auxiliary terms to \(i\) tie the masked pass to its gated counterpart, \(ii\) keep each visit individually diagnosable and constrain the information loss from applying masks, \(iii\) enforce longitudinal consistency, and \(iv\) avoid representation collapse\. All losses are summed with the hyperparameter weights described at the end of this subsection\.NNdenotes the number of samples in the training set\.

Language modeling\.Teacher\-forcing cross\-entropyCE\\mathrm\{CE\}on ground\-truth answersyy:

ℒlm=CE\(Decoder\(C\),y\)\.\\mathcal\{L\}\_\{\\text\{lm\}\}=\\mathrm\{CE\}\\\!\\left\(\\mathrm\{Decoder\}\(C\),\\,y\\right\)\.\(11\)
Mask consistency\.The masked responses should equal the original responses gated by the final maskMM:

ℒmask\_main\\displaystyle\\mathcal\{L\}\_\{\\text\{mask\\\_main\}\}=1N∑i=1N‖fmask\_main,i−Mifmain,i‖22\.\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Big\\\|f\_\{\\text\{mask\\\_main\},i\}\-M\_\{i\}\\,f\_\{\\text\{main\},i\}\\Big\\\|\_\{2\}^\{2\}\.\(12\)ℒmask\_ref\\displaystyle\\mathcal\{L\}\_\{\\text\{mask\\\_ref\}\}=1N∑i=1N‖fmask\_ref,i−Mifref,i‖22\.\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Big\\\|f\_\{\\text\{mask\\\_ref\},i\}\-M\_\{i\}\\,f\_\{\\text\{ref\},i\}\\Big\\\|\_\{2\}^\{2\}\.
Reconstruction of the light head maskA tiny headP\(⋅\)P\(\\cdot\)regresses the masked tokens back to their pre\-mask counterparts to preserve diagnosticability:

fpred\_main=P\(fmask\_main\),fpred\_ref=P\(fmask\_ref\),f\_\{\\text\{pred\\\_main\}\}=P\\\!\\big\(f\_\{\\text\{mask\\\_main\}\}\\big\),\\quad f\_\{\\text\{pred\\\_ref\}\}=P\\\!\\big\(f\_\{\\text\{mask\\\_ref\}\}\\big\),\(13\)
ℒpred\_main\\displaystyle\\mathcal\{L\}\_\{\\text\{pred\\\_main\}\}=1N∑i=1N‖fpred\_main,i−fmain,i‖22,\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Big\\\|f\_\{\\text\{pred\\\_main\},i\}\-f\_\{\\text\{main\},i\}\\Big\\\|\_\{2\}^\{2\},\(14\)ℒpred\_ref\\displaystyle\\mathcal\{L\}\_\{\\text\{pred\\\_ref\}\}=1N∑i=1N‖fpred\_ref,i−fref,i‖22\.\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Big\\\|f\_\{\\text\{pred\\\_ref\},i\}\-f\_\{\\text\{ref\},i\}\\Big\\\|\_\{2\}^\{2\}\.
Gram\-style longitudinal consistency\.Inspired by the gram anchoring adapted inSiméoni et al\.\[[38](https://arxiv.org/html/2606.06534#bib.bib38)\], we also consider a gram\-style objective here\. Specifically, rather than computing the gram loss between teacher and student models, we computed the gram loss between the main image and the reference image to preserve the patch\-to\-patch relationship across different visits\.

ℒgram\\displaystyle\\mathcal\{L\}\_\{\\text\{gram\}\}=‖G\(fmain\)−G\(fref\)‖F2,\\displaystyle=\\Big\\\|G\(f\_\{\\text\{main\}\}\)\-G\(f\_\{\\text\{ref\}\}\)\\Big\\\|\_\{F\}^\{2\},\(15\)ℒgram,mask\\displaystyle\\mathcal\{L\}\_\{\\text\{gram,mask\}\}=‖G\(fmask\_main\)−G\(fmask\_ref\)‖F2,\\displaystyle=\\Big\\\|G\(f\_\{\\text\{mask\\\_main\}\}\)\-G\(f\_\{\\text\{mask\\\_ref\}\}\)\\Big\\\|\_\{F\}^\{2\},
Gram⁡\(X\)=1NX^X^⊤,X^=X∥X∥2∈ℝN×C\.\\operatorname\{Gram\}\(X\)\\;=\\;\\frac\{1\}\{N\}\\,\\hat\{X\}\\hat\{X\}^\{\\top\},\\quad\\hat\{X\}\\;=\\;\\frac\{X\}\{\\lVert X\\rVert\_\{2\}\}\\in\\mathbb\{R\}^\{N\\times C\}\.\(16\)
KoLeo dispersion\.We penalize small nearest\-neighbor distances within each batchBBto avoid sample collapse:

ℒKoLeo\\displaystyle\\mathcal\{L\}\_\{\\text\{KoLeo\}\}=−1B∑i=1Blog⁡\(minj≠i⁡‖z^i−z^j‖2\+ε\),\\displaystyle=\-\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\log\\\!\\Big\(\\min\_\{j\\neq i\}\\big\\\|\\hat\{z\}\_\{i\}\-\\hat\{z\}\_\{j\}\\big\\\|\_\{2\}\+\\varepsilon\\Big\),\(17\)z^i\\displaystyle\\hat\{z\}\_\{i\}=zi‖zi‖2\.\\displaystyle=\\frac\{z\_\{i\}\}\{\\\|z\_\{i\}\\\|\_\{2\}\}\.
Total objective\.We combine language modeling, registration, and auxiliaries to form the total loss\. In our settings, we haveαmask=αpred=αgram=0\.1\\alpha\_\{\\text\{mask\}\}=\\alpha\_\{\\text\{pred\}\}=\\alpha\_\{\\text\{gram\}\}=0\.1, andαkl=0\.001\\alpha\_\{\\text\{kl\}\}=0\.001\.

ℒtotal\\displaystyle\\mathcal\{L\}\_\{\\text\{total\}\}=ℒlm\+ℒreg\+αmask\(ℒmask\_main\+ℒmask\_ref\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{lm\}\}\+\\mathcal\{L\}\_\{\\text\{reg\}\}\+\\alpha\_\{\\text\{mask\}\}\\\!\\big\(\\mathcal\{L\}\_\{\\text\{mask\\\_main\}\}\+\\mathcal\{L\}\_\{\\text\{mask\\\_ref\}\}\\big\)\(18\)\+αpred\(ℒpred\_m\+ℒpred\_r\)\\displaystyle\\quad\+\\alpha\_\{\\text\{pred\}\}\\\!\\big\(\\mathcal\{L\}\_\{\\text\{pred\\\_m\}\}\+\\mathcal\{L\}\_\{\\text\{pred\\\_r\}\}\\big\)\+αgram\(ℒgram\+ℒgram\_mask\)\+αklℒKoLeo\.\\displaystyle\\quad\+\\alpha\_\{\\text\{gram\}\}\\\!\\big\(\\mathcal\{L\}\_\{\\text\{gram\}\}\+\\mathcal\{L\}\_\{\\text\{gram\\\_mask\}\}\\big\)\+\\alpha\_\{\\text\{kl\}\}\\mathcal\{L\}\_\{\\text\{KoLeo\}\}\.
Two stages of training\.To avoid interfering with or compromising existing semantic information within the pretrained image encoder during the initial stages of multi\-task learning, we divided the training into two phases\. In the first phase, we froze the image encoder’s weights and trained the model for four epochs, allowing the remaining components to learn their respective functionalities\. In the second phase, we unfroze the image encoder and trained the entire model for a further four epochs\.

### 3\.7Inference

During inference, we use the entire learned framework to generate the final answers\.

### 3\.8Rationality analysis of masking

Notice that ifM≡𝟏M\\equiv\\mathbf\{1\}, the model is reduced to the baseline without masking\. Therefore, the masked model strictly contains the unmasked model as a special case\. Denote the question asqqand the answer asyy\.

Masking as sufficient statistics\.For longitudinal questions, we assume that the answer depends only on changes inside a sparse anatomical supportR⊂\{1,…,H\}×\{1,…,W\}R\\subset\\\{1,\\dots,H\\\}\\times\\\{1,\\dots,W\\\}:

p\(y\|I^main,Iref,q\)=p\(y\|I^main\(R\),Iref\(R\),q\),p\(y\\,\|\\,\\hat\{I\}\_\{\\text\{main\}\},I\_\{\\text\{ref\}\},q\)=p\\big\(y\\,\\big\|\\hat\{I\}\_\{\\text\{main\}\}^\{\(R\)\},I\_\{\\text\{ref\}\}^\{\(R\)\},q\\big\),\(19\)whereI^\(R\)\\hat\{I\}^\{\(R\)\}keeps only pixels inRR\. Define the ideal binary mask

Mij⋆=\{1,\(i,j\)∈R,0,otherwise,M^\{\\star\}\_\{ij\}=\\begin\{cases\}1,&\(i,j\)\\in R,\\\\\[2\.0pt\] 0,&\\text\{otherwise\},\\end\{cases\}\(20\)and letImain′⁣⋆=M⋆⊙I^mainI\_\{\\text\{main\}\}^\{\\prime\\star\}=M^\{\\star\}\\odot\\hat\{I\}\_\{\\text\{main\}\}andIref′⁣⋆=M⋆⊙IrefI\_\{\\text\{ref\}\}^\{\\prime\\star\}=M^\{\\star\}\\odot I\_\{\\text\{ref\}\}\. Then\(Imain′⁣⋆,Iref′⁣⋆\)\(I\_\{\\text\{main\}\}^\{\\prime\\star\},I\_\{\\text\{ref\}\}^\{\\prime\\star\}\)is a sufficient statistic foryygiven\(I^main,Iref,q\)\(\\hat\{I\}\_\{\\text\{main\}\},I\_\{\\text\{ref\}\},q\):

p\(y\|I^main,Iref,q\)=p\(y\|Imain′⁣⋆,Iref′⁣⋆,q\)\.p\(y\\,\|\\,\\hat\{I\}\_\{\\text\{main\}\},I\_\{\\text\{ref\}\},q\)=p\\big\(y\\,\\big\|I\_\{\\text\{main\}\}^\{\\prime\\star\},I\_\{\\text\{ref\}\}^\{\\prime\\star\},q\\big\)\.\(21\)Therefore, a Bayesian optimal predictor exists that relies solely on the masked image\. The objective of learning the maskMMis to approximateM⋆M^\{\\star\}, filtering out task\-irrelevant background without compromising information related toyy\.

Signal\-to\-noise and longitudinal differences\.Write each registered image as a signal plus background noise,

I^main=smain\+nmain,Iref=sref\+nref,\\hat\{I\}\_\{\\text\{main\}\}=s\_\{\\text\{main\}\}\+n\_\{\\text\{main\}\},\\qquad I\_\{\\text\{ref\}\}=s\_\{\\text\{ref\}\}\+n\_\{\\text\{ref\}\},\(22\)where\(smain,sref\)\(s\_\{\\text\{main\}\},s\_\{\\text\{ref\}\}\)encodes disease\-related variations withinRR, whilstnmain,nrefn\_\{\\text\{main\}\},n\_\{\\text\{ref\}\}denotes instrument and background noise, respectively\. ForMMapproachingM⋆M^\{\\star\}, we have

Imain′=M⊙I^main≈smain\+\(M⊙nmain\),I^\{\\prime\}\_\{\\text\{main\}\}=M\\odot\\hat\{I\}\_\{\\text\{main\}\}\\approx s\_\{\\text\{main\}\}\+\(M\\odot n\_\{\\text\{main\}\}\),\(23\)The noise energy decreases approximately in proportion to the area ratio\|R\|/\|Ω\|\|R\|/\|\\Omega\|\(whereΩ\\Omegadenotes the support set of the entire image\)\. In high\-dimensional models, both reduced effective input dimensions and improved signal\-to\-noise ratios mitigate overfitting risks\. Consequently, the mask serves as a structured regularization term, provided that it does not erroneously exclude the lesion regions\.

Shared masks for longitudinal differences\.Diff\-VQA answers are driven by longitudinal changes rather than absolute findings, so the relevant quantity is the difference

ΔI=I^main−Iref\.\\Delta I=\\hat\{I\}\_\{\\text\{main\}\}\-I\_\{\\text\{ref\}\}\.\(24\)Applying the*same*maskMMto both time points yields

ΔI′=Imain′−Iref′=M⊙\(I^main−Iref\)=M⊙ΔI\.\\Delta I^\{\\prime\}=I^\{\\prime\}\_\{\\text\{main\}\}\-I^\{\\prime\}\_\{\\text\{ref\}\}=M\\odot\(\\hat\{I\}\_\{\\text\{main\}\}\-I\_\{\\text\{ref\}\}\)=M\\odot\\Delta I\.\(25\)This can be viewed as a projection operatorPMP\_\{M\}acting on the joint input\(I^main,Iref\)\(\\hat\{I\}\_\{\\text\{main\}\},I\_\{\\text\{ref\}\}\):

PM\(I^main,Iref\)=\(M⊙I^main,M⊙Iref\),P\_\{M\}\(\\hat\{I\}\_\{\\text\{main\}\},I\_\{\\text\{ref\}\}\)=\(M\\odot\\hat\{I\}\_\{\\text\{main\}\},\\,M\\odot I\_\{\\text\{ref\}\}\),\(26\)which preserves voxel\-wise differences within the masked anatomical coordinates while attenuating all others\. With our registration module, this enforces an inductive bias that the model must compare the corresponding anatomy across time, rather than relying on arbitrary global statistics\.DINO\-guided prior and adaptive refinement\.In the implementation,MMis obtained by combining the unsupervised priorUUgenerated by DINO and the task\-adaptive maskFF\.UUconstrains the mask to semantically and anatomically plausible regions, whilstFFtask\-specializes the prior under the joint influence ofℒlm\\mathcal\{L\}\_\{\\text\{lm\}\}, mask consistency loss, Gram\-style constraints, and KoLeo regularization\. Overall, this may be viewed as a straightforward estimation in the mask space: the prior derives from DINO, while the posterior is refined by the longitudinal VQA objective, yielding a shared mask that is both anatomically plausible and tightly aligned with the differential response\.

## 4Experiments

We use the longitudinal chest radiograph Diff\-VQA dataset, Medical\-Diff\-VQA\[[10](https://arxiv.org/html/2606.06534#bib.bib10),[9](https://arxiv.org/html/2606.06534#bib.bib9)\], which constructs samples from paired studies of the same subject at two time points together with a difference\-focused question–answer pair\. The dataset is derived from MIMIC\-CXR\[[16](https://arxiv.org/html/2606.06534#bib.bib16)\]and MIMIC\-CXR\-JPG\[[17](https://arxiv.org/html/2606.06534#bib.bib17)\]and was obtained from PhysioNet\[[7](https://arxiv.org/html/2606.06534#bib.bib7)\]\. All usage follows the PhysioNet credentialed\-access license and de\-identification guidelines\. The corpus contains 164,223 samples, split into 131,556 for training, 16,278 for validation, and 16,389 for testing\. Further experimental details can be found in the supplementary materials\. Following are details for each component shown in in[Figure1](https://arxiv.org/html/2606.06534#S2.F1)\.

In the data preparation stage, we resized the input images as three\-channel 384 × 384 pixel images\. To mitigate overfitting of limited training data, we adjust brightness, contrast, and sharpness to diversify the training data\. All codes are mainly implemented in two widely used python libraries, including PyTorch \(2\.6\.0\+cu124\) and Hugging Face Transformers \(4\.56\.2\)\. The image encoder is from Hugging Face \(microsoft/swin\-base\-patch4\-window12\-384\-in22k\) and was further trained on one NVIDIA A100\-SXM4\-80G GPU\. The full model was trained on four NVIDIA A100\-SXM4\-80G GPUs\. Batch size is set to 10 for training and 8 for validation\. The optimizer is AdamW with learning rate1\.5×10−41\.5\\times 10^\{\-4\}and weight decay0\.050\.05\. The DINO backbone used in mask generation is from RAD\-DINO\[[31](https://arxiv.org/html/2606.06534#bib.bib31)\]\.

### 4\.1Quantitative Comparisons

We adopt common generation metrics in VQA, BLEU\-1\[[30](https://arxiv.org/html/2606.06534#bib.bib30)\]\(1\-gram precision with a brevity penalty\), METEOR\[[1](https://arxiv.org/html/2606.06534#bib.bib1)\]\(stem matching with an emphasis on recall\), ROUGE\-L\[[21](https://arxiv.org/html/2606.06534#bib.bib21)\]\(overlap and longest common subsequence\), and CIDEr\[[41](https://arxiv.org/html/2606.06534#bib.bib41)\]\(a TF\-IDF based consensus metric\) to evaluate different aspects such as surface\-level matching, semantic alignment, and consistency with human references\. To emphasize the medical keyword and semantic meaning, also considering the scale of the metrics, we use CIDEr to select the best model at the end of training\.

Table 1:Evaluation on four metrics on Medical\-Diff\-VQA comparing ours with others\.Our approach demonstrates complementary advantages over existing methods across multiple evaluation metrics on the Medical\-Diff\-VQA task, exhibiting particularly significant superiority at the semantic level\. Specifically, our approach achieves the highest BLEU\-1 score of 0\.747 on first\-order n\-grams, substantially outperforming VED \(0\.716\)\. This indicates more comprehensive coverage of critical medical entities and difference\-relevant vocabulary by our model\. More notably, on the METEOR metric, which prioritizes semantic matching and disambiguation, our method achieves 0\.700, substantially surpassing all comparators, demonstrating the model’s distinct advantage in capturing clinically critical information and generating semantically equivalent descriptions\. On ROUGE\-L, which is more sensitive to overall syntactic structure and paragraph coherence, our method achieved 0\.703, substantially outperforming VED \(0\.670\) and other baselines\. Regarding CIDEr metrics, our score of 2\.011, while slightly below VED \(2\.119\), substantially surpasses methods such as RegioMix \(1\.804\) and PLURAL \(1\.832\)\. This indicates robust expressive power in generating clinically relevant answers that integrate medical context with temporal variations in images\. Overall, our method substantially surpasses existing differential VQA models in semantic adequacy and clinical relevance\. It maintains parity with current state\-of\-the\-art baselines in lexical accuracy and overall linguistic quality while achieving leadership in key metrics\.

### 4\.2Example and Qualitative Visualization

To further evaluate the actual effectiveness of the proposed method in practical application, we considered sampling from the test set to compare the genuine discrepancies between the model\-generated answers and the ground truth\. As shown in[Figure2](https://arxiv.org/html/2606.06534#S4.F2), although we achieved state\-of\-the\-art performance on evaluation metrics, the generated answers could still differ from the actual answers\. This aligns with findings inMarhuenda et al\.\[[27](https://arxiv.org/html/2606.06534#bib.bib27)\], highlighting the need for new metrics tailored to the medical domain\. A novel metric may require assigning greater weight to key medical terms and their semantic ordering, rather than treating them as ordinary tokens\. Additionally, to further demonstrate how the proposed adaptive mask captures lesion regions, we overlaid the mask onto the original image\.[Figure2](https://arxiv.org/html/2606.06534#S4.F2)indicates our model’s mask effectively identifies and focuses on critical areas in both images, underscoring the method’s validity and exceptional interpretability\. It is noteworthy that this artificial intelligence system is not intended to replace medical practitioners’ diagnosis and treatment but rather to serve as an auxiliary tool for clinicians\. Therefore, beyond providing textual prompts, this mask also offers reference and convenience for doctors interpreting images\. The transparency it affords can enhance medical practitioners’ trust in the model\. It should be noted that some extramural points of interest remain visible beneath the mask, though these did not adversely affect the model’s final performance\. This suggests the model may utilize non\-anatomical regions as a shortcut for inference\. However, such covariates may exert differing influences on the model across other data distributions\. Consequently, to enhance this model’s generalizability and robustness, future research may explore further constraining and redistributing the model’s attention using DINOv3\. To further demonstrate the rationale and interpretability, additional examples of RAD\-DINO and adaptive masks are provided in supplementary materials\.

![Refer to caption](https://arxiv.org/html/2606.06534v1/x2.png)Figure 2:Examples of questions and their corresponding answers generated by our model and the ground truth\. Correct predictions are highlighted inblue, while incorrect predictions are highlighted inred\.
### 4\.3Ablation studies

In this section, we systematically analyze the impact of three key design choices on Diff\-VQA performance through ablation experiments, with results presented in[Table2](https://arxiv.org/html/2606.06534#S4.T2)\. Specifically, we compare four configurations: the full model incorporating all components; a model without the initial freezing of the image encoder; a model without the DINO\-inspired unsupervised objectives; and a model without the saliency attention masks\. Compared to the full model, performance metrics generally declined when the image encoder was not frozen earlier in training\. This indicates that two\-stage optimization helps stabilize visual representations, providing more discriminative features for subsequent modules and allowing sufficient time for them to learn their duties\. Removing the DINO\-inspired unsupervised objective resulted in a consistent decline in metrics such as BLEU, METEOR, and CIDEr scores, demonstrating that in the limited\-annotation differential chest X\-ray VQA scenario, additional unsupervised representation constraints help strengthen image\-text alignment and enhance answer quality\. Performance degraded markedly when inference was performed directly on the full image without applying the saliency attention mask\. This demonstrates that saliency guidance is crucial for directing the model’s focus towards lesions and regions of longitudinal change, thereby enabling the effective utilization of temporal variation information to generate accurate and reliable responses\. Besides, this may also be partially caused by the fact that the mask consistency loss and mask rebuilding loss cannot be included in this scenario\. Collectively, these results underscore the complementary roles and critical value of staged freezing, unsupervised objectives, and saliency attention masks in enhancing performance on Diff\-VQA task\.

Table 2:Results of ablation experiments to investigate different components’ contribution\.−\-refers to removing the component from the proposed framework\.

## 5Discussion and Conclusion

This work proposes a generative framework for addressing longitudinal medical imaging differences, unifying approximate affine pre\-registration, shared attention mask constraints, and multimodal generative decoding within an end\-to\-end system\. First, a lightweight alignment is performed between the primary and reference examinations via the registration module, mitigating interference from patient position and acquisition conditions\. Building upon this, a shared attention mask is generated using DINO priors and an adaptive feature\-driven mask generator, simultaneously acting on both images\. The Union/Fuse dual\-channel architecture explicitly decouples common regions of interest and regions of change, enhancing sensitivity to lesion variations and spatial consistency without requiring additional pixel\-level annotations; Subsequently, masked and unmasked image features undergo joint image encoding and projection\. These are concatenated with question representations extracted by a text encoder to form a multimodal prefix\. A generative decoder then outputs a differential descriptive answer, enabling traceable modelling of local lesion evolution and end\-to\-end response generation\.

Nevertheless, this framework retains several limitations\. Firstly, the introduction of DINO branches, adaptive mask generators, and multiple auxiliary losses inevitably increases computational overhead and implementation complexity during training\. Further exploration is required to achieve a more optimal trade\-off between efficiency and performance in resource\-constrained scenarios\. Secondly, the current registration module only supports approximately identical affine transformations\. For cases with significant positional variations or markedly altered imaging conditions, its alignment capability may prove insufficient\. Thirdly, as analyzed earlier, masks generated by the DINO branch still exhibit considerable noise\. Future work could consider pre\-training a segmentation head focused on the lesion\. From an empirical perspective, we validated model performance on the MIMIC\-CXR Dataset, without systematically evaluating generalization capabilities across other disease types, imaging modalities, or multi\-center datasets\.

Despite the aforementioned limitations, this study holds significant theoretical and practical implications: We present a straightforward yet effective implementation of explicitly constrained spatial attention on longitudinal chest radiograph pairs\. By sharing masks and employing a dual\-channel Union/Fuse architecture, we separately model anatomical regions requiring attention in both examinations and locally altered lesions, optimizing them collaboratively within a unified framework\. This enables the mask generator to inherit the stable spatial prior provided by DINO while flexibly adjusting at the sample level through adaptive feature\-driven mechanisms\. Coupled with a comprehensive training objective that covers semantic modeling, representation alignment, and attention/feature distribution properties, the model achieves competitive question\-answering performance while directly generating semantically consistent attention masks\. This eliminates the need for additional post\-processing significance analysis, providing clinicians with integrated textual responses and visual lesion highlights\. Consequently, it reduces reading burden and mitigates distrust in black\-box models\. Lastly, it also provides a novel and principled approach to using existing image foundation models for biomedical research and applications\.

## 6Acknowledgment

This work is partially supported by the NIH grants P41EB022544, R21EB034911, and NVIDIA Academic Grant Program\.

## References

- Banerjee and Lavie \[2005\]Satanjeev Banerjee and Alon Lavie\.METEOR: An automatic metric for MT evaluation with improved correlation with human judgments\.In*Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization*, pages 65–72, Ann Arbor, Michigan, 2005\. Association for Computational Linguistics\.
- \[2\]Yusuf Brima and Marcellin Atemkeng\.Saliency\-driven explainable deep learning in medical imaging: Bridging visual explainability and statistical quantitative analysis\.17\(1\):18\.
- Cao et al\. \[2021\]Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang\.Swin\-unet: Unet\-like pure transformer for medical image segmentation, 2021\.
- Chen et al\. \[2021\]Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L\. Yuille, and Yuyin Zhou\.Transunet: Transformers make strong encoders for medical image segmentation, 2021\.
- Cho et al\. \[2024\]Yeongjae Cho, Taehee Kim, Heejun Shin, Sungzoon Cho, and Dongmyung Shin\.Pretraining vision\-language model for difference visual question answering in longitudinal chest x\-rays, 2024\.
- Gao et al\. \[2020\]Guangshuai Gao, Wenting Zhao, Qingjie Liu, and Yunhong Wang\.Co\-saliency detection with co\-attention fully convolutional network, 2020\.
- Goldberger et al\. \[2000\]Ary L\. Goldberger, Luis A\. N\. Amaral, Leon Glass, Jeffrey M\. Hausdorff, Plamen Ch\. Ivanov, Roger G\. Mark, Joseph E\. Mietus, George B\. Moody, Chung\-Kang Peng, and H\. Eugene Stanley\.Physiobank, physiotoolkit, and physionet\.*Circulation*, 101\(23\):e215–e220, 2000\.
- Hatamizadeh et al\. \[2021\]Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger Roth, and Daguang Xu\.Unetr: Transformers for 3d medical image segmentation, 2021\.
- \[9\]Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, liangchen liu, Kazuma Kobayashi, Tatsuya Harada, Ronald Summers, and Yingying Zhu\.Medical\-Diff\-VQA: A Large\-Scale Medical Dataset for Difference Visual Question Answering on Chest X\-Ray Images\.
- Hu et al\. \[2023\]Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M\. Summers, and Yingying Zhu\.Expert knowledge\-aware image difference graph representation learning for difference\-aware medical visual question answering\.In*Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, page 4156–4165, New York, NY, USA, 2023\. Association for Computing Machinery\.
- Huang et al\. \[2021\]Shih\-Cheng Huang, Liyue Shen, Matthew P\. Lungren, and Serena Yeung\.Gloria: A multimodal global\-local representation learning framework for label\-efficient medical image recognition\.In*2021 IEEE/CVF International Conference on Computer Vision \(ICCV\)*, pages 3922–3931, 2021\.
- \[12\]Alaa Hussien, Abdelkareem Elkhateb, Mai Saeed, Nourhan M\. Elsabawy, Alaa Ebraheem Elnakeeb, and Nora Elrashidy\.Explainable self\-supervised learning for medical image diagnosis based on DINO V2 model and semantic search\.15\(1\):32174\.
- \[13\]Fabian Isensee, Paul F\. Jaeger, Simon A\. A\. Kohl, Jens Petersen, and Klaus H\. Maier\-Hein\.nnU\-Net: A self\-configuring method for deep learning\-based biomedical image segmentation\.18\(2\):203–211\.
- Jaderberg et al\. \[2016\]Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu\.Spatial transformer networks, 2016\.
- Jin et al\. \[2021\]Weina Jin, Xiaoxiao Li, and Ghassan Hamarneh\.One map does not fit all: Evaluating saliency map explanation on multi\-modal medical images, 2021\.
- \[16\]Alistair E\. W\. Johnson, Tom J\. Pollard, Seth J\. Berkowitz, Nathaniel R\. Greenbaum, Matthew P\. Lungren, Chih\-ying Deng, Roger G\. Mark, and Steven Horng\.MIMIC\-CXR, a de\-identified publicly available database of chest radiographs with free\-text reports\.6\(1\):317\.
- Johnson et al\. \[2019\]Alistair E\. W\. Johnson, Tom J\. Pollard, Nathaniel R\. Greenbaum, Matthew P\. Lungren, Chih ying Deng, Yifan Peng, Zhiyong Lu, Roger G\. Mark, Seth J\. Berkowitz, and Steven Horng\.Mimic\-cxr\-jpg, a large publicly available database of labeled chest radiographs, 2019\.
- Lanfredi et al\. \[2023\]Ricardo Bigolin Lanfredi, Ambuj Arora, Trafton Drew, Joyce D\. Schroeder, and Tolga Tasdizen\.Comparing radiologists’ gaze and saliency maps generated by interpretability methods for chest x\-rays, 2023\.
- Li et al\. \[2023\]Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao\.Llava\-med: Training a large language\-and\-vision assistant for biomedicine in one day, 2023\.
- Li et al\. \[2025\]Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, and Xiaofeng Yang\.Meddinov3: How to adapt vision foundation models for medical image segmentation?, 2025\.
- Lin \[2004\]Chin\-Yew Lin\.ROUGE: A package for automatic evaluation of summaries\.In*Text Summarization Branches Out*, pages 74–81, Barcelona, Spain, 2004\. Association for Computational Linguistics\.
- \[22\]Zhihong Lin, Donghao Zhang, Qingyi Tao, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge\.Medical visual question answering: A survey\.143:102611\.
- Liu et al\. \[2021\]Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo\.Swin transformer: Hierarchical vision transformer using shifted windows\.*CoRR*, abs/2103\.14030, 2021\.
- Lou et al\. \[2025\]Jianxun Lou, Huasheng Wang, Xinbo Wu, John Cho Hui Ng, Richard White, Kaveri A\. Thakoor, Padraig Corcoran, Ying Chen, and Hantao Liu\.Chest x\-ray visual saliency modeling: Eye\-tracking dataset and saliency prediction model\.*IEEE Transactions on Neural Networks and Learning Systems*, 36\(9\):16920–16930, 2025\.
- Lu et al\. \[2024\]Zilin Lu, Yutong Xie, Qingjie Zeng, Mengkang Lu, Qi Wu, and Yong Xia\.Spot the Difference: Difference Visual Question Answering with Residual Alignment \.In*proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024*\. Springer Nature Switzerland, 2024\.
- \[26\]Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang\.Segment anything in medical images\.15\(1\):654\.
- Marhuenda et al\. \[2025\]Luis\-Jesus Marhuenda, Miquel Obrador\-Reina, Mohamed Aas\-Alas, Alberto Albiol, and Roberto Paredes\.Unveiling differences: A vision encoder\-decoder model for difference medical visual question answering\.In*Medical Imaging with Deep Learning*, 2025\.
- Oh et al\. \[2019\]Dong Yul Oh, Jihang Kim, and Kyong Joon Lee\.Longitudinal change detection on chest x\-rays using geometric correlation maps\.page 748–756, Berlin, Heidelberg, 2019\. Springer\-Verlag\.
- Oquab et al\. \[2024\]Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El\-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po\-Yao Huang, Shang\-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski\.Dinov2: Learning robust visual features without supervision, 2024\.
- Papineni et al\. \[2002\]Kishore Papineni, Salim Roukos, Todd Ward, and Wei\-Jing Zhu\.Bleu: a method for automatic evaluation of machine translation\.In*Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics*, pages 311–318, Philadelphia, Pennsylvania, USA, 2002\. Association for Computational Linguistics\.
- Pérez\-García et al\. \[2025\]Fernando Pérez\-García, Harshita Sharma, Sam Bond\-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C\. Castro, Anton Schwaighofer, Matthew P\. Lungren, Maria Teodora Wetscherek, Noel Codella, Stephanie L\. Hyland, Javier Alvarez\-Valle, and Ozan Oktay\.Exploring scalable medical image encoders beyond text supervision\.*Nature Machine Intelligence*, 2025\.
- Radford et al\. \[2019\]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever\.Language models are unsupervised multitask learners\.2019\.
- Ravi et al\. \[2024\]Nikhila Ravi, Valentin Gabeur, Yuan\-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao\-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer\.Sam 2: Segment anything in images and videos\.*arXiv preprint arXiv:2408\.00714*, 2024\.
- Ridnik et al\. \[2021\]Tal Ridnik, Emanuel Ben\-Baruch, Asaf Noy, and Lihi Zelnik\.Imagenet\-21k pretraining for the masses\.In*Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks*, 2021\.
- \[35\]Adriel Saporta, Xiaotong Gui, Ashwin Agrawal, Anuj Pareek, Steven Q\. H\. Truong, Chanh D\. T\. Nguyen, Van\-Doan Ngo, Jayne Seekins, Francis G\. Blankenberg, Andrew Y\. Ng, Matthew P\. Lungren, and Pranav Rajpurkar\.Benchmarking saliency methods for chest X\-ray interpretation\.4\(10\):867–878\.
- Scholz et al\. \[2025\]Daniel Scholz, Ayhan Can Erdur, Viktoria Ehm, Anke Meyer\-Baese, Jan C\. Peeken, Daniel Rueckert, and Benedikt Wiestler\.MM\-DINOv2: Adapting Foundation Models for Multi\-Modal Medical Image Analysis \.In*proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2025*\. Springer Nature Switzerland, 2025\.
- Selvaraju et al\. \[2019\]Ramprasaath R\. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra\.Grad\-cam: Visual explanations from deep networks via gradient\-based localization\.*International Journal of Computer Vision*, 128\(2\):336–359, 2019\.
- Siméoni et al\. \[2025\]Oriane Siméoni, Huy V\. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, and Piotr Bojanowski\.DINOv3, 2025\.
- Song et al\. \[2024\]Xinrui Song, Xuanang Xu, and Pingkun Yan\.General purpose image encoder dinov2 for medical image registration, 2024\.
- Vaswani et al\. \[2023\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\.Attention is all you need, 2023\.
- Vedantam et al\. \[2015\]Ramakrishna Vedantam, C\. Lawrence Zitnick, and Devi Parikh\.Cider: Consensus\-based image description evaluation, 2015\.
- Wang et al\. \[2022\]Qizao Wang, Xuelin Qian, Yanwei Fu, and Xiangyang Xue\.Co\-attention aligned mutual cross\-attention for cloth\-changing person re\-identification\.page 351–368, Berlin, Heidelberg, 2022\. Springer\-Verlag\.
- Wolf et al\. \[2020\]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush\.Transformers: State\-of\-the\-art natural language processing\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 38–45, Online, 2020\. Association for Computational Linguistics\.
- Wollek et al\. \[2023\]Alessandro Wollek, Robert Graf, Saša Čečatka, Nicola Fink, Theresa Willem, Bastian O\. Sabel, and Tobias Lasser\.Attention\-based saliency maps improve interpretability of pneumothorax classification\.*Radiology: Artificial Intelligence*, 5\(2\):e220187, 2023\.PMID: 37035429\.
- Yang et al\. \[2025\]Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, and Lei Zhu\.Segdino: An efficient design for medical and natural image segmentation with dino\-v3, 2025\.
- Yao et al\. \[2018\]Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei\.Exploring visual relationship for image captioning, 2018\.
- Yue Qiu and Satoh \[2021\]Kodai Nakashima Ryota Suzuki Kenji Iwata Hirokatsu Kataoka Yue Qiu, Shintaro Yamamoto and Yutaka Satoh\.Describing and localizing multiple changes with transformers, 2021\.
- Yung et al\. \[2024\]Ka\-Wai Yung, Jayaram Sivaraj, Danail Stoyanov, Stavros Loukogeorgakis, and Evangelos B\. Mazomenos\.Region\-Specific Retrieval Augmentation for Longitudinal Visual Question Answering: A Mix\-and\-Match Paradigm \.In*proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024*\. Springer Nature Switzerland, 2024\.
- \[49\]Juan Manuel Zambrano Chaves, Shih\-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P\. Lungren, Akshay Chaudhari, Serena Yeung\-Levy, Curtis P\. Langlotz, Sheng Wang, and Hoifung Poon\.A clinically accessible small multimodal radiology model and evaluation metric for chest X\-ray findings\.16\(1\):3108\.
- Zhang et al\. \[2024\]Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie\.PMC\-VQA: Visual instruction tuning for medical visual question answering, 2024\.
- Zhu et al\. \[2024\]Jiayuan Zhu, Abdullah Hamdi, Yunli Qi, Yueming Jin, and Junde Wu\.Medical sam 2: Segment medical images as video via segment anything model 2, 2024\.
Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

Similar Articles

Self-Evolving Visual Questioner

Brain-IT-VQA: From Brain Signals to Answers

Stateful Visual Encoders for Vision-Language Models

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Aloe-Vision: Robust Vision-Language Models for Healthcare

Submit Feedback

Similar Articles

Self-Evolving Visual Questioner
Brain-IT-VQA: From Brain Signals to Answers
Stateful Visual Encoders for Vision-Language Models
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory
Aloe-Vision: Robust Vision-Language Models for Healthcare