Tag
Proposes an attention-guided encoder-decoder for longitudinal medical visual question answering, using a frozen DINO-based mask generator and auxiliary losses to improve consistency and interpretability, achieving strong results on the Medical-Diff-VQA benchmark.