REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk
Summary
This paper introduces REVEAL++, a differentiable phenotypic grouping method for vision-language contrastive learning, applied to retinal fundus images and clinical risk narratives for Alzheimer's disease risk prediction, outperforming discrete grouping baselines.
View Cached Full Text
Cached at: 06/20/26, 02:30 PM
# REVEAL++: Differentiable Phenotypic Grouping for Vision–Language Retinal Modeling of Alzheimer’s Disease Risk
Source: [https://arxiv.org/html/2606.19522](https://arxiv.org/html/2606.19522)
11institutetext:University of Virginia, Charlottesville, VA, USA22institutetext:J\. Crayton Pruitt Family Department of Biomedical Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, USASeowung LeemZeyun ZhaoRuogu FangCorresponding author: ruogu\.fang@ufl\.edu
###### Abstract
The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive decline\. Vision–language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer’s disease \(AD\)\. A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi\-positive pairs during contrastive learning\. However, existing methods operationalize phenotypic similarity as a discrete construct, relying on hard group assignments that impose rigid supervision and decouple group formation from representation learning\. We propose a continuous formulation of phenotypic structure within contrastive learning\. Rather than assigning samples to fixed clusters, we model inter\-subject similarity as a differentiable weighting function derived from intra\-modality embedding similarities in both retinal images and risk profiles\. These weights define soft multi\-positive relationships through a continuous aggregation operator, enabling graded supervision that reflects the spectrum nature of disease risk\. We further introduce a soft\-target contrastive objective that jointly learns cross\-modal alignment and phenotypic structure in an end\-to\-end manner\. Evaluated on UK Biobank retinal imaging data for incident AD prediction, the proposed framework consistently outperforms discrete group\-based contrastive learning and standard vision–language baselines\. By treating phenotypic similarity as a learnable, continuous signal rather than a fixed grouping rule, our approach provides a principled and robust foundation for population\-scale neurodegenerative risk modeling from multi\-modal retinal and clinical data\.
## 1Introduction
Figure 1:Architecture of the proposed differentiable phenotypic weighting framework for group\-aware contrastive learning\. Image and text embeddings are aligned via a similarity\-weighted multi\-positive contrastive loss, where continuous phenotypic weights replace hard grouping to model the heterogeneous spectrum of Alzheimer’s disease riskAlzheimer’s disease \(AD\) is a progressive neurodegenerative disorder characterized by a long preclinical phase during which pathological changes accumulate before clinical symptoms\[[6](https://arxiv.org/html/2606.19522#bib.bib1)\]\. Advances in brain imaging and plasma\-based biomarkers have substantially improved the ability to detect disease\-related pathology\. However, these approaches may remain costly, invasive, or impractical for large\-scale population screening\. Complementary modalities that are scalable and noninvasive therefore, play an important role in early risk stratification\. The retina has emerged as one such modality, as its structure and microvasculature share developmental and physiological links with the central nervous system and are associated with neurodegenerative and vascular processes relevant to AD\[[3](https://arxiv.org/html/2606.19522#bib.bib2)\]\. In parallel, systemic and lifestyle\-related risk factors, including cardiometabolic health and sleep patterns, capture longitudinal exposures that contribute to dementia risk decades before diagnosis\[[4](https://arxiv.org/html/2606.19522#bib.bib3),[2](https://arxiv.org/html/2606.19522#bib.bib4)\]\. Rather than serving as standalone diagnostic markers, these signals offer complementary population\-level information that may help characterize early disease susceptibility\.
Recent advances in vision–language models \(VLMs\), has enabled joint representation learning across heterogeneous data modalities through contrastive alignment\. Inspired by CLIP\-style architectures, medical VLMs have increasingly been adapted to retinal imaging, leveraging large\-scale pretraining to learn clinically meaningful visual representations\[[7](https://arxiv.org/html/2606.19522#bib.bib17),[17](https://arxiv.org/html/2606.19522#bib.bib18),[21](https://arxiv.org/html/2606.19522#bib.bib9)\]\. Building on this paradigm, the REVEAL framework aligned retinal fundus images with individualized clinical risk narratives derived from structured health data, enabling multi\-modal modeling of neurodegenerative risk\[[13](https://arxiv.org/html/2606.19522#bib.bib5)\]\. A central innovation of REVEAL was group\-aware contrastive learning \(GACL\), which encouraged subjects with similar phenotypic profiles to act as multi\-positive pairs during training\. This strategy improved robustness to individual\-level noise and promoted learning of shared disease\-relevant structure, leading to improved downstream AD risk prediction compared with uni\-modal and standard pairwise contrastive approaches\.
Despite these advantages, phenotypic grouping in existing GACL formulations is constructed through discrete similarity thresholds, implicitly assuming that individuals belong to well\-separated risk categories\. From a biological perspective, however, neurodegenerative risk evolves along continuous and overlapping trajectories shaped by heterogeneous genetic, vascular, metabolic, and lifestyle factors\. Individuals often exhibit partial similarity across multiple phenotypic axes rather than membership in a single homogeneous group\. Hard group assignments may therefore introduce artificial boundaries that fail to reflect the graded and spectrum\-like nature of disease vulnerability, while preventing the grouping process itself from adapting during representation learning\.
In this work, we introduce a differentiable phenotypic weighting framework that treats inter\-subject similarity as a continuous supervisory signal within multi\-modal contrastive learning\. Rather than relying on threshold\-based clustering, similarity structures are computed directly from retinal image embeddings and clinical risk\-profile embeddings and combined through a soft aggregation operator to produce continuous group\-membership weights\. These weights define a soft multi\-positive contrastive objective in which supervision strength varies smoothly according to phenotypic proximity\. By modeling phenotypic relationships as a differentiable attention\-like process, the proposed framework enables joint learning of representation alignment and population\-level structure, more faithfully capturing the continuous and heterogeneous biological variability underlying neurodegenerative risk\.
Our contributions are three\-fold:
- •Differentiable phenotypic weighting:We replace hard threshold\-based grouping in group\-aware contrastive learning with continuous phenotypic similarity weights derived from retinal and clinical embeddings, enabling smooth data\-driven cohort modeling that better captures heterogeneous Alzheimer’s disease risk\.
- •Soft multi\-positive contrastive learning:We introduce a soft\-target contrastive objective that incorporates phenotypic similarity into cross\-modal alignment, enabling graded multi\-positive supervision instead of binary pair assignments\.
- •State\-of\-the\-art Alzheimer’s risk prediction from retinal imaging:We achieve new state\-of\-the\-art performance on UK Biobank retinal imaging for incident Alzheimer’s disease prediction, outperforming existing vision–language and group\-aware contrastive learning methods\.
## 2Methods
### 2\.1Overview of REVEAL\+\+
REVEAL learns joint image–text representations of retinal fundus images and structured clinical reports under a group\-aware contrastive objective\. Given a minibatch ofNNsubjects, each subjectppis associated with a retinal image and a clinical report\. Image and text encoders produce modality\-specific embeddings, which are projected into a shared latent space\. To incorporate phenotypic structure into contrastive supervision, we compute intra\-modality similarity matrices that capture retinal image embedding and risk\-profile similarity between subjects\. These similarities are transformed into a differentiable phenotypic weighting maskW∈\[0,1\]N×NW\\in\[0,1\]^\{N\\times N\}, which acts as a soft pairwise target matrix in a multi\-positive contrastive loss\.
### 2\.2Clinical Report Generation
To enable alignment between retinal images and systemic risk factors within a vision–language framework, structured questionnaire data were converted into synthetic clinical narratives compatible with pretrained text encoders\. Using the LLaMA\-3\.1 API as the text generation engine, each participant’s tabular risk\-factor profile was mapped into a standardized clinical\-style summary\[[9](https://arxiv.org/html/2606.19522#bib.bib7)\]\. For each subject, the LLM received a predefined documentation template, the subject’s structured demographic, behavioral, cognitive, and lifestyle variables, and explicit instructions to generate a concise report without inferring missing values\. The template was adapted from the “Patient Information” section of the CARE clinical case reporting guidelines, ensuring consistency with established medical documentation conventions\[[8](https://arxiv.org/html/2606.19522#bib.bib8)\]\. To minimize variability and preserve numerical fidelity, the prompt enforced a one\-to\-one mapping between tabular entries and template fields, with unavailable values explicitly marked rather than imputed\. This controlled translation process produces semantically enriched text representations that enable structured health information to be embedded within a shared multi\-modal latent space\.
### 2\.3Image and Text Encoders
Let𝐱p\\mathbf\{x\}\_\{p\}denote the retinal image and𝐭p\\mathbf\{t\}\_\{p\}the associated clinical report for subjectpp\. The image encoderEI\(⋅\)E\_\{I\}\(\\cdot\)and text encoderET\(⋅\)E\_\{T\}\(\\cdot\)produce modality\-specific embeddings\. In our implementation, we instantiateEIE\_\{I\}as RETFound\[[21](https://arxiv.org/html/2606.19522#bib.bib9)\]andETE\_\{T\}as GatorTron\[[19](https://arxiv.org/html/2606.19522#bib.bib10)\]\. Each encoder is followed by a lightweight linear projection layer to map features into a shared embedding space of dimensiondd\.
𝐳pI=EI\(𝐱p\),𝐳pT=ET\(𝐭p\)\.\\mathbf\{z\}^\{I\}\_\{p\}=E\_\{I\}\(\\mathbf\{x\}\_\{p\}\),\\qquad\\mathbf\{z\}^\{T\}\_\{p\}=E\_\{T\}\(\\mathbf\{t\}\_\{p\}\)\.\(1\)
Both embeddings are projected into a shareddd\-dimensional space andℓ2\\ell\_\{2\}\-normalized, yielding𝐳^pI=𝐳pI/∥𝐳pI∥2\\hat\{\\mathbf\{z\}\}^\{I\}\_\{p\}=\\mathbf\{z\}^\{I\}\_\{p\}/\\lVert\\mathbf\{z\}^\{I\}\_\{p\}\\rVert\_\{2\}and𝐳^pT=𝐳pT/∥𝐳pT∥2\\hat\{\\mathbf\{z\}\}^\{T\}\_\{p\}=\\mathbf\{z\}^\{T\}\_\{p\}/\\lVert\\mathbf\{z\}^\{T\}\_\{p\}\\rVert\_\{2\}\. A learnable logit scale parametersscontrols the contrastive temperature, withτ=exp\(−s\)\\tau=\\exp\(\-s\)\.
### 2\.4Intra\-Modality Similarity for Phenotypic Grouping
To capture phenotypic similarity between subjects, we construct intra\-modality similarity matrices based on cosine similarity between normalized representations\. Let𝐳^pI\\hat\{\\mathbf\{z\}\}^\{I\}\_\{p\}denote the normalized image embedding produced by the image encoder, and let𝐳^pT\\hat\{\\mathbf\{z\}\}^\{T\}\_\{p\}denote the normalized text\-derived risk\-profile embedding for subjectsp,qp,q\.
Sii\(p,q\)=⟨𝐳^pI,𝐳^qI⟩S\_\{ii\}\(p,q\)=\\langle\\hat\{\\mathbf\{z\}\}^\{I\}\_\{p\},\\hat\{\\mathbf\{z\}\}^\{I\}\_\{q\}\\rangle\(2\)Stt\(p,q\)=⟨𝐳^pT,𝐳^qT⟩S\_\{tt\}\(p,q\)=\\langle\\hat\{\\mathbf\{z\}\}^\{T\}\_\{p\},\\hat\{\\mathbf\{z\}\}^\{T\}\_\{q\}\\rangle\(3\)
Here,SiiS\_\{ii\}captures similarity in the learned retinal image embeddings, whileSttS\_\{tt\}captures similarity between clinical report embeddings\.
### 2\.5Differentiable Phenotypic Weighting
We transform these similarities into soft membership signals using sigmoid gating with thresholdsτF,τT\\tau\_\{F\},\\tau\_\{T\}and learnable sharpness parametersgF,gTg\_\{F\},g\_\{T\}:
aF\(p,q\)=σ\(Sii\(p,q\)−τFgF\),aT\(p,q\)=σ\(Stt\(p,q\)−τTgT\),a\_\{F\}\(p,q\)=\\sigma\\\!\\left\(\\frac\{S\_\{ii\}\(p,q\)\-\\tau\_\{F\}\}\{g\_\{F\}\}\\right\),\\qquad a\_\{T\}\(p,q\)=\\sigma\\\!\\left\(\\frac\{S\_\{tt\}\(p,q\)\-\\tau\_\{T\}\}\{g\_\{T\}\}\\right\),\(4\)whereσ\(⋅\)\\sigma\(\\cdot\)denotes the logistic sigmoid function\. Finally, we combine the two signals using a differentiable probabilistic union operator to obtain the phenotypic weighting score
Wpq=1−\(1−aF\(p,q\)\)\(1−aT\(p,q\)\),Wpq∈\[0,1\]\.W\_\{pq\}=1\-\\bigl\(1\-a\_\{F\}\(p,q\)\\bigr\)\\bigl\(1\-a\_\{T\}\(p,q\)\\bigr\),\\qquad W\_\{pq\}\\in\[0,1\]\.\(5\)Pairs with largerWpqW\_\{pq\}are treated as more strongly aligned in phenotype space and receive higher weight as positives in the multi\-positive contrastive objective\.
### 2\.6Phenotypic Similarity\-Weighted Multi\-Positive Contrastive Loss
Cross\-modal similarity between image and text embeddings is defined as:
Sit\(p,q\)=⟨𝐳^pI,𝐳^qT⟩\.S\_\{it\}\(p,q\)=\\left\\langle\\hat\{\\mathbf\{z\}\}^\{I\}\_\{p\},\\hat\{\\mathbf\{z\}\}^\{T\}\_\{q\}\\right\\rangle\.\(6\)Logits are computed using temperature scaling with a learnable log\-temperature parameterssand a learnable bias termβ\\beta::
ℓpq=Sit\(p,q\)τ−β,τ=exp\(−s\)\.\\ell\_\{pq\}=\\frac\{S\_\{it\}\(p,q\)\}\{\\tau\}\-\\beta,\\qquad\\tau=\\exp\(\-s\)\.\(7\)We optimize a soft\-target multi\-positive contrastive objective:
ℒMP=1N2∑p=1N∑q=1N\[Wpqlog\(1\+exp\(−ℓpq\)\)\+\(1−Wpq\)log\(1\+exp\(ℓpq\)\)\]\.\\mathcal\{L\}\_\{\\mathrm\{MP\}\}=\\frac\{1\}\{N^\{2\}\}\\sum\_\{p=1\}^\{N\}\\sum\_\{q=1\}^\{N\}\\left\[W\_\{pq\}\\,\\log\\\!\\bigl\(1\+\\exp\(\-\\ell\_\{pq\}\)\\bigr\)\+\\bigl\(1\-W\_\{pq\}\\bigr\)\\,\\log\\\!\\bigl\(1\+\\exp\(\\ell\_\{pq\}\)\\bigr\)\\right\]\.\(8\)WhenWpqW\_\{pq\}approaches 1, the pair\(p,q\)\(p,q\)is treated as a positive match, whenWpqW\_\{pq\}approaches 0, it is treated as a negative pair\. Intermediate values allow soft supervision based on phenotypic similarity\.
## 3Experiments
### 3\.1Dataset and Preprocessing
Table 1:Cohort characteristics across data splits\.A comprehensive set of demographic, behavioral, cognitive, and lifestyle variables was extracted from the UK Biobank\[[5](https://arxiv.org/html/2606.19522#bib.bib6)\]baseline assessment and compiled as candidate risk factors based on established epidemiological and biomarker evidence linking modifiable exposures to Alzheimer’s disease and dementia risk\[[14](https://arxiv.org/html/2606.19522#bib.bib13),[2](https://arxiv.org/html/2606.19522#bib.bib4),[18](https://arxiv.org/html/2606.19522#bib.bib12),[10](https://arxiv.org/html/2606.19522#bib.bib14),[11](https://arxiv.org/html/2606.19522#bib.bib15),[16](https://arxiv.org/html/2606.19522#bib.bib16)\]\. These include factors associated with amyloid and tau pathology, sleep disturbance, cardiometabolic health, and other modifiable determinants of neurodegeneration\.
Color fundus photographs \(CFPs\) from the initial UK Biobank assessment visit were used for image\-based modeling\. Images underwent automated quality control to exclude low\-quality scans, retaining only high\-quality CFPs for downstream analysis\[[22](https://arxiv.org/html/2606.19522#bib.bib23)\]\. Preprocessed CFPs are input into a RETFound\-initialized vision encoder, which is fine\-tuned during training\[[21](https://arxiv.org/html/2606.19522#bib.bib9)\]\. Each image was resized to match the input resolution of the pretrained RETFound encoder and normalized using standard channel\-wise mean and standard deviation values consistent with its pretraining setup\. To ensure consistent anatomical orientation across subjects, right\-eye images were horizontally flipped prior to encoding\.
### 3\.2Implementation Details
RETFound and GatorTron were used as image and text encoders\. The vision encoder is initialized with RETFound weights and fine\-tuned end\-to\-end, while the text encoder is kept frozen\. Lightweight linear projections map both modalities into a sharedd=1024d=1024\-dimensional space\. Embeddings wereℓ2\\ell\_\{2\}\-normalized prior to similarity computation\. The batch size was 128\. Optimization used AdamW with hyperparameters selected via Optuna\[[1](https://arxiv.org/html/2606.19522#bib.bib21)\]\. The final learning rate was2\.42×10−42\.42\\times 10^\{\-4\},ϵ=8\.61×10−7\\epsilon=8\.61\\times 10^\{\-7\}, and weight decay0\.02320\.0232\. Phenotypic similarity thresholds were initialized from empirical intra\-modality cosine similarity distributions computed on 85% of the development set, restricting the search to the upper interquartile range\.
## 4Results
To evaluate the proposed framework, we compared our method against strong retinal and biomedical foundation models\. We included RETFound\[[21](https://arxiv.org/html/2606.19522#bib.bib9)\], a large\-scale retinal image foundation model; RET\-CLIP\[[7](https://arxiv.org/html/2606.19522#bib.bib17)\], a retinal image–text contrastive pretraining framework; and MM\-Retinal\[[17](https://arxiv.org/html/2606.19522#bib.bib18)\], a knowledge\-enhanced retinal vision–language model\. We additionally evaluated two general biomedical multi\-modal foundation models: PMC\-CLIP\[[15](https://arxiv.org/html/2606.19522#bib.bib20)\]and BiomedCLIP\[[20](https://arxiv.org/html/2606.19522#bib.bib19)\]\. Because RETFound is an image\-only encoder, we paired it with GatorTron\[[19](https://arxiv.org/html/2606.19522#bib.bib10)\]to construct multi\-modal representations, concatenating image and text embeddings for downstream classification\. In addition to vision–language baselines, we trained tabular SVM models using structured clinical variables and CFP\-derived image features to assess whether performance gains stem from semantic narrative modeling or solely from image foundation representations\. All methods were evaluated under an identical multi\-modal SVM protocol\. Each experiment was repeated across 10 random seeds, and we report mean±\\pmstandard deviation performance\.
Table 2:Comparison of multi\-modal and baseline methods for incident AD prediction\. Results are reported as mean±\\pmstandard deviation across folds\.BoldandUnderlinerepresent the best and the second best results\.In the incident AD prediction task \(Table[2](https://arxiv.org/html/2606.19522#S4.T2)\), our phenotypic\-weighted multi\-positive contrastive framework consistently outperformed all comparison methods, indicating that soft, differentiable phenotypic alignment leads to more coherent multi\-modal representations\. Rather than relying on single positive pairs or hard grouping, the proposed formulation allows subjects with similar risk profiles to contribute proportionally during training, yielding stronger downstream discrimination\. While pretrained vision–language baselines such as RETFound\+GatorTron and RET\-CLIP capture meaningful retinal–text correspondences, they do not explicitly model phenotypic structure, which may be important for long\-horizon neurodegenerative risk prediction\.
These findings suggest that modeling phenotypic similarity as a continuous, differentiable weighting mechanism enables smoother transitions between positive and negative supervision, leading to more coherent multi\-modal embedding spaces and improved long\-horizon neurodegenerative risk prediction\.
## 5Discussion
Alzheimer’s disease is increasingly understood not as a binary condition but a long\-term neurodegenerative process that evolves over years prior to diagnosis\. Pathological changes including amyloid deposition, tau accumulation, vascular dysfunction, and systemic metabolic dysregulation emerge progressively and interact across multiple biological scales before cognitive symptoms become apparent\[[6](https://arxiv.org/html/2606.19522#bib.bib1),[16](https://arxiv.org/html/2606.19522#bib.bib16)\]\. Retinal microvascular alterations and structural remodeling likewise develop along a continuum, reflecting cumulative exposure to systemic and neurodegenerative risk factors\. Therefore, similarity between individuals along disease\-relevant dimensions is continuous rather than discretely separable\.
Hard similarity thresholds impose artificial boundaries on this biological continuum by assigning subjects to fixed phenotypic groups\. While such grouping can strengthen contrastive supervision, it implicitly assumes well\-defined subtype partitions that may not reflect the underlying progression of the disease\. However, such discrete subtype boundaries may not exist during the preclinical stages of neurodegenerative disease, where pathological processes evolve gradually and heterogeneously across individuals\. This mismatch can limit the ability of representation learning methods to capture subtle transitions in risk states\. In contrast, the proposed differentiable phenotypic weighting mechanism allows phenotypic similarity to modulate supervision strength continuously\. Participants with partially overlapping risk profiles or subtly similar retinal signatures contribute proportionally during training, enabling smoother organization of the shared embedding space while preserving meaningful inter\-subject variation\.
This formulation more closely reflects the pathophysiology of preclinical AD, where risk accumulates gradually and manifests heterogeneously across individuals\[[12](https://arxiv.org/html/2606.19522#bib.bib22)\]\. By relaxing discrete grouping into continuous supervision, the model is able to represent intermediate phenotypic states that may correspond to early pathological changes\. The resulting embedding geometry reflects a continuum of risk rather than rigid clusters, providing a representation that is both biologically plausible and better suited for early risk stratification\.
## 6Conclusion
We presented REVEAL\+\+, a differentiable phenotypic alignment framework for multi\-modal learning from retinal imaging and clinical risk narratives in preclinical Alzheimer’s disease prediction\. By replacing discrete threshold\-based grouping with continuous similarity\-driven supervision, the proposed approach enables phenotypic relationships to be learned jointly with representation alignment, allowing population structure to emerge directly from data\. This formulation better captures the gradual and heterogeneous nature of neurodegenerative disease progression and leads to improved risk prediction performance\. More broadly, differentiable phenotypic alignment offers a strategy for modeling structured variability in multi\-modal biomedical data, with potential applications spanning chronic disease risk prediction, precision medicine, longitudinal health modeling, and large\-scale population health analytics across diverse clinical domains\.
## References
- \[1\]T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, and M\. Koyama\(2019\)Optuna: a next\-generation hyperparameter optimization framework\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining \(KDD ’19\),New York, NY, USA,pp\. 2623–2631\.External Links:[Document](https://dx.doi.org/10.1145/3292500.3330701)Cited by:[§3\.2](https://arxiv.org/html/2606.19522#S3.SS2.p1.5)\.
- \[2\]M\. Aktan Süzgün, Q\. Tang, and A\. Stefani\(2025\)Sleep abnormalities and risk of alzheimer’s disease\.Current Neurology and Neuroscience Reports25\(1\),pp\. 67\.External Links:[Document](https://dx.doi.org/10.1007/s11910-025-01451-5)Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p1.1)\.
- \[3\]H\. Banna, M\. Slayo, J\. Armitage, B\. Del Rosal, L\. Vocale, and S\. Spencer\(2024\)Imaging the eye as a window to brain health: frontier approaches and future directions\.Journal of Neuroinflammation21\(1\),pp\. 309\.External Links:[Document](https://dx.doi.org/10.1186/s12974-024-03304-3)Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p1.1)\.
- \[4\]C\. Bueno Lopez, A\. Iona, D\. Avery, I\. Turnbull, L\. Yang, H\. Du, Y\. Chen, N\. Zhang, J\. Chen, P\. Pei, J\. Lv, C\. Yu, D\. Sun, L\. Li, D\. Bennett, C\. van Duijn, R\. Clarke, Z\. Chen, and F\. Bragg\(2025\)Cardiometabolic health and risk of dementia and brain atrophy: a community\-based prospective cohort study of 0\.5 million adults in china\.The Lancet Regional Health – Western Pacific64,pp\. 101743\.External Links:[Document](https://dx.doi.org/10.1016/j.lanwpc.2025.101743)Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p1.1)\.
- \[5\]C\. Bycroft, C\. Freeman, D\. Petkova, G\. Band, L\. T\. Elliott, K\. Sharp, A\. Motyer, D\. Vukcevic, O\. Delaneau, J\. O’Connell, A\. Cortes, S\. Welsh, A\. Young, M\. Effingham, G\. McVean, S\. Leslie, N\. Allen, P\. Donnelly, and J\. Marchini\(2018\)The uk biobank resource with deep phenotyping and genomic data\.Nature562\(7726\),pp\. 203–209\.External Links:[Document](https://dx.doi.org/10.1038/s41586-018-0579-z)Cited by:[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p1.1)\.
- \[6\]K\. H\. Chow and T\. Abel\(2026\)Neurodevelopmental origins of age\-related neurodegenerative diseases\.eBioMedicine124,pp\. 106151\.External Links:[Document](https://dx.doi.org/10.1016/j.ebiom.2026.106151)Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p1.1),[§5](https://arxiv.org/html/2606.19522#S5.p1.1)\.
- \[7\]J\. Du, J\. Guo, W\. Zhang, S\. Yang, H\. Liu, H\. Li, and N\. Wang\(2024\)RET\-clip: a retinal image foundation model pre\-trained with clinical diagnostic reports\.arXiv preprint arXiv:2405\.14137\.Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p2.1),[§4](https://arxiv.org/html/2606.19522#S4.p1.1)\.
- \[8\]J\. J\. Gagnier, G\. Kienle, D\. G\. Altman, D\. Moher, H\. Sox, D\. Riley, and C\. Group\(2013\-09\)The care guidelines: consensus\-based clinical case reporting guideline development\.Global Advances in Health and Medicine2\(5\),pp\. 38–43\.External Links:[Document](https://dx.doi.org/10.7453/gahmj.2013.008)Cited by:[§2\.2](https://arxiv.org/html/2606.19522#S2.SS2.p1.1)\.
- \[9\]A\. Grattafiori, A\. Dubey, A\. Jauhri,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§2\.2](https://arxiv.org/html/2606.19522#S2.SS2.p1.1)\.
- \[10\]K\. M\. Hayden, M\. M\. Mielke, J\. K\. Evans, R\. Neiberg, D\. Molina\-Henry, M\. Culkin, S\. Marcovina, K\. C\. Johnson, O\. T\. Carmichael, S\. R\. Rapp, B\. C\. Sachs, J\. Ding, H\. Shappell, L\. Wagenknecht, J\. A\. Luchsinger, and M\. A\. Espeland\(2024\-01\)Association between modifiable risk factors and levels of blood\-based biomarkers of alzheimer’s and related dementias in the look ahead cohort\.JAR Life13,pp\. 1–21\.External Links:[Document](https://dx.doi.org/10.14283/jarlife.2024.1)Cited by:[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p1.1)\.
- \[11\]Z\. Huszár, A\. Solomon, M\. A\. Engh, V\. Koszovácz, T\. Terebessy, Z\. Molnár, P\. Hegyi, A\. Horváth, F\. Mangialasche, M\. Kivipelto, and G\. Csukly\(2024\-10\)Association of modifiable risk factors with progression to dementia in relation to amyloid and tau pathology\.Alzheimer’s Research & Therapy16,pp\. 238\.External Links:[Document](https://dx.doi.org/10.1186/s13195-024-01602-9)Cited by:[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p1.1)\.
- \[12\]C\. R\. Jr\. Jack, D\. A\. Bennett, K\. Blennow, M\. C\. Carrillo, B\. Dunn, S\. B\. Haeberlein, D\. M\. Holtzman, W\. Jagust, F\. Jessen, J\. Karlawish, E\. Liu, J\. L\. Molinuevo, T\. Montine, C\. Phelps, K\. P\. Rankin, C\. C\. Rowe, P\. Scheltens, E\. Siemers, H\. M\. Snyder, and R\. Sperling\(2018\)NIA\-aa research framework: toward a biological definition of alzheimer’s disease\.Alzheimer’s & Dementia14\(4\),pp\. 535–562\.External Links:[Document](https://dx.doi.org/10.1016/j.jalz.2018.02.018)Cited by:[§5](https://arxiv.org/html/2606.19522#S5.p3.1)\.
- \[13\]S\. Leem, L\. Gu, C\. You, K\. Gong, and R\. Fang\(2026\)REVEAL: multimodal vision–language alignment of retinal morphometry and clinical risks for incident AD and dementia prediction\.InMedical Imaging with Deep Learning,Note:Accepted by MIDL 2026\. Proceedings of Machine Learning Research \(PMLR\)External Links:[Link](https://openreview.net/pdf?id=aOKAXRHXVw)Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p2.1)\.
- \[14\]A\. I\. Leshner, S\. Landis, C\. Stroud, and A\. Downey \(Eds\.\)\(2017\-09\)Preventing cognitive decline and dementia: a way forward\.National Academies Press,Washington, DC\.External Links:[Document](https://dx.doi.org/10.17226/24782)Cited by:[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p1.1)\.
- \[15\]W\. Lin, Z\. Zhao, X\. Zhang, C\. Wu, Y\. Zhang, Y\. Wang, and W\. Xie\(2023\)PMC\-clip: contrastive language\-image pre\-training using biomedical documents\.arXiv preprint arXiv:2303\.07240\.Cited by:[§4](https://arxiv.org/html/2606.19522#S4.p1.1)\.
- \[16\]G\. Livingston, J\. Huntley, K\. Y\. Liu, S\. G\. Costafreda, G\. Selbæk, S\. Alladi, D\. Ames, S\. Banerjee, A\. Burns, C\. Brayne, N\. C\. Fox, C\. P\. Ferri, L\. N\. Gitlin, R\. Howard, H\. C\. Kales, M\. Kivimäki, E\. B\. Larson, N\. Nakasujja, K\. Rockwood, Q\. Samus, K\. Shirai, A\. Singh\-Manoux, L\. S\. Schneider, S\. Walsh, Y\. Yao, A\. Sommerlad, and N\. Mukadam\(2024\-08\)Dementia prevention, intervention, and care: 2024 report of the lancet standing commission\.The Lancet404\(10452\),pp\. 572–628\.External Links:[Document](https://dx.doi.org/10.1016/S0140-6736%2824%2901296-0)Cited by:[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p1.1),[§5](https://arxiv.org/html/2606.19522#S5.p1.1)\.
- \[17\]R\. Wu, C\. Zhang, J\. Zhang, Y\. Zhou, T\. Zhou, and H\. Fu\(2024\)MM\-retinal: knowledge\-enhanced foundational pretraining with fundus image\-text expertise\.arXiv preprint arXiv:2405\.11793\.Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p2.1),[§4](https://arxiv.org/html/2606.19522#S4.p1.1)\.
- \[18\]J\. Xiong, R\. Bhimani, and L\. Carney\-Anderson\(2023\-06\)Review of risk factors associated with biomarkers for alzheimer disease\.Journal of Neuroscience Nursing55\(3\),pp\. 103–109\.External Links:[Document](https://dx.doi.org/10.1097/JNN.0000000000000705)Cited by:[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p1.1)\.
- \[19\]X\. Yang, A\. Chen, N\. PourNejatian, H\. C\. Shin, K\. E\. Smith, C\. Parisien, C\. Compas, C\. Martin, A\. B\. Costa, M\. G\. Flores, Y\. Zhang, T\. Magoc, C\. A\. Harle, G\. Lipori, D\. A\. Mitchell, W\. R\. Hogan, E\. A\. Shenkman, J\. Bian, and Y\. Wu\(2022\-12\)A large language model for electronic health records\.npj Digital Medicine5\(1\),pp\. 1–9\.External Links:[Document](https://dx.doi.org/10.1038/s41746-022-00742-2)Cited by:[§2\.3](https://arxiv.org/html/2606.19522#S2.SS3.p1.8),[§4](https://arxiv.org/html/2606.19522#S4.p1.1)\.
- \[20\]S\. Zhang, Y\. Xu, N\. Usuyama, H\. Xu, J\. Bagga, R\. Tinn, S\. Preston, R\. Rao, M\. Wei, N\. Valluri, C\. Wong, A\. Tupini, Y\. Wang, M\. Mazzola, S\. Shukla, L\. Liden, J\. Gao, A\. Crabtree, B\. Piening, C\. Bifulco, M\. P\. Lungren, T\. Naumann, S\. Wang, and H\. Poon\(2025\)BiomedCLIP: a multimodal biomedical foundation model pretrained from scientific image\-text pairs\.arXiv preprint arXiv:2303\.00915\.Cited by:[§4](https://arxiv.org/html/2606.19522#S4.p1.1)\.
- \[21\]Y\. Zhou, M\. A\. Chia, S\. K\. Wagner, M\. S\. Ayhan, D\. J\. Williamson, R\. R\. Struyven, T\. Liu, M\. Xu, M\. G\. Lozano, P\. Woodward\-Court, Y\. Kihara, A\. Altmann, A\. Y\. Lee, E\. J\. Topol, A\. K\. Denniston, D\. C\. Alexander, and P\. A\. Keane\(2023\-10\)A foundation model for generalizable disease detection from retinal images\.Nature622\(7981\),pp\. 156–163\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06555-x)Cited by:[§1](https://arxiv.org/html/2606.19522#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.19522#S2.SS3.p1.8),[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p2.1),[§4](https://arxiv.org/html/2606.19522#S4.p1.1)\.
- \[22\]Y\. Zhou, S\. K\. Wagner, M\. A\. Chia, A\. Zhao, P\. Woodward\-Court, M\. Xu, R\. Struyven, D\. C\. Alexander, and P\. A\. Keane\(2022\-07\)AutoMorph: automated retinal vascular morphology quantification via a deep learning pipeline\.Translational Vision Science & Technology11\(7\),pp\. 12\.External Links:[Document](https://dx.doi.org/10.1167/tvst.11.7.12),[Link](https://doi.org/10.1167/tvst.11.7.12),ISSN 2164\-2591Cited by:[§3\.1](https://arxiv.org/html/2606.19522#S3.SS1.p2.1)\.Similar Articles
Revealing Interpretable Failure Modes of VLMs
This paper introduces Revelio, a framework that systematically discovers interpretable failure modes in Vision-Language Models (VLMs) by searching over discrete concept combinations. Applied to autonomous driving and indoor robotics, it reveals previously unreported vulnerabilities that lead to crashes or safety hazards.
KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models
This paper introduces KODA (Kernel Optimization for Discrepancy Analysis), a kernel-based framework for comparing and aligning vision-language model representations by identifying sample subsets that are clustered differently across models like CLIP, SigLIP, and BLIP. The method uses contrastive embedding clustering and randomized low-dimensional approximations to scale to large datasets while providing interpretable structural differences between representations.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
This paper introduces Anthropogenic Regional Adaptation, a paradigm for optimizing vision-language models to specific regional contexts while maintaining global generalization. The authors propose GG-EZ, an adaptation method using regional data filtering and model merging, demonstrating 5-15% improvements in cultural relevance for Southeast Asia across three VL architectures.
Fair Cognitive Impairment Detection Through Unlearning
Proposes a multimodal framework for fair Mild Cognitive Impairment detection from speech, using unlearning via gradient reversal to reduce demographic bias and improve performance across subgroups.
Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection
This paper introduces VisAnomReasoner, a parameter-efficient vision-language model fine-tuned on a novel benchmark (VisAnomBench) with natural-language rationales, achieving over 21pp improvement in precision and F1 for time-series anomaly detection and strong cross-benchmark generalization.