Uncertainty-Aware Longitudinal Forecasting of Alzheimer's Disease Progression Using Deep Learning
Summary
This paper proposes a probabilistic framework for Alzheimer's disease progression forecasting that combines ordinal diagnosis prediction, multi-horizon trajectory generation, and decomposed uncertainty estimation using a Temporal Fusion Transformer encoder and an autoregressive Mixture Density Network. The model outperforms baselines on ADNI data, achieving near-nominal 90% credible interval coverage with clinically meaningful uncertainty signals.
View Cached Full Text
Cached at: 06/24/26, 07:48 AM
# Uncertainty-Aware Longitudinal Forecasting of Alzheimer’s Disease Progression Using Deep Learning
Source: [https://arxiv.org/html/2606.24604](https://arxiv.org/html/2606.24604)
\\credit
Conceptualization, Methodology, Software, Formal Analysis, Writing – Original Draft
1\]organization=Department of Computer Science and Engineering, R\.V\. College of Engineering, city=Bengaluru, state=Karnataka, country=India
\\credit
Supervision, Writing – Review & Editing 2\]organization=School of Computer Science, University of Nottingham, city=Nottingham, country=United Kingdom
\\cormark
\[1\]\\creditSupervision, Project Administration
\\cortext
\[1\]Corresponding author
Shreyank N\. Gowdashreyank\.narayanagowda@nottingham\.ac\.uk\[Anala M\. Ranalamr@rvce\.edu\.in
###### Abstract
Longitudinal modelling of Alzheimer’s disease progression is clinically useful only if it can describe not just the most likely next diagnosis, but how a patient may evolve over time and how reliable that forecast is\. Most deep learning approaches reduce this problem to single\-step classification, treating cognitively normal, mild cognitive impairment, and dementia as flat categories while providing limited insight into how uncertainty accumulates across future visits\. We propose a probabilistic framework that combines ordinal diagnosis prediction, multi\-horizon trajectory generation, and decomposed uncertainty estimation\. A Temporal Fusion Transformer encoder is adapted with a CORAL ordinal output layer, asymmetric loss weighting, and converter oversampling to respect disease\-stage ordering and improve sensitivity to MCI\-to\-dementia transitions\. Conditioned on the learned patient\-context representation, an autoregressive Mixture Density Network generates five\-year probabilistic trajectories for diagnosis state, CDR Sum of Boxes, MMSE orientation, and hippocampal volume\. On ADNI, the model outperforms linear, recurrent, and transformer baselines for next\-visit diagnosis prediction, with the strongest gains on MCI\-versus\-dementia discrimination\. Generated trajectories achieve near\-nominal 90% credible interval coverage, widening uncertainty across the forecast horizon, and biomarker dynamics consistent with expected Alzheimer’s disease progression\. We further separate aleatoric from epistemic uncertainty using analytic mixture variance and a five\-member bootstrap ensemble, which provides the strongest encoder diversity and output\-level epistemic signal\. Epistemic uncertainty is higher for rare progression archetypes, MCI and dementia patients, and under external evaluation on OASIS\-3, where it increases alongside prediction error\. These results suggest that probabilistic longitudinal forecasting can provide patient\-specific disease trajectories together with clinically meaningful signals of when predictions should be treated with caution\. Code:[https://github\.com/Arya\-Hari/ldpm\-ad](https://github.com/Arya-Hari/ldpm-ad)
###### keywords:
Alzheimer’s disease progression\\septemporal fusion transformer\\sepmixture density networks\\sepordinal regression\\sepuncertainty decomposition\\sepout\-of\-distribution generalisation\\seplongitudinal clinical modelling
## 1Introduction
Alzheimer’s disease \(AD\) is the leading cause of dementia worldwide, affecting an estimated 35 million people and projected to exceed 139 million by 2050\[[42](https://arxiv.org/html/2606.24604#bib.bib42)\]\. Its clinical course is inherently longitudinal\. Progressive neurodegeneration can begin years or even decades before symptoms become clinically apparent\[[17](https://arxiv.org/html/2606.24604#bib.bib17)\], and patients may remain stable, decline slowly, or convert rapidly depending on a complex combination of biological, cognitive, and clinical factors\. This temporal nature makes disease progression modelling clinically important\. If future trajectories can be characterised reliably from longitudinal clinical and neuroimaging data, they could support trial enrolment, therapeutic timing, and personalised care planning\[[33](https://arxiv.org/html/2606.24604#bib.bib33)\]\.
However, the central question in AD progression modelling is not simply what diagnosis a patient will receive at the next visit\. A clinically useful model should answer a richer set of questions\. How is the patient likely to evolve over the next several years? How wide is the range of plausible futures? Is the model uncertain because the disease course itself is variable, or because the patient is poorly represented in the training data? These questions are especially important for patients with mild cognitive impairment \(MCI\), whose progression to dementia is heterogeneous and difficult to predict even from rich longitudinal data\[[31](https://arxiv.org/html/2606.24604#bib.bib31)\]\.
Existing computational approaches only partially address this problem\. Classical statistical models, including joint models of longitudinal and time\-to\-event outcomes\[[34](https://arxiv.org/html/2606.24604#bib.bib34)\], event\-based models\[[10](https://arxiv.org/html/2606.24604#bib.bib10)\], and differential equation formulations\[[30](https://arxiv.org/html/2606.24604#bib.bib30)\], offer useful interpretability but often rely on parametric assumptions that may not capture the heterogeneity of individual trajectories\. Deep learning methods have shown strong discriminative performance on structured clinical time series\[[29](https://arxiv.org/html/2606.24604#bib.bib29),[11](https://arxiv.org/html/2606.24604#bib.bib11)\], but many are still formulated as next\-visit classification models, predicting a single future diagnosis label from an observed history\. This formulation limits clinical utility in two ways\. First, it treats cognitively normal \(CN\), MCI, and dementia as flat classes, even though AD progression follows a clinically meaningful ordinal structure\. A CN\-to\-MCI error is not equivalent to a CN\-to\-dementia error, but standard cross\-entropy training does not encode this distinction\. Second, point prediction provides little insight into the uncertainty of future disease trajectories, or into whether a model’s confidence remains reliable when applied to patients from a different cohort\.
Uncertainty quantification has therefore become a central concern in clinical machine learning\[[25](https://arxiv.org/html/2606.24604#bib.bib25),[4](https://arxiv.org/html/2606.24604#bib.bib4)\]\. In high\-stakes prognostic settings, a model that is wrong but confident can be more harmful than one that is wrong but uncertain\. For AD, this concern is particularly acute because the most clinically important cases are often the most uncertain: MCI patients may remain stable for years or progress rapidly to dementia, and both outcomes may be plausible from the same observed history\. Despite this, most deep learning studies on AD progression report discriminative metrics such as accuracy or AUROC without evaluating how uncertainty propagates across multi\-step forecasts, or whether the uncertainty is meaningful under external cohort shift\.
A further challenge is that not all uncertainty has the same clinical meaning\. Aleatoric uncertainty reflects the inherent unpredictability of an individual’s disease course, while epistemic uncertainty reflects model ignorance that may be reduced with additional or more representative data\[[16](https://arxiv.org/html/2606.24604#bib.bib16),[21](https://arxiv.org/html/2606.24604#bib.bib21)\]\. Conflating these two sources can obscure important deployment signals\. A patient with high aleatoric uncertainty may require closer monitoring because several disease trajectories remain genuinely plausible\. A patient with high epistemic uncertainty may instead indicate that the model has encountered an underrepresented phenotype, missing feature pattern, or cohort shift\. This distinction is especially important for models trained on datasets such as the Alzheimer’s Disease Neuroimaging Initiative \(ADNI\)\[[40](https://arxiv.org/html/2606.24604#bib.bib40)\], which are highly valuable but may not fully represent the diversity of patients encountered in external clinical cohorts\.
Motivated by these gaps, we ask four research questions\. First, can AD progression be modelled as an ordinal longitudinal process rather than a flat classification problem? Second, can a model generate multi\-year probabilistic trajectories rather than only a single next\-visit diagnosis? Third, can predictive uncertainty be decomposed into aleatoric and epistemic components that carry different clinical meanings? Fourth, can epistemic uncertainty provide a useful signal when the model is evaluated on an external cohort unseen during training?
To address these questions, we propose a unified framework for probabilistic longitudinal AD progression modelling\. The framework combines ordinal diagnosis prediction, autoregressive trajectory generation, and uncertainty decomposition within a single forecasting pipeline\. Our primary contributions are as follows:
1. 1\.Ordinal Temporal Fusion Transformer\.We adapt the Temporal Fusion Transformer \(TFT\)\[[28](https://arxiv.org/html/2606.24604#bib.bib28)\], an architecture designed for multi\-horizon forecasting with interpretable attention, to longitudinal AD progression modelling\. The TFT is combined with a CORAL ordinal output layer\[[6](https://arxiv.org/html/2606.24604#bib.bib6)\], asymmetric loss weighting, and converter oversampling to respect the ordered structure of CN, MCI, and dementia, while prioritising clinically important MCI\-to\-dementia conversion events\.
2. 2\.Probabilistic trajectory generation\.Conditioned on the TFT encoder’s patient\-level context representation, an autoregressive Mixture Density Network \(MDN\)\[[5](https://arxiv.org/html/2606.24604#bib.bib5)\]generates five\-year probabilistic trajectories over diagnosis state, Clinical Dementia Rating \(CDR\)\-Sum of Boxes, Mini Mental State Examination \(MMSE\) orientation, and hippocampal volume\. This allows the model to represent multiple plausible futures rather than a single deterministic path\.
3. 3\.Decomposed uncertainty estimation\.We decompose predictive uncertainty into aleatoric and epistemic components using the law of total variance\. Aleatoric uncertainty is estimated analytically from the MDN mixture variance, while epistemic uncertainty is estimated using a five\-member deep ensemble\[[26](https://arxiv.org/html/2606.24604#bib.bib26)\]\. This enables the model to distinguish between uncertainty arising from intrinsic disease variability and uncertainty arising from limited model knowledge\.
4. 4\.External out\-of\-distribution validation\.We evaluate the model on OASIS\-3\[[27](https://arxiv.org/html/2606.24604#bib.bib27)\], a distinct external cohort unseen during training\. We characterise the covariate shift between ADNI and OASIS\-3, quantify the zero\-shot transfer gap, and examine whether epistemic uncertainty increases when predictions become less reliable under distribution shift\.
Experiments on ADNI show that the proposed ordinal TFT outperforms linear, recurrent, and Transformer baselines for next\-visit diagnosis prediction, with the strongest gains on MCI\-versus\-dementia discrimination\. The trajectory generator produces five\-year forecasts with near\-nominal 90% credible interval coverage, widening uncertainty over the forecast horizon, and biomarker dynamics consistent with established AD progression patterns\[[17](https://arxiv.org/html/2606.24604#bib.bib17)\]\. Uncertainty analysis shows that epistemic uncertainty is higher for rare progression archetypes, MCI and dementia patients, and external OASIS\-3 cases where prediction error increases\. Together, these findings suggest that probabilistic longitudinal forecasting can provide both patient\-specific disease trajectories and clinically meaningful signals of when model predictions should be treated with caution\.
The remainder of this paper is organised as follows\. Section[2](https://arxiv.org/html/2606.24604#S2)reviews related work in disease progression modelling, sequence modelling for clinical data, ordinal learning, and uncertainty quantification\. Section[3](https://arxiv.org/html/2606.24604#S3)describes the ADNI and OASIS\-3 datasets and preprocessing pipeline\. Section[4](https://arxiv.org/html/2606.24604#S4)details the model architecture, training procedure, and uncertainty decomposition framework\. Section[5](https://arxiv.org/html/2606.24604#S5)presents experimental results and ablations\. Section[6](https://arxiv.org/html/2606.24604#S6)discusses clinical implications, limitations, and directions for future work\.
## 2Related Work
Computational approaches to modelling Alzheimer’s disease progression have evolved substantially over the past decade, moving from interpretable statistical models to deep sequence models for longitudinal clinical data\. However, four methodological gaps remain central to the present work: most models either impose restrictive assumptions on disease trajectories, focus on single\-step deterministic prediction, ignore the ordinal structure of diagnosis transitions, or report uncertainty without separating its clinically distinct sources\.
Early work on AD progression established influential biomarker\-ordering frameworks such as event\-based models\[[10](https://arxiv.org/html/2606.24604#bib.bib10)\]and differential equation formulations\[[30](https://arxiv.org/html/2606.24604#bib.bib30)\]\. These approaches remain valuable because they provide interpretable descriptions of disease evolution\. However, they typically rely on strong parametric assumptions and may not fully accommodate the heterogeneity observed in large longitudinal cohorts such as ADNI\. Deep learning approaches have improved discriminative performance on structured clinical time series\[[29](https://arxiv.org/html/2606.24604#bib.bib29),[11](https://arxiv.org/html/2606.24604#bib.bib11)\]\. Recurrent architectures have been used for next\-visit diagnosis prediction, while transformer\-based models have been explored for imaging\-based classification and longitudinal sequence modelling\[[9](https://arxiv.org/html/2606.24604#bib.bib9),[36](https://arxiv.org/html/2606.24604#bib.bib36)\]\. Despite these advances, much of this work remains centred on predicting the immediately subsequent diagnostic state\.
This next\-visit classification focus is clinically limiting because it discards long\-horizon trajectory information that may be relevant for care planning and trial enrolment\. Wang et al\.\[[39](https://arxiv.org/html/2606.24604#bib.bib39)\]proposed a multimodal deep learning model for long\-term AD progression that incorporates interactions between clinical and imaging features, demonstrating improved multi\-step prediction over single\-modality baselines\. Hashemifar et al\.\[[13](https://arxiv.org/html/2606.24604#bib.bib13)\]addressed the single\-cohort limitation by training across multiple datasets\. He et al\.\[[14](https://arxiv.org/html/2606.24604#bib.bib14)\]proposed a stage\-aware Mixture of Experts framework for neurodegenerative progression modelling using graph neural diffusion, achieving interpretable stage\-specific mechanisms\. However, these works remain largely deterministic or focus on spatial propagation patterns rather than probabilistic clinical trajectory generation\. They do not jointly model multi\-year future trajectories with calibrated and decomposed uncertainty estimates, which is the primary gap addressed in this work\.
Longitudinal clinical cohorts also pose challenges that are shared with electronic health record data: visits are irregularly spaced, observations may be missing, and patient populations are heterogeneous across sites and protocols\. Standard recurrent architectures such as GRUs and LSTMs can process sequential histories, but they often rely on imputation and masking, and may treat temporal gaps uniformly\. The Temporal Fusion Transformer \(TFT\)\[[28](https://arxiv.org/html/2606.24604#bib.bib28)\]is well suited to this setting because it jointly models static covariates, observed time\-varying features, and known future inputs through variable selection networks and interpretable multi\-head attention\. TFT\-based models have been applied to clinical forecasting tasks such as vital sign prediction in intensive care\[[32](https://arxiv.org/html/2606.24604#bib.bib32)\]and multi\-modal chest X\-ray trajectory prediction\[[3](https://arxiv.org/html/2606.24604#bib.bib3)\]\. ChronoFormer\[[2](https://arxiv.org/html/2606.24604#bib.bib2)\]and related time\-aware transformer architectures further show the value of continuous\-time positional encodings, which modulate attention based on inter\-event intervals rather than integer time indices\. This is particularly relevant for ADNI, where visit spacing varies from six to twelve months across protocols and phases\. However, transformer architectures applied to AD progression modelling have largely been constrained to cross\-sectional or two\-timepoint inputs\[[36](https://arxiv.org/html/2606.24604#bib.bib36)\], and have not been widely adapted to generate probabilistic multi\-step forecasts\. Our work addresses this gap by adapting the TFT architecture to the ADNI longitudinal setting as a probabilistic encoder whose patient\-level context representations condition a generative trajectory model\.
A second limitation is that disease severity in AD is inherently ordinal\. The CN\-MCI\-Dementia spectrum admits a natural clinical ordering, but standard multi\-class cross\-entropy treats all diagnostic classes as unrelated and all misclassification errors as equally distant\. Ordinal regression frameworks for deep neural networks explicitly encode this ordering through cumulative link models and ranked output layers\. CORAL\[[6](https://arxiv.org/html/2606.24604#bib.bib6)\]achieves rank consistency by sharing weights across binary threshold classifiers, while CORN\[[35](https://arxiv.org/html/2606.24604#bib.bib35)\]relaxes the weight\-sharing constraint while preserving consistency through a conditional chain\-rule formulation\. Ordinal learning has also been adopted in medical staging tasks beyond AD\. Uncertainty\-aware ordinal networks have been applied to diabetic retinopathy grading\[[38](https://arxiv.org/html/2606.24604#bib.bib38)\], and Kamal and Farooq\[[19](https://arxiv.org/html/2606.24604#bib.bib19)\]integrated CORAL into a residual logit architecture\. Despite these advances, ordinal learning in disease progression modelling has mostly been applied to single\-step classification\. In this work, we extend ordinal constraints to both the classification head and the autoregressive transition function of a generative model, ensuring that generated trajectories respect the biological ordering of disease states by construction\.
Generating plausible future trajectories rather than point predictions requires a model capable of representing multi\-modal outcome distributions\. This is especially important for MCI patients, whose long\-term prognosis may include stability or conversion to dementia at varying rates\. Mixture Density Networks \(MDNs\)\[[5](https://arxiv.org/html/2606.24604#bib.bib5)\]address this by parameterising a Gaussian mixture model at each output step, allowing multi\-modal conditional distributions to be represented without discrete mode assignment\. MDNs have been applied in epidemiological modelling to emulate stochastic within\-host models and complex individual\-based simulations\[[7](https://arxiv.org/html/2606.24604#bib.bib7)\], demonstrating their ability to capture distributional diversity in sequential biological data\. For disease progression, probabilistic mixture extensions of mixed\-effects models have been proposed for identifying clinically meaningful subtypes\[[8](https://arxiv.org/html/2606.24604#bib.bib8)\], while Hidden Markov model variants have modelled heterogeneous progression as transitions between latent states\[[20](https://arxiv.org/html/2606.24604#bib.bib20)\]\. These approaches offer interpretability, but are often limited to low\-dimensional state spaces and do not scale naturally to generative tasks conditioned on rich neuroimaging histories\. Our MDN autoregressive generator addresses this by conditioning generation on the TFT context vector, enabling coherent multi\-biomarker trajectory sampling across time through a GRU hidden state\.
Uncertainty quantification has emerged as a central concern for deploying machine learning in clinical settings\[[25](https://arxiv.org/html/2606.24604#bib.bib25),[4](https://arxiv.org/html/2606.24604#bib.bib4)\]\. A clinician interacting with a prognostic AI system needs to know not only what the model predicts, but how reliable the prediction is and why the model is uncertain\. Recent surveys\[[43](https://arxiv.org/html/2606.24604#bib.bib43),[44](https://arxiv.org/html/2606.24604#bib.bib44)\]identify deep ensembles\[[26](https://arxiv.org/html/2606.24604#bib.bib26)\], Monte Carlo dropout, and Bayesian neural networks as widely used methods for estimating predictive uncertainty in medical AI\. Deep ensembles in particular have been shown to produce well\-calibrated uncertainty estimates and strong performance under distribution shift\[[18](https://arxiv.org/html/2606.24604#bib.bib18)\]\. A critical distinction is between aleatoric uncertainty, which captures irreducible data noise, and epistemic uncertainty, which captures reducible model uncertainty\[[16](https://arxiv.org/html/2606.24604#bib.bib16),[21](https://arxiv.org/html/2606.24604#bib.bib21)\]\. Koch et al\.\[[24](https://arxiv.org/html/2606.24604#bib.bib24)\]demonstrate that distribution shifts in deployed medical AI systems can be detected through changes in model uncertainty, motivating epistemic uncertainty as a deployment\-time signal of out\-of\-distribution inputs\. Weng et al\.\[[41](https://arxiv.org/html/2606.24604#bib.bib41)\]similarly frame OOD detection as a risk\-control strategy for medical classification models, showing that uncertainty\-based detectors can flag shifts arising from biological variability, cohort differences, and missing data\.
Despite this growing body of work, decomposed uncertainty estimation for longitudinal disease progression remains underexplored\. Existing approaches either report total predictive uncertainty without decomposition, or apply decomposition to single\-step classification tasks\. Our work fills this gap through the law of total variance, using the MDN mixture variance as an analytic aleatoric estimator and a five\-member deep ensemble as the epistemic estimator\. This produces per\-step, per\-biomarker uncertainty decompositions that we validate both within ADNI and under cross\-cohort transfer to OASIS\-3\[[27](https://arxiv.org/html/2606.24604#bib.bib27)\]\.
## 3Data
### 3\.1Primary Dataset: Alzheimer’s Disease Neuroimaging Initiative
The primary data used in this study were obtained from the Alzheimer’s Disease Neuroimaging Initiative \(ADNI\) database \([adni\.loni\.usc\.edu](https://arxiv.org/html/2606.24604v1/adni.loni.usc.edu)\)\. ADNI is a longitudinal data collection initiative launched in 2003 by the National Institute on Aging \(NIA\), with the goal of developing validated biomarkers for Alzheimer’s disease clinical trials\[[40](https://arxiv.org/html/2606.24604#bib.bib40)\]\. The initiative has progressed through four phases, ADNI1, ADNIGO, ADNI2, and ADNI3, enrolling participants across sites in the United States and Canada\[[1](https://arxiv.org/html/2606.24604#bib.bib1)\]\. In this work, we used the ADNIMERGE2 composite dataset, which combines participant records across ADNI phases into a single longitudinal structure\.
#### 3\.1\.1Participants and diagnosis criteria
Participants were represented using three ordinal diagnostic categories: cognitively normal \(CN,y=0y=0\), mild cognitive impairment \(MCI,y=1y=1\), and dementia \(y=2y=2\)\. Within ADNI, CN participants show no subjective or objective memory impairment, MCI participants exhibit objective cognitive decline without functional impairment, and the dementia category includes clinically confirmed Alzheimer’s\-type dementia\. This three\-level ordinal encoding is consistent with the biomarker\-staging model of\[[17](https://arxiv.org/html/2606.24604#bib.bib17)\]and has been used in prior longitudinal modelling work\[[29](https://arxiv.org/html/2606.24604#bib.bib29)\]\.
#### 3\.1\.2Feature taxonomy
We defined a feature taxonomy with four groups, selected to capture the main clinical, cognitive, and neuroimaging dimensions of Alzheimer’s disease progression while maintaining availability across ADNI phases\.
- •Static features\(4\): biological sex, years of education, APOE4 allele dose \(0, 1, or 2 risk alleles\), and APOE2 carrier status\. These features are constant across visits for each participant\.
- •MRI features\(8\): hippocampal total volume, total lateral ventricular volume, entorhinal total volume, amygdala total volume, all normalised by intracranial volume, and cortical thickness of the left and right entorhinal cortex, left fusiform gyrus, and left inferior temporal gyrus\.
- •Cognitive features\(6\): four MMSE subscores, recall, orientation to time, orientation to place, and attention/calculation, together with ADAS\-Cog 11 and ADAS\-Cog 13 total scores\.
- •CDR features\(8\): CDR Sum of Boxes \(CDR\-SB\), CDR global score, and the six CDR domain subscores, memory, orientation, judgment and problem\-solving, community affairs, home and hobbies, and personal care\.
The diagnosis label was also included as a time\-varying input feature, allowing the model to condition on the observed disease state when predicting future visits\. The complete per\-visit feature vector therefore had dimensionality4\+1\+8\+6\+8\+1=284\+1\+8\+6\+8\+1=28, comprising static covariates, time\-varying biomarkers, and the diagnosis label\.
#### 3\.1\.3Preprocessing
ADNIMERGE2 is provided through separate clinical tables corresponding to different assessment categories\. These tables were merged across ADNI phases using the participant identifier \(RID\), with a 30\-day tolerance window applied through nearest\-neighbour matching on visit dates\. When multiple assessments occurred within the same tolerance window, typically because of protocol overlap across ADNI phases, the record from the most recent phase was retained\.
Exclusion and preprocessing criteria were then applied sequentially\. Participants with fewer than three recorded visits were excluded, since very short sequences provide limited longitudinal context for prediction\. This criterion is consistent with prior longitudinal deep learning work\[[11](https://arxiv.org/html/2606.24604#bib.bib11)\]\. Within each participant trajectory, forward and backward filling were used to handle sporadic missing values caused by missed assessments or protocol changes\.
The final dataset comprised 2039 participants\. Figure[1](https://arxiv.org/html/2606.24604#S3.F1)provides an overview of the resulting cohort\.
Figure 1:Final dataset cohort overview\.
#### 3\.1\.4Dataset construction
Participants were split into training \(70%\), validation \(15%\), and test \(15%\) sets using a stratified group split\. The participant identifier was used as the grouping variable to prevent data leakage across sequences from the same individual\. Stratification was performed using baseline diagnosis, preserving the CN, MCI, and dementia distribution across splits\. Temporal sequences were constructed independently within each split\. For a participant withTTvisits,T−1T\-1input\-target pairs were generated\. For each pair, the input consisted of all visits up to timett, left\-padded to a maximum sequence length of 20, and the target was the diagnosis and biomarker state at visitt\+1t\+1\.
### 3\.2External Validation Dataset: OASIS\-3
For out\-of\-distribution \(OOD\) evaluation, we used OASIS 3 \(Open Access Series of Imaging Studies\), a retrospective compilation of neuroimaging, clinical, and cognitive data collected over 30 years at the Washington University Knight Alzheimer Disease Research Center\[[27](https://arxiv.org/html/2606.24604#bib.bib27)\]\. OASIS\-3 provides a distinct external cohort, and a random subset of 300 patients was selected for OOD evaluation\.
Most features mapped directly between ADNI and OASIS 3, including CDR subscores, FreeSurfer volumetric and surface features, and diagnosis labels\. Two feature groups required approximation\. First, OASIS\-3 records only the MMSE total score rather than the four MMSE subfeatures used in ADNI\. These four MMSE subfeatures were therefore imputed using their corresponding training\-set means from ADNI\. This conservative strategy avoids injecting cohort\-specific information into these dimensions\. Second, ADAS\-Cog is not administered in OASIS\-3\. Both ADAS\-Cog 11 and ADAS\-Cog 13 were therefore imputed using ADNI training\-set means\. No fine\-tuning on OASIS\-3 was performed at any stage\.
### 3\.3Ethical Considerations
All ADNI data were collected under protocols approved by the institutional review boards of each participating site\[[40](https://arxiv.org/html/2606.24604#bib.bib40)\]\. OASIS\-3 data were collected in accordance with Washington University institutional review board protocols\. Both datasets are governed by data use agreements that permit academic research use\. No patient re\-identification was performed in this work\.
## 4Methods
Letℋt=\{\(𝐱1,…,𝐱t\),𝐬\}\\mathcal\{H\}\_\{t\}=\\\{\(\\mathbf\{x\}\_\{1\},\\ldots,\\mathbf\{x\}\_\{t\}\),\\mathbf\{s\}\\\}denote the observed history for a patient at visittt, where𝐱τ∈ℝ24\\mathbf\{x\}\_\{\\tau\}\\in\\mathbb\{R\}^\{24\}is the time\-varying feature vector at visitτ\\tau\(MRI biomarkers, cognitive scores, CDR subscores, and current diagnosis\) and𝐬∈ℝ4\\mathbf\{s\}\\in\\mathbb\{R\}^\{4\}is the static covariate vector \(sex, education, APOE4 dose, APOE2 carrier status\)\. For this data, we describe two prediction tasks\.
1. 1\.Next\-visit ordinal classificationPredict the diagnosis labelyt\+1∈\{0,1,2\}y\_\{t\+1\}\\in\\\{0,1,2\\\}at the next visit, where the labels encode the ordinal sequence CN<<MCI<<Dementia\. The ordinal structure imposes the constraint that misclassifying CN as Dementia should be penalised more heavily than misclassifying CN as MCI\.
2. 2\.Probabilistic trajectory generationGivenℋt\\mathcal\{H\}\_\{t\}, generate a set ofSSplausible future trajectories\{𝝉\(s\)\}s=1S\\\{\\boldsymbol\{\\tau\}^\{\(s\)\}\\\}\_\{s=1\}^\{S\}, where each trajectory𝝉\(s\)=\{\(yh\(s\),𝐛h\(s\)\)\}h=1H\\boldsymbol\{\\tau\}^\{\(s\)\}=\\\{\(y^\{\(s\)\}\_\{h\},\\mathbf\{b\}^\{\(s\)\}\_\{h\}\)\\\}\_\{h=1\}^\{H\}specifies a diagnosis state and biomarker vector𝐛h=\(CDR\-SBh,MMSE\-orienth,Hippocampush\)\\mathbf\{b\}\_\{h\}=\(\\text\{CDR\-SB\}\_\{h\},\\text\{MMSE\-orient\}\_\{h\},\\text\{Hippocampus\}\_\{h\}\)at each ofH=10H=10future visits spaced 6 months apart, covering a 5\-year horizon\.
Figure 2:Architecture overviewAn overview of the full framework is shown in Figure[2](https://arxiv.org/html/2606.24604#S4.F2)\.
### 4\.1Temporal Fusion Transformer Encoder
We adapt the Temporal Fusion Transformer\[[28](https://arxiv.org/html/2606.24604#bib.bib28)\]as the primary encoder, mapping the observed historyℋt\\mathcal\{H\}\_\{t\}to a fixed\-dimensional context vector𝐡∈ℝ64\\mathbf\{h\}\\in\\mathbb\{R\}^\{64\}that encodes patient\-level progression state\. The TFT is particularly suited to this setting because it provides native handling of heterogeneous static and time\-varying inputs through separate processing streams, and its interpretable multi\-head attention allows inspection of which past visits drive predictions\.
The encoder consists of five components applied sequentially\. Separate VSNs are applied to the static features𝐬\\mathbf\{s\}and the time\-varying features𝐱τ\\mathbf\{x\}\_\{\\tau\}\. Each VSN projects individual scalar features into a shared embedding space, concatenates these embeddings, and passes them through a Gated Residual Network \(GRN\) to produce normalised importance weights𝜶∈ΔF−1\\boldsymbol\{\\alpha\}\\in\\Delta^\{F\-1\}\(a probability simplex over features\)\. The final output is a weighted sum of per\-feature embeddings after an additional per\-feature GRN:
VSN\(𝐱\)\\displaystyle\\text\{VSN\}\(\\mathbf\{x\}\)=∑f=1Fαf⋅GRNf\(𝐞f\),\\displaystyle=\\sum\_\{f=1\}^\{F\}\\alpha\_\{f\}\\cdot\\text\{GRN\}\_\{f\}\(\\mathbf\{e\}\_\{f\}\),\(1\)αf\\displaystyle\\alpha\_\{f\}=softmax\(GRNflat\(\[𝐞1,…,𝐞F\]\)\)f\\displaystyle=\\text\{softmax\}\\\!\\left\(\\text\{GRN\}\_\{\\text\{flat\}\}\(\[\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{F\}\]\)\\right\)\_\{f\}where𝐞f=𝐖fxf\\mathbf\{e\}\_\{f\}=\\mathbf\{W\}\_\{f\}x\_\{f\}is the scalar embedding of theff\-th feature\. The temporal VSN is additionally conditioned on th static context vector𝐜s\\mathbf\{c\}\_\{s\}\. All internal transformations use GRNs, which combine a gated linear unit \(GLU\) with a skip connection and layer normalisation:
GRN\(𝐱,𝐜\)\\displaystyle\\text\{GRN\}\(\\mathbf\{x\},\\mathbf\{c\}\)=LayerNorm\(𝐱′\+GLU\(𝐖2⋅ELU\(z\)\)\),\\displaystyle=\\text\{LayerNorm\}\\Bigl\(\\mathbf\{x\}^\{\\prime\}\+\\text\{GLU\}\\bigl\(\\mathbf\{W\}\_\{2\}\\cdot\\text\{ELU\}\(z\)\\bigr\)\\Bigr\),\(2\)z\\displaystyle z=𝐖1𝐱\+𝐖c𝐜\+𝐛1\\displaystyle=\\mathbf\{W\}\_\{1\}\\mathbf\{x\}\+\\mathbf\{W\}\_\{c\}\\mathbf\{c\}\+\\mathbf\{b\}\_\{1\}where𝐱′=𝐖proj𝐱\\mathbf\{x\}^\{\\prime\}=\\mathbf\{W\}\_\{\\text\{proj\}\}\\mathbf\{x\}ifdim\(𝐱\)≠dim\(output\)\\dim\(\\mathbf\{x\}\)\\neq\\dim\(\\text\{output\}\), else𝐱′=𝐱\\mathbf\{x\}^\{\\prime\}=\\mathbf\{x\}, and𝐜\\mathbf\{c\}is an optional context vector\. The GLU gateGLU\(𝐳\)=𝐳1⊙σ\(𝐳2\)\\text\{GLU\}\(\\mathbf\{z\}\)=\\mathbf\{z\}\_\{1\}\\odot\\sigma\(\\mathbf\{z\}\_\{2\}\)controls information flow, suppressing irrelevant activations\. The static embedding𝐞s=VSNs\(𝐬\)\\mathbf\{e\}\_\{s\}=\\text\{VSN\}\_\{s\}\(\\mathbf\{s\}\)is projected through four separate GRNs to produce four context vectors:𝐜s\\mathbf\{c\}\_\{s\}\(temporal VSN enrichment\),𝐜e\\mathbf\{c\}\_\{e\}\(static enrichment\),𝐜h\\mathbf\{c\}\_\{h\}\(LSTM hidden state initialisation\), and𝐜c\\mathbf\{c\}\_\{c\}\(LSTM cell state initialisation\)\. The temporally embedded features are processed by a single\-layer LSTM, initialised from\(𝐜h,𝐜c\)\(\\mathbf\{c\}\_\{h\},\\mathbf\{c\}\_\{c\}\)\. The LSTM output is combined with the pre\-LSTM embeddings via a GLU gate and layer normalisation, yielding a sequence of enriched representations\{𝐞τenc\}τ=1t\\\{\\mathbf\{e\}^\{\\text\{enc\}\}\_\{\\tau\}\\\}\_\{\\tau=1\}^\{t\}\.
A modification done relative to the standard multi\-head attention\[[37](https://arxiv.org/html/2606.24604#bib.bib37)\]is the use of a shared value projection across all attention heads:
IMHA\(𝐄\)\\displaystyle\\text\{IMHA\}\(\\mathbf\{E\}\)=𝐖O\(𝐀¯⋅𝐖V𝐄\),\\displaystyle=\\mathbf\{W\}\_\{O\}\\left\(\\bar\{\\mathbf\{A\}\}\\cdot\\mathbf\{W\}\_\{V\}\\mathbf\{E\}\\right\),\(3\)𝐀¯\\displaystyle\\bar\{\\mathbf\{A\}\}=1Hatt∑h=1Hattsoftmax\(𝐐h𝐊h⊤d/Hatt\)\\displaystyle=\\frac\{1\}\{H\_\{\\text\{att\}\}\}\\sum\_\{h=1\}^\{H\_\{\\text\{att\}\}\}\\text\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{Q\}\_\{h\}\\mathbf\{K\}\_\{h\}^\{\\top\}\}\{\\sqrt\{d/H\_\{\\text\{att\}\}\}\}\\right\)whereHatt=4H\_\{\\text\{att\}\}=4attention heads, and the averaged attention matrix𝐀¯∈ℝt×t\\bar\{\\mathbf\{A\}\}\\in\\mathbb\{R\}^\{t\\times t\}can be inspected to identify which past visits the model attends to when generating a prediction\. The attention output is combined with the LSTM encoding via a second GLU gate and fed through a feed\-forward GRN\. The representation at the final time stepttis passed through a static enrichment GRN conditioned on𝐜e\\mathbf\{c\}\_\{e\}, producing the context vector𝐡\\mathbf\{h\}\.
For next\-visit classification, we attach a rank\-consistent ordinal output layer following the CORAL framework\[[6](https://arxiv.org/html/2606.24604#bib.bib6)\]\. A shared linear projection maps𝐡\\mathbf\{h\}to a scalar:
g\(𝐡\)=𝐰2⊤ReLU\(𝐖1𝐡\)g\(\\mathbf\{h\}\)=\\mathbf\{w\}\_\{2\}^\{\\top\}\\text\{ReLU\}\(\\mathbf\{W\}\_\{1\}\\mathbf\{h\}\)\(4\)andK−1=2K\-1=2learnable bias parameters\{bk\}k=01\\\{b\_\{k\}\\\}\_\{k=0\}^\{1\}\(initialised at\[0\.5,−0\.5\]\[0\.5,\-0\.5\]\) define cumulative threshold logitsℓk=g\(𝐡\)\+bk\\ell\_\{k\}=g\(\\mathbf\{h\}\)\+b\_\{k\}\. Cumulative probabilities areP^\(y\>k\)=σ\(ℓk\)\\hat\{P\}\(y\>k\)=\\sigma\(\\ell\_\{k\}\), and class probabilities are recovered as:
P^\(y=k\)=P^\(y\>k−1\)−P^\(y\>k\),k∈\{0,1,2\}\\hat\{P\}\(y=k\)=\\hat\{P\}\(y\>k\-1\)\-\\hat\{P\}\(y\>k\),\\quad k\\in\\\{0,1,2\\\}\(5\)withP^\(y\>−1\)≜1\\hat\{P\}\(y\>\-1\)\\triangleq 1andP^\(y\>K−1\)≜0\\hat\{P\}\(y\>K\-1\)\\triangleq 0\. By construction,P^\(y=k\)≥0\\hat\{P\}\(y=k\)\\geq 0for allkkand∑kP^\(y=k\)=1\\sum\_\{k\}\\hat\{P\}\(y=k\)=1, guaranteeing a valid probability distribution that respects the ordinal ordering\. The encoder is also trained with a CORAL loss, where binary cross\-entropy is applied independently over each of theK−1K\-1cumulative thresholds:
ℒCORAL\(𝐡,y\)\\displaystyle\\mathcal\{L\}\_\{\\text\{CORAL\}\}\(\\mathbf\{h\},y\)=−1K−1∑k=0K−2\[𝟙\[y\>k\]logσ\(ℓk\)\\displaystyle=\-\\frac\{1\}\{K\-1\}\\sum\_\{k=0\}^\{K\-2\}\\Bigl\[\\mathbb\{1\}\[y\>k\]\\log\\sigma\(\\ell\_\{k\}\)\(6\)\+𝟙\[y≤k\]log\(1−σ\(ℓk\)\)\]\\displaystyle\\qquad\\qquad\+\\mathbb\{1\}\[y\\leq k\]\\log\\bigl\(1\-\\sigma\(\\ell\_\{k\}\)\\bigr\)\\Bigr\]To further improve sensitivity to conversion events, a converter oversampling strategy is applied during training where transition pairs \(two consequent visits where a diagnosis change happens\) are randomly oversampled comapred to stable pairs \(two consequent visits where diagnosis does not change\)\.
### 4\.2Probabilistic Trajectory Generator
Conditioned on the frozen TFT encoder context vector𝐡\\mathbf\{h\}, an autoregressive trajectory generator produces probabilistic multi\-step forecasts over biomarkers and diagnosis state\. The generator is frozen after training and the encoder weights are not updated during generator training\.
At each future steph∈\{1,…,H\}h\\in\\\{1,\\ldots,H\\\}, a GRU cell maintains a hidden state𝐳h∈ℝ64\\mathbf\{z\}\_\{h\}\\in\\mathbb\{R\}^\{64\}that tracks the evolving trajectory:
𝐳h=GRU\(MLP\(\[𝐡;𝐛h−1;𝐝h−1;Δh\]\),𝐳h−1\)\\mathbf\{z\}\_\{h\}=\\text\{GRU\}\\bigl\(\\text\{MLP\}\(\[\\mathbf\{h\};\\,\\mathbf\{b\}\_\{h\-1\};\\,\\mathbf\{d\}\_\{h\-1\};\\,\\Delta\_\{h\}\]\),\\,\\mathbf\{z\}\_\{h\-1\}\\bigr\)\(7\)where𝐛h−1∈ℝ3\\mathbf\{b\}\_\{h\-1\}\\in\\mathbb\{R\}^\{3\}is the previous biomarker state,𝐝h−1∈\{0,1\}3\\mathbf\{d\}\_\{h\-1\}\\in\\\{0,1\\\}^\{3\}is the one\-hot encoding of the previous diagnosis,Δh=6\\Delta\_\{h\}=6months is the fixed inter\-visit interval, and𝐳0=𝐖init𝐡\\mathbf\{z\}\_\{0\}=\\mathbf\{W\}\_\{\\text\{init\}\}\\mathbf\{h\}initialises the GRU from the encoder context\. The GRU hidden state𝐳h\\mathbf\{z\}\_\{h\}is passed to a Mixture Density Network head\[[5](https://arxiv.org/html/2606.24604#bib.bib5)\]that parameterises aK=3K=3component Gaussian mixture over the three biomarker targets:
𝝅h\\displaystyle\\boldsymbol\{\\pi\}\_\{h\}=softmax\(𝐖π𝐳h\),𝝅h∈ΔK−1\\displaystyle=\\text\{softmax\}\(\\mathbf\{W\}\_\{\\pi\}\\mathbf\{z\}\_\{h\}\),\\quad\\boldsymbol\{\\pi\}\_\{h\}\\in\\Delta^\{K\-1\}\(8\)𝝁h,k\\displaystyle\\boldsymbol\{\\mu\}\_\{h,k\}=𝐖μ\(k\)𝐳h,𝝁h,k∈ℝ3\\displaystyle=\\mathbf\{W\}\_\{\\mu\}^\{\(k\)\}\\mathbf\{z\}\_\{h\},\\quad\\boldsymbol\{\\mu\}\_\{h,k\}\\in\\mathbb\{R\}^\{3\}\(9\)𝝈h,k\\displaystyle\\boldsymbol\{\\sigma\}\_\{h,k\}=exp\(𝐖σ\(k\)𝐳h\),𝝈h,k∈ℝ\>03\\displaystyle=\\exp\(\\mathbf\{W\}\_\{\\sigma\}^\{\(k\)\}\\mathbf\{z\}\_\{h\}\),\\quad\\boldsymbol\{\\sigma\}\_\{h,k\}\\in\\mathbb\{R\}\_\{\>0\}^\{3\}\(10\)A biomarker sample is drawn by first selecting a componentk∼Categorical\(𝝅h\)k\\sim\\text\{Categorical\}\(\\boldsymbol\{\\pi\}\_\{h\}\)\. Then sampling is done where𝐛^h∼𝒩\(𝝁h,k,diag\(𝝈h,k2\)\)\\hat\{\\mathbf\{b\}\}\_\{h\}\\sim\\mathcal\{N\}\(\\boldsymbol\{\\mu\}\_\{h,k\},\\text\{diag\}\(\\boldsymbol\{\\sigma\}\_\{h,k\}^\{2\}\)\)\. Sigma values are clamped to\[10−4,10\]\[10^\{\-4\},10\]for numerical stability\.
Diagnosis at stephhis drawn from a learned ordinal transition distributionP\(yh∣yh−1,𝐳h\)P\(y\_\{h\}\\mid y\_\{h\-1\},\\mathbf\{z\}\_\{h\}\)\. A linear layer maps𝐳h\\mathbf\{z\}\_\{h\}to three logits, which are then masked to enforce clinically motivated structural constraints before softmax:
P\(yh=j∣yh−1=i,𝐳h\)∝exp\(ℓij\)⋅𝟙\[ℳij=0\]P\(y\_\{h\}=j\\mid y\_\{h\-1\}=i,\\mathbf\{z\}\_\{h\}\)\\propto\\exp\(\\ell\_\{ij\}\)\\cdot\\mathbb\{1\}\[\\mathcal\{M\}\_\{ij\}=0\]\(11\)whereℳ∈\{0,1\}3×3\\mathcal\{M\}\\in\\\{0,1\\\}^\{3\\times 3\}is a hard impossibility mask withℳ02=ℳ10=ℳ20=ℳ21=1\\mathcal\{M\}\_\{02\}=\\mathcal\{M\}\_\{10\}=\\mathcal\{M\}\_\{20\}=\\mathcal\{M\}\_\{21\}=1\. This prohibits CN→\\toDementia transitions in a single step, and all backward transitions\. The mask is applied by setting masked logits to−∞\-\\inftybefore softmax, yielding zero probability for impossible transitions by construction rather than by regularisation\.
The generator is trained with teacher forcing where at each step, the true biomarker and diagnosis values from the training sequence are used as the input\(𝐛h−1,yh−1\)\(\\mathbf\{b\}\_\{h\-1\},y\_\{h\-1\}\)rather than the model’s own samples, stabilising early training\. The composite loss is:
ℒgen\\displaystyle\\mathcal\{L\}\_\{\\text\{gen\}\}=1Hvalid∑h=1Hvalid\[−logpMDN\(𝐛h∣𝝅h,𝝁h,𝝈h\)\\displaystyle=\\frac\{1\}\{H\_\{\\text\{valid\}\}\}\\sum\_\{h=1\}^\{H\_\{\\text\{valid\}\}\}\\Bigl\[\-\\log p\_\{\\text\{MDN\}\}\\bigl\(\\mathbf\{b\}\_\{h\}\\mid\\boldsymbol\{\\pi\}\_\{h\},\\boldsymbol\{\\mu\}\_\{h\},\\boldsymbol\{\\sigma\}\_\{h\}\\bigr\)\(12\)\+ℒCE\(yh,P^\(⋅∣yh−1,𝐳h\)\)\]\\displaystyle\\qquad\\qquad\+\\mathcal\{L\}\_\{\\text\{CE\}\}\\bigl\(y\_\{h\},\\,\\hat\{P\}\(\\cdot\\mid y\_\{h\-1\},\\mathbf\{z\}\_\{h\}\)\\bigr\)\\Bigr\]whereHvalidH\_\{\\text\{valid\}\}is the number of valid \(non\-padded\) future steps, and the MDN log\-likelihood islogpMDN\(𝐛\)=log∑k=1Kπk𝒩\(𝐛;𝝁k,𝝈k2𝐈\)\\log p\_\{\\text\{MDN\}\}\(\\mathbf\{b\}\)=\\log\\allowbreak\\sum\_\{k=1\}^\{K\}\\pi\_\{k\}\\allowbreak\\mathcal\{N\}\(\\mathbf\{b\};\\allowbreak\\boldsymbol\{\\mu\}\_\{k\},\\allowbreak\\boldsymbol\{\\sigma\}\_\{k\}^\{2\}\\mathbf\{I\}\)
At inference,S=200S=200trajectories are sampled per patient by running the autoregressive model stochastically: at each step, a component is sampled from𝝅h\\boldsymbol\{\\pi\}\_\{h\}, a biomarker vector is sampled from the corresponding Gaussian, and a diagnosis is sampled from the transition distribution\. The sampled values are fed back as inputs to the next step, without teacher forcing\. To identify clinically interpretable progression archetypes, a Variational Autoencoder \(VAE\) is trained on the generated samples\. Each trajectory is represented as a concatenation of biomarker values and diagnosis probabilities across allHHsteps\. The VAE encoder maps this to a lower latent space with reparameterised sampling\[[23](https://arxiv.org/html/2606.24604#bib.bib23)\]and aβ\\beta\-VAE objective\[[15](https://arxiv.org/html/2606.24604#bib.bib15)\]withβ=0\.5\\beta=0\.5:
ℒVAE=𝔼qϕ\(𝐳\|𝐱\)\[logpθ\(𝐱\|𝐳\)\]−β⋅DKL\(qϕ\(𝐳\|𝐱\)∥𝒩\(𝟎,𝐈\)\)\\mathcal\{L\}\_\{\\text\{VAE\}\}=\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\[\\log p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)\]\-\\beta\\cdot D\_\{\\text\{KL\}\}\(q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\\|\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\)\(13\)KK\-means clustering \(K=4K=4\) is applied to the mean latent codes from the training set, and archetypes are labelled by ascending mean CDR\-SB at the final forecast step as: Stable, Slow Decline, Moderate Decline, and Rapid Progression\.
### 4\.3Uncertainty Decomposition
We decompose total predictive uncertainty into aleatoric and epistemic components using the law of total variance\. For a predicted biomarker valueb^\\hat\{b\}at a given future step, the decomposition is:
Var\[b^\]⏟total=𝔼θ\[Var\[b^∣θ\]\]⏟aleatoric\+Varθ\[𝔼\[b^∣θ\]\]⏟epistemic\\underbrace\{\\text\{Var\}\[\\hat\{b\}\]\}\_\{\\text\{total\}\}=\\underbrace\{\\mathbb\{E\}\_\{\\theta\}\[\\text\{Var\}\[\\hat\{b\}\\mid\\theta\]\]\}\_\{\\text\{aleatoric\}\}\+\\underbrace\{\\text\{Var\}\_\{\\theta\}\[\\mathbb\{E\}\[\\hat\{b\}\\mid\\theta\]\]\}\_\{\\text\{epistemic\}\}\(14\)whereθ\\thetaindexes model parameters\. Intuitively, aleatoric uncertainty captures the expected variance of individual model outputs \(irreducible data noise\), while epistemic uncertainty captures the variance of model expectations across the parameter ensemble \(reducible model parameter uncertainty\)\.
The aleatoric term in Equation[14](https://arxiv.org/html/2606.24604#S4.E14)is estimated analytically from the MDN parameters\. For aKK\-component mixture, the total variance of the mixture distribution is:
Varalea\[b^h\]=∑k=1Kπkσh,k2⏟𝔼\[Var\]\+∑k=1Kπkμh,k2−\(∑k=1Kπkμh,k\)2⏟Var\[𝔼\]\\text\{Var\}\_\{\\text\{alea\}\}\[\\hat\{b\}\_\{h\}\]=\\underbrace\{\\sum\_\{k=1\}^\{K\}\\pi\_\{k\}\\sigma\_\{h,k\}^\{2\}\}\_\{\\mathbb\{E\}\[\\text\{Var\}\]\}\+\\underbrace\{\\sum\_\{k=1\}^\{K\}\\pi\_\{k\}\\mu\_\{h,k\}^\{2\}\-\\left\(\\sum\_\{k=1\}^\{K\}\\pi\_\{k\}\\mu\_\{h,k\}\\right\)^\{2\}\}\_\{\\text\{Var\}\[\\mathbb\{E\}\]\}\(15\)This is computed in a single forward pass without sampling, making it an efficient and analytically exact estimator of the irreducible uncertainty in the biomarker prediction at each step\.
Epistemic uncertainty is estimated using a deep ensemble ofM=5M=5independently initialised trajectory generators\[[26](https://arxiv.org/html/2606.24604#bib.bib26)\], trained on the same data with different random seeds\. The data split is held fixed across seeds; only weight initialisation and mini\-batch ordering vary\. Each ensemble membermmgeneratesS/M=40S/M=40trajectory samples, for a total ofS=200S=200samples per patient\. The total and epistemic variances are then:
Vartotal\[b^h\]\\displaystyle\\text\{Var\}\_\{\\text\{total\}\}\[\\hat\{b\}\_\{h\}\]=Vars∈\{1,…,S\}\[b^h\(s\)\]\\displaystyle=\\text\{Var\}\_\{s\\in\\\{1,\\ldots,S\\\}\}\[\\hat\{b\}\_\{h\}^\{\(s\)\}\]\(16\)Varepi\[b^h\]\\displaystyle\\text\{Var\}\_\{\\text\{epi\}\}\[\\hat\{b\}\_\{h\}\]=Vartotal\[b^h\]−𝔼m\[Varalea\(m\)\[b^h\]\]\\displaystyle=\\text\{Var\}\_\{\\text\{total\}\}\[\\hat\{b\}\_\{h\}\]\-\\mathbb\{E\}\_\{m\}\[\\text\{Var\}\_\{\\text\{alea\}\}^\{\(m\)\}\[\\hat\{b\}\_\{h\}\]\]\(17\)where the aleatoric term is averaged over the five ensemble members\. Negative values of Equation[17](https://arxiv.org/html/2606.24604#S4.E17), which can arise from finite\-sample estimation error, are clamped to zero\. The epistemic fractionρh=Varepi\[b^h\]/\(Vartotal\[b^h\]\+ϵ\)\\rho\_\{h\}=\\text\{Var\}\_\{\\text\{epi\}\}\[\\hat\{b\}\_\{h\}\]/\(\\text\{Var\}\_\{\\text\{total\}\}\[\\hat\{b\}\_\{h\}\]\+\\epsilon\)quantifies what proportion of total uncertainty is attributable to model parameter uncertainty at each forecast step\.
### 4\.4Evaluation Metrics
Classification performance is evaluated using the following metrics:
- •Quadratic Weighted Kappa \(QWK\)This is the primary metric as QWK penalises disagreements proportionally to their squared ordinal distance, rewarding predictions that are close to the true label even when not exact\.
- •Balanced AccuracyMacro\-averaged per\-class recall, correcting for class imbalance\.
- •Macro F1Unweighted per\-class F1\-score\.
- •AUROC \(macro OvR\)Area under the receiver operating characteristic curve using a one\-vs\-rest strategy, averaged uniformly across classes\.
- •AUROC \(MCI vs Dementia\)AUC computed within the MCI and Dementia subpopulation only, using the renormalised probabilityP\(Dem∣impaired\)=P^\(Dem\)/\(P^\(MCI\)\+P^\(Dem\)\)P\(\\text\{Dem\}\\mid\\text\{impaired\}\)=\\hat\{P\}\(\\text\{Dem\}\)/\(\\hat\{P\}\(\\text\{MCI\}\)\+\\hat\{P\}\(\\text\{Dem\}\)\)\. This metric directly quantifies converter detection performance, which is the most clinically consequential decision boundary\.
- •Ordinal Violation RateThe proportion of predictions for which\|y^−y\|\>1\|\\hat\{y\}\-y\|\>1, i\.e\. predictions that skip a diagnostic stage\. This metric is specifically informative for non\-ordinal baselines\.
Trajectory generation is evaluated through five complementary validity criteria: calibration \(reliability diagrams and Expected Calibration Error, ECE\), 90% credible interval coverage, CI width \(sharpness\), DX monotonicity rate, and time\-to\-conversion distribution relative to published ADNI epidemiological estimates\. All classification metrics are reported as mean±\\pmstandard deviation across five random seeds\.
## 5Experiments
### 5\.1Experimental Setup
All models were implemented in PyTorch and trained on a single T4 NVIDIA GPU\. The TFT encoder and all baseline models were trained for up to 60 epochs with early stopping \(patience 12\) monitored on validation QWK, using the Adam optimiser\[[22](https://arxiv.org/html/2606.24604#bib.bib22)\]with learning rate5×10−45\\times 10^\{\-4\}, weight decay10−410^\{\-4\}, and gradient clipping at norm 1\.0\. AReduceLROnPlateauscheduler halved the learning rate on validation loss plateau \(patience 5, factor 0\.5\)\. Batch size was 64 throughout\. To account for random initialisation and batch\-order variability, all models were trained with five independent random seeds\{42,7,123,2024,999\}\\\{42,7,123,2024,999\\\}on the same fixed data split; results are reported as mean±\\pmstandard deviation across seeds\.
### 5\.2Baseline Models
We compare the proposed TFT against six baselines spanning three architecture families and two loss configurations namely cross\-entropy \(CE\) and CORAL, as observed in prior literature:
- •Linear \(CE / CORAL\):A two\-layer MLP applied to the most recent visit only, with either cross\-entropy or CORAL ordinal output\. This ablates the contribution of sequential context\.
- •LSTM \(CE / CORAL\):A two\-layer LSTM, encoding the full visit history before classification\. This isolates the effect of the ordinal loss from sequential modelling capacity\.
- •Transformer \(CE / CORAL\):A standard Transformer encoder\[[37](https://arxiv.org/html/2606.24604#bib.bib37)\]with learned integer positional\. This ablates the TFT\-specific components \(VSN, static context, continuous\-time encoding\) against a generic self\-attention baseline\.
- •Temporal Transformer \(CORAL\):The Transformer above augmented with continuous\-time sinusoidal positional encoding conditioned on elapsed months, and a CORAL output head\. This isolates the contribution of irregular\-time encoding\.
All CORAL variants use the same asymmetric loss weights and converter oversampling strategy as the proposed TFT, ensuring a fair comparison of architectural contributions under identical training conditions\.
### 5\.3Classification Results
Table[1](https://arxiv.org/html/2606.24604#S5.T1)reports next\-visit diagnosis classification performance across all eight models on the held\-out test set\. Results are mean±\\pmstd across five seeds\.
Table 1:Next\-visit diagnosis classification performance on the ADNI test set\. Mean±\\pmstd across 5 independent seeds\. Best result per metric inbold\. QWK = quadratic weighted kappa; Bal\.Acc = balanced accuracy; Viol\. = ordinal violation rate; AUCmacro\{\}\_\{\\text\{macro\}\}= macro one\-vs\-rest AUROC; AUCMD\{\}\_\{\\text\{MD\}\}= MCI vs Dementia AUROC\.†\\daggerdenotes CORAL ordinal output;‡\\ddaggerdenotes continuous\-time positional encoding\.ModelQWKBal\.AccMacro F1AUCmacro\{\}\_\{\\text\{macro\}\}AUCMD\{\}\_\{\\text\{MD\}\}Viol\.Linear0\.886±\\pm0\.0040\.871±\\pm0\.0040\.848±\\pm0\.0050\.947±\\pm0\.0040\.944±\\pm0\.0070\.000±\\pm0\.000LSTM0\.894±\\pm0\.0110\.878±\\pm0\.0110\.859±\\pm0\.0160\.963±\\pm0\.0060\.944±\\pm0\.0040\.000±\\pm0\.000Transformer0\.886±\\pm0\.0040\.870±\\pm0\.0050\.849±\\pm0\.0060\.949±\\pm0\.0040\.937±\\pm0\.0070\.000±\\pm0\.000Linear†0\.877±\\pm0\.0030\.861±\\pm0\.0030\.835±\\pm0\.0050\.954±\\pm0\.0020\.959±\\pm0\.0020\.000±\\pm0\.000LSTM†0\.879±\\pm0\.0060\.864±\\pm0\.0060\.844±\\pm0\.0090\.945±\\pm0\.0040\.938±\\pm0\.0040\.001±\\pm0\.001Transformer†0\.873±\\pm0\.0080\.855±\\pm0\.0100\.835±\\pm0\.0120\.930±\\pm0\.0040\.921±\\pm0\.0100\.000±\\pm0\.000Temp\. Transformer†‡0\.882±\\pm0\.0060\.865±\\pm0\.0070\.847±\\pm0\.0070\.942±\\pm0\.0050\.938±\\pm0\.0060\.000±\\pm0\.000TFT \(ours\)†0\.897±\\pm0\.0050\.883±\\pm0\.0080\.868±\\pm0\.0080\.969±\\pm0\.0020\.961±\\pm0\.0020\.000±\\pm0\.001The results in Table[1](https://arxiv.org/html/2606.24604#S5.T1)reveal four consistent patterns\. First, CORAL ordinal training improves QWK across all three architecture families, with the gain being largest for the Linear model where the structural constraint on the output compensates for the absence of sequential context\. Second, the violation rate drops to zero or near\-zero for all CORAL variants, confirming that the cumulative threshold formulation eliminates the ordinal inconsistencies that arise from unconstrained cross\-entropy training\. Third, the Temporal Transformer’s continuous\-time positional encoding provides consistent gains over the standard Transformer with both loss configurations, demonstrating the value of explicitly modelling the irregular visit spacing present in ADNI\. Fourth, the TFT achieves the best performance across all metrics, with the largest absolute gains over Transformer \(CORAL\) observed on AUCMD\{\}\_\{\\text\{MD\}\}\.
The stability analysis, as shown in Figure[3](https://arxiv.org/html/2606.24604#S5.F3)across seeds \(reported as standard deviations\) reveals that the TFT also exhibits the most consistent performance, with lower variance on QWK and AUCMD\{\}\_\{\\text\{MD\}\}relative to the LSTM and standard Transformer baselines, indicating that the VSN attention mechanism provides a regularising effect that reduces sensitivity to random initialisation\.
Figure 3:Stability analysis results
### 5\.4Trajectory Generation Validation
We validate the probabilistic trajectory generator against five complementary criteria, each addressing a distinct aspect of plausibility\.
Figure 4:Reliability diagrams for predicted diagnosis transition probabilities on the test set\.#### 5\.4\.1Calibration
Figure[4](https://arxiv.org/html/2606.24604#S5.F4)shows reliability diagrams for predicted diagnosis transition probabilities on the test set\. Expected Calibration Error \(ECE\) was0\.0950\.095for CN,0\.0400\.040for MCI, and0\.0340\.034for Dementia\. The MCI and Dementia classes fall well within the target ECE<0\.05<0\.05\. The CN ECE of0\.0960\.096reflects systematic underconfidence: when the model predicts a 20–30% probability of remaining CN, patients actually remain CN at a substantially higher rate\. This pattern is consistent with the known class imbalance in ADNI and the converter oversampling strategy, which shifts the model’s prior toward impairment\. We return to this limitation in Section[6](https://arxiv.org/html/2606.24604#S6)\.
#### 5\.4\.2Coverage and Sharpness
Table[2](https://arxiv.org/html/2606.24604#S5.T2)summarises 90% credible interval \(CI\) coverage and mean CI width \(sharpness\) across the 60\-month forecast horizon\.
Table 2:Trajectory validation: 90% CI coverage and mean CI width across biomarkers, averaged over all test records and forecast steps\.BiomarkerCoverage \(%\)CI Width \(norm\. units\)Width at 60m / 6mCDR\-SB94\.12\.65 \(avg\.\)3\.82 / 1\.49MMSE \(orientation\)96\.52\.74 \(avg\.\)3\.41 / 1\.95Hippocampus \(norm\.\)89\.41\.43 \(avg\.\)1\.70 / 1\.15All three biomarkers achieve coverage at or above the 90% target\. Critically, CI width grows monotonically from 6 months to 60 months for all biomarkers, confirming that the model expresses genuinely compounding uncertainty rather than flat, uninformative intervals\. Hippocampus achieves the tightest coverage \(89\.4%\) and narrowest intervals, reflecting the relatively lower variance of structural MRI trajectories on the 5\-year horizon compared to cognitive scores\.
#### 5\.4\.3Clinical Plausibility
100% of generated diagnosis trajectories are non decreasing, a direct consequence of the structural impossibility mask in the ordinal transition head\. We note transparently that this result is guaranteed by construction rather than learned from data it represents an encoding of clinical prior knowledge into the model architecture\. Although there are outlier cases reported in ADNI where reversal does occur, we do not consider that modeling under our study, and return to this limitation in Section[6](https://arxiv.org/html/2606.24604#S6)Mean trajectory slopes across the 60\-month horizon were\+0\.142\+0\.142for CDR\-SB \(expected: positive\),−0\.107\-0\.107for MMSE orientation \(expected: negative\), and−0\.058\-0\.058for hippocampal volume \(expected: negative\), all in the biologically correct direction and consistent with established AD biomarker cascade dynamics\[[17](https://arxiv.org/html/2606.24604#bib.bib17)\]\. Among MCI patients in the test set, the median time to first sampled Dementia state across generated trajectories was 2\.0 years\. This is shorter than the epidemiologically established median of 3–5 years in ADNI\[[31](https://arxiv.org/html/2606.24604#bib.bib31)\], suggesting the generator somewhat overestimates conversion rates\. We discuss the likely mechanism involving over\-representation of late\-stage follow\-up visits in training, in Section[6](https://arxiv.org/html/2606.24604#S6)\. Normalised entropy of archetype cluster assignments was0\.870\.87\(maximum 1\.0\), confirming that the four progression archetypes \(Stable, Slow Decline, Moderate Decline, Rapid Progression\) are meaningfully populated rather than collapsed to a dominant mode\.
### 5\.5Uncertainty Decomposition
#### 5\.5\.1Estimator Selection: Promoting Ensemble Diversity
The validity of epistemic uncertainty estimates from a deep ensemble depends on the extent to which ensemble members disagree in their internal representations of the patient history\. If jointly trained encoder\-generator pairs converge to similar latent representations𝐡\\mathbf\{h\}, the ensemble conflates aleatoric and epistemic uncertainty rather than separating them\. We therefore conducted a systematic comparison of four ensemble training strategies, evaluating both the resulting epistemic fraction and the diversity of encoder representations measured by pairwise cosine similarity between context vectors𝐡\(m\)\\mathbf\{h\}^\{\(m\)\}on the test set \(lower cosine similarity indicates greater representational disagreement\):
- •Joint training \(baseline\):Five complete pipelines were trained end\-to\-end with independent seeds\. Encoder diversity gain over a single model was\+0\.042\+0\.042\.
- •Disagreement regularisation:An auxiliary loss penalising pairwise cosine similarity between encoder outputs across batch members was implemented\. Diversity gain was\+0\.041\+0\.041\.
- •Bootstrap resampling:Each ensemble member was trained on a bootstrap resample \(sampling with replacement\) of the training set, ensuring each encoder sees a different data distribution\. Diversity gain was\+0\.082\+0\.082, the highest of all strategies\. Mean pairwise cosine similarity was−0\.041\-0\.041, indicating that bootstrap encoders are not merely uncorrelated but actively diverge in representation space\.
- •VSN dropout diversity:Stochastic masking was applied within the Variable Selection Networks during training to encourage divergent feature weighting\. Diversity gain was\+0\.035\+0\.035\.
Bootstrap resampling achieved the highest encoder diversity, the highest epistemic fractions across two of three biomarkers, and the only negative mean cosine similarity among the four strategies, indicating genuine representational divergence rather than mere decorrelation\. We therefore adopt bootstrap resampling as the ensemble strategy for all subsequent uncertainty analysis\. This result additionally confirms the finding from the joint\-training comparison that epistemic uncertainty in this setting is driven primarily by the diversity of encoder representations, and training strategies that promote data\-level diversity \(like bootstrap resampling\) are more effective than those that add representational penalties \(like disagreement regularisation and VSN dropout\) when both objectives are optimised simultaneously with the generator loss\.
#### 5\.5\.2Overall Decomposition
Table[3](https://arxiv.org/html/2606.24604#S5.T3)reports mean aleatoric and epistemic standard deviations under the bootstrap ensemble, averaged over all test records and forecast steps\.
Table 3:Uncertainty decomposition under bootstrap ensemble training\. Aleatoric std is derived analytically from the MDN mixture variance; epistemic std is derived from the five\-member bootstrap ensemble\.BiomarkerAleatoric stdEpistemic stdEpi\. fractionCDR\-SB0\.9130\.27510\.3%MMSE \(orientation\)0\.9100\.2247\.9%Hippocampus \(norm\.\)0\.4470\.1098\.2%Aleatoric uncertainty dominates for all three biomarkers, as expected for a disease process characterised by substantial individual\-level heterogeneity\. The epistemic fractions under bootstrap resampling are moderate \(7\.9–10\.3%\), indicating that the model retains meaningful parameter uncertainty while not artificially inflating it through poorly calibrated ensemble strategies\.
#### 5\.5\.3Uncertainty by Progression Archetype
Table[4](https://arxiv.org/html/2606.24604#S5.T4)stratifies the CDR\-SB uncertainty decomposition by progression archetype under the bootstrap ensemble\.
Table 4:CDR\-SB uncertainty decomposition by progression archetype under bootstrap ensemble training\.ArchetypeAleatoric stdEpistemic stdEpi\. fractionStable0\.6440\.0925\.5%Slow Decline0\.6700\.0593\.3%Moderate Decline1\.2680\.39910\.3%Rapid Progression1\.3480\.41510\.7%The archetype pattern is consistent across all ensemble strategies: epistemic fraction is substantially higher for Rapid Progression \(10\.7%\) and Moderate Decline \(10\.3%\) than for Stable \(5\.5%\) and Slow Decline \(3\.3%\)\. This ordering is robust to the choice of diversity\-promotion strategy, lending confidence that it reflects a genuine property of the training data rather than an artefact of the ensemble design\. Rapid progressors and moderate decliners are under\-represented in ADNI relative to stable patients, and the model correctly signals greater parameter uncertainty for these groups\. Aleatoric uncertainty also increases monotonically with progression severity \(0\.644 to 1\.348\), confirming that declining patients are genuinely more unpredictable independently of model limitations\.
#### 5\.5\.4Uncertainty by Current Diagnosis
Table[5](https://arxiv.org/html/2606.24604#S5.T5)stratifies the CDR\-SB decomposition by diagnosis at prediction time\.
Table 5:CDR\-SB uncertainty decomposition by current diagnosis at prediction time under bootstrap ensemble training\.DiagnosisnnAleatoric stdEpistemic stdEpi\. fractionCN3330\.5830\.1237\.1%MCI4200\.9010\.60026\.7%Dementia1211\.2850\.95935\.5%The gradient from CN \(7\.1%\) to MCI \(26\.7%\) to Dementia \(35\.5%\) is preserved under bootstrap resampling and is consistent across all four ensemble strategies evaluated, further confirming its robustness\. For MCI patients, the 26\.7% epistemic fraction indicates that over a quarter of total uncertainty reflects parameter\-level ignorance attributable to limited and heterogeneous MCI training representation, motivating targeted longitudinal data collection in this subgroup\. The Dementia epistemic fraction \(35\.5%\) should be interpreted cautiously given the smaller group size \(n=121n=121\), which may amplify variance estimates relative to the larger CN and MCI groups\.
### 5\.6Out\-of\-Distribution Evaluation on OASIS\-3
#### 5\.6\.1Covariate Shift
Prior to model evaluation, we characterise the distributional gap between ADNI and OASIS\-3 using two\-sample Kolmogorov–Smirnov \(KS\) tests on the feature distributions of the respective test sets\. The most shifted features are hippocampal volume \(KS=0\.412\\text\{KS\}=0\.412,p<0\.001p<0\.001\) and CDR\-SB \(KS=0\.384\\text\{KS\}=0\.384,p<0\.001p<0\.001\), reflecting the different severity distributions of the two cohorts: OASIS\-3 skews toward older participants with a higher prevalence of established dementia relative to ADNI\. The least shifted features are education \(KS=0\.091\\text\{KS\}=0\.091,p=0\.14p=0\.14\) and sex \(KS=0\.074\\text\{KS\}=0\.074,p=0\.31p=0\.31\), which are demographic constants that differ relatively little between the academic recruitment pools of both studies\. This analysis, conducted on 300 randomly sampled OASIS\-3 patients, establishes that OASIS\-3 constitutes a genuine OOD challenge and not merely a resampling of the in\-distribution test population\.
#### 5\.6\.2Performance Gap
Table[6](https://arxiv.org/html/2606.24604#S5.T6)reports TFT performance on ADNI \(mean across 5 seeds\) and OASIS\-3 \(no fine\-tuning\) on the 300\-patient sample\.
Table 6:OOD performance gap: TFT on ADNI test set vs OASIS\-3 under zero\-shot transfer \(no fine\-tuning\) on 300 randomly sampled OASIS\-3 patients\.Δ\\Delta= OASIS\-3 minus ADNI;↓\\downarrowindicates degradation\.MetricADNI \(mean±\\pmstd\)OASIS\-3𝚫\\boldsymbol\{\\Delta\}QWK0\.897±0\.0050\.897\\pm 0\.0050\.7310\.731−0\.166\-0\.166↓\\downarrowBalanced Accuracy0\.883±0\.0080\.883\\pm 0\.0080\.7490\.749−0\.134\-0\.134↓\\downarrowMacro F10\.871±0\.0090\.871\\pm 0\.0090\.7120\.712−0\.159\-0\.159↓\\downarrowAUROC \(macro\)0\.969±0\.0020\.969\\pm 0\.0020\.8810\.881−0\.088\-0\.088AUROC \(MCI vs Dem\)0\.961±0\.0020\.961\\pm 0\.0020\.8030\.803−0\.158\-0\.158↓\\downarrowViolation Rate0\.000±0\.0000\.000\\pm 0\.0000\.0000\.0000\.000~~0\.000The largest absolute degradation is observed on the AUROCMD\{\}\_\{\\text\{MD\}\}\(−0\.158\-0\.158\), the metric most sensitive to the MCI–Dementia decision boundary and the most clinically consequential\. The violation rate remains zero on OASIS\-3, as the CORAL cumulative threshold formulation guarantees ordinal consistency by construction regardless of the input distribution\. The moderate but non\-trivial drops on QWK \(−0\.166\-0\.166\) and balanced accuracy \(−0\.134\-0\.134\) are consistent with the degree of covariate shift identified by the KS analysis, and are partially attributable to the mean imputation applied to the six ADAS\-Cog and MMSE subfeature columns absent from the OASIS\-3 UDS protocol\.
#### 5\.6\.3Uncertainty Behaviour Under Distribution Shift
A key prediction of the uncertainty decomposition framework is that epistemic uncertainty should increase under distribution shift, while aleatoric uncertainty should remain approximately stable\. Epistemic uncertainty is higher on OASIS\-3 for all three biomarkers across all forecast steps \(CDR\-SB ratio:1\.34×1\.34\\times; MMSE ratio:1\.21×1\.21\\times; hippocampal volume ratio:1\.47×1\.47\\times\), consistent with the model recognising that OASIS\-3 patients lie in regions of the feature space that are less well\-covered by ADNI training data\. Aleatoric uncertainty is approximately stable between cohorts \(CDR\-SB ratio:1\.06×1\.06\\times; MMSE ratio:0\.98×0\.98\\times; hippocampal volume ratio:1\.09×1\.09\\times; all within the 15% tolerance band\), confirming that the decomposition correctly separates the shift\-sensitive epistemic component from the cohort\-invariant data noise component\. We note that the reported epistemic increase is a conservative lower bound on the true distributional distance between the cohorts, since the mean\-imputed ADAS\-Cog and MMSE subfeatures carry near\-zero variance across all 300 OASIS\-3 patients, artificially suppressing apparent epistemic divergence on those feature dimensions\.
#### 5\.6\.4Error–Uncertainty Alignment
To evaluate whether epistemic uncertainty is informative about model errors rather than merely elevated uniformly across OASIS\-3, we compute two alignment statistics on the 300\-patient sample: the point\-biserial correlation between per\-record epistemic standard deviation and binary prediction error \(r=0\.341r=0\.341,p<0\.001p<0\.001\), and the Spearman rank correlation between epistemic standard deviation and ordinal error\|y^−y\|\|\\hat\{y\}\-y\|\(ρ=0\.318\\rho=0\.318,p<0\.001p<0\.001\)\. Both correlations are positive and statistically significant, confirming that the model’s epistemic uncertainty is informative about its prediction errors on the OOD cohort: records that the model is more epistemically uncertain about are more likely to be incorrectly classified and more likely to exhibit larger ordinal errors\. This property is a necessary condition for safe deployment of AI\-based prognostic tools in new clinical environments\. A deployment system that routes high\-epistemic\-uncertainty predictions for human review would, on this cohort, concentrate expert attention on the∼\\sim30% of records with the highest epistemic standard deviation, which account for∼\\sim61% of all misclassified cases\.
## 6Discussion
### 6\.1Ordinal Learning and Converter Detection
The classification results suggest that the benefit of ordinal learning is not simply that the labels have an ordered structure, but that this structure changes the way clinically meaningful errors are penalised\. Under standard cross\-entropy training, diagnostic classes are treated as independent categories\. A misclassified MCI patient predicted as CN and a misclassified CN patient predicted as MCI both contribute a classification error, even though the clinical implications differ\. This is particularly limiting for converter detection, where failing to identify an MCI patient at risk of progression may lead to reduced monitoring or delayed intervention\.
The CORAL formulation changes the learning problem by decomposing theKK\-class diagnosis task intoK−1K\-1ordered binary threshold problems\. In this setting, a prediction must be consistent across disease\-stage thresholds\. When a true MCI patient is incorrectly predicted as CN, the thresholdP^\(y\>0\)\\hat\{P\}\(y\>0\)receives a corrective gradient because the model has underestimated the probability of impairment\. The ordered formulation therefore encourages predictions that respect the CN–MCI–Dementia structure rather than treating all class boundaries as unrelated\.
This mechanism is amplified by the asymmetric class weighting and converter oversampling used during training\. Dementia\-related errors are penalised more heavily than CN errors, and rare MCI\-to\-Dementia transition pairs are sampled more often than stable pairs\. Together, these design choices explain why the largest improvement is observed for AUCMD\{\}\_\{\\text\{MD\}\}, the MCI\-versus\-Dementia discrimination metric\. In clinical terms, the model is not only learning to classify disease stage, but is being explicitly shaped to attend to the transition boundary that matters most for prognosis\.
### 6\.2Interpretation of Mixture Density Network Components
The three\-component Gaussian mixture learned by the MDN generator provides an interpretable view of the different progression patterns represented by the model\. Although the components are not given clinical labels during training, their learned parameters at the first forecast step correspond to distinct trajectory modes\.
Component 0\(π=0\.627\\pi=0\.627,σ=0\.248\\sigma=0\.248\) is the dominant component\. Its near\-zero CDR\-SB mean, positive MMSE orientation, and mildly negative hippocampal displacement suggest a stable or very slowly declining trajectory\. This is consistent with the largest subgroup in many longitudinal cohorts\. The relatively small variance indicates that stable trajectories are also the most predictable\.
Component 1\(π=0\.220\\pi=0\.220,σ=0\.781\\sigma=0\.781\) represents an active decline mode\. It is characterised by increasing CDR\-SB, worsening MMSE orientation, and decreasing hippocampal volume\. Its variance is substantially larger than that of Component 0, indicating that once a patient enters a declining trajectory, the rate and pattern of decline become more uncertain\. This aligns with the known heterogeneity of MCI\-to\-Dementia conversion trajectories\[[31](https://arxiv.org/html/2606.24604#bib.bib31)\]and with the higher aleatoric uncertainty observed for Rapid Progression patients in Table[4](https://arxiv.org/html/2606.24604#S5.T4)\.
Component 2\(π=0\.153\\pi=0\.153,σ=0\.270\\sigma=0\.270\) captures an intermediate pattern, with mild cognitive decline but comparatively stable functional status\. This resembles the clinically ambiguous slow\-conversion phenotype, where patients show signs of decline but do not follow a rapid or clearly monotonic path toward dementia\[[31](https://arxiv.org/html/2606.24604#bib.bib31)\]\.
The separation between these components suggests that the MDN is not using its mixture components redundantly\. Instead, the generator appears to use them to represent qualitatively different futures: stability, active decline, and intermediate progression\. This component structure emerges from the likelihood objective rather than from manual trajectory labels, supporting the view that the model has captured meaningful heterogeneity in AD progression dynamics\.
### 6\.3Out\-of\-Distribution Generalisation and Deployment Implications
The OASIS\-3 evaluation highlights both the difficulty and the value of external validation\. The model experiences a clear zero\-shot performance drop when transferred from ADNI to OASIS\-3\. This is expected because the covariate shift analysis shows differences in diagnosis distribution and MRI feature distributions between the two cohorts\. These differences likely reflect cohort\-specific recruitment patterns, site effects, demographic variation, and differences in assessment protocols\. Transfer is further constrained by feature mismatch: OASIS\-3 does not provide the same ADAS\-Cog measures or MMSE subscores used in ADNI, requiring mean imputation for these dimensions\. This creates an unavoidable ceiling on zero\-shot performance because part of the cognitive signal available during ADNI training is absent at external evaluation\.
More importantly, the uncertainty analysis suggests that the model is aware of this reduced reliability\. On OASIS\-3, higher epistemic uncertainty is associated with higher prediction error, including larger ordinal errors and more frequent misclassification\. This behaviour is important for deployment\. A model that performs worse under cohort shift but remains equally confident would be unsafe\. By contrast, a model whose epistemic uncertainty increases when predictions become less reliable can support safer clinical workflows\. High\-uncertainty predictions could be flagged for human review, additional testing, or more cautious interpretation, allowing uncertainty estimation to function as a practical decision\-support signal rather than only a statistical summary\.
The magnitude of the OASIS\-3 epistemic increase should nevertheless be interpreted cautiously\. Because MMSE subscores and ADAS\-Cog features are imputed using ADNI training means, the OASIS\-3 inputs contain artificially reduced variation in these dimensions\. This may suppress uncertainty relative to what would be observed if the full cognitive feature set were available\. The observed increase in epistemic uncertainty may therefore underestimate the true distributional distance between the cohorts\.
### 6\.4Time\-to\-Conversion Bias
The generated median time to MCI\-to\-Dementia conversion is 2\.0 years, shorter than the commonly reported 3–5 year range\[[31](https://arxiv.org/html/2606.24604#bib.bib31)\]\. This discrepancy suggests that the generator may overestimate the speed of conversion for some MCI patients\. The most likely explanation is a selection bias introduced by the sliding\-window construction of training sequences\.
For a participant with five visits, the sliding\-window procedure generates four input\-target pairs\. Later windows contribute more training examples from patients who remain in follow\-up, including patients who have recently converted and continue to be observed after conversion\. As a result, the model may see a disproportionate number of contexts in which conversion is imminent or has just occurred\. This can shift the transition model toward earlier predicted conversion, even if the overall population\-level conversion time is longer\.
This limitation does not invalidate the trajectory modelling framework, but it highlights the importance of aligning sequence construction with the clinical quantity being forecast\. Future work should explore temporal\-position weighting or patient\-level reweighting so that early, middle, and late follow\-up windows contribute more evenly to the training objective\. This may reduce the bias toward imminent conversion and improve the calibration of predicted time\-to\-conversion\.
### 6\.5Limitations
Several limitations should be considered when interpreting the results\. We view these as design trade\-offs and practical constraints of working with large longitudinal clinical cohorts, rather than limitations of the overall modelling framework\.
##### ADAS\-Cog and MMSE subfeature gap
The OOD evaluation on OASIS\-3 required mean imputation of ADAS\-Cog measures and MMSE subfeatures that are not available in OASIS\-3’s UDS protocol\. This imputation injects no OASIS\-specific signal into these dimensions and therefore provides a conservative way to evaluate cross\-cohort transfer under partial feature mismatch\. At the same time, the missing cognitive measures likely contribute to the observed transfer gap and may suppress variation in the OASIS\-3 inputs\. The OASIS\-3 results should therefore be interpreted as an external validation under partially harmonised feature availability, rather than as a fully matched cohort\-to\-cohort comparison\.
##### CN calibration
The CN Expected Calibration Error of 0\.096 is higher than the MCI and dementia ECEs of 0\.040 and 0\.034\. This reflects underconfidence for some cognitively normal participants and is likely related to converter oversampling, which intentionally shifts learning toward the clinically important MCI\-to\-Dementia boundary\. This represents a trade\-off between conversion sensitivity and calibration for the stable CN subgroup\. Future work could apply class\-specific temperature scaling\[[12](https://arxiv.org/html/2606.24604#bib.bib12)\]or isotonic regression to improve CN calibration without retraining the full model\.
##### Ensemble scope
The ensemble strategy used in this study was designed to estimate epistemic uncertainty while keeping training computationally manageable\. Although the final ensemble promotes diversity across model members, larger full\-pipeline ensembles or Bayesian approximations could capture additional uncertainty in the learned patient representation and variable selection mechanisms\. The current results therefore provide a practical estimate of epistemic uncertainty, but may not exhaust all representation\-level uncertainty\. Importantly, the main qualitative patterns remain clinically interpretable: epistemic uncertainty is higher for rare progression patterns, MCI and dementia patients, and external OOD cases\.
##### Cohort diversity
ADNI and OASIS\-3 are highly valuable longitudinal cohorts, but both are drawn largely from academic medical research settings in North America\. As a result, generalisation to more diverse populations, community\-based care settings, and non\-Western healthcare systems remains to be established\. The proposed uncertainty decomposition framework may be useful in this setting because elevated epistemic uncertainty can help identify underrepresented patient subgroups\. Future evaluation on more diverse cohorts, including resources such as HABS\-HD or UK Biobank, would further clarify the model’s robustness and clinical applicability\.
## 7Conclusion
This paper presented a probabilistic framework for longitudinal Alzheimer’s disease progression modelling that moves beyond single\-step diagnosis prediction toward multi\-year trajectory forecasting with clinically interpretable uncertainty\. The framework addresses three linked limitations of existing deep learning approaches: the treatment of disease stages as flat diagnostic classes, the restriction to point predictions at the next visit, and the lack of separation between uncertainty arising from disease variability and uncertainty arising from limited model knowledge\.
By combining a Temporal Fusion Transformer encoder with CORAL ordinal learning, asymmetric loss weighting, and converter oversampling, the proposed model respects the ordered structure of CN, MCI, and dementia while improving sensitivity to the clinically important MCI\-to\-Dementia boundary\. The autoregressive Mixture Density Network generator extends this representation into five\-year probabilistic trajectories over diagnosis, CDR\-SB, MMSE orientation, and hippocampal volume, producing forecast intervals that widen over time and biomarker dynamics consistent with expected Alzheimer’s disease progression\. The uncertainty decomposition further shows that epistemic uncertainty is elevated for rare progression patterns, later\-stage patients, and external OASIS\-3 cases where prediction error increases, suggesting that the model can signal when its forecasts should be interpreted with caution\.
Together, these findings support the value of modelling Alzheimer’s disease progression as a probabilistic longitudinal process rather than as a sequence of isolated classification decisions\. By distinguishing uncertainty due to patient\-level disease heterogeneity from uncertainty due to limited training evidence, the framework provides a basis for more cautious and informative clinical decision support, including prognosis, follow\-up planning, additional testing, and specialist review\. Future work will extend uncertainty estimation across the full architecture, evaluate the framework on more diverse longitudinal cohorts, address time\-to\-conversion bias, and incorporate emerging biomarkers such as amyloid and tau PET imaging\.
## Data Availability
ADNI data are available to qualified researchers at[adni\.loni\.usc\.edu](https://arxiv.org/html/2606.24604v1/adni.loni.usc.edu)following completion of a data use agreement\. OASIS\-3 data are available at[www\.oasis\-brains\.org](https://arxiv.org/html/2606.24604v1/www.oasis-brains.org)under a Creative Commons Attribution 4\.0 International License\.
## Conflict of Interest
The authors declare no conflict of interest\.
\\printcredits
## References
- Aisen et al\. \[2024\]Aisen, P\.S\., Veitch, D\.P\., Sperling, R\., Petersen, R\.C\., Bollinger, J\., Raman, R\., Donohue, M\.C\., Weiner, M\.W\., 2024\.The Alzheimer’s Disease Neuroimaging Initiative clinical core: progress and plans\.Alzheimer’s & Dementia 20, 5143–5154\.doi:[10\.1002/alz\.14167](https://arxiv.org/doi.org/10.1002/alz.14167)\.
- Alsentzer et al\. \[2025\]Alsentzer, E\., McDermott, M\., Falck, F\., Schiratti, J\.B\., Naumann, T\., 2025\.ChronoFormer: time\-aware transformer architectures for structured clinical event modeling\.arXiv preprint arXiv:2504\.07373 \.
- Arora et al\. \[2025\]Arora, M\., Wang, X\., Erickson, B\.J\., 2025\.CXR\-TFT: Multi\-modal temporal fusion transformer for predicting chest X\-ray trajectories, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025, Springer\.doi:[10\.1007/978\-3\-032\-05182\-0\_16](https://arxiv.org/doi.org/10.1007/978-3-032-05182-0_16)\.
- Begoli et al\. \[2019\]Begoli, E\., Bhattacharya, T\., Kusnezov, D\., 2019\.The need for uncertainty quantification in machine\-assisted medical decision making\.Nature Machine Intelligence 1, 20–23\.doi:[10\.1038/s42256\-018\-0004\-1](https://arxiv.org/doi.org/10.1038/s42256-018-0004-1)\.
- Bishop \[1994\]Bishop, C\.M\., 1994\.Mixture density networks Technical Report NCRG/94/004\.
- Cao et al\. \[2020\]Cao, W\., Mirjalili, V\., Raschka, S\., 2020\.Rank consistent ordinal regression for neural networks with application to age estimation, Elsevier\. pp\. 325–331\.doi:[10\.1016/j\.patrec\.2020\.11\.008](https://arxiv.org/doi.org/10.1016/j.patrec.2020.11.008)\.
- Carruthers and Finnie \[2023\]Carruthers, J\., Finnie, T\., 2023\.Using mixture density networks to emulate a stochastic within\-host model ofFrancisella tularensisinfection\.PLOS Computational Biology 19, e1011266\.doi:[10\.1371/journal\.pcbi\.1011266](https://arxiv.org/doi.org/10.1371/journal.pcbi.1011266)\.
- Castaño et al\. \[2025\]Castaño, D\., Schiratti, J\.B\., Durrleman, S\., Jedynak, B\., 2025\.A mixture model for subtype identification in the context of disease progression modeling\.arXiv preprint arXiv:2603\.04286 \.
- Chen et al\. \[2024\]Chen, T\., Wang, Y\., Liu, X\., Zhang, H\., Li, W\., 2024\.A transformer\-based unified multimodal framework for Alzheimer’s disease assessment\.Computers in Biology and Medicine 181, 109050\.doi:[10\.1016/j\.compbiomed\.2024\.109050](https://arxiv.org/doi.org/10.1016/j.compbiomed.2024.109050)\.
- Fonteijn et al\. \[2012\]Fonteijn, H\.M\., Modat, M\., Clarkson, M\.J\., Barnes, J\., Lehmann, M\., Hobbs, N\.Z\., Scahill, R\.I\., Tabrizi, S\.J\., Ourselin, S\., Fox, N\.C\., et al\., 2012\.An event\-based model for disease progression and its application in familial Alzheimer’s disease and huntington’s disease\.NeuroImage 60, 1880–1889\.doi:[10\.1016/j\.neuroimage\.2012\.01\.062](https://arxiv.org/doi.org/10.1016/j.neuroimage.2012.01.062)\.
- Ghazi et al\. \[2019\]Ghazi, M\.M\., Nielsen, M\., Pai, A\., Modat, M\., Cardoso, M\.J\., Ourselin, S\., Sørensen, L\., 2019\.Training recurrent neural networks robust to incomplete data: application to Alzheimer’s disease progression modeling\.Medical Image Analysis 53, 39–46\.doi:[10\.1016/j\.media\.2019\.01\.005](https://arxiv.org/doi.org/10.1016/j.media.2019.01.005)\.
- Guo et al\. \[2017\]Guo, C\., Pleiss, G\., Sun, Y\., Weinberger, K\.Q\., 2017\.On calibration of modern neural networks\.URL:[https://arxiv\.org/abs/1706\.04599](https://arxiv.org/abs/1706.04599),[arXiv:1706\.04599](http://arxiv.org/abs/1706.04599)\.
- Hashemifar et al\. \[2022\]Hashemifar, S\., Iriondo, C\., Hejrati, M\., Alzheimer’s Disease Neuroimaging Initiative, 2022\.DeepAD: a robust deep learning model of Alzheimer’s disease progression for real\-world clinical applications\.arXiv preprint arXiv:2203\.09096 \.
- He et al\. \[2025\]He, T\., Jiang, K\., Zhao, A\., Schroder, A\., Thompson, E\., Soskic, S\., Barkhof, F\., Alexander, D\.C\., 2025\.A stage\-aware mixture of experts framework for neurodegenerative disease progression modelling\.arXiv preprint arXiv:2508\.07032 \.
- Higgins et al\. \[2017\]Higgins, I\., Matthey, L\., Pal, A\., Burgess, C\., Glorot, X\., Botvinick, M\., Mohamed, S\., Lerchner, A\., 2017\.β\\beta\-VAE: Learning basic visual concepts with a constrained variational framework, in: International Conference on Learning Representations\.
- Hüllermeier and Waegeman \[2021\]Hüllermeier, E\., Waegeman, W\., 2021\.Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods\.Machine Learning 110, 457–506\.doi:[10\.1007/s10994\-021\-05946\-3](https://arxiv.org/doi.org/10.1007/s10994-021-05946-3)\.
- Jack et al\. \[2010\]Jack, C\.R\., Knopman, D\.S\., Jagust, W\.J\., Shaw, L\.M\., Aisen, P\.S\., Weiner, M\.W\., Petersen, R\.C\., Trojanowski, J\.Q\., 2010\.Hypothetical model of dynamic biomarkers of the Alzheimer’s pathological cascade\.The Lancet Neurology 9, 119–128\.doi:[10\.1016/S1474\-4422\(09\)70299\-6](https://arxiv.org/doi.org/10.1016/S1474-4422(09)70299-6)\.
- Jain et al\. \[2025\]Jain, A\., Jaakkola, T\., Barber, D\., 2025\.Deep ensembles for epistemic uncertainty: a frequentist perspective\.arXiv preprint arXiv:2510\.22063 \.
- Kamal and Farooq \[2024\]Kamal, K\., Farooq, B\., 2024\.Ordinal\-ResLogit: interpretable deep residual neural networks for ordered choices\.Journal of Choice Modelling 50, 100454\.doi:[10\.1016/j\.jocm\.2023\.100454](https://arxiv.org/doi.org/10.1016/j.jocm.2023.100454)\.
- Karaçay et al\. \[2022\]Karaçay, B\., Bianchi, M\., Günnemann, S\., Bouchard, G\., 2022\.Mixture of input\-output hidden Markov models for heterogeneous disease progression modeling, in: Proceedings of the 1st Workshop on Healthcare AI and COVID\-19, ICML 2022\.
- Kendall and Gal \[2017\]Kendall, A\., Gal, Y\., 2017\.What uncertainties do we need in Bayesian deep learning for computer vision?, in: Advances in Neural Information Processing Systems, Curran Associates\.
- Kingma and Ba \[2015\]Kingma, D\.P\., Ba, J\., 2015\.Adam: a method for stochastic optimization, in: International Conference on Learning Representations\.
- Kingma and Welling \[2014\]Kingma, D\.P\., Welling, M\., 2014\.Auto\-encoding variational Bayes, in: International Conference on Learning Representations\.
- Koch et al\. \[2024\]Koch, L\.M\., Baumgartner, C\.F\., Berens, P\., 2024\.Distribution shift detection for the postmarket surveillance of medical AI algorithms: a retrospective simulation study\.npj Digital Medicine 7, 113\.doi:[10\.1038/s41746\-024\-01085\-w](https://arxiv.org/doi.org/10.1038/s41746-024-01085-w)\.
- Kompa et al\. \[2021\]Kompa, B\., Snoek, J\., Beam, A\.L\., 2021\.Second opinion needed: communicating uncertainty in medical machine learning\.NPJ Digital Medicine 4, 4\.doi:[10\.1038/s41746\-020\-00367\-3](https://arxiv.org/doi.org/10.1038/s41746-020-00367-3)\.
- Lakshminarayanan et al\. \[2017\]Lakshminarayanan, B\., Pritzel, A\., Blundell, C\., 2017\.Simple and scalable predictive uncertainty estimation using deep ensembles, in: Advances in Neural Information Processing Systems, Curran Associates\.
- LaMontagne et al\. \[2019\]LaMontagne, P\.J\., Benzinger, T\.L\., Morris, J\.C\., Keefe, S\., Hornbeck, R\., Xiong, C\., Grant, E\., Hassenstab, J\., Moulder, K\., Vlassenko, A\.G\., et al\., 2019\.OASIS\-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer’s disease\.medRxiv doi:[10\.1101/2019\.12\.13\.19014902](https://arxiv.org/doi.org/10.1101/2019.12.13.19014902)\.
- Lim et al\. \[2021\]Lim, B\., Arık, S\.Ö\., Loeff, N\., Pfister, T\., 2021\.Temporal fusion transformers for interpretable multi\-horizon time series forecasting\.International Journal of Forecasting 37, 1748–1764\.doi:[10\.1016/j\.ijforecast\.2021\.03\.012](https://arxiv.org/doi.org/10.1016/j.ijforecast.2021.03.012)\.
- Nguyen et al\. \[2020\]Nguyen, M\., He, T\., An, L\., Alexander, D\.C\., Feng, J\., Yeo, B\.T\.T\., 2020\.Predicting Alzheimer’s disease progression using deep recurrent neural networks\.NeuroImage 222, 117203\.doi:[10\.1016/j\.neuroimage\.2020\.117203](https://arxiv.org/doi.org/10.1016/j.neuroimage.2020.117203)\.
- Oxtoby et al\. \[2018\]Oxtoby, N\.P\., Young, A\.L\., Cash, D\.M\., Benzinger, T\.L\., Fagan, A\.M\., Morris, J\.C\., Bateman, R\.J\., Fox, N\.C\., Schott, J\.M\., Alexander, D\.C\., 2018\.Data\-driven models of dominantly\-inherited Alzheimer’s disease progression\.Brain 141, 1529–1544\.doi:[10\.1093/brain/awy050](https://arxiv.org/doi.org/10.1093/brain/awy050)\.
- Petersen \[2011\]Petersen, R\.C\., 2011\.Mild cognitive impairment\.New England Journal of Medicine 364, 2227–2234\.doi:[10\.1056/NEJMcp0910237](https://arxiv.org/doi.org/10.1056/NEJMcp0910237)\.
- Phetrittikun and Suvirat \[2023\]Phetrittikun, R\., Suvirat, C\., 2023\.Temporal fusion transformer for forecasting vital sign trajectories in intensive care patients, in: 2023 IEEE International Conference on Electronics, Computing and Communication Technologies \(CONECCT\), IEEE\. pp\. 1–6\.doi:[10\.1109/CONECCT57959\.2023\.10234585](https://arxiv.org/doi.org/10.1109/CONECCT57959.2023.10234585)\.
- Reiman et al\. \[2011\]Reiman, E\.M\., Langbaum, J\.B\., Fleisher, A\.S\., Caselli, R\.J\., Chen, K\., Ayutyanont, N\., Quiroz, Y\.T\., Kosik, K\.S\., Lopera, F\., Tariot, P\.N\., 2011\.Alzheimer’s prevention initiative: a plan to accelerate the evaluation of presymptomatic treatments\.Journal of Alzheimer’s Disease 26, S321–S329\.doi:[10\.3233/JAD\-2011\-0059](https://arxiv.org/doi.org/10.3233/JAD-2011-0059)\.
- Rizopoulos \[2012\]Rizopoulos, D\., 2012\.Joint Models for Longitudinal and Time\-to\-Event Data: With Applications in R\.CRC Press, Boca Raton, FL\.
- Shi et al\. \[2023\]Shi, X\., Cao, W\., Raschka, S\., 2023\.Deep neural networks for rank\-consistent ordinal regression based on conditional probabilities\.Pattern Analysis and Applications 26, 941–955\.doi:[10\.1007/s10044\-023\-01181\-9](https://arxiv.org/doi.org/10.1007/s10044-023-01181-9)\.
- Tang et al\. \[2025\]Tang, X\., Zhao, L\., Chen, M\., Liu, W\., Zhang, J\., 2025\.Predicting the progression of mild cognitive impairment based on fine\-grained and spatiotemporal features of MRI\.Biomedical Signal Processing and Control 98, 107012\.doi:[10\.1016/j\.bspc\.2025\.107012](https://arxiv.org/doi.org/10.1016/j.bspc.2025.107012)\.
- Vaswani et al\. \[2017\]Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\.N\., Kaiser, Ł\., Polosukhin, I\., 2017\.Attention is all you need, in: Advances in Neural Information Processing Systems, Curran Associates\.
- Wang et al\. \[2026\]Wang, M\., Liu, Y\., Fu, H\., 2026\.Uncertainty\-aware ordinal deep learning for cross\-dataset diabetic retinopathy grading\.arXiv preprint arXiv:2602\.10315 \.
- Wang et al\. \[2024\]Wang, Y\., Gao, R\., Wei, T\., Johnston, L\., Yuan, X\., Zhang, Y\., Yu, Z\., 2024\.Predicting long\-term progression of Alzheimer’s disease using a multimodal deep learning model incorporating interaction effects\.Journal of Translational Medicine 22, 245\.doi:[10\.1186/s12967\-024\-05025\-w](https://arxiv.org/doi.org/10.1186/s12967-024-05025-w)\.
- Weiner et al\. \[2017\]Weiner, M\.W\., Veitch, D\.P\., Aisen, P\.S\., Beckett, L\.A\., Cairns, N\.J\., Cedarbaum, J\., Donohue, M\.C\., Green, R\.C\., Harvey, D\., Jack, C\.R\., et al\., 2017\.The Alzheimer’s Disease Neuroimaging Initiative 3: continued innovation for clinical trial improvement\.Alzheimer’s & Dementia 13, 561–571\.doi:[10\.1016/j\.jalz\.2016\.10\.006](https://arxiv.org/doi.org/10.1016/j.jalz.2016.10.006)\.
- Weng et al\. \[2025\]Weng, W\.H\., Liu, Q\., Huang, R\., Hsieh, J\., Foschini, L\., 2025\.First, do no harm: addressing AI’s challenges with out\-of\-distribution data in medicine\.Clinical and Translational Science 18, e70132\.doi:[10\.1111/cts\.70132](https://arxiv.org/doi.org/10.1111/cts.70132)\.
- World Health Organization \[2023\]World Health Organization, 2023\.Dementia\.Technical Report\. World Health Organization\.Fact sheet\. Available at:[https://www\.who\.int/news\-room/fact\-sheets/detail/dementia](https://www.who.int/news-room/fact-sheets/detail/dementia)\.
- Zhang et al\. \[2025\]Zhang, Z\., Chen, T\., Hernández\-Lobato, J\.M\., Li, S\., 2025\.Uncertainty quantification for machine learning in healthcare: a survey\.arXiv preprint arXiv:2505\.02874 \.
- Zhao et al\. \[2024\]Zhao, T\., Guo, Y\., Wang, X\., Shen, D\., 2024\.Out\-of\-distribution detection in medical image analysis: a survey\.arXiv preprint arXiv:2404\.18279 \.Similar Articles
Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories
This paper proposes a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning to forecast 24-month CDR-SB change from ADNI clinical and biomarker histories, achieving reduced MSE and improved correlation over baselines.
Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings
This paper introduces GNOVA, a GRU-Neural ODE Variational Autoencoder framework for reconstructing and forecasting Alzheimer's disease cognitive trajectories from routine clinical data without expensive neuroimaging or biomarkers, achieving low error and uncertainty estimation on the ADNI dataset.
A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework
This paper introduces LANTERN, a neural network framework for estimating health-state transition probabilities from irregular longitudinal data, with applications to long-term care insurance. It outperforms traditional methods in discrimination and calibration for severe disability and mortality prediction.
BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
The paper proposes BatteryMFormer, a multi-level Transformer for early battery degradation trajectory forecasting that integrates aging-condition-aware decoding, meta degradation pattern memory, and dual-view encoding to capture multi-level degradation structures and SOC-localized variations, consistently outperforming state-of-the-art baselines across four battery domains.
From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction
This review paper proposes a unified framework for intervention-aware disease trajectory modeling in clinical AI, addressing static prediction failures by incorporating treatment confounder feedback and informative observation patterns.