Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories
Summary
This paper proposes a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning to forecast 24-month CDR-SB change from ADNI clinical and biomarker histories, achieving reduced MSE and improved correlation over baselines.
View Cached Full Text
Cached at: 05/19/26, 06:40 AM
# Forecasting Medium-Horizon Alzheimer’s Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories
Source: [https://arxiv.org/html/2605.16319](https://arxiv.org/html/2605.16319)
Ran Tong1,\*,†\\dagger, Tong Wang2,\*,†\\dagger, Lanruo Wang3, Yin Ni4 1Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX 75080, United States 2Department of Plant Science and Landscape Architecture, University of Connecticut, Storrs, CT, United States 3Naveen Jindal School of Management, University of Texas at Dallas, Richardson, TX 75080, United States 4Zhejiang Provincial People’s Hospital, Zhejiang, China \*These authors contributed equally\. †\\daggerCorresponding authors\.
###### Abstract
Medium\-horizon Alzheimer’s disease progression prediction is difficult because future clinical scores can remain strongly tied to baseline severity, while longitudinal biomarker histories are irregular and incompletely observed\. We develop an anchor\-based analysis of 24\-month Clinical Dementia Rating Sum of Boxes \(CDR\-SB\) change using harmonized tables derived from the Alzheimer’s Disease Neuroimaging Initiative \(ADNI\)\. Each labeled sample is anchored at a mild cognitive impairment visit, uses only clinical and biomarker history observed at or before that anchor, and defines the response as CDR\-SB at the future visit closest to 24 months within an 18–30 month window minus anchor CDR\-SB\. The analytic cohort contains 2,600 labeled anchors from 858 participants and 7,276 longitudinal rows, with cognition, function, demographics, diagnosis, APOE4 allele count \(number of apolipoprotein E epsilon 4 alleles\), structural magnetic resonance imaging summaries, and cerebrospinal fluid biomarkers aligned by actual visit date\.
We propose a residual gap\-aware transformer that combines a mixed\-effects statistical reference with transformer\-based residual learning from pre\-anchor longitudinal clinical and biomarker histories\. The model uses participant\-level random intercepts in the mixed\-effects reference, observation\-level triplet tokenization for irregular histories, and a learned nonnegative time\-gap penalty inside self\-attention\. The final prediction is the sum of the mixed\-effects fixed\-effect prediction and the learned transformer residual\. We compare the proposed model with a Bayesian\-information\-criterion\-selected linear mixed\-effects baseline, GRU\-D, and STraTS under repeated participant\-level train–test splits\. Across five participant\-level random seeds, the proposed model achieves the best mean test performance across all reported metrics, reducing MSE by 13\.1% and increasing prediction–observation correlation by 26\.4% relative to the mixed\-effects baseline\. It also improves over both GRU\-D and STraTS in mean error and correlation\. These results show that statistical anchoring and gap\-aware residual learning provide a useful structure for medium\-horizon Alzheimer’s disease progression prediction from longitudinal clinical and biomarker ADNI histories\.
## 1Introduction
Alzheimer’s disease progression is a longitudinal process\. Cognitive status, functional impairment, diagnosis, and biomarker burden change over repeated visits, often under irregular follow\-up and incomplete biomarker coverage\. This structure makes the Alzheimer’s Disease Neuroimaging Initiative \(ADNI\) an important resource for studying disease progression rather than only cross\-sectional diagnosis\. In a broad review of ADNI publications,Weiner et al\. \([2017](https://arxiv.org/html/2605.16319#bib.bib34)\)showed that ADNI had become a major platform for biomarker validation and longitudinal disease characterization\.Veitch et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib31)\)documented that later ADNI studies continued to support longitudinal modeling, biomarker research, and treatment\-era methodological development\. Looking across two decades of work,Okonkwo et al\. \([2025](https://arxiv.org/html/2605.16319#bib.bib22)\)described ADNI as a mature infrastructure for prognosis research, biomarker development, and cross\-cohort translation\. These reviews place ADNI at the center of modern longitudinal Alzheimer’s disease research\.
Recent work in Alzheimer’s disease prediction has also moved toward future\-oriented progression modeling\.Marinescu et al\. \([2021](https://arxiv.org/html/2605.16319#bib.bib20)\)made this direction explicit through the TADPOLE challenge, which organized prediction around future diagnosis and future clinical measurements from ADNI histories\.Nguyen et al\. \([2020](https://arxiv.org/html/2605.16319#bib.bib21)\)showed that recurrent neural networks can forecast future diagnosis, cognition, and ventricular measures from longitudinal ADNI data\.Al Olaimat et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib2)\)focused more directly on progression from mild cognitive impairment by developing deep models for future Alzheimer’s disease progression\.Zhang et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib37)\)used longitudinal multi\-source data to identify disease\-related progression patterns associated with prognosis\.Lee et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib18)\)developed a machine learning framework for predicting dementia conversion\. These studies show that future\-oriented Alzheimer’s disease prediction is feasible and clinically relevant, while also showing that the target definition and evaluation design strongly shape what a model is being asked to learn\.
A central difficulty is that reported predictive gains can be hard to interpret when cohort construction and outcome definition differ across studies\. Outcome horizon, anchor state, follow\-up window, biomarker handling, and validation protocol all affect the meaning of a model comparison\.Grueso and Viejo\-Sobera \([2021](https://arxiv.org/html/2605.16319#bib.bib11)\)documented substantial variation in inclusion criteria, feature construction, and validation among studies of progression from mild cognitive impairment to Alzheimer’s disease dementia\.Ahmadzadeh et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib1)\)reached a similar conclusion for neuroimaging\-based transition prediction and emphasized continuing methodological unevenness\.Singh et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib25)\)described forecasting\-oriented transition studies as a rapidly growing area whose target definitions and evaluation strategies still differ widely\.Kumar et al\. \([2021](https://arxiv.org/html/2605.16319#bib.bib15)\)reviewed clinical\-data machine learning studies and highlighted heterogeneity in data sources, tasks, and validation settings\.Malik et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib19)\)extended the same concern to the broader Alzheimer’s disease prediction literature\. These reviews motivate a study design in which the anchor definition, prediction horizon, endpoint, covariate history, and participant\-level evaluation rule are stated before model comparison\.
The endpoint is especially important\. Prior work has shown that the Clinical Dementia Rating Sum of Boxes \(CDR\-SB\) is a meaningful longitudinal outcome in Alzheimer’s disease\.Cedarbaum et al\. \([2013](https://arxiv.org/html/2605.16319#bib.bib6)\)argued that CDR\-SB is a suitable primary outcome because it combines cognitive and functional decline\.Williams et al\. \([2013](https://arxiv.org/html/2605.16319#bib.bib35)\)showed that CDR\-SB tracks Alzheimer’s disease progression over time and carries clinically meaningful longitudinal information\.Andrews et al\. \([2019](https://arxiv.org/html/2605.16319#bib.bib4)\)studied clinically meaningful change in Alzheimer’s disease outcome measures, including CDR\-SB\.Jamalian et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib13)\)modeled longitudinal CDR\-SB trajectories using clinical trial and ADNI data\. These studies make CDR\-SB a natural outcome family for progression modeling\. The remaining design question is which version of CDR\-SB gives the clearest medium\-horizon progression target\. In the cohort used here, raw 24\-month CDR\-SB remains strongly associated with anchor CDR\-SB, whereas 24\-month CDR\-SB change is much less explained by the anchor value alone\. This empirical contrast motivates the primary response in the present study\. A change score places the prediction target on worsening after the anchor rather than on baseline disease burden carried into the future score\.
Clinical and biomarker history creates a second practical challenge\. Structural magnetic resonance imaging and cerebrospinal fluid biomarkers are informative for Alzheimer’s disease progression, but they are often unavailable at the exact anchor visit\.Lee et al\. \([2019](https://arxiv.org/html/2605.16319#bib.bib17)\)showed that combining different data sources can improve Alzheimer’s disease progression prediction\.Ding et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib9)\)further showed that longitudinal and multi\-source information improves prediction of progression from mild cognitive impairment to Alzheimer’s disease\.Zhang et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib37)\)used longitudinal multi\-source data to study disease\-related progression patterns, andLee et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib18)\)used multiple data sources for dementia conversion prediction\. These studies support the use of clinical and biomarker histories\. At the same time, a strict same\-visit complete\-case rule can sharply reduce the usable cohort\. For a medium\-horizon longitudinal analysis, historically observed biomarker information remains clinically relevant, and its recency should be represented rather than ignored\.
A third challenge is the choice of statistical reference\. In an anchor\-based longitudinal analysis, the same participant can contribute multiple eligible mild cognitive impairment anchors\. This creates within\-subject dependence that a simple anchor\-level regression does not represent\. A mixed\-effects model with participant\-level random intercepts is therefore a more appropriate statistical comparator for this study design\. The proposed neural model is also built around this principle: it learns a residual component beyond the mixed\-effects fixed\-effect prediction, rather than replacing the longitudinal statistical reference with a black\-box sequence model\.
This study is built from these considerations\. We construct a participant\-level, anchor\-based analysis of 24\-month CDR\-SB change using harmonized longitudinal tables derived from ADNI\. Each labeled sample is anchored at a visit where the participant is diagnosed with mild cognitive impairment and includes only information observed up to that visit, including repeated clinical scores and available biomarker history\. The outcome is defined at the future visit closest to 24 months within an 18–30 month window\. To evaluate models in a way that respects repeated measures, we compare the proposed model against a mixed\-effects statistical baseline with participant random intercepts, together with recurrent and transformer\-based comparators for irregular clinical time series\.
The contribution of the paper is threefold\. First, it defines a clinically interpretable medium\-horizon progression analysis based on 24\-month CDR\-SB change, with explicit anchor selection, follow\-up window construction, biomarker\-history handling, and participant\-level evaluation\. Second, it uses a mixed\-effects repeated\-measures model as the statistical reference, which makes the comparison more appropriate for repeated anchors from the same participant\. Third, it proposes a residual gap\-aware transformer that combines the mixed\-effects fixed\-effect prediction with a transformer residual learned from irregular pre\-anchor clinical and biomarker histories, using a nonnegative time\-gap penalty inside self\-attention\. This design links the prediction task, statistical reference, and neural architecture to the same longitudinal data structure\. Under this shared quantitative CDR\-SB\-change analysis, the proposed model achieves the best repeated\-seed mean performance among the compared model families\.
## 2Related Work
### 2\.1Longitudinal Alzheimer’s disease forecasting and progression targets
Recent Alzheimer’s disease prediction studies have increasingly moved from current\-state classification toward explicit forecasting of future disease status and future clinical outcomes\.Marinescu et al\. \([2021](https://arxiv.org/html/2605.16319#bib.bib20)\)played an important role in this shift by organizing the TADPOLE challenge around future diagnosis and future clinical measurements from ADNI histories\. That work helped clarify that, once the prediction target moves into the future, cohort definition, horizon choice, and response construction become part of the scientific problem\.
Nguyen et al\. \([2020](https://arxiv.org/html/2605.16319#bib.bib21)\)showed that recurrent neural networks can use longitudinal ADNI histories to forecast future diagnosis, cognition, and ventricular measures\. Their work demonstrated that repeated observations contain predictive information beyond a single anchor visit and that neural sequence models can use that structure\. Their study considered several future targets, which helped establish feasibility while leaving room for a more focused question about how to define one clinically interpretable medium\-horizon progression response for direct model comparison\.
Al Olaimat et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib2)\)moved closer to the clinical setting considered here by focusing on progression from mild cognitive impairment to Alzheimer’s disease over future visits\. That work placed MCI progression directly at the center of model development and showed that longitudinal deep models can be trained for future progression\.Zhang et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib37)\)used longitudinal multi\-source data to identify disease\-related progression patterns associated with prognosis\.Lee et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib18)\)developed a machine learning framework aimed at dementia conversion prediction\.Ding et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib9)\)showed that longitudinal and multi\-source information improves prediction of progression from MCI to Alzheimer’s disease relative to more limited formulations\. Together, these studies show that longitudinal prediction is feasible and clinically relevant, while also showing that the exact target definition remains central\.
Prior reviews reinforce this point\.Grueso and Viejo\-Sobera \([2021](https://arxiv.org/html/2605.16319#bib.bib11)\)documented variation in inclusion criteria, feature construction, and validation protocols among studies of progression from mild cognitive impairment to Alzheimer’s disease dementia\.Ahmadzadeh et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib1)\)reached a similar conclusion for neuroimaging\-based transition prediction\.Singh et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib25)\)described forecasting\-oriented transition studies as a growing area whose target definitions and evaluation strategies still differ substantially\.Kumar et al\. \([2021](https://arxiv.org/html/2605.16319#bib.bib15)\)andMalik et al\. \([2024](https://arxiv.org/html/2605.16319#bib.bib19)\)extended the same concern to broader clinical\-data and machine\-learning studies of Alzheimer’s disease prediction\. These reviews matter for the present paper because model performance is often tied to cohort restriction, outcome definition, and validation design\. The present study addresses this issue by specifying the horizon, anchor state, response, covariate history, and participant\-level split rule before comparing model families\.
Quantitative CDR\-SB\-change outcomes are important in this setting\. CDR\-SB is an ordered composite clinical scale, but its change over time is widely used to summarize clinical worsening\.Cedarbaum et al\. \([2013](https://arxiv.org/html/2605.16319#bib.bib6)\)examined the psychometric properties of CDR\-SB in ADNI and argued that it is a strong candidate primary outcome because it reflects both cognitive and functional decline\.Williams et al\. \([2013](https://arxiv.org/html/2605.16319#bib.bib35)\)showed that CDR\-SB tracks disease progression over time\.Andrews et al\. \([2019](https://arxiv.org/html/2605.16319#bib.bib4)\)studied minimal clinically important differences in Alzheimer’s disease outcome measures and gave clinical meaning to changes on scales such as CDR\-SB\.Jamalian et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib13)\)modeled longitudinal CDR\-SB trajectories using clinical trial and ADNI data\. These papers support 24\-month CDR\-SB change as a meaningful quantitative progression endpoint\.
### 2\.2Classical longitudinal statistical models and clinically interpretable comparators
Classical longitudinal modeling provides the statistical basis for the comparator used in this study\. In biomedical research, repeated\-measures models remain a standard way to describe within\-subject dependence, separate population\-level and subject\-level structure, and support interpretable effect assessment\.Laird and Ware \([1982](https://arxiv.org/html/2605.16319#bib.bib16)\)gave the foundational random\-effects formulation for longitudinal data\.Verbeke and Molenberghs \([2000](https://arxiv.org/html/2605.16319#bib.bib33)\)developed the linear mixed\-model framework more fully for repeated continuous outcomes\.Fitzmaurice et al\. \([2011](https://arxiv.org/html/2605.16319#bib.bib10)\)remains a standard reference for practical longitudinal\-data analysis\. These works matter directly for the present study because the same participant can contribute multiple eligible anchors, which creates within\-subject dependence\.
This point affects the choice of comparator\. In a repeated\-anchor progression analysis, the statistical baseline should reflect the repeated\-measures structure of the data\. A purely anchor\-level regression can serve as a descriptive reference, while a mixed\-effects model is the more appropriate longitudinal comparator\. This is why the present study uses a mixed\-effects baseline with participant random intercepts rather than a simpler cross\-sectional regression reference\.
The classical longitudinal modeling tradition also connects with Alzheimer’s disease progression modeling itself\.Jedynak et al\. \([2012](https://arxiv.org/html/2605.16319#bib.bib14)\)proposed a computational disease progression score for the ADNI cohort and treated progression as a latent longitudinal process rather than as a sequence of disconnected classifications\.Jamalian et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib13)\)modeled longitudinal CDR\-SB trajectories directly, again emphasizing that progression can be represented as an evolving process rather than only as a future label\. A related statistical direction is joint modeling of longitudinal and event data, for whichRizopoulos \([2012](https://arxiv.org/html/2605.16319#bib.bib23)\)provides the standard treatment\. The current paper uses a fixed\-horizon change score, while the joint\-modeling framework reflects the broader statistical principle that repeated longitudinal measurements are informative about future clinical outcomes\.
### 2\.3Irregular clinical sequence models and transformer\-based time\-series methods
The machine learning work most relevant to the proposed model comes from irregular clinical sequence modeling\.Baytas et al\. \([2017](https://arxiv.org/html/2605.16319#bib.bib5)\)introduced T\-LSTM and showed that elapsed time can be incorporated directly into recurrent dynamics when observations arrive at irregular intervals\. This work is relevant here because it treats irregular timing as a structural feature of the sequence\.
Che et al\. \([2018](https://arxiv.org/html/2605.16319#bib.bib7)\)developed GRU\-D, a recurrent model for multivariate clinical time series with missing values\. Its main contribution was to treat both elapsed time and missingness as informative\. GRU\-D allows the model to learn how stale information decays and how missingness patterns carry signal\. That idea is directly relevant to ADNI\-style clinical and biomarker histories, where measurements are observed irregularly and biomarker availability changes across visits\. GRU\-D is therefore one of the most appropriate non\-transformer comparators for the data structure used in this paper\.
Al Olaimat and Bozdag \([2024](https://arxiv.org/html/2605.16319#bib.bib3)\)continued this line of work by combining time\-awareness with attention in irregular electronic health record sequences\. That paper is useful here because it reinforces the idea that temporal distance should affect interaction strength explicitly\. This idea is related to the gap\-aware attention term used in the proposed model, although the present paper applies it to a fixed\-horizon Alzheimer’s disease progression analysis and combines it with a mixed\-effects residual structure\.
Transformer\-based sequence modeling forms the other main methodological branch\.Vaswani et al\. \([2017](https://arxiv.org/html/2605.16319#bib.bib30)\)introduced the self\-attention architecture underlying modern transformer models\. For irregular clinical time series, visits and variables are often observed on different schedules, so a dense regular\-grid representation can be poorly matched to the data\.Tipirneni and Reddy \([2021](https://arxiv.org/html/2605.16319#bib.bib27)\)addressed this problem through STraTS, which represents multivariate clinical history as observation\-level triplets rather than as a fully aligned matrix\. STraTS is therefore an important comparator in the present study because it shares the triplet\-style representation used by the proposed model\.
More general transformer work for multivariate time series has also helped define the design space\.Zerveas et al\. \([2021](https://arxiv.org/html/2605.16319#bib.bib36)\)developed a transformer\-based framework for multivariate time\-series representation learning and showed that transformer models can be adapted beyond language data\. The question for the present setting is which structure makes attention better matched to irregular, clinically organized histories\.
The proposed method enters at that point\. It keeps the observation\-level triplet representation associated with STraTS, preserves a mixed\-effects statistical reference, and modifies attention through an explicit penalty on temporal distance\. The model should therefore be read as a structured statistical\-neural model for irregular longitudinal progression prediction\. Its main design choices correspond to the data features emphasized throughout this section: repeated anchors within participant, irregular measurement timing, incomplete biomarker coverage, and a fixed\-horizon quantitative progression response\.
## 3Methodology: Cohort Construction, Outcome Definition, and Longitudinal Data Structure
This section describes the analytic construction used before model fitting\. We define the study cohort, the anchor and follow\-up rules, the 24\-month CDR\-SB\-change response, and the clinical and biomarker covariate structure\. These steps are part of the methodology because they determine the prediction target, the information available at each anchor, and the participant\-level structure used in the subsequent model comparison\.
### 3\.1Study cohort and follow\-up design
All analyses use harmonized longitudinal tables derived from the Alzheimer’s Disease Neuroimaging Initiative \(ADNI\)\. The analytic dataset is built by aligning participant identifier, visit code, and actual visit date across diagnosis records, cognitive and functional assessments, demographics, APOE4 allele count \(number of apolipoprotein E epsilon 4 alleles\), structural magnetic resonance imaging regional volume tables, and cerebrospinal fluid biomarker tables\. Actual visit date is used because nominal visit labels alone do not resolve missed visits, asynchronous modality collection, or irregular follow\-up\. In a medium\-horizon progression study, timing determines which observations belong to the available history, which future visits are eligible for outcome assignment, and how informative historical biomarker measurements remain at the anchor\.
The study is anchored at visits where the participant is diagnosed with mild cognitive impairment \(MCI\)\. This anchor state is clinically useful because future worsening remains plausible and clinically important, yet there is still substantial heterogeneity in cognitive burden, functional status, and biomarker pattern across subjects\. For each anchor visit, the study searches for future follow\-up visits occurring between 18 and 30 months after the anchor\. Among those eligible visits, the visit closest to 24 months is selected as the outcome visit\. This rule yields a clinically interpretable medium\-horizon design while remaining feasible under irregular ADNI follow\-up\.
The primary CDR\-SB\-change cohort retains anchor visits for which both anchor and selected future CDR\-SB values are observed\. Under this rule, the resulting analytic cohort contains 2,600 labeled anchor visits from 858 participants and 7,276 longitudinal rows\. The median target gap is 24\.11 months\. Same\-visit structural magnetic resonance imaging availability is 17\.1%, whereas any\-prior structural magnetic resonance imaging availability rises to 36\.4%\. The corresponding numbers for cerebrospinal fluid are 22\.0% for same\-visit availability and 60\.5% for any\-prior availability\. These quantities show why strict same\-visit biomarker coverage would be poorly matched to the data structure\. Historical biomarker information is much more broadly available than same\-visit biomarker information alone\. Table[1](https://arxiv.org/html/2605.16319#S3.T1)summarizes the labeled cohort size, repeated\-anchor structure, target timing, and same\-visit versus any\-prior biomarker availability used in the primary 24\-month CDR\-SB\-change analysis\.
Table 1:Profile of the primary 24\-month CDR\-SB\-change cohort derived from ADNI\. Percentages are relative to labeled anchor visits\.
### 3\.2Primary outcome and statistical rationale
The primary outcome is the 24\-month Clinical Dementia Rating Sum of Boxes \(CDR\-SB\) change,
ΔCDR\-SBi,24m=CDR\-SBi,24m−CDR\-SBi,0,\\Delta\\mathrm\{CDR\\mbox\{\-\}SB\}\_\{i,24m\}=\\mathrm\{CDR\\mbox\{\-\}SB\}\_\{i,24m\}\-\\mathrm\{CDR\\mbox\{\-\}SB\}\_\{i,0\},\(1\)whenever both values are observed\. Throughout the paper, this quantity is referred to as the*24\-month CDR\-SB change*\. This endpoint measures worsening magnitude over a clinically interpretable follow\-up interval\. Because CDR\-SB is an ordered composite clinical scale, the response is interpreted as a quantitative change score rather than as a strictly continuous biological measurement\.
The choice of CDR\-SB is supported by prior clinical and longitudinal work\.Cedarbaum et al\. \([2013](https://arxiv.org/html/2605.16319#bib.bib6)\)argued that CDR\-SB is a suitable primary outcome for Alzheimer’s disease trials because it combines cognitive and functional decline in a single clinical measure\.Williams et al\. \([2013](https://arxiv.org/html/2605.16319#bib.bib35)\)showed that CDR\-SB scores track Alzheimer’s disease progression over time, supporting their use in longitudinal progression analysis\.Andrews et al\. \([2019](https://arxiv.org/html/2605.16319#bib.bib4)\)studied clinically meaningful change in Alzheimer’s disease outcome measures, including CDR\-SB, which helps connect score changes with clinical interpretation\.Jamalian et al\. \([2023](https://arxiv.org/html/2605.16319#bib.bib13)\)modeled longitudinal CDR\-SB trajectories using clinical trial and ADNI data, further supporting CDR\-SB as a progression endpoint in longitudinal modeling\.
The key design question is which version of CDR\-SB gives the clearest medium\-horizon progression target\. A natural alternative is the raw future CDR\-SB level at the selected 24\-month visit\. That quantity is clinically interpretable, but it remains strongly tied to the anchor value\. In the present cohort, current CDR\-SB alone has correlation 0\.667 with raw 24\-month CDR\-SB and explains 44\.4% of its variance\. By contrast, current CDR\-SB has correlation only 0\.175 with 24\-month CDR\-SB change and explains 3\.1% of its variance\. This contrast is central to the design of the study\. A raw future score still carries substantial information about where the subject started\. A change score moves the target toward worsening after the anchor\.
Each labeled sample therefore represents a clinically interpretable question asked at an MCI visit: given the subject’s repeated clinical history and available biomarker history up to this point, how much worsening is recorded approximately two years later? That question is narrower than a generic future\-state forecast and better aligned with the quantitative progression problem studied in this paper\.
### 3\.3Covariate structure, biomarker history, and derived variables
The covariate structure is organized around the same principle as the outcome: the data used for prediction should reflect what is observed before the anchor and should preserve the longitudinal clinical and biomarker structure of the cohort\. The analysis therefore includes repeated clinical measures, anchor\-level characteristics, historical biomarker summaries, and derived variables that describe missingness and timing\.
Longitudinal clinical variables include CDR\-SB, MMSE, ADAS13, and FAQ measured over repeated visits before the anchor\. These variables provide the densest part of the subject history and capture both disease burden and its temporal evolution\. Structural magnetic resonance imaging summaries include regional and global measurements such as whole brain, ventricles, hippocampus, entorhinal cortex, fusiform gyrus, middle temporal cortex, and inferior temporal cortex whenever observed before the anchor\. Cerebrospinal fluid history includes amyloid\-β\\beta, tau, and phosphorylated tau when available\. These biomarker histories are retained because prior values remain scientifically informative even when they are not observed at the exact anchor visit\.
Anchor\-level and slowly varying variables include age, sex, education, APOE4 allele count, baseline diagnostic context, months from first visit, and exact target\-gap months\. These quantities provide stable clinical context and disease\-history information that complement the repeated measures\. They are also the natural covariates for the mixed\-effects reference model used later in the paper\.
Derived variables are used to represent the irregular structure of follow\-up explicitly\. For statistical modeling, the analysis includes missingness indicators and modality\-recency variables that record how long it has been since structural magnetic resonance imaging or cerebrospinal fluid was last observed\. For sequence models, the same history is represented through observation\-level triplets and elapsed\-time tensors\. This construction keeps timing and modality availability explicit rather than hiding them inside strict complete\-case filtering\. Table[2](https://arxiv.org/html/2605.16319#S3.T2)lists the clinical, biomarker, timing, missingness, sequence\-format, and response variables used to construct the statistical and neural model inputs\.
Table 2:Covariate and outcome structure for the 24\-month CDR\-SB\-change analysis\. Variables are grouped into longitudinal clinical predictors, longitudinal biomarker predictors, anchor\-level predictors, derived timing and missingness variables, sequence\-format variables, and the response\.BlockVariables includedTemporal statusRole in the analysisLongitudinal clinical and functional predictorsClinical severityCDR\-SB, MMSE, ADAS13Repeated pre\-anchor measurementsCapture cognitive burden and its pre\-anchor trajectory\. These variables provide the main longitudinal clinical signal\.Functional statusFAQRepeated pre\-anchor measurementsCaptures functional impairment before the anchor and complements cognitive scores\.Longitudinal biomarker predictorsStructural magnetic resonance imaging historyWhole brain, ventricles, hippocampus, entorhinal cortex, fusiform gyrus, middle temporal cortex, inferior temporal cortexObserved at or before anchor when availableSummarizes regional and global structural neurodegeneration\. Prior values are retained when same\-visit magnetic resonance imaging is unavailable\.Cerebrospinal fluid historyAmyloid\-β\\beta, tau, phosphorylated tauObserved at or before anchor when availableRepresents molecular pathology before the anchor and complements clinical and structural magnetic resonance imaging information\.Anchor\-level and slowly varying predictorsDemographic variablesAge, sex, educationAnchor\-level or effectively fixedProvide background clinical adjustment and account for demographic heterogeneity\.Genetic variableAPOE4 allele countFixedEncodes inherited risk information related to Alzheimer’s disease progression\.Anchor diagnostic contextMCI anchor indicator and available diagnosis codingAnchor\-levelRecords the clinical state from which future worsening is modeled and preserves the diagnostic coding used during cohort construction\.Disease\-history timingMonths from first visit, exact target\-gap monthsAnchor\-level derived quantitiesEncodes duration of observed follow\-up history and the exact distance between anchor and outcome visit\.Derived variables for irregular clinical and biomarker follow\-upHistory\-aware magnetic resonance imaging summariesMost recent prior structural magnetic resonance imaging summariesDerived from pre\-anchor historyPreserves structural biomarker information when same\-visit imaging is unavailable\.History\-aware cerebrospinal fluid summariesMost recent prior amyloid\-β\\beta, tau, and phosphorylated tau summariesDerived from pre\-anchor historyPreserves molecular biomarker information when same\-visit cerebrospinal fluid is unavailable\.Missingness indicatorsVariable\-specific missingness flags for clinical, structural magnetic resonance imaging, and cerebrospinal fluid featuresDerived from observation patternRecords which measurements are observed and allows missing\-data patterns to enter the statistical and neural models\.Modality recencyMonths since structural magnetic resonance imaging was last observed; months since cerebrospinal fluid was last observedDerived from observation timingDistinguishes recent biomarker measurements from older carried\-forward history\.Sequence\-format variables for neural modelsObservation\-level tripletsObservation time, variable identity, and observed value, denoted\(τj,kj,vj\)\(\\tau\_\{j\},k\_\{j\},v\_\{j\}\)Derived from longitudinal historyInput representation for STraTS and the proposed gap\-aware transformer\.Elapsed\-time tensorsTime since last observation by variableDerived from longitudinal historyInput representation for GRU\-D and other time\-aware recurrent models\.ResponsePrimary response24\-month CDR\-SB changeFuture value minus anchor valueQuantitative regression target for medium\-horizon clinical worsening\.Table[3](https://arxiv.org/html/2605.16319#S3.T3)summarizes the analytic construction of the study, including the data source, anchor definition, available history, outcome window, response construction, biomarker handling, repeated\-anchor structure, and evaluation unit\.
Table 3:Analytic construction of the 24\-month CDR\-SB\-change study\.
## 4Proposed Model and Structural Properties
This section presents the proposed residual gap\-aware transformer for 24\-month CDR\-SB change\. The model combines a mixed\-effects statistical reference with a transformer encoder for irregular longitudinal clinical and biomarker histories\. The mixed\-effects component provides a longitudinal statistical reference based on anchor\-level covariates, while the transformer component learns residual trajectory information from the pre\-anchor history\. The section first defines the model, then states the structural assumptions and propositions used to characterize the residual decomposition and the time\-gap attention mechanism\. Detailed proofs are provided in Appendix[A](https://arxiv.org/html/2605.16319#A1)\.
Figure 1:Overview of the proposed residual gap\-aware transformer for predicting 24\-month CDR\-SB change\. The upper branch uses anchor\-level covariates to obtain the mixed\-effects statistical reference predictiongistatg\_\{i\}^\{\\mathrm\{stat\}\}\. The lower branch encodes pre\-anchor longitudinal clinical and biomarker history through observation\-level triplet tokenization and a gap\-aware transformer, producing the residual predictionrθ\(Hi\)r\_\{\\theta\}\(H\_\{i\}\)\. The final prediction is the sumgistat\+rθ\(Hi\)g\_\{i\}^\{\\mathrm\{stat\}\}\+r\_\{\\theta\}\(H\_\{i\}\)\.### 4\.1Notation and prediction target
For notational simplicity, this section indexes labeled anchor samples by a single indexii\. Participant identifiers are used when fitting the mixed\-effects reference model and when defining participant\-level train–test splits in Section[5](https://arxiv.org/html/2605.16319#S5)\.
For anchorii, let
Hi=\{\(τij,kij,vij\)\}j=1miH\_\{i\}=\\\{\(\\tau\_\{ij\},k\_\{ij\},v\_\{ij\}\)\\\}\_\{j=1\}^\{m\_\{i\}\}denote the longitudinal history observed at or before the anchor\. Hereτij∈ℝ\\tau\_\{ij\}\\in\\mathbb\{R\}is the observation time measured relative to the anchor,kij∈\{1,…,K\}k\_\{ij\}\\in\\\{1,\\dots,K\\\}is the variable identity, andvij∈ℝv\_\{ij\}\\in\\mathbb\{R\}is the observed value\. Letpip\_\{i\}denote the participant identifier, letxi∈ℝpx\_\{i\}\\in\\mathbb\{R\}^\{p\}denote the anchor\-level covariate vector used by the mixed\-effects reference model, and let
yi=ΔCDR\-SBi,24my\_\{i\}=\\Delta\\mathrm\{CDR\\mbox\{\-\}SB\}\_\{i,24m\}be the observed 24\-month CDR\-SB change\.
The mixed\-effects reference model is fit on the training set only using participant identifiers, anchor\-level covariates, and observed responses\. Let
gistat=gβ^\(xi\)g\_\{i\}^\{\\mathrm\{stat\}\}=g\_\{\\hat\{\\beta\}\}\(x\_\{i\}\)denote its fixed\-effect prediction for anchorii\. In all residual\-learning and test\-prediction steps,gistatg\_\{i\}^\{\\mathrm\{stat\}\}denotes the marginal fixed\-effect prediction from the fitted mixed\-effects reference model\. Participant\-specific random intercepts are used to fit the reference model in the training set, but they are not used as test\-subject information\.
The proposed model predicts
y^i=gistat\+rθ\(Hi\),\\hat\{y\}\_\{i\}=g\_\{i\}^\{\\mathrm\{stat\}\}\+r\_\{\\theta\}\(H\_\{i\}\),\(2\)whererθ\(Hi\)r\_\{\\theta\}\(H\_\{i\}\)is a residual component learned from the pre\-anchor longitudinal history\. This decomposition gives the model two components: a statistical reference term and a sequence\-based residual term\.
Define the residual response
ui=yi−gistat\.u\_\{i\}=y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\.\(3\)Then fitting the model in \([2](https://arxiv.org/html/2605.16319#S4.E2)\) is equivalent to fittinguiu\_\{i\}using the longitudinal historyHiH\_\{i\}\. This equivalence is formalized in Proposition[1](https://arxiv.org/html/2605.16319#Thmproposition1)\.
### 4\.2Observation\-level longitudinal tokenization
Each observed triplet\(τij,kij,vij\)\(\\tau\_\{ij\},k\_\{ij\},v\_\{ij\}\)is mapped to an initial token embedding
zij\(0\)=eτ\(τij\)\+ek\(kij\)\+Wvvij\+bv,z\_\{ij\}^\{\(0\)\}=e\_\{\\tau\}\(\\tau\_\{ij\}\)\+e\_\{k\}\(k\_\{ij\}\)\+W\_\{v\}v\_\{ij\}\+b\_\{v\},\(4\)whereeτ\(⋅\)e\_\{\\tau\}\(\\cdot\)is a time embedding,ek\(⋅\)e\_\{k\}\(\\cdot\)is a variable\-identity embedding, andWvvij\+bvW\_\{v\}v\_\{ij\}\+b\_\{v\}embeds the observed measurement value\. Let
Zi\(0\)=\(zi1\(0\),…,zimi\(0\)\)⊤∈ℝmi×dZ\_\{i\}^\{\(0\)\}=\\big\(z\_\{i1\}^\{\(0\)\},\\dots,z\_\{im\_\{i\}\}^\{\(0\)\}\\big\)^\{\\top\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times d\}be the initial token matrix\. This representation uses observed events directly and avoids forcing all variables onto a dense common time grid\.
### 4\.3Gap\-aware self\-attention
The encoder follows the standard multi\-head self\-attention structure, with a time\-gap penalty added to the pre\-softmax attention score\. For layerℓ\\elland headhh, define
qia\(ℓ,h\)=WQ\(ℓ,h\)zia\(ℓ−1\),kib\(ℓ,h\)=WK\(ℓ,h\)zib\(ℓ−1\),vib\(ℓ,h\)=WV\(ℓ,h\)zib\(ℓ−1\)\.q\_\{ia\}^\{\(\\ell,h\)\}=W\_\{Q\}^\{\(\\ell,h\)\}z\_\{ia\}^\{\(\\ell\-1\)\},\\qquad k\_\{ib\}^\{\(\\ell,h\)\}=W\_\{K\}^\{\(\\ell,h\)\}z\_\{ib\}^\{\(\\ell\-1\)\},\\qquad v\_\{ib\}^\{\(\\ell,h\)\}=W\_\{V\}^\{\(\\ell,h\)\}z\_\{ib\}^\{\(\\ell\-1\)\}\.The attention score from query tokenaato key tokenbbis
siab\(ℓ,h\)=⟨qia\(ℓ,h\),kib\(ℓ,h\)⟩dh−λℓ,h\|τia−τib\|,s\_\{iab\}^\{\(\\ell,h\)\}=\\frac\{\\langle q\_\{ia\}^\{\(\\ell,h\)\},k\_\{ib\}^\{\(\\ell,h\)\}\\rangle\}\{\\sqrt\{d\_\{h\}\}\}\-\\lambda\_\{\\ell,h\}\\,\|\\tau\_\{ia\}\-\\tau\_\{ib\}\|,\(5\)where
λℓ,h=softplus\(ηℓ,h\)≥0\.\\lambda\_\{\\ell,h\}=\\mathrm\{softplus\}\(\\eta\_\{\\ell,h\}\)\\geq 0\.\(6\)The corresponding attention weight is
αiab\(ℓ,h\)=exp\(siab\(ℓ,h\)\)∑c=1miexp\(siac\(ℓ,h\)\)\.\\alpha\_\{iab\}^\{\(\\ell,h\)\}=\\frac\{\\exp\(s\_\{iab\}^\{\(\\ell,h\)\}\)\}\{\\sum\_\{c=1\}^\{m\_\{i\}\}\\exp\(s\_\{iac\}^\{\(\\ell,h\)\}\)\}\.\(7\)The output of headhhat tokenaais
oia\(ℓ,h\)=∑b=1miαiab\(ℓ,h\)vib\(ℓ,h\)\.o\_\{ia\}^\{\(\\ell,h\)\}=\\sum\_\{b=1\}^\{m\_\{i\}\}\\alpha\_\{iab\}^\{\(\\ell,h\)\}v\_\{ib\}^\{\(\\ell,h\)\}\.\(8\)The multi\-head output is
oia\(ℓ\)=WO\(ℓ\)Concat\(oia\(ℓ,1\),…,oia\(ℓ,H\)\)\.o\_\{ia\}^\{\(\\ell\)\}=W\_\{O\}^\{\(\\ell\)\}\\mathrm\{Concat\}\\big\(o\_\{ia\}^\{\(\\ell,1\)\},\\dots,o\_\{ia\}^\{\(\\ell,H\)\}\\big\)\.The token representation is updated through a residual feed\-forward block,
zia\(ℓ\)=FFN\(ℓ\)\(zia\(ℓ−1\)\+oia\(ℓ\)\)\.z\_\{ia\}^\{\(\\ell\)\}=\\mathrm\{FFN\}^\{\(\\ell\)\}\\\!\\left\(z\_\{ia\}^\{\(\\ell\-1\)\}\+o\_\{ia\}^\{\(\\ell\)\}\\right\)\.\(9\)
The nonnegative parameterλℓ,h\\lambda\_\{\\ell,h\}controls the strength of temporal attenuation in layerℓ\\elland headhh\. For fixed content similarity, larger temporal distance lowers the pre\-softmax score\. Propositions[3](https://arxiv.org/html/2605.16319#Thmproposition3)and[4](https://arxiv.org/html/2605.16319#Thmproposition4)state this property for the score and the resulting softmax weight\.
### 4\.4Pooling and residual regression head
AfterLLtransformer layers, the encoded tokens are
Zi\(L\)=\(zi1\(L\),…,zimi\(L\)\)⊤\.Z\_\{i\}^\{\(L\)\}=\\big\(z\_\{i1\}^\{\(L\)\},\\dots,z\_\{im\_\{i\}\}^\{\(L\)\}\\big\)^\{\\top\}\.A learned pooling layer converts the variable\-length history into a fixed\-dimensional representation\. Define
πij=exp\(wp⊤tanh\(Wpzij\(L\)\+bp\)\)∑c=1miexp\(wp⊤tanh\(Wpzic\(L\)\+bp\)\),\\pi\_\{ij\}=\\frac\{\\exp\\\!\\big\(w\_\{p\}^\{\\top\}\\tanh\(W\_\{p\}z\_\{ij\}^\{\(L\)\}\+b\_\{p\}\)\\big\)\}\{\\sum\_\{c=1\}^\{m\_\{i\}\}\\exp\\\!\\big\(w\_\{p\}^\{\\top\}\\tanh\(W\_\{p\}z\_\{ic\}^\{\(L\)\}\+b\_\{p\}\)\\big\)\},\(10\)and
hi=∑j=1miπijzij\(L\)\.h\_\{i\}=\\sum\_\{j=1\}^\{m\_\{i\}\}\\pi\_\{ij\}z\_\{ij\}^\{\(L\)\}\.\(11\)The residual prediction is
rθ\(Hi\)=wr⊤hi\+br\.r\_\{\\theta\}\(H\_\{i\}\)=w\_\{r\}^\{\\top\}h\_\{i\}\+b\_\{r\}\.\(12\)Combining \([2](https://arxiv.org/html/2605.16319#S4.E2)\) and \([12](https://arxiv.org/html/2605.16319#S4.E12)\) gives the final predictor
y^i=gistat\+wr⊤hi\+br\.\\hat\{y\}\_\{i\}=g\_\{i\}^\{\\mathrm\{stat\}\}\+w\_\{r\}^\{\\top\}h\_\{i\}\+b\_\{r\}\.\(13\)
### 4\.5Training objective
For anchorii, the squared\-error loss is
ℓi\(θ\)=\(yi−gistat−rθ\(Hi\)\)2\.\\ell\_\{i\}\(\\theta\)=\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}\.\(14\)The empirical objective is
ℒn\(θ\)=1n∑i=1n\(yi−gistat−rθ\(Hi\)\)2\.\\mathcal\{L\}\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}\.\(15\)Using the residual responseuiu\_\{i\}in \([3](https://arxiv.org/html/2605.16319#S4.E3)\), this objective can also be written as
ℒn\(θ\)=1n∑i=1n\(ui−rθ\(Hi\)\)2\.\\mathcal\{L\}\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(u\_\{i\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}\.\(16\)This shows that the transformer component is trained as a residual regression model relative to the fitted mixed\-effects reference\.
### 4\.6Training algorithm
Algorithm[1](https://arxiv.org/html/2605.16319#alg1)summarizes the training procedure\. The mixed\-effects reference is fit first on the training anchors only\. Its fixed\-effect predictions are then used as reference values\. The gap\-aware transformer is trained under the residual regression objective in \([15](https://arxiv.org/html/2605.16319#S4.E15)\)\.
Algorithm 1Training the proposed residual gap\-aware transformer for 24\-month CDR\-SB change1:Training anchors
\{\(pi,Hi,xi,yi\)\}i=1n\\\{\(p\_\{i\},H\_\{i\},x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}with participant identifiers
pip\_\{i\}, observation\-level histories
HiH\_\{i\}, anchor\-level covariates
xix\_\{i\}, and targets
yi=ΔCDR\-SBi,24my\_\{i\}=\\Delta\\mathrm\{CDR\\mbox\{\-\}SB\}\_\{i,24m\}; epochs
EE; batch size
BB; learning rate
η\\eta
2:Fit the training\-only mixed\-effects reference model on
\{\(pi,xi,yi\)\}i=1n\\\{\(p\_\{i\},x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}
3:Compute fixed\-effect reference predictions
gistatg\_\{i\}^\{\\mathrm\{stat\}\}for all training anchors
4:Initialize transformer parameters
θ\\thetaand gap\-penalty parameters
\{ηℓ,h\}\\\{\\eta\_\{\\ell,h\}\\\}
5:for
e=1,…,Ee=1,\\dots,Edo
6:Shuffle the training anchors
7:foreach mini\-batch
ℬ\\mathcal\{B\}of size
BBdo
8:Convert each history
HiH\_\{i\}in
ℬ\\mathcal\{B\}to observation\-level triplets
\(τj,kj,vj\)\(\\tau\_\{j\},k\_\{j\},v\_\{j\}\)
9:Embed triplets and run the transformer encoder
10:Compute gap\-aware attention scores using
sab\(ℓ,h\)=⟨qa\(ℓ,h\),kb\(ℓ,h\)⟩dh−softplus\(ηℓ,h\)\|τa−τb\|s\_\{ab\}^\{\(\\ell,h\)\}=\\frac\{\\langle q\_\{a\}^\{\(\\ell,h\)\},k\_\{b\}^\{\(\\ell,h\)\}\\rangle\}\{\\sqrt\{d\_\{h\}\}\}\-\\mathrm\{softplus\}\(\\eta\_\{\\ell,h\}\)\|\\tau\_\{a\}\-\\tau\_\{b\}\|
11:Produce residual predictions
rθ\(Hi\)r\_\{\\theta\}\(H\_\{i\}\)for all
i∈ℬi\\in\\mathcal\{B\}
12:Form final predictions
y^i=gistat\+rθ\(Hi\)\\hat\{y\}\_\{i\}=g\_\{i\}^\{\\mathrm\{stat\}\}\+r\_\{\\theta\}\(H\_\{i\}\)
13:Compute mini\-batch loss
ℒℬ=1\|ℬ\|∑i∈ℬ\(y^i−yi\)2\\mathcal\{L\}\_\{\\mathcal\{B\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{i\\in\\mathcal\{B\}\}\(\\hat\{y\}\_\{i\}\-y\_\{i\}\)^\{2\}
14:Update
θ\\thetaand
\{ηℓ,h\}\\\{\\eta\_\{\\ell,h\}\\\}by backpropagation with learning rate
η\\eta
15:endfor
16:Evaluate validation error and apply early stopping when validation performance stops improving
17:endfor
18:returntrained model
y^i=gistat\+rθ\(Hi\)\\hat\{y\}\_\{i\}=g\_\{i\}^\{\\mathrm\{stat\}\}\+r\_\{\\theta\}\(H\_\{i\}\)
### 4\.7Assumptions and structural properties
The following assumptions state the regularity conditions used to characterize the proposed model\. The results are deterministic properties of the empirical objective and the model map\. They formalize two features of the architecture: residualization relative to the mixed\-effects reference model and temporal attenuation induced by the gap\-aware attention score\. Complete proofs are provided in Appendix[A](https://arxiv.org/html/2605.16319#A1)\.
###### Assumption 1\(Finite observed response and finite reference prediction\)\.
For every training anchorii, the observed responseyiy\_\{i\}is finite, and the fitted statistical reference predictiongistatg\_\{i\}^\{\\mathrm\{stat\}\}is finite\.
###### Assumption 2\(Feasible zero residual\)\.
The residual function class
ℛ=\{rθ:θ∈Θ\}\\mathcal\{R\}=\\\{r\_\{\\theta\}:\\theta\\in\\Theta\\\}contains the zero function\. That is, there existsθ0∈Θ\\theta\_\{0\}\\in\\Thetasuch that
rθ0\(H\)=0r\_\{\\theta\_\{0\}\}\(H\)=0for every admissible historyHH\.
###### Assumption 3\(Nonnegative temporal gap penalty\)\.
For every layerℓ\\elland headhh,
λℓ,h=softplus\(ηℓ,h\)≥0\.\\lambda\_\{\\ell,h\}=\\mathrm\{softplus\}\(\\eta\_\{\\ell,h\}\)\\geq 0\.
###### Assumption 4\(Bounded admissible histories and regularity of the residual map\)\.
The admissible history spaceℋ\\mathcal\{H\}is equipped with a metricdℋd\_\{\\mathcal\{H\}\}\. Each history has at mostmmax<∞m\_\{\\max\}<\\inftyobserved tokens, and observed times and values lie in bounded sets after preprocessing\. Variable identities are equipped with the discrete metric\. The tokenization map from histories to initial embeddings is Lipschitz onℋ\\mathcal\{H\}\. For fixed parameter values, all linear maps in the attention blocks, feed\-forward blocks, pooling layer, and residual head have finite operator norms on the admissible domain\. The activation functions used in the model are Lipschitz on the relevant bounded range\.
###### Proposition 1\(Residual\-risk equivalence\)\.
For fixed statistical reference predictions\{gistat\}i=1n\\\{g\_\{i\}^\{\\mathrm\{stat\}\}\\\}\_\{i=1\}^\{n\}, minimizing
ℒn\(θ\)=1n∑i=1n\(yi−gistat−rθ\(Hi\)\)2\\mathcal\{L\}\_\{n\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}overθ\\thetais equivalent to minimizing
1n∑i=1n\(ui−rθ\(Hi\)\)2,ui=yi−gistat\.\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(u\_\{i\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\},\\qquad u\_\{i\}=y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\.
###### Proposition 2\(Feasibility of the statistical reference in the empirical objective\)\.
Under Assumption[2](https://arxiv.org/html/2605.16319#Thmassumption2),
infθ∈Θℒn\(θ\)≤1n∑i=1n\(yi−gistat\)2\.\\inf\_\{\\theta\\in\\Theta\}\\mathcal\{L\}\_\{n\}\(\\theta\)\\leq\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\\right\)^\{2\}\.\(17\)
###### Proposition 3\(Temporal monotonicity of the gap\-aware score\)\.
Fix a layerℓ\\ell, a headhh, a query tokenaa, and a key tokenbb\. Holding the content term
cab\(ℓ,h\)=⟨qa\(ℓ,h\),kb\(ℓ,h\)⟩dhc\_\{ab\}^\{\(\\ell,h\)\}=\\frac\{\\langle q\_\{a\}^\{\(\\ell,h\)\},k\_\{b\}^\{\(\\ell,h\)\}\\rangle\}\{\\sqrt\{d\_\{h\}\}\}fixed, the pre\-softmax attention score
sab\(ℓ,h\)\(d\)=cab\(ℓ,h\)−λℓ,hd,d=\|τa−τb\|,s\_\{ab\}^\{\(\\ell,h\)\}\(d\)=c\_\{ab\}^\{\(\\ell,h\)\}\-\\lambda\_\{\\ell,h\}d,\\qquad d=\|\\tau\_\{a\}\-\\tau\_\{b\}\|,is nonincreasing inddand strictly decreasing wheneverλℓ,h\>0\\lambda\_\{\\ell,h\}\>0\.
###### Proposition 4\(Temporal monotonicity of the softmax attention weight\)\.
Fix a query tokenaa, a headhh, and all key scores except that of one key tokenbb\. Supposeλℓ,h\>0\\lambda\_\{\\ell,h\}\>0and the content term for tokenbbis held fixed\. Then the corresponding softmax attention weight
αb\(d\)=exp\(cb−λℓ,hd\)∑c≠bexp\(sc\)\+exp\(cb−λℓ,hd\)\\alpha\_\{b\}\(d\)=\\frac\{\\exp\(c\_\{b\}\-\\lambda\_\{\\ell,h\}d\)\}\{\\sum\_\{c\\neq b\}\\exp\(s\_\{c\}\)\+\\exp\(c\_\{b\}\-\\lambda\_\{\\ell,h\}d\)\}is nonincreasing asd=\|τa−τb\|d=\|\\tau\_\{a\}\-\\tau\_\{b\}\|increases\. If tokenbbcompetes with at least one other key token, the decrease is strict\.
###### Proposition 5\(Stability of the residual predictor\)\.
Under Assumption[4](https://arxiv.org/html/2605.16319#Thmassumption4), the residual predictorrθ\(⋅\)r\_\{\\theta\}\(\\cdot\)is Lipschitz on the admissible history domain\. That is, there exists a finite constantLθL\_\{\\theta\}such that
\|rθ\(H\)−rθ\(H′\)\|≤Lθdℋ\(H,H′\)\|r\_\{\\theta\}\(H\)\-r\_\{\\theta\}\(H^\{\\prime\}\)\|\\leq L\_\{\\theta\}d\_\{\\mathcal\{H\}\}\(H,H^\{\\prime\}\)\(18\)for any two admissible historiesH,H′∈ℋH,H^\{\\prime\}\\in\\mathcal\{H\}\.
These propositions give the formal role of the two structural components in the model\. Propositions[1](https://arxiv.org/html/2605.16319#Thmproposition1)and[2](https://arxiv.org/html/2605.16319#Thmproposition2)show that the transformer is trained as a residual regression model relative to the mixed\-effects reference and that the empirical objective contains the statistical reference as a feasible special case\. Propositions[3](https://arxiv.org/html/2605.16319#Thmproposition3)and[4](https://arxiv.org/html/2605.16319#Thmproposition4)show that the learned nonnegative gap penalty induces temporal attenuation in the attention mechanism\. Proposition[5](https://arxiv.org/html/2605.16319#Thmproposition5)gives a stability property of the residual predictor under boundedness and Lipschitz conditions\. These results clarify the structure of the proposed model; they should be interpreted as structural properties rather than as a full finite\-sample generalization theory\.
## 5Experiments: Comparator Models and Evaluation Protocol
This section describes the experimental design used for the 24\-month CDR\-SB\-change analysis\. All models are evaluated under the same anchor construction, participant\-level splitting rule, training–validation–test logic, and test metrics\. The comparator set is chosen to cover three modeling families that are relevant to the proposed method: longitudinal mixed\-effects modeling, missingness\-aware recurrent modeling, and transformer\-based modeling for irregular clinical time series\.
### 5\.1Comparator models
#### Linear mixed\-effects baseline\.
The first comparator is a linear mixed\-effects model\. Mixed\-effects models are standard for repeated\-measures data because they separate population\-level fixed effects from participant\-level random variation\.Laird and Ware \([1982](https://arxiv.org/html/2605.16319#bib.bib16)\)introduced the random\-effects formulation for longitudinal data,Verbeke and Molenberghs \([2000](https://arxiv.org/html/2605.16319#bib.bib33)\)gave a full treatment of linear mixed models for repeated outcomes, andFitzmaurice et al\. \([2011](https://arxiv.org/html/2605.16319#bib.bib10)\)emphasized their use for within\-subject dependence in applied longitudinal studies\.
For anchorjjfrom participantii, the general linear mixed\-effects model is written as
yij=xij⊤β\+zij⊤bi\+εij,y\_\{ij\}=x\_\{ij\}^\{\\top\}\\beta\+z\_\{ij\}^\{\\top\}b\_\{i\}\+\\varepsilon\_\{ij\},\(19\)where
yij=ΔCDR\-SBij,24m\.y\_\{ij\}=\\Delta\\mathrm\{CDR\\mbox\{\-\}SB\}\_\{ij,24m\}\.Herexijx\_\{ij\}is the fixed\-effect covariate vector,β\\betais the fixed\-effect coefficient vector,zijz\_\{ij\}is the random\-effect design vector,bib\_\{i\}is the participant\-specific random\-effect vector, andεij\\varepsilon\_\{ij\}is the residual error\. The distributional assumptions are
bi∼N\(0,D\),εij∼N\(0,σ2\),bi⟂εij\.b\_\{i\}\\sim N\(0,D\),\\qquad\\varepsilon\_\{ij\}\\sim N\(0,\\sigma^\{2\}\),\\qquad b\_\{i\}\\perp\\varepsilon\_\{ij\}\.\(20\)In the implemented baseline, the random\-effect structure is a participant\-level random intercept\. Thuszij=1z\_\{ij\}=1,bi=b0ib\_\{i\}=b\_\{0i\}, and the fitted model is
yij=β0\+xij⊤β\+b0i\+εij,b0i∼N\(0,σb2\)\.y\_\{ij\}=\\beta\_\{0\}\+x\_\{ij\}^\{\\top\}\\beta\+b\_\{0i\}\+\\varepsilon\_\{ij\},\\qquad b\_\{0i\}\\sim N\(0,\\sigma\_\{b\}^\{2\}\)\.\(21\)This form keeps the repeated\-anchor dependence explicit while avoiding a more complex random\-slope structure that would be harder to estimate stably under the participant\-level train–test split\.
The fixed\-effect covariates include current cognitive and functional measurements, target\-gap information, demographics, APOE4 allele count, anchor diagnostic context, structural magnetic resonance imaging history summaries, cerebrospinal fluid biomarker history summaries, missingness indicators, and modality\-recency covariates\. Candidate fixed\-effect sets are selected within the training partition only by the Bayesian information criterion,
BIC\(M\)=−2ℓM\(θ^M\)\+qMlog\(ntrain\),\\mathrm\{BIC\}\(M\)=\-2\\ell\_\{M\}\(\\hat\{\\theta\}\_\{M\}\)\+q\_\{M\}\\log\(n\_\{\\mathrm\{train\}\}\),\(22\)whereℓM\(θ^M\)\\ell\_\{M\}\(\\hat\{\\theta\}\_\{M\}\)is the maximized log\-likelihood for candidate modelMM,qMq\_\{M\}is the number of estimated parameters, andntrainn\_\{\\mathrm\{train\}\}is the number of training anchors\. The Bayesian information criterion is used as a likelihood\-based model selection rule with an explicit penalty for model dimension\(Schwarz,[1978](https://arxiv.org/html/2605.16319#bib.bib24)\)\. The participant random intercept is retained throughout the selection procedure\.
For held\-out test participants, prediction uses the marginal fixed\-effect component,
y^ijstat=β^0\+xij⊤β^\.\\hat\{y\}\_\{ij\}^\{\\mathrm\{stat\}\}=\\hat\{\\beta\}\_\{0\}\+x\_\{ij\}^\{\\top\}\\hat\{\\beta\}\.\(23\)The participant\-specific random intercept is not used for test prediction because the held\-out participants are not observed during training\. Equivalently, the prediction usesE\(b0i\)=0E\(b\_\{0i\}\)=0for new participants\. This rule keeps the evaluation aligned with the participant\-level split and avoids using subject\-specific outcome information from the test set\.
#### GRU\-D\.
The second comparator is GRU\-D, a recurrent model designed for multivariate time series with missing values and irregular observation times\(Che et al\.,[2018](https://arxiv.org/html/2605.16319#bib.bib7)\)\. It is included because the present data contain both missingness and variable\-specific elapsed time since last observation\.
Letxt∈ℝDx\_\{t\}\\in\\mathbb\{R\}^\{D\}denote the observed feature vector at timett,mt∈\{0,1\}Dm\_\{t\}\\in\\\{0,1\\\}^\{D\}the observation mask, andδt∈ℝ\+D\\delta\_\{t\}\\in\\mathbb\{R\}\_\{\+\}^\{D\}the elapsed time since each variable was last observed\. GRU\-D learns variable\-specific input decay and hidden\-state decay terms,
γtx\\displaystyle\\gamma\_\{t\}^\{x\}=exp\{−max\(0,Wγxδt\+bγx\)\},\\displaystyle=\\exp\\\!\\left\\\{\-\\max\\left\(0,W\_\{\\gamma\}^\{x\}\\delta\_\{t\}\+b\_\{\\gamma\}^\{x\}\\right\)\\right\\\},\(24\)γth\\displaystyle\\gamma\_\{t\}^\{h\}=exp\{−max\(0,Wγhδt\+bγh\)\}\.\\displaystyle=\\exp\\\!\\left\\\{\-\\max\\left\(0,W\_\{\\gamma\}^\{h\}\\delta\_\{t\}\+b\_\{\\gamma\}^\{h\}\\right\)\\right\\\}\.\(25\)The input decay is used to form a completed input vector,
x~t=mt⊙xt\+\(1−mt\)⊙\[γtx⊙xtlast\+\(1−γtx\)⊙x¯\],\\tilde\{x\}\_\{t\}=m\_\{t\}\\odot x\_\{t\}\+\(1\-m\_\{t\}\)\\odot\\left\[\\gamma\_\{t\}^\{x\}\\odot x\_\{t\}^\{\\mathrm\{last\}\}\+\(1\-\\gamma\_\{t\}^\{x\}\)\\odot\\bar\{x\}\\right\],\(26\)wherextlastx\_\{t\}^\{\\mathrm\{last\}\}is the most recent observed value for each variable,x¯\\bar\{x\}is the empirical mean vector computed from the training data, and⊙\\odotdenotes elementwise multiplication\. The hidden state is decayed as
h~t−1=γth⊙ht−1\.\\tilde\{h\}\_\{t\-1\}=\\gamma\_\{t\}^\{h\}\\odot h\_\{t\-1\}\.\(27\)The recurrent update then follows a gated recurrent unit form using the completed input, the decayed hidden state, and the observation mask:
rt\\displaystyle r\_\{t\}=σ\(Wrx~t\+Urh~t−1\+Vrmt\+br\),\\displaystyle=\\sigma\(W\_\{r\}\\tilde\{x\}\_\{t\}\+U\_\{r\}\\tilde\{h\}\_\{t\-1\}\+V\_\{r\}m\_\{t\}\+b\_\{r\}\),\(28\)qt\\displaystyle q\_\{t\}=σ\(Wqx~t\+Uqh~t−1\+Vqmt\+bq\),\\displaystyle=\\sigma\(W\_\{q\}\\tilde\{x\}\_\{t\}\+U\_\{q\}\\tilde\{h\}\_\{t\-1\}\+V\_\{q\}m\_\{t\}\+b\_\{q\}\),\(29\)h~t∗\\displaystyle\\tilde\{h\}\_\{t\}^\{\\,\*\}=tanh\(Whx~t\+Uh\(rt⊙h~t−1\)\+Vhmt\+bh\),\\displaystyle=\\tanh\(W\_\{h\}\\tilde\{x\}\_\{t\}\+U\_\{h\}\(r\_\{t\}\\odot\\tilde\{h\}\_\{t\-1\}\)\+V\_\{h\}m\_\{t\}\+b\_\{h\}\),\(30\)ht\\displaystyle h\_\{t\}=\(1−qt\)⊙h~t−1\+qt⊙h~t∗\.\\displaystyle=\(1\-q\_\{t\}\)\\odot\\tilde\{h\}\_\{t\-1\}\+q\_\{t\}\\odot\\tilde\{h\}\_\{t\}^\{\\,\*\}\.\(31\)The final hidden state is passed to a regression head to predict 24\-month CDR\-SB change\. GRU\-D is a strong non\-transformer comparator because its architecture directly encodes two features of the study data: stale measurements and informative missingness\.
#### STraTS\.
The third comparator is STraTS, a transformer\-based model for irregularly sampled multivariate clinical time series\(Tipirneni and Reddy,[2021](https://arxiv.org/html/2605.16319#bib.bib27)\)\. STraTS is included because it uses observation\-level triplets rather than forcing all variables onto a dense visit\-by\-variable grid\.
For a history represented by observed triplets
Hi=\{\(τij,kij,vij\)\}j=1mi,H\_\{i\}=\\\{\(\\tau\_\{ij\},k\_\{ij\},v\_\{ij\}\)\\\}\_\{j=1\}^\{m\_\{i\}\},STraTS maps each triplet to an embedding
eij=eτ\(τij\)\+ek\(kij\)\+ev\(vij\),e\_\{ij\}=e\_\{\\tau\}\(\\tau\_\{ij\}\)\+e\_\{k\}\(k\_\{ij\}\)\+e\_\{v\}\(v\_\{ij\}\),\(32\)whereeτe\_\{\\tau\},eke\_\{k\}, andeve\_\{v\}encode observation time, variable identity, and observed value\. The embedded sequence is then processed by self\-attention\. In a standard attention head, the pre\-softmax score is
siab=⟨WQeia,WKeib⟩dh,s\_\{iab\}=\\frac\{\\langle W\_\{Q\}e\_\{ia\},W\_\{K\}e\_\{ib\}\\rangle\}\{\\sqrt\{d\_\{h\}\}\},\(33\)and the attention weight is
αiab=exp\(siab\)∑c=1miexp\(siac\)\.\\alpha\_\{iab\}=\\frac\{\\exp\(s\_\{iab\}\)\}\{\\sum\_\{c=1\}^\{m\_\{i\}\}\\exp\(s\_\{iac\}\)\}\.\(34\)The resulting token representations are pooled and passed to a regression head\. This comparator is important because it isolates the value of the proposed residualization and gap\-aware attention beyond the use of a transformer\-family architecture itself\. The proposed model shares the triplet\-style longitudinal representation with STraTS, but adds the mixed\-effects reference term and a learned time\-gap penalty inside the attention score\.
#### Proposed residual gap\-aware transformer\.
The proposed model is the structured statistical\-neural model defined in Section[4](https://arxiv.org/html/2605.16319#S4)\. It first fits the mixed\-effects reference model on the training partition, then learns a transformer\-based residual from the pre\-anchor longitudinal clinical and biomarker history\. In the experiment, this model is evaluated against the three comparators above under the same participant\-level splits and the same test metrics\.
### 5\.2Participant\-level splitting and repeated\-seed design
All experiments use participant\-level splitting\. This choice is essential because a single participant can contribute more than one eligible MCI anchor\. A visit\-level split would allow earlier and later anchors from the same participant to appear in different data partitions, which would make the test set partially identifiable from the training set\. The split is therefore performed at the participant level before any anchor\-level model fitting or validation step is carried out\.
For each random seed, 80% of participants are assigned to the training and validation pool, and the remaining 20% are held out as the test set\. Validation participants are selected only from the training pool\. After the participant split is fixed, all eligible anchors from a participant stay in the same partition\. Thus, every training, validation, and test set contains complete participant\-level anchor histories rather than isolated visits\. The repeated\-seed analysis uses five participant\-level random seeds: 42, 43, 44, 45, and 46\.
This design gives two advantages\. First, it protects the test evaluation from within\-subject leakage\. Second, it reduces the dependence of the final results on a single random split\. The reported performance summaries are therefore based on repeated participant\-level train–validation–test partitions rather than on one arbitrary partition of the cohort\.
### 5\.3Hyperparameters and implementation
Implementation details are summarized in Table[4](https://arxiv.org/html/2605.16319#S5.T4)\. The linear mixed\-effects model is used as the longitudinal statistical reference\. GRU\-D and STraTS are used as neural comparators for irregular clinical time series, and the proposed model is evaluated using the architecture defined in Section[4](https://arxiv.org/html/2605.16319#S4)\. All model fitting, feature selection, and early stopping decisions are carried out within the training and validation partitions only\.
Table 4:Implementation settings for the compared model families\.The settings are kept moderate across the neural models\. The purpose of the experiment is to compare model families under the same longitudinal progression task, rather than to perform an exhaustive architecture search\. The proposed model is therefore evaluated with a small transformer encoder and compared against recurrent and transformer baselines trained under the same participant\-level split logic\.
### 5\.4Evaluation metrics
Evaluation is performed on held\-out test anchors from held\-out participants\. Letℐtest\\mathcal\{I\}\_\{\\mathrm\{test\}\}denote the test anchor set and let
Ntest=\|ℐtest\|\.N\_\{\\mathrm\{test\}\}=\|\\mathcal\{I\}\_\{\\mathrm\{test\}\}\|\.For each test anchor\(i,j\)\(i,j\), letyijy\_\{ij\}be the observed 24\-month CDR\-SB change andy^ij\\hat\{y\}\_\{ij\}be the model prediction\. The prediction error is
eij=yij−y^ij\.e\_\{ij\}=y\_\{ij\}\-\\hat\{y\}\_\{ij\}\.We report mean squared error,
MSE=1Ntest∑\(i,j\)∈ℐtesteij2,\\mathrm\{MSE\}=\\frac\{1\}\{N\_\{\\mathrm\{test\}\}\}\\sum\_\{\(i,j\)\\in\\mathcal\{I\}\_\{\\mathrm\{test\}\}\}e\_\{ij\}^\{2\},\(35\)mean absolute error,
MAE=1Ntest∑\(i,j\)∈ℐtest\|eij\|,\\mathrm\{MAE\}=\\frac\{1\}\{N\_\{\\mathrm\{test\}\}\}\\sum\_\{\(i,j\)\\in\\mathcal\{I\}\_\{\\mathrm\{test\}\}\}\|e\_\{ij\}\|,\(36\)root mean squared error,
RMSE=MSE,\\mathrm\{RMSE\}=\\sqrt\{\\mathrm\{MSE\}\},\(37\)and Pearson prediction–observation correlation,
Corr=∑\(i,j\)∈ℐtest\(yij−y¯test\)\(y^ij−y^¯test\)∑\(i,j\)∈ℐtest\(yij−y¯test\)2∑\(i,j\)∈ℐtest\(y^ij−y^¯test\)2,\\mathrm\{Corr\}=\\frac\{\\sum\_\{\(i,j\)\\in\\mathcal\{I\}\_\{\\mathrm\{test\}\}\}\(y\_\{ij\}\-\\bar\{y\}\_\{\\mathrm\{test\}\}\)\(\\hat\{y\}\_\{ij\}\-\\bar\{\\hat\{y\}\}\_\{\\mathrm\{test\}\}\)\}\{\\sqrt\{\\sum\_\{\(i,j\)\\in\\mathcal\{I\}\_\{\\mathrm\{test\}\}\}\(y\_\{ij\}\-\\bar\{y\}\_\{\\mathrm\{test\}\}\)^\{2\}\}\\sqrt\{\\sum\_\{\(i,j\)\\in\\mathcal\{I\}\_\{\\mathrm\{test\}\}\}\(\\hat\{y\}\_\{ij\}\-\\bar\{\\hat\{y\}\}\_\{\\mathrm\{test\}\}\)^\{2\}\}\},\(38\)where
y¯test=1Ntest∑\(i,j\)∈ℐtestyij,y^¯test=1Ntest∑\(i,j\)∈ℐtesty^ij\.\\bar\{y\}\_\{\\mathrm\{test\}\}=\\frac\{1\}\{N\_\{\\mathrm\{test\}\}\}\\sum\_\{\(i,j\)\\in\\mathcal\{I\}\_\{\\mathrm\{test\}\}\}y\_\{ij\},\\qquad\\bar\{\\hat\{y\}\}\_\{\\mathrm\{test\}\}=\\frac\{1\}\{N\_\{\\mathrm\{test\}\}\}\\sum\_\{\(i,j\)\\in\\mathcal\{I\}\_\{\\mathrm\{test\}\}\}\\hat\{y\}\_\{ij\}\.The error metrics measure numerical accuracy of predicted CDR\-SB change, while the correlation measures how well the model preserves the ordering of test anchors by progression magnitude\.
For each metric, repeated\-seed results are summarized by the mean and an approximate 95% split\-level interval across the five participant\-level random seeds\. IfMsM\_\{s\}denotes the metric value for seeds∈\{1,…,S\}s\\in\\\{1,\\dots,S\\\}, withS=5S=5, then
M¯=1S∑s=1SMs,SE\(M¯\)=1S\{1S−1∑s=1S\(Ms−M¯\)2\}1/2\.\\bar\{M\}=\\frac\{1\}\{S\}\\sum\_\{s=1\}^\{S\}M\_\{s\},\\qquad\\mathrm\{SE\}\(\\bar\{M\}\)=\\frac\{1\}\{\\sqrt\{S\}\}\\left\\\{\\frac\{1\}\{S\-1\}\\sum\_\{s=1\}^\{S\}\(M\_\{s\}\-\\bar\{M\}\)^\{2\}\\right\\\}^\{1/2\}\.The reported interval is
M¯±tS−1,0\.975SE\(M¯\),\\bar\{M\}\\pm t\_\{S\-1,0\.975\}\\,\\mathrm\{SE\}\(\\bar\{M\}\),\(39\)wheretS−1,0\.975t\_\{S\-1,0\.975\}is the 97\.5th percentile of thettdistribution withS−1S\-1degrees of freedom\. These intervals are used as descriptive summaries of split\-level stability, rather than as formal sampling\-based confidence intervals\.
## 6Results
### 6\.1Distribution of the 24\-month CDR\-SB\-change target
We first examine the empirical distribution of the response variable\. Figure[2](https://arxiv.org/html/2605.16319#S6.F2)shows the distribution of 24\-month CDR\-SB change across the labeled anchors used in the primary analysis\. The outcome is centered near mild worsening, with mean 0\.69 and median 0\.50, and the distribution has a clear right tail\. This shape is clinically meaningful\. Many MCI anchors show small changes over the follow\-up interval, while a smaller group shows larger worsening\. A useful model therefore has to predict both common mild changes and less frequent larger increases in CDR\-SB\.
The response distribution also helps explain why the task is more demanding than predicting raw future severity\. A model that mainly carries forward current disease burden can perform well on raw future CDR\-SB, but 24\-month CDR\-SB change asks the model to predict subsequent worsening after the anchor\. The distribution in Figure[2](https://arxiv.org/html/2605.16319#S6.F2)therefore motivates the use of both error metrics and correlation: error metrics evaluate numerical closeness to observed change, while correlation evaluates whether the model preserves relative ordering across smaller and larger worsening patterns\.
Figure 2:Distribution of 24\-month CDR\-SB change in the primary analysis cohort\. The distribution is centered near mild worsening and has a right tail, indicating that the analysis contains both common small changes and less frequent larger worsening events\.
### 6\.2Primary performance comparison
Table[5](https://arxiv.org/html/2605.16319#S6.T5)reports repeated\-seed test performance across the four compared model families under participant\-level splitting\. The proposed residual gap\-aware transformer achieves the best mean performance across all four criteria: mean squared error, mean absolute error, root mean squared error, and prediction–observation correlation\. The mixed\-effects baseline provides the repeated\-measures statistical reference, GRU\-D provides a strong missingness\-aware recurrent comparator, and STraTS provides the observation\-level transformer comparator\. Within this comparison set, the proposed method gives the strongest mean performance profile\.
Table 5:Repeated\-seed test performance on 24\-month CDR\-SB change\. Values are mean±\\pmapproximate 95% split\-level interval over participant\-level seeds 42, 43, 44, 45, and 46\. Lower is better for error metrics; higher is better for correlation\.Figure[3](https://arxiv.org/html/2605.16319#S6.F3)gives the same comparison visually\. The proposed method has the lowest mean error on MSE, MAE, and RMSE, and it has the highest mean correlation\. This visual summary is useful because it shows that the proposed method improves both numerical accuracy and agreement with the observed ordering of progression magnitude\.
Figure 3:Main performance comparison for the 24\-month CDR\-SB\-change task\. Bars show repeated\-seed mean performance, with approximate 95% split\-level intervals across participant\-level seeds\. The proposed residual gap\-aware transformer gives the best mean result across all four metrics\.Several aspects of the comparison are important\. The mixed\-effects baseline is a genuine longitudinal statistical comparator because it accounts for repeated anchors through participant random intercepts during training\. GRU\-D is also a strong baseline because it directly models missingness and elapsed time in irregular clinical sequences\. STraTS is the closest transformer\-family comparator because it uses observation\-level triplets\. The proposed model improves over all three, which supports the value of combining the mixed\-effects reference, residual learning, and gap\-aware attention in a single model\.
### 6\.3Magnitude of improvement
Table[6](https://arxiv.org/html/2605.16319#S6.T6)summarizes the relative improvement of the proposed residual gap\-aware transformer over each competing model family\. Relative to the mixed\-effects baseline, the proposed method reduces mean squared error by 13\.1%, mean absolute error by 11\.4%, and root mean squared error by 6\.7%, while increasing prediction–observation correlation by 26\.4%\. These are the largest relative gains because the mixed\-effects model is a strong but lower\-flexibility statistical reference\.
Relative to GRU\-D, the proposed method still improves every metric, although the margins are smaller\. This is important because GRU\-D is specifically designed for irregular time series with missing values\. Relative to STraTS, the proposed method again improves every metric, which shows that the advantage comes from more than using a transformer architecture\. The improvement over STraTS is especially relevant because both STraTS and the proposed model use observation\-level triplet representations of irregular longitudinal histories\.
Table 6:Relative improvement of the proposed residual gap\-aware transformer over competing model families, computed from the repeated\-seed means in Table[5](https://arxiv.org/html/2605.16319#S6.T5)\. Negative values indicate lower error; positive values indicate higher correlation\.These results give a direct interpretation of the contribution\. The proposed method improves substantially over the mixed\-effects statistical reference, which indicates that the pre\-anchor longitudinal history contains residual signal beyond the fixed\-effect statistical prediction\. It also improves over GRU\-D, a missingness\-aware recurrent model, and over STraTS, an observation\-level transformer model\. The pattern therefore supports the combined structure of the proposed method: statistical anchoring first, residual sequence learning second, and time\-gap\-aware attention inside the residual learner\.
### 6\.4Repeated\-seed stability
The repeated\-seed analysis examines whether the performance advantage is stable across participant\-level splits\. Figure[4](https://arxiv.org/html/2605.16319#S6.F4)displays the metric values for seeds 42, 43, 44, 45, and 46\. The proposed method is consistently competitive across all seeds and is usually the best or among the best methods for each metric\. Its strongest pattern appears in correlation, where it is highest across most seeds and highest on average\.
The stability plot also shows that the analysis itself varies across participant splits\. For example, the absolute error levels change across seeds, especially for MSE and RMSE\. This variation is expected in a participant\-level split with repeated anchors and heterogeneous progression patterns\. The key point is that the proposed method maintains a strong position despite this split\-to\-split variability\. This supports the conclusion that the improvement in Table[5](https://arxiv.org/html/2605.16319#S6.T5)is not driven by a single favorable partition\.
Figure 4:Repeated\-seed stability across participant\-level splits\. Each panel shows test performance across seeds 42, 43, 44, 45, and 46\. The proposed residual gap\-aware transformer remains consistently competitive and has the best mean performance across the reported metrics\.
### 6\.5Main empirical message: statistical anchoring and gap\-aware residual learning
The main empirical message is that the proposed model gains performance by combining two complementary sources of structure\. The mixed\-effects reference provides a statistical anchor for repeated\-anchor longitudinal data\. It uses anchor\-level covariates and participant random intercepts during training to represent the repeated\-measures structure of the cohort\. The transformer residual then focuses on the part of 24\-month CDR\-SB change that remains after this statistical reference has been fitted\. This division of labor gives the neural component a clearer role than direct end\-to\-end prediction from the full history\.
The improvement over the mixed\-effects baseline shows that the longitudinal clinical and biomarker history contains predictive information beyond the fixed\-effect statistical prediction\. This is the first important finding\. The mixed\-effects model already accounts for repeated anchors and uses the same anchor\-level clinical and biomarker summaries\. The residual gain therefore suggests that the pre\-anchor trajectory itself, including timing and irregular observation patterns, adds useful information for medium\-horizon progression prediction\.
The improvement over GRU\-D gives a second message\. GRU\-D is a strong comparator for irregular clinical time series because it explicitly models missingness and elapsed time\. The proposed method achieves a better repeated\-seed mean profile while using a different strategy: it represents the history as observation\-level triplets and lets attention weights depend directly on temporal distance\. This indicates that gap\-aware attention is a useful way to use irregular timing in the present CDR\-SB\-change task\.
The improvement over STraTS gives a third message\. STraTS is the closest transformer\-family comparator because it also uses observation\-level triplets\. The proposed method adds two elements on top of this representation: residualization against a mixed\-effects reference and a learned nonnegative time\-gap penalty in attention\. The gain over STraTS therefore supports the proposed architecture as more than a standard transformer applied to the same inputs\.
Taken together, the results support the central claim of the paper: medium\-horizon Alzheimer’s disease progression prediction benefits from aligning the statistical reference, neural residual learner, and time representation with the longitudinal data structure\. The 24\-month CDR\-SB\-change response focuses the task on worsening after the MCI anchor\. Participant\-level splitting protects the evaluation from within\-subject leakage\. The mixed\-effects component supplies the longitudinal statistical reference\. The gap\-aware transformer then learns residual trajectory information from irregular clinical and biomarker histories\. This combination gives the proposed model its empirical advantage\.
## 7Discussion
This study shows that medium\-horizon Alzheimer’s disease progression can be modeled more effectively when the prediction target, covariate history, statistical reference, and neural architecture are matched to the same longitudinal data structure\. The main empirical result is that the proposed residual gap\-aware transformer achieves the best repeated\-seed mean performance across MSE, MAE, RMSE, and prediction–observation correlation for 24\-month CDR\-SB change\. This result is informative because the comparison includes a mixed\-effects repeated\-measures baseline, a missingness\-aware recurrent neural model, and an observation\-level transformer model for irregular clinical time series\. The improvement therefore reflects more than generic model flexibility\.
The first implication concerns the endpoint\. The 24\-month CDR\-SB\-change response places the prediction target on worsening after the anchor visit\. This differs from predicting raw future CDR\-SB, which remains strongly tied to current CDR\-SB in the present cohort\. The change\-score formulation makes the task more directly about medium\-horizon progression magnitude\. Because CDR\-SB is an ordered composite clinical scale, the response should be interpreted as a quantitative change score rather than as a strictly continuous biological measurement\. This interpretation supports regression modeling while remaining faithful to the clinical nature of the outcome\.
The second implication concerns longitudinal clinical and biomarker history construction\. The cohort profile shows that same\-visit structural magnetic resonance imaging and cerebrospinal fluid measurements are limited at the anchor visit, while any\-prior availability is broader\. The study therefore uses historically observed biomarker information through prior summaries and recency variables\. This design better reflects the way longitudinal Alzheimer’s disease data are collected\. It also avoids restricting the analysis to a narrow subset with strict same\-visit biomarker coverage\.
The third implication concerns the role of the mixed\-effects reference model\. The proposed model is built as a residual learner relative to a longitudinal statistical baseline\. This gives the neural component a clear role: it models trajectory information beyond the anchor\-level fixed\-effect structure captured by the mixed\-effects model\. The empirical gain over the mixed\-effects baseline suggests that such residual sequence information is useful\. The gain over STraTS further suggests that the advantage comes from the combination of residualization and time\-gap\-aware attention rather than from using a transformer alone\.
The comparison with GRU\-D is also informative\. GRU\-D is designed for irregular time series with missing values, and it remains a strong comparator in this study\. The proposed model improves over GRU\-D on the repeated\-seed mean of all reported metrics, but the margin is smaller than the gain over the mixed\-effects baseline\. This pattern is scientifically useful because it shows that missingness\-aware recurrent modeling is already well matched to this data structure, while the proposed model adds value by combining time\-aware representation learning with an explicit statistical reference\.
The results therefore support a specific methodological conclusion\. For irregular longitudinal clinical and biomarker histories in Alzheimer’s disease, a useful modeling strategy is to align four elements: a clinically interpretable medium\-horizon response, participant\-level evaluation, a longitudinal statistical comparator, and a neural architecture that directly represents irregular timing\. Under this alignment, the proposed residual gap\-aware transformer gives the strongest repeated\-seed mean performance in the present ADNI\-based analysis\.
## 8Limitations
Several limitations should guide the interpretation of the results\. First, the analysis is based on ADNI\. ADNI is deeply characterized and carefully curated, but it is still a research cohort\. Its participants, measurement schedule, and data quality differ from many routine\-care settings\. The present results therefore establish performance within an ADNI\-derived progression analysis\. Broader clinical transportability requires temporal validation, external validation, or internal–external validation across independent cohorts\(Collins et al\.,[2015](https://arxiv.org/html/2605.16319#bib.bib8); Steyerberg and Harrell,[2016](https://arxiv.org/html/2605.16319#bib.bib26)\)\.
Second, the primary response is based on CDR\-SB change\. CDR\-SB is clinically meaningful and widely used, but it remains an ordered composite score with bounded values and scale\-specific measurement properties\. Treating 24\-month CDR\-SB change as a quantitative regression response is reasonable for the present prediction task, yet future work should also examine robustness to alternative outcome definitions, including raw future CDR\-SB, clinically meaningful worsening thresholds, diagnosis conversion, and longer\-horizon change\.
Third, the biomarker inputs are table\-derived summaries rather than raw imaging or raw molecular data\. This choice keeps the paper focused on longitudinal prognosis from harmonized clinical and biomarker histories, but it leaves open the question of whether raw magnetic resonance imaging, positron emission tomography, or learned image representations could add signal beyond the regional summaries used here\. This is especially relevant because modern Alzheimer’s disease research increasingly frames disease status through biological markers of amyloid, tau, and neurodegeneration\(Jack et al\.,[2018](https://arxiv.org/html/2605.16319#bib.bib12)\)\.
Fourth, the biomarker\-history construction uses most recent prior biomarker summaries and modality\-recency variables\. This is a practical and interpretable way to retain historical information under irregular follow\-up, but it remains a summary of the full biomarker trajectory\. More flexible approaches could model biomarker evolution directly, represent uncertainty in stale measurements, or combine longitudinal biomarker trajectories with clinical endpoints in a joint modeling framework\.
Fifth, the repeated\-seed analysis uses five participant\-level random seeds\. This improves over a single split and shows that the proposed model remains strong across multiple partitions, but the intervals across seeds should be interpreted as split\-level stability summaries rather than full sampling uncertainty\. Larger repeated\-split studies, bootstrap evaluation, or external validation would give a stronger assessment of stability\.
Finally, the theoretical results in this paper are structural properties of the proposed empirical objective and attention mechanism\. They clarify the residual\-risk equivalence, the recovery of the statistical reference as a feasible model, temporal attenuation from the nonnegative gap penalty, and Lipschitz stability under boundedness conditions\. These results help explain the architecture, but they do not provide a full finite\-sample generalization theorem for the learned deep model\.
## 9Conclusion and Future Work
This paper develops a medium\-horizon Alzheimer’s disease progression analysis centered on 24\-month CDR\-SB change from longitudinal clinical and biomarker ADNI histories\. The proposed residual gap\-aware transformer combines a mixed\-effects statistical reference, observation\-level longitudinal tokenization, and a learned nonnegative time\-gap penalty inside attention\. Under participant\-level repeated\-seed evaluation, the proposed model achieves the strongest mean performance across all reported metrics and improves over a mixed\-effects baseline, GRU\-D, and STraTS\.
The main conclusion is that the value of the proposed method comes from the alignment between the clinical prediction problem and the model structure\. The response is a quantitative change score over a clinically interpretable two\-year horizon\. The inputs preserve irregular longitudinal histories rather than forcing strict same\-visit biomarker coverage\. The baseline comparison includes a repeated\-measures statistical reference\. The proposed model then learns residual trajectory signal with attention weights that account for temporal distance\. This combination gives a clear contribution to longitudinal Alzheimer’s disease progression modeling\.
Future work should first test external validity\. The next stage should evaluate the model across additional Alzheimer’s disease cohorts, temporal ADNI splits, and more routine\-care datasets\. This direction is especially important for statistical\-neural longitudinal modeling, because our recent Parkinson’s disease study on longitudinal voice biomarkers found that neural flexibility in small clinical cohorts requires careful validation against interpretable statistical references\(Tong et al\.,[2026](https://arxiv.org/html/2605.16319#bib.bib28)\)\. More broadly, clinical prediction models require validation beyond the development sample before their performance can be interpreted as transportable\(Collins et al\.,[2015](https://arxiv.org/html/2605.16319#bib.bib8); Steyerberg and Harrell,[2016](https://arxiv.org/html/2605.16319#bib.bib26)\)\.
A second direction is calibration and clinical utility\. The present paper reports error metrics and prediction–observation correlation, which are appropriate for a regression analysis of CDR\-SB change\. Future work should also assess calibration of predicted progression magnitude, calibration within risk or progression strata, and clinical decision value\. Calibration is a central requirement for trustworthy prediction models\(Van Calster et al\.,[2019](https://arxiv.org/html/2605.16319#bib.bib29)\), and decision\-curve analysis can help assess whether predicted progression adds value for clinically relevant decisions\(Vickers and Elkin,[2006](https://arxiv.org/html/2605.16319#bib.bib32)\)\.
A third direction is modality expansion\. The current analysis uses structured magnetic resonance imaging summaries and cerebrospinal fluid biomarkers\. Future models could incorporate raw imaging, positron emission tomography, learned image embeddings, or richer biomarker trajectories\. This would connect the statistical\-neural progression model more closely to biological staging frameworks for Alzheimer’s disease\(Jack et al\.,[2018](https://arxiv.org/html/2605.16319#bib.bib12)\)\. Such extensions should be evaluated carefully because richer modalities can increase predictive signal while also increasing missingness, computational burden, and risk of overfitting\.
A fourth direction is endpoint sensitivity\. The 24\-month CDR\-SB\-change endpoint is a useful primary response for the present paper, but the broader scientific question includes multiple horizons and multiple forms of progression\. Future studies should compare 12\-month, 24\-month, and 36\-month change; raw future CDR\-SB; diagnosis conversion; and time\-to\-progression outcomes\. These analyses would clarify whether the proposed residual gap\-aware structure is specifically strongest for quantitative change or whether it also improves other clinically relevant endpoints\.
Finally, future work should improve interpretability of the learned residual component\. The mixed\-effects reference already provides an interpretable statistical component\. The next step is to characterize which time points, variables, and data sources contribute most to the residual transformer prediction\. Attention patterns, source\-specific ablations, and counterfactual removal of historical biomarkers could help explain whether the model gains mainly from cognitive trajectories, biomarker recency, structural neurodegeneration summaries, cerebrospinal fluid history, or their interactions over time\.
Overall, the present study supports a focused conclusion: structured statistical\-neural modeling can improve 24\-month CDR\-SB\-change prediction from longitudinal clinical and biomarker ADNI histories when the outcome, covariate history, baseline model, and evaluation protocol are defined consistently\. Future work should test the scope, clinical reliability, and biological interpretation of that finding\.
## Appendix AProofs of Structural Properties
This appendix gives the proofs of Propositions[1](https://arxiv.org/html/2605.16319#Thmproposition1)–[5](https://arxiv.org/html/2605.16319#Thmproposition5)\. The results are deterministic properties of the empirical objective and the proposed model map under the structural assumptions stated in Section[4](https://arxiv.org/html/2605.16319#S4)\.
### A\.1Proof of Proposition[1](https://arxiv.org/html/2605.16319#Thmproposition1)
For fixed statistical reference predictions, define
ui=yi−gistat\.u\_\{i\}=y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\.For everyθ∈Θ\\theta\\in\\Thetaand every training anchorii,
yi−gistat−rθ\(Hi\)=ui−rθ\(Hi\)\.y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\-r\_\{\\theta\}\(H\_\{i\}\)=u\_\{i\}\-r\_\{\\theta\}\(H\_\{i\}\)\.Squaring both sides gives
\(yi−gistat−rθ\(Hi\)\)2=\(ui−rθ\(Hi\)\)2\.\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}=\\left\(u\_\{i\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}\.Averaging overi=1,…,ni=1,\\dots,nyields
1n∑i=1n\(yi−gistat−rθ\(Hi\)\)2=1n∑i=1n\(ui−rθ\(Hi\)\)2\.\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(u\_\{i\}\-r\_\{\\theta\}\(H\_\{i\}\)\\right\)^\{2\}\.Thus the original empirical objective and the residual empirical objective are identical for everyθ\\theta\. Therefore the two optimization problems have the same objective values, the same infimum, and the same set of minimizers\. This proves Proposition[1](https://arxiv.org/html/2605.16319#Thmproposition1)\.□\\square
### A\.2Proof of Proposition[2](https://arxiv.org/html/2605.16319#Thmproposition2)
By Assumption[2](https://arxiv.org/html/2605.16319#Thmassumption2), there existsθ0∈Θ\\theta\_\{0\}\\in\\Thetasuch that
rθ0\(H\)=0r\_\{\\theta\_\{0\}\}\(H\)=0for every admissible historyHH\. In particular,
rθ0\(Hi\)=0r\_\{\\theta\_\{0\}\}\(H\_\{i\}\)=0for every training historyHiH\_\{i\}\. Evaluating the empirical loss at this feasible parameter value gives
ℒn\(θ0\)=1n∑i=1n\(yi−gistat−rθ0\(Hi\)\)2\.\\mathcal\{L\}\_\{n\}\(\\theta\_\{0\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\-r\_\{\\theta\_\{0\}\}\(H\_\{i\}\)\\right\)^\{2\}\.Usingrθ0\(Hi\)=0r\_\{\\theta\_\{0\}\}\(H\_\{i\}\)=0,
ℒn\(θ0\)=1n∑i=1n\(yi−gistat\)2\.\\mathcal\{L\}\_\{n\}\(\\theta\_\{0\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\\right\)^\{2\}\.Since the infimum overΘ\\Thetais no larger than the value of the objective at any feasible parameter point,
infθ∈Θℒn\(θ\)≤ℒn\(θ0\)=1n∑i=1n\(yi−gistat\)2\.\\inf\_\{\\theta\\in\\Theta\}\\mathcal\{L\}\_\{n\}\(\\theta\)\\leq\\mathcal\{L\}\_\{n\}\(\\theta\_\{0\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(y\_\{i\}\-g\_\{i\}^\{\\mathrm\{stat\}\}\\right\)^\{2\}\.This proves Proposition[2](https://arxiv.org/html/2605.16319#Thmproposition2)\.□\\square
### A\.3Proof of Proposition[3](https://arxiv.org/html/2605.16319#Thmproposition3)
Fix a layerℓ\\ell, a headhh, a query tokenaa, and a key tokenbb\. Holding the content term fixed, write
sab\(ℓ,h\)\(d\)=cab\(ℓ,h\)−λℓ,hd,d≥0\.s\_\{ab\}^\{\(\\ell,h\)\}\(d\)=c\_\{ab\}^\{\(\\ell,h\)\}\-\\lambda\_\{\\ell,h\}d,\\qquad d\\geq 0\.By Assumption[3](https://arxiv.org/html/2605.16319#Thmassumption3),
λℓ,h≥0\.\\lambda\_\{\\ell,h\}\\geq 0\.For anyd2\>d1≥0d\_\{2\}\>d\_\{1\}\\geq 0,
sab\(ℓ,h\)\(d2\)−sab\(ℓ,h\)\(d1\)=−λℓ,h\(d2−d1\)\.s\_\{ab\}^\{\(\\ell,h\)\}\(d\_\{2\}\)\-s\_\{ab\}^\{\(\\ell,h\)\}\(d\_\{1\}\)=\-\\lambda\_\{\\ell,h\}\(d\_\{2\}\-d\_\{1\}\)\.Becaused2−d1\>0d\_\{2\}\-d\_\{1\}\>0andλℓ,h≥0\\lambda\_\{\\ell,h\}\\geq 0,
sab\(ℓ,h\)\(d2\)−sab\(ℓ,h\)\(d1\)≤0\.s\_\{ab\}^\{\(\\ell,h\)\}\(d\_\{2\}\)\-s\_\{ab\}^\{\(\\ell,h\)\}\(d\_\{1\}\)\\leq 0\.Hencesab\(ℓ,h\)\(d\)s\_\{ab\}^\{\(\\ell,h\)\}\(d\)is nonincreasing indd\. Ifλℓ,h\>0\\lambda\_\{\\ell,h\}\>0, then
−λℓ,h\(d2−d1\)<0\-\\lambda\_\{\\ell,h\}\(d\_\{2\}\-d\_\{1\}\)<0for everyd2\>d1d\_\{2\}\>d\_\{1\}, so the score is strictly decreasing\. This proves Proposition[3](https://arxiv.org/html/2605.16319#Thmproposition3)\.□\\square
### A\.4Proof of Proposition[4](https://arxiv.org/html/2605.16319#Thmproposition4)
Fix a query token, a head, and all key scores except the score of key tokenbb\. Let
d=\|τa−τb\|d=\|\\tau\_\{a\}\-\\tau\_\{b\}\|and write the score of tokenbbas
sb\(d\)=cb−λd,s\_\{b\}\(d\)=c\_\{b\}\-\\lambda d,wherecbc\_\{b\}is fixed andλ=λℓ,h\>0\\lambda=\\lambda\_\{\\ell,h\}\>0\. Let
A=∑c≠bexp\(sc\),A=\\sum\_\{c\\neq b\}\\exp\(s\_\{c\}\),where eachscs\_\{c\}forc≠bc\\neq bis fixed\. Define
z\(d\)=exp\(cb−λd\)\.z\(d\)=\\exp\(c\_\{b\}\-\\lambda d\)\.The softmax weight of tokenbbis then
αb\(d\)=z\(d\)A\+z\(d\)\.\\alpha\_\{b\}\(d\)=\\frac\{z\(d\)\}\{A\+z\(d\)\}\.Since
z′\(d\)=−λexp\(cb−λd\)=−λz\(d\),z^\{\\prime\}\(d\)=\-\\lambda\\exp\(c\_\{b\}\-\\lambda d\)=\-\\lambda z\(d\),differentiatingαb\(d\)\\alpha\_\{b\}\(d\)gives
αb′\(d\)=z′\(d\)\(A\+z\(d\)\)−z\(d\)z′\(d\)\(A\+z\(d\)\)2\.\\alpha\_\{b\}^\{\\prime\}\(d\)=\\frac\{z^\{\\prime\}\(d\)\(A\+z\(d\)\)\-z\(d\)z^\{\\prime\}\(d\)\}\{\(A\+z\(d\)\)^\{2\}\}\.The terms involvingz\(d\)z′\(d\)z\(d\)z^\{\\prime\}\(d\)cancel, so
αb′\(d\)=Az′\(d\)\(A\+z\(d\)\)2\.\\alpha\_\{b\}^\{\\prime\}\(d\)=\\frac\{Az^\{\\prime\}\(d\)\}\{\(A\+z\(d\)\)^\{2\}\}\.Substitutingz′\(d\)=−λz\(d\)z^\{\\prime\}\(d\)=\-\\lambda z\(d\)gives
αb′\(d\)=−λAz\(d\)\(A\+z\(d\)\)2\.\\alpha\_\{b\}^\{\\prime\}\(d\)=\-\\lambda\\frac\{Az\(d\)\}\{\(A\+z\(d\)\)^\{2\}\}\.Becauseλ\>0\\lambda\>0,A≥0A\\geq 0, andz\(d\)\>0z\(d\)\>0,
αb′\(d\)≤0\.\\alpha\_\{b\}^\{\\prime\}\(d\)\\leq 0\.Thus the attention weight is nonincreasing indd\. If tokenbbcompetes with at least one other key token, thenA\>0A\>0, and therefore
αb′\(d\)<0\.\\alpha\_\{b\}^\{\\prime\}\(d\)<0\.The decrease is then strict\. This proves Proposition[4](https://arxiv.org/html/2605.16319#Thmproposition4)\.□\\square
### A\.5Proof of Proposition[5](https://arxiv.org/html/2605.16319#Thmproposition5)
LetH,H′∈ℋH,H^\{\\prime\}\\in\\mathcal\{H\}be two admissible histories\. By Assumption[4](https://arxiv.org/html/2605.16319#Thmassumption4), each history has at mostmmax<∞m\_\{\\max\}<\\inftyobserved tokens, and observed times and values lie in bounded sets after preprocessing\. Variable identities are equipped with the discrete metric\. Therefore each admissible history can be represented, after padding or another fixed\-length embedding convention, in a bounded finite\-dimensional domain\.
By Assumption[4](https://arxiv.org/html/2605.16319#Thmassumption4), the tokenization map is Lipschitz onℋ\\mathcal\{H\}\. Hence there exists a finite constantL0L\_\{0\}such that
‖Z\(0\)\(H\)−Z\(0\)\(H′\)‖≤L0dℋ\(H,H′\)\.\\\|Z^\{\(0\)\}\(H\)\-Z^\{\(0\)\}\(H^\{\\prime\}\)\\\|\\leq L\_\{0\}d\_\{\\mathcal\{H\}\}\(H,H^\{\\prime\}\)\.
For a fixed layerℓ\\elland headhh, the query, key, and value maps are linear:
Q=ZWQ\(ℓ,h\),K=ZWK\(ℓ,h\),V=ZWV\(ℓ,h\)\.Q=ZW\_\{Q\}^\{\(\\ell,h\)\},\\qquad K=ZW\_\{K\}^\{\(\\ell,h\)\},\\qquad V=ZW\_\{V\}^\{\(\\ell,h\)\}\.The corresponding weight matrices have finite operator norms by Assumption[4](https://arxiv.org/html/2605.16319#Thmassumption4); hence these maps are Lipschitz on the admissible domain\. The content score map
\(Q,K\)↦QK⊤/dh\(Q,K\)\\mapsto QK^\{\\top\}/\\sqrt\{d\_\{h\}\}is bilinear and is Lipschitz on bounded domains\. The temporal penalty
\(τa,τb\)↦λℓ,h\|τa−τb\|\(\\tau\_\{a\},\\tau\_\{b\}\)\\mapsto\\lambda\_\{\\ell,h\}\|\\tau\_\{a\}\-\\tau\_\{b\}\|is Lipschitz in the time coordinates for finiteλℓ,h\\lambda\_\{\\ell,h\}\. Therefore the complete attention score map is Lipschitz on the admissible domain\.
The row\-wise softmax map is Lipschitz on any fixed finite\-dimensional bounded score set\. For a score vectors∈ℝms\\in\\mathbb\{R\}^\{m\}, its Jacobian has entries
∂αj∂sk=αj\(𝟏\{j=k\}−αk\)\.\\frac\{\\partial\\alpha\_\{j\}\}\{\\partial s\_\{k\}\}=\\alpha\_\{j\}\(\\mathbf\{1\}\\\{j=k\\\}\-\\alpha\_\{k\}\)\.Since0≤αj≤10\\leq\\alpha\_\{j\}\\leq 1, the Jacobian entries are bounded\. Becausem≤mmaxm\\leq m\_\{\\max\}, the corresponding operator norm is uniformly bounded over admissible histories\. Thus the softmax map is Lipschitz on the score sets generated by the model\.
The attention head output is
O\(ℓ,h\)=α\(ℓ,h\)V\(ℓ,h\)\.O^\{\(\\ell,h\)\}=\\alpha^\{\(\\ell,h\)\}V^\{\(\\ell,h\)\}\.On the bounded admissible domain, this output is a product of bounded Lipschitz maps and is therefore Lipschitz\. Concatenation over finitely many heads and multiplication byWO\(ℓ\)W\_\{O\}^\{\(\\ell\)\}preserve Lipschitz continuity\. Residual addition and the feed\-forward update are Lipschitz by Assumption[4](https://arxiv.org/html/2605.16319#Thmassumption4)\. Hence each transformer layer defines a Lipschitz map
Z\(ℓ−1\)↦Z\(ℓ\)\.Z^\{\(\\ell\-1\)\}\\mapsto Z^\{\(\\ell\)\}\.Because the encoder has finitely many layers, the full encoder map
Z\(0\)↦Z\(L\)Z^\{\(0\)\}\\mapsto Z^\{\(L\)\}is Lipschitz with some finite constantLencL\_\{\\mathrm\{enc\}\}\.
The pooling score
z↦wp⊤tanh\(Wpz\+bp\)z\\mapsto w\_\{p\}^\{\\top\}\\tanh\(W\_\{p\}z\+b\_\{p\}\)is a composition of Lipschitz maps and is therefore Lipschitz\. The pooling softmax is Lipschitz by the bounded\-Jacobian argument above, again using the fact that the number of tokens is bounded bymmaxm\_\{\\max\}\. The weighted sum
h=∑jπjzjh=\\sum\_\{j\}\\pi\_\{j\}z\_\{j\}is Lipschitz on the bounded admissible domain because both the weightsπj\\pi\_\{j\}and the encoded tokenszjz\_\{j\}are bounded Lipschitz functions of the input history\. Finally, the residual head
h↦wr⊤h\+brh\\mapsto w\_\{r\}^\{\\top\}h\+b\_\{r\}is linear and hence Lipschitz\.
Combining the Lipschitz constants from tokenization, the encoder, pooling, and residual head, there exists a finite constant
Lθ=LheadLpoolLencL0L\_\{\\theta\}=L\_\{\\mathrm\{head\}\}L\_\{\\mathrm\{pool\}\}L\_\{\\mathrm\{enc\}\}L\_\{0\}such that
\|rθ\(H\)−rθ\(H′\)\|≤Lθdℋ\(H,H′\)\.\|r\_\{\\theta\}\(H\)\-r\_\{\\theta\}\(H^\{\\prime\}\)\|\\leq L\_\{\\theta\}d\_\{\\mathcal\{H\}\}\(H,H^\{\\prime\}\)\.This proves Proposition[5](https://arxiv.org/html/2605.16319#Thmproposition5)\.□\\square
## References
- Ahmadzadeh et al\. \(2023\)Mahshid Ahmadzadeh, Gregory J\. Christie, and Faezeh Ghasemi\.Neuroimaging and machine learning for studying the pathways from mild cognitive impairment to Alzheimer’s disease: a systematic review\.*BMC Neurology*, 23:302, 2023\.
- Al Olaimat et al\. \(2023\)Mohammad Al Olaimat, Jared Martinez, and Serdar Bozdag\.PPAD: a deep learning architecture to predict progression of Alzheimer’s disease\.*Bioinformatics*, 39\(Supplement\_1\):i149–i157, 2023\.
- Al Olaimat and Bozdag \(2024\)Mohammad Al Olaimat and Serdar Bozdag\.TA\-RNN: an attention\-based time\-aware recurrent neural network architecture for electronic health records\.*Bioinformatics*, 40\(Supplement\_1\):i169–i179, 2024\.
- Andrews et al\. \(2019\)J\. Scott Andrews, Urvi Desai, Noam Y\. Kirson, Miriam L\. Zichlin, Daniel E\. Ball, and Brandy R\. Matthews\.Disease severity and minimal clinically important differences in clinical outcome assessments for Alzheimer’s disease clinical trials\.*Alzheimer’s & Dementia: Translational Research & Clinical Interventions*, 5:354–363, 2019\.
- Baytas et al\. \(2017\)Inci M\. Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K\. Jain, and Jiayu Zhou\.Patient subtyping via time\-aware LSTM networks\.In*Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pages 65–74, 2017\.
- Cedarbaum et al\. \(2013\)Jesse M\. Cedarbaum, Mark Jaros, Chito Hernandez, Nicola Coley, Sandrine Andrieu, Michael Grundman, Bruno Vellas, and the Alzheimer’s Disease Neuroimaging Initiative\.Rationale for use of the Clinical Dementia Rating Sum of Boxes as a primary outcome measure for Alzheimer’s disease clinical trials\.*Alzheimer’s & Dementia*, 9\(1 Suppl\):S45–S55, 2013\.
- Che et al\. \(2018\)Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu\.Recurrent neural networks for multivariate time series with missing values\.*Scientific Reports*, 8:6085, 2018\.
- Collins et al\. \(2015\)Gary S\. Collins, Johannes B\. Reitsma, Douglas G\. Altman, and Karel G\. M\. Moons\.Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis \(TRIPOD\): the TRIPOD statement\.*Annals of Internal Medicine*, 162\(1\):55–63, 2015\.
- Ding et al\. \(2023\)Huitong Ding, Biqi Wang, Alexander P\. Hamel, Mark Melkonyan, Ting F\. A\. Ang, Rhoda Au, and Honghuang Lin\.Prediction of progression from mild cognitive impairment to Alzheimer’s disease with longitudinal and multimodal data\.*Frontiers in Dementia*, 2:1271680, 2023\.
- Fitzmaurice et al\. \(2011\)Garrett M\. Fitzmaurice, Nan M\. Laird, and James H\. Ware\.*Applied Longitudinal Analysis*\.Wiley, 2nd edition, 2011\.
- Grueso and Viejo\-Sobera \(2021\)Sergio Grueso and Raquel Viejo\-Sobera\.Machine learning methods for predicting progression from mild cognitive impairment to Alzheimer’s disease dementia: a systematic review\.*Alzheimer’s Research & Therapy*, 13:162, 2021\.
- Jack et al\. \(2018\)Clifford R\. Jack, David A\. Bennett, Kaj Blennow, Maria C\. Carrillo, Billy Dunn, Samantha B\. Haeberlein, David M\. Holtzman, William Jagust, Frank Jessen, Jason Karlawish, and others\.NIA\-AA research framework: toward a biological definition of Alzheimer’s disease\.*Alzheimer’s & Dementia*, 14\(4\):535–562, 2018\.
- Jamalian et al\. \(2023\)Samira Jamalian, Michael Dolton, Pascal Chanu, Vidya Ramakrishnan, Yesenia Franco, Kristin Wildsmith, and colleagues\.Modeling Alzheimer’s disease progression utilizing clinical trial and ADNI data to predict longitudinal trajectory of CDR\-SB\.*CPT: Pharmacometrics & Systems Pharmacology*, 12\(7\):1029–1042, 2023\.
- Jedynak et al\. \(2012\)Bruno M\. Jedynak, Alexander Lang, Binghai Liu, Elan Katz, David Wang, B\. Yu, Steven Ferris, Paul S\. Aisen, Jeffrey L\. Cummings, Clifford R\. Jack, and Michael W\. Weiner\.A computational neurodegenerative disease progression score: method and results with the Alzheimer’s Disease Neuroimaging Initiative cohort\.*NeuroImage*, 63\(3\):1478–1486, 2012\.
- Kumar et al\. \(2021\)Sayantan Kumar, Inez Oh, Suzanne Schindler, Albert M\. Lai, Philip R\. O\. Payne, and Aditi Gupta\.Machine learning for modeling the progression of Alzheimer disease dementia using clinical data: a systematic literature review\.*JAMIA Open*, 4\(3\):ooab052, 2021\.
- Laird and Ware \(1982\)Nan M\. Laird and James H\. Ware\.Random\-effects models for longitudinal data\.*Biometrics*, 38\(4\):963–974, 1982\.
- Lee et al\. \(2019\)Garam Lee, Kwangsik Nho, Byungkon Kang, Kyung\-Ah Sohn, Dokyoon Kim, and the Alzheimer’s Disease Neuroimaging Initiative\.Predicting Alzheimer’s disease progression using multi\-modal deep learning approach\.*Scientific Reports*, 9:1952, 2019\.
- Lee et al\. \(2024\)Min Woo Lee, Hye Weon Kim, Yeong Sim Choe, Hyeon Sik Yang, Jiyeon Lee, and colleagues\.A multimodal machine learning model for predicting dementia conversion in Alzheimer’s disease\.*Scientific Reports*, 14:12276, 2024\.
- Malik et al\. \(2024\)Ishrat Malik, Muhammad Iqbal, and collaborators\.Deep learning for Alzheimer’s disease prediction: a comprehensive review\.*Diagnostics*, 14\(12\):1281, 2024\.
- Marinescu et al\. \(2021\)Razvan V\. Marinescu, Neil P\. Oxtoby, Alexandra L\. Young, Esther E\. Bron, Arthur W\. Toga, Michael W\. Weiner, Frederik Barkhof, Nick C\. Fox, Stefan Klein, Daniel C\. Alexander, and others\.The Alzheimer’s Disease Prediction Of Longitudinal Evolution \(TADPOLE\) challenge: results after 1 year follow\-up\.*Machine Learning for Biomedical Imaging*, 1:1–60, 2021\.
- Nguyen et al\. \(2020\)Minh Nguyen, Tianye N\. S\. He, Lei An, Daniel C\. Alexander, Jianfeng Feng, and Tze Yue Yeo\.Predicting Alzheimer’s disease progression using deep recurrent neural networks\.*NeuroImage*, 222:117203, 2020\.
- Okonkwo et al\. \(2025\)Ozioma C\. Okonkwo, Maria Rivera\-Mindt, and Michael W\. Weiner\.Alzheimer’s Disease Neuroimaging Initiative: two decades of pioneering Alzheimer’s disease research and future directions\.*Alzheimer’s & Dementia*, 21:e14186, 2025\.
- Rizopoulos \(2012\)Dimitris Rizopoulos\.*Joint Models for Longitudinal and Time\-to\-Event Data: With Applications in R*\.Chapman & Hall/CRC, 2012\.
- Schwarz \(1978\)Gideon Schwarz\.Estimating the dimension of a model\.*The Annals of Statistics*, 6\(2\):461–464, 1978\.
- Singh et al\. \(2024\)Soraisam Gobinkumar Singh, Dulumani Das, Utpal Barman, and Manob Jyoti Saikia\.Early Alzheimer’s disease detection: A review of machine learning techniques for forecasting transition from mild cognitive impairment\.*Diagnostics*, 14\(16\):1759, 2024\.
- Steyerberg and Harrell \(2016\)Ewout W\. Steyerberg and Frank E\. Harrell\.Prediction models need appropriate internal, internal–external, and external validation\.*Journal of Clinical Epidemiology*, 69:245–247, 2016\.
- Tipirneni and Reddy \(2021\)Sindhu Tipirneni and Chandan K\. Reddy\.Self\-supervised transformer for sparse and irregularly sampled multivariate clinical time\-series\.*arXiv preprint arXiv:2107\.14293*, 2021\.
- Tong et al\. \(2026\)Ran Tong, Lanruo Wang, Tong Wang, and Wei Yan\.Modeling Parkinson’s disease progression from longitudinal voice biomarkers: A comparative study of statistical and neural mixed effects models\.*Computer Methods and Programs in Biomedicine Update*, 9:100242, 2026\.ISSN 2666\-9900\.doi:[10\.1016/j\.cmpbup\.2026\.100242](https://doi.org/10.1016/j.cmpbup.2026.100242)\.
- Van Calster et al\. \(2019\)Ben Van Calster, David J\. McLernon, Maarten van Smeden, Laure Wynants, Ewout W\. Steyerberg, and Topic Group Evaluating Diagnostic Tests and Prediction Models\.Calibration: the Achilles heel of predictive analytics\.*BMC Medicine*, 17:230, 2019\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\.Attention is all you need\.In*Advances in Neural Information Processing Systems*, pages 5998–6008, 2017\.
- Veitch et al\. \(2024\)Douglas P\. Veitch, Paul S\. Aisen, Laurel A\. Beckett, and colleagues\.The Alzheimer’s Disease Neuroimaging Initiative in the era of Alzheimer’s disease treatment: a review of ADNI studies from 2021 to 2022\.*Alzheimer’s & Dementia*, 20\(1\):652–694, 2024\.
- Vickers and Elkin \(2006\)Andrew J\. Vickers and Elena B\. Elkin\.Decision curve analysis: a novel method for evaluating prediction models\.*Medical Decision Making*, 26\(6\):565–574, 2006\.
- Verbeke and Molenberghs \(2000\)Geert Verbeke and Geert Molenberghs\.*Linear Mixed Models for Longitudinal Data*\.Springer, 2000\.
- Weiner et al\. \(2017\)Michael W\. Weiner, Douglas P\. Veitch, Paul S\. Aisen, Laurel A\. Beckett, Nigel J\. Cairns, Robert C\. Green, Danielle Harvey, Clifford R\. Jack, William Jagust, Eran Liu, and others\.The Alzheimer’s Disease Neuroimaging Initiative: a review of papers published since its inception\.*Alzheimer’s & Dementia*, 13\(6\):730–743, 2017\.
- Williams et al\. \(2013\)Monique M\. Williams, Martha Storandt, Catherine M\. Roe, and John C\. Morris\.Progression of Alzheimer’s disease as measured by Clinical Dementia Rating Sum of Boxes scores\.*Alzheimer’s & Dementia*, 9\(1 Suppl\):S39–S44, 2013\.
- Zerveas et al\. \(2021\)George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff\.A transformer\-based framework for multivariate time series representation learning\.In*Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining*, pages 2114–2124, 2021\.
- Zhang et al\. \(2024\)Suixia Zhang, Jing Yuan, Yu Sun, Fei Wu, Ziyue Liu, Feifei Zhai, Yaoyun Zhang, Judith Somekh, Mor Peleg, Yi\-Cheng Zhu, Zhengxing Huang, and collaborators\.Machine learning on longitudinal multi\-modal data enables the understanding and prognosis of Alzheimer’s disease progression\.*iScience*, 27\(7\):110263, 2024\.Similar Articles
DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
DT-Transformer is a foundation model trained on 57.1 million structured EHR entries from 1.7 million patients across 11 hospitals in the Mass General Brigham health system, achieving strong discrimination for next-event prediction across 896 disease categories.
BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
The paper proposes BatteryMFormer, a multi-level Transformer for early battery degradation trajectory forecasting that integrates aging-condition-aware decoding, meta degradation pattern memory, and dual-view encoding to capture multi-level degradation structures and SOC-localized variations, consistently outperforming state-of-the-art baselines across four battery domains.
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction
Discrete autoregressive MRI reconstruction using privileged information distillation achieves superior performance under extreme undersampling by leveraging visual autoregressive modeling techniques.
Delta Attention Residuals
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction
Proposes ReTAMamba, a method using reliability-aware temporal aggregation with Mamba for irregular clinical time series prediction, achieving significant AUPRC gains on MIMIC-IV, eICU, and PhysioNet 2012.