BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

arXiv cs.AI Papers

Summary

Introduces BrainG3N, a dual-purpose tokenizer for 3D brain MRI latent diffusion using a frozen masked autoencoder encoder for clinically informative embeddings and a CNN decoder for reconstruction, achieving state-of-the-art performance on a 23-task benchmark and enabling controllable generation and longitudinal forecasting.

arXiv:2606.19651v1 Announce Type: new Abstract: Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:31 PM

# BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation
Source: [https://arxiv.org/html/2606.19651](https://arxiv.org/html/2606.19651)
Max Van Puyvelde∗1,2 maxvpuyv@stanford\.edu &H\. Ibrahim Gulluk∗3 gulluk@stanford\.edu Wim Van Criekinge2 wim\.vancriekinge@ugent\.be &Olivier Gevaert1 ogevaert@stanford\.edu 1Department of Biomedical Data Science, Stanford University School of Medicine 2Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University 3Department of Electrical Engineering, Stanford University

###### Abstract

Three\-dimensional \(3D\) brain MRI is central to clinical neurology and neuro\-oncology, where generative models could augment under\-represented cohorts, simulate disease trajectories, and support privacy\-preserving data sharing\. Latent diffusion has been the go\-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes\. Existing reconstruction\-driven tokenizers achieve the second at the expense of the first\. To address this, we introduce a fully volumetric masked\-autoencoder \(MAE\) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings\. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200\+ acquisition sites, and demonstrate its dual utility in two settings\. First, on a 23\-task linear\-probing benchmark, the encoder outperforms or matches SOTA models \(i\.e\., BrainIAC, BrainSegFounder, and MedicalNet\) on 21 of 23 tasks\. Second, a conditional diffusion transformer \(DiT\) trained on these clinically informative embeddings supports both conditional generation across six variables and patient\-specific longitudinal forecasting\. Together these results establish a single 3D brain\-MRI embedding space capable of both downstream clinical tasks and controllable generation\.

11footnotetext:Equal contribution\.## 1Introduction

In neurology and neuro\-oncology, brain MRI informs clinical decisions ranging from tumor diagnosis and treatment planning to staging and monitoring of neurodegenerative diseases such as Alzheimer’s and Parkinson’s, and supports population\-scale research on brain development and aging\. Generative models on 3D brain MRI could extend this practice in several directions: augmenting under\-represented patient cohorts, producing patient\-specific digital twins\[[35](https://arxiv.org/html/2606.19651#bib.bib58)\]that simulate counterfactual disease trajectories, and enabling privacy\-preserving cohort sharing across institutions where regulatory and logistical barriers currently prevent access to real imaging\. Realizing these applications requires generative models at full 3D resolution, yet much of the field still operates on 2D slices\. Because direct voxel\-space generation is computationally infeasible at that scale, generative pipelines have broadly converged on latent diffusion\[[34](https://arxiv.org/html/2606.19651#bib.bib55)\]: an encoder–decoder tokenizer first compresses volumes into a low\-dimensional latent space, and a diffusion model is then trained on those embeddings\. Conditional generation in this setting places two distinct demands on the tokenizer\. First, the encoder embeddings must carry the clinical information needed for both conditional generation and downstream clinical tasks\. Second, the decoder must reconstruct anatomically faithful voxel volumes\. Existing 3D radiographic latent diffusion pipelines\[[31](https://arxiv.org/html/2606.19651#bib.bib14),[18](https://arxiv.org/html/2606.19651#bib.bib8),[44](https://arxiv.org/html/2606.19651#bib.bib15),[41](https://arxiv.org/html/2606.19651#bib.bib16),[17](https://arxiv.org/html/2606.19651#bib.bib17)\]train a single encoder–decoder against a reconstruction objective, which biases the encoder toward voxel fidelity at the expense of clinical content; the resulting latent space is typically evaluated only on voxel\-level reconstruction metrics\.

We propose a dual\-purpose self\-supervised approach for 3D brain MRI in which a frozen masked\-autoencoder \(MAE\) encoder produces an embedding space that serves two roles: a clinical representation for downstream tasks, and the feature space of a conditional diffusion transformer \(DiT\)\. To use this embedding space within a latent\-diffusion pipeline, the encoder is paired with a CNN decoder via a linear projection of the embeddings \(§[2](https://arxiv.org/html/2606.19651#S2)\)\. Using an MAE as a tokenizer for downstream diffusion has recent 2D precedent\[[9](https://arxiv.org/html/2606.19651#bib.bib4)\]; whether the same approach transfers to 3D radiographic data, where corpora are orders of magnitude smaller, the input dimensionality is higher, the relevant axes are subvisual phenotypes rather than visible object categories, is the question we address here\.

Our work makes two main contributions\. First, the frozen MAE encoder, pretrained on 35,309 brain MRI volumes from 18 cohorts, produces clinically informative embeddings: on a 23\-task linear\-probing benchmark \(§[3\.2](https://arxiv.org/html/2606.19651#S3.SS2)\) the frozen encoder outperforms or matches BrainIAC\[[36](https://arxiv.org/html/2606.19651#bib.bib13)\], BrainSegFounder\[[11](https://arxiv.org/html/2606.19651#bib.bib12)\], and MedicalNet\[[10](https://arxiv.org/html/2606.19651#bib.bib39)\]on 21 of 23 tasks; for example, the encoder reaches AUC0\.9370\.937on isocitrate dehydrogenase 1 \(IDH1\) mutation status prediction, a key genomic biomarker for glioma diagnosis and treatment stratification; AUC0\.9210\.921on tumor grade classification; a mean absolute error of4\.434\.43years on brain age regression; and AUC0\.9670\.967on sex prediction\. Second, a conditional diffusion transformer \(DiT\)\[[29](https://arxiv.org/html/2606.19651#bib.bib2),[24](https://arxiv.org/html/2606.19651#bib.bib5)\]trained on the same embeddings supports controllable generation across six variables \(§[3\.3](https://arxiv.org/html/2606.19651#S3.SS3)\) and patient\-specific longitudinal forecasting \(§[3\.4](https://arxiv.org/html/2606.19651#S3.SS4)\); generated embeddings are mapped back to high\-fidelity 3D voxel volumes by the CNN decoder\. In both cases, generated samples are correctly recovered by classifiers trained on real data: for example, Pearsonr=0\.93r\{=\}0\.93on cross\-sectional age conditioning and Pearsonr=0\.72r\{=\}0\.72on longitudinal age progression\. This transfer test connects the embedding’s encoding of clinical phenotypes to its generative controllability\.

## 2Method

#### Pretraining corpus\.

Our pretraining corpus comprises 35,309 brain MRI volumes from 17,399 unique subjects across 18 public cohorts and 200\+ acquisition sites\. Volumes span four modalities \(T1, T2, fluid\-attenuated inversion recovery \[FLAIR\], and T1 contrast\-enhanced \[T1c\]\) and ten clinical categories covering healthy controls, neurodegenerative disease \(Alzheimer’s, Parkinson’s\), neurodevelopmental conditions, psychiatric disorders, and brain tumors; subject ages range from 5 to 98 years\. A single harmonized preprocessing pipeline registers every volume to the SRI24 atlas\[[33](https://arxiv.org/html/2606.19651#bib.bib20)\]via ANTs affine registration\[[4](https://arxiv.org/html/2606.19651#bib.bib22)\], performs skull stripping with HD\-BET\[[22](https://arxiv.org/html/2606.19651#bib.bib21)\], and corrects intensity inhomogeneity with N4 bias\-field correction\[[38](https://arxiv.org/html/2606.19651#bib.bib23)\], producing160×192×160160\{\\times\}192\{\\times\}160volumes at11mm isotropic spacing\. Cohort\-specific entry points handle raw inputs at different preprocessing stages \(already\-stripped, defaced, native DICOM, etc\.\) without double\-processing\. The full dataset card and per\-cohort pipelines are in Appendices[A](https://arxiv.org/html/2606.19651#A1)and[B](https://arxiv.org/html/2606.19651#A2)\.

#### MAE encoder\.

The encoder is a 3D masked autoencoder\[[19](https://arxiv.org/html/2606.19651#bib.bib1)\]built on a 12\-layer vision transformer\[[15](https://arxiv.org/html/2606.19651#bib.bib24)\]with hidden dimension 1152 and16316^\{3\}patches, producing 1200 tokens per volume\. During pretraining, 70% of patches are randomly masked: the encoder processes only the 360 visible patches, and a separate transformer decoder reconstructs the 840 masked patches from the encoder output under a per\-patch mean\-squared\-error loss\. Reconstructing 840 missing patches from 360 visible ones requires modeling long\-range anatomical context, which forces the encoder to capture global structural relationships rather than local voxel statistics\. This property is what we exploit downstream, both for clinical prediction and as the input space for the diffusion tokenizer\.

\(a\) Two\-phase tokenizer

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/fig_tokenizer.png)

\\phantomcaption

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/fig_dit.png)\(b\) Conditional flow\-matching DiT

\\phantomcaption

Figure 1:Architecture\.\(a\)Phase 1 pretrains a 3D MAE encoder on 70% masked\-patch reconstruction; Phase 2 freezes the encoder and trains a linear projectionP∈ℝ1152×32P\{\\in\}\\mathbb\{R\}^\{1152\{\\times\}32\}\+ 3D CNN decoder under voxelℓ1\\ell\_\{1\}\. The same frozen feature spacez′=z​Pz^\{\\prime\}\{=\}zPis consumed by the probe and produced by the DiT\.\(b\)Noised tokens𝐱t\\mathbf\{x\}\_\{t\}pass through a 12\-block DiT stack with adaLN\-Zero modulation by a conditioning vector𝐜\\mathbf\{c\}, producing the 32\-channel velocity𝐯^\\hat\{\\mathbf\{v\}\}\. Categorical conditions use embedding lookups with aK\+1stK\{\+\}1^\{\\text\{st\}\}null slot for CFG dropout; age uses a sinusoidal\+\+MLP head\.
#### Two\-phase tokenizer \(Fig\.[1](https://arxiv.org/html/2606.19651#S2.F1)\(a\)\)\.

The tokenizer couples the frozen MAE encoder to a 3D CNN decoder through a linear projection\. The pretrained encoder is frozen; a linear projectionP∈ℝ1152×d′P\\in\\mathbb\{R\}^\{1152\\times d^\{\\prime\}\}compresses each of the 1200 tokens from11521152tod′=32d^\{\\prime\}\{=\}32channels; and a 3D CNN decoderϕ\\phireconstructs voxels from the projected tokens under anℓ1\\ell\_\{1\}loss:

z=Enc​\(x\)​\[frozen\],z′=z​P,x^=ϕ​\(z′\)\.z=\\mathrm\{Enc\}\(x\)\\text\{\\;\[frozen\]\},\\qquad z^\{\\prime\}=zP,\\qquad\\hat\{x\}=\\phi\(z^\{\\prime\}\)\.\(1\)
Jointly training encoder and decoder against the same reconstruction objective, as in CNN\-VAE tokenizers, drifts the encoder toward local intensity fidelity and degrades its clinical content\. The bottleneckd′=32d^\{\\prime\}\{=\}32is a deliberate trade\-off: a smaller bottleneck reduces the diffusion model’s input dimensionality, while still preserving most of the encoder’s clinical content \(sweep in Table[5](https://arxiv.org/html/2606.19651#A4.T5)\)\.

#### Linear probing\.

We evaluate the clinical content of the frozen encoder embeddings via linear probing, the standard protocol in self\-supervised representation learning: a single linear classifier is fit per task on top of the frozen embeddings, with no fine\-tuning of the encoder\. The input to each probe is the encoder’s 1200\-token output mean\-pooled to a singled=1152d\{=\}1152vector per volume\. We use logistic regression for classification tasks and ridge regression for regression tasks, evaluated under 5\-fold subject\-grouped stratified cross\-validation so that no subject appears in both the training and test splits of any fold\. The full 23\-task panel and per\-modality breakdown are in Appendix[E](https://arxiv.org/html/2606.19651#A5)\.

#### Conditional latent diffusion \(Fig\.[1](https://arxiv.org/html/2606.19651#S2.F1)\(b\)\)\.

A flow\-matching diffusion transformer \(DiT\)\[[29](https://arxiv.org/html/2606.19651#bib.bib2),[24](https://arxiv.org/html/2606.19651#bib.bib5)\]is trained on the projected token sequencesz′∈ℝ1200×32z^\{\\prime\}\\in\\mathbb\{R\}^\{1200\\times 32\}\. The DiT has 12 blocks, hidden dimension11521152, and1818attention heads, and is trained with the flow\-matching objective

xt=\(1−t\)​x1\+t​ε,ℒ=∥vθ​\(xt,t,𝐜\)−\(ε−x1\)∥22,x\_\{t\}=\(1\{\-\}t\)\\,x\_\{1\}\+t\\,\\varepsilon,\\qquad\\mathcal\{L\}=\\lVert v\_\{\\theta\}\(x\_\{t\},t,\\mathbf\{c\}\)\-\(\\varepsilon\-x\_\{1\}\)\\rVert\_\{2\}^\{2\},\(2\)wherex1x\_\{1\}is a real latent,ε∼𝒩​\(0,I\)\\varepsilon\\sim\\mathcal\{N\}\(0,I\), and the conditioning vector𝐜\\mathbf\{c\}is the sum of six condition embeddings routed through adaLN\-Zero modulation\[[29](https://arxiv.org/html/2606.19651#bib.bib2)\],modulate​\(h;𝐜\)=h​\(1\+scale​\(𝐜\)\)\+shift​\(𝐜\)\\mathrm\{modulate\}\(h;\\mathbf\{c\}\)=h\\,\(1\+\\mathrm\{scale\}\(\\mathbf\{c\}\)\)\+\\mathrm\{shift\}\(\\mathbf\{c\}\)\. The six conditions are disease \(8\-way\), sex, modality \(4\-way, never dropped\), acquisition site \(19\-way\), age \(continuous\), and IDH1 mutation status \(binary\)\. Classifier\-free guidance \(CFG\)\[[20](https://arxiv.org/html/2606.19651#bib.bib26),[12](https://arxiv.org/html/2606.19651#bib.bib25)\]is implemented by independently replacing each condition with a null embedding at probabilityp=0\.1p\{=\}0\.1during training; at sampling time the velocity is extrapolated per condition asv=vuncond\+s​\(vcond−vuncond\)v=v\_\{\\text\{uncond\}\}\+s\\,\(v\_\{\\text\{cond\}\}\-v\_\{\\text\{uncond\}\}\)withs=2\.0s\{=\}2\.0\(CFG\-scale sensitivity in Appendix[G](https://arxiv.org/html/2606.19651#A7)\)\. Modality is never dropped because it is always specified at inference, so a null\-modality branch would only waste capacity\. The longitudinal variant of the DiT \(§[3\.4](https://arxiv.org/html/2606.19651#S3.SS4)\) reuses the same frozen tokenizer and adaLN\-Zero conditioning, replacing the noise\-to\-data interpolant with a baseline\-to\-followup latent bridge for patient\-specific forecasting at requested time horizons\.

## 3Experiments

Section[3\.1](https://arxiv.org/html/2606.19651#S3.SS1)validates the architectural choice on a small\-scale benchmark; Sections[3\.2](https://arxiv.org/html/2606.19651#S3.SS2)–[3\.4](https://arxiv.org/html/2606.19651#S3.SS4)evaluate the resulting embeddings in three settings: cross\-sectional probing, conditional generation, and longitudinal forecasting\.

### 3\.1Architectural validation at matched scale

Before scaling to the full corpus, we validated two architectural choices on a 1100\-volume tumor cohort \(UCSF\-PDGM\+\+UPENN\-GBM\): the projection bottleneckd′d^\{\\prime\}, and the choice of MAE–CNN versus CNN\-VAE tokenizer\.

#### Bottleneck dimensiond′d^\{\\prime\}\.

We sweep the projection dimension overd′∈\{32,128,512\}d^\{\\prime\}\\in\\\{32,128,512\\\}and train the Phase\-2 decoder at each value\. Reconstruction quality increases monotonically withd′d^\{\\prime\}, with diminishing returns aboved′=128d^\{\\prime\}\{=\}128\. To assess whether the projection preserves the clinical information that downstream tasks depend on, we additionally probe the projected featuresz′z^\{\\prime\}at eachd′d^\{\\prime\}with linear classifiers\. The IDH1 probing AUC atd′=32d^\{\\prime\}\{=\}32is0\.8610\.861,0\.0220\.022below the AUC obtained by probing the raw 1152\-dim encoder embeddings \(0\.8830\.883\) at less than 3% of the total dimensionality \(38K vs 1\.4M\); the corresponding gap on WHO tumor grade is0\.0550\.055\. We therefore adoptd′=32d^\{\\prime\}\{=\}32for the remainder of the paper: the smaller bottleneck reduces the diffusion model’s input dimensionality and training cost, while the linear projection preserves enough of the encoder’s clinical content inz′z^\{\\prime\}to support downstream conditioning \(full sweep in Appendix[D](https://arxiv.org/html/2606.19651#A4), Table[5](https://arxiv.org/html/2606.19651#A4.T5)\)\.

#### Tokenizer architecture\.

We compare the frozen MAE–CNN tokenizer against an AutoencoderKL \(AKL\)\[[31](https://arxiv.org/html/2606.19651#bib.bib14)\], the canonical 3D medical generative tokenizer, trained from scratch on the same cohort at matched compute\. At the matched\-dimensionality comparison \(d′=512d^\{\\prime\}\{=\}512vs AKL, both 614K total elements\), MAE outperforms AKL on linear probing by\+0\.064\+0\.064AUC on IDH1 and\+0\.046\+0\.046AUC on WHO tumor grade\. The same pattern obtains at smaller MAE bottlenecks: at our chosend′=32d^\{\\prime\}\{=\}32\(38K vs AKL’s 614K\), MAE outperforms AKL on IDH1 \(\+0\.060\+0\.060AUC\) and matches it on tumor grade\. MAE is also more compute\-efficient at training: the 70% masking ratio means only360360of12001200tokens enter the encoder per step, supporting larger per\-GPU batches than full\-volume CNN\-VAE training at the same memory budget\. Based on this small\-scale validation, we then pretrained the full MAE–CNN tokenizer on the entire 35,309\-volume corpus; this full\-scale tokenizer is the one used throughout the rest of the paper, with representative reconstructions shown in Figure[2](https://arxiv.org/html/2606.19651#S3.F2)\.

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/fig_recon.png)Figure 2:Voxel reconstructions from the frozen\-MAE \+ CNN\-decoder tokenizer atd′=32d^\{\\prime\}\{=\}32\. Rows: a healthy young adult and an Alzheimer’s case\. Within each view \(axial, coronal, 3D rendered\), left = ground truth, right = reconstruction\. Cortical folding, ventricular geometry, and overall morphology are preserved across age and disease state\.

### 3\.2Frozen linear probing of clinical variables

Table 1:Head\-to\-head linear probing on the 35,309\-volume corpus\. All four encoders evaluated on identical splits with identical probe code; the modality used per task \(in parentheses\) is fixed across encoders and chosen for clinical relevance\. AUC \(↑\\uparrow\) for classification, MAE \(↓\\downarrow\) for regression, mean±\\pmstd across 5 subject\-grouped folds\. Shown: 8 above\-floor tasks; full 23\-task panel \(best modality per encoder\) in Appendix[F](https://arxiv.org/html/2606.19651#A6)\. Abbreviations: CDR \(Clinical Dementia Rating\), MMSE \(Mini\-Mental State Examination\)\.#### Head\-to\-head vs published brain\-MRI foundation models\.

We re\-ran frozen\-feature linear probes for the three published 3D brain\-MRI foundation models with public encoders, BrainIAC\[[36](https://arxiv.org/html/2606.19651#bib.bib13)\], BrainSegFounder\[[11](https://arxiv.org/html/2606.19651#bib.bib12)\], and MedicalNet\[[10](https://arxiv.org/html/2606.19651#bib.bib39)\], on our 35,309\-volume corpus using identical splits and probe code; a more recent fourth, BrainDINO\[[43](https://arxiv.org/html/2606.19651#bib.bib57)\], has not released encoder weights or pretraining code\. On 8 representative clinical and acquisition\-related tasks \(Table[1](https://arxiv.org/html/2606.19651#S3.T1)\), our frozen encoder outperforms all three competitors on 7 of 8 tasks; on the 8th \(modality recovery\) all encoders are essentially at ceiling\. Across the full 23\-task panel \(Appendix[F](https://arxiv.org/html/2606.19651#A6), 15 classification \+ 8 regression\), we outperform or match competitors on 21 of 23 tasks; the two non\-wins, Geriatric Depression Scale \(GDS\) and the third part of the Unified Parkinson’s Disease Rating Scale \(UPDRS\-III\), are near\-floor regression tasks for every encoder\. As a sanity check, our reproduction of BrainIAC’s frozen\-feature brain\-age MAE on T1 is7\.337\.33y, close to its published7\.517\.51y, confirming the implementation follows BrainIAC’s protocol\. The most recent brain\-MRI foundation model, BrainDINO, is a 2D slice\-based DINOv3\-style model; the authors report brain\-age MAE5\.545\.54y and IDH1 AUC0\.9010\.901at 100% supervision under a lightweight task\-head fine\-tune, vs4\.434\.43y and0\.9370\.937from our frozen linear probes on the corresponding tasks, positioning it below our fully volumetric self\-supervised approach\.

### 3\.3Probe\-to\-controllability transfer

Table 2:DiT generation evaluation under CFG \(s=2\.0s\{=\}2\.0\)\.\(a\)Controllability: real\-data probes applied to generated samples\.*Real*: probe AUC on real held\-out data \(Pearsonrrfor age\), the ceiling\.*Cond*: agreement with the requested class\.*Null*: probe output on CFG\-null samples\.*Prior*: dataset class frequency\. Modality is never dropped during training, hence no null branch\. Disease, sex, and age use T1; IDH1 uses T1c \(glioblastoma \[GBM\] subset\)\.\(b\)3D\-FID per condition and pooled across 14 arms; per\-arm decomposition in Appendix[H](https://arxiv.org/html/2606.19651#A8)\. Disease classes: HC = healthy control, AD = Alzheimer’s disease, MCI = mild cognitive impairment, PD = Parkinson’s disease\.\(a\)Cross\-sectional controllability
\(b\)3D\-FID \(↓\\downarrow\)

#### Protocol\.

For each DiT conditioncc, we \(i\) train a probe forccon real volumes, \(ii\) draw 64 samples from the DiT under each requested class ofccand a matching set under null conditioning, \(iii\) apply the frozen real\-data probe to both sets\. A controllable condition recovers the requested attribute undercondand falls back to the class prior \(or dataset mean\) undernull; thecond–nullgap measures the conditional signal the sampler preserves\.

#### Results\.

Real\-data probes reliably recover the requested clinical attributes from samples generated under classifier\-free guidance \(Table[2\(a\)](https://arxiv.org/html/2606.19651#S3.T2.st1)\)\. Disease HC\-vs\-AD recovers at 0\.99 agreement undercondand collapses to the class prior undernull\(92% HC vs 91% prior\); control extends to the other three disease contrasts \(MCI vs AD: Cond 0\.90; HC vs MCI: 0\.80; HC vs PD: 0\.82\)\. Undernull, HC vs PD and MCI vs AD collapse to the majority class as expected, while HC vs MCI shows a residual MCI bias \(0\.67 MCI vs 0\.30 MCI prior\), suggesting the unconditional model preferentially produces disease\-leaning samples on this contrast\. Sex reaches 0\.93 against a 0\.72nullbaseline; modality, which is never dropped during training, is perfectly recovered \(cond agreement1\.001\.00across the four classes\)\. Age tracks the requested sweep at Pearsonr=0\.93r\{=\}0\.93, with predicted means35\.4/54\.4/76\.035\.4\\,/\\,54\.4\\,/\\,76\.0for requested30/50/7030\\,/\\,50\\,/\\,70years and null generations centered at the dataset mean of55\.555\.5y\. IDH1 is the hardest axis \(mean cond agreement0\.520\.52against an 88% wildtype prior; balanced\-probe variant0\.560\.56\), reflecting that classifier\-free guidance struggles to steer toward the rare mutant class on the small T1c tumor subset rather than a probe artefact; the null branch collapses fully to wildtype\. Figure[3](https://arxiv.org/html/2606.19651#S3.F3)shows same\-noise counterfactual sweeps: holding the noise seed fixed and varying one conditioning attribute at a time produces the expected visible change \(ventricular enlargement under AD, modality contrast switching, GBM enhancing tumor mass on T1c\) while preserving the overall anatomical layout\.

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/fig_counterfactual.png)Figure 3:Same\-noise counterfactual generation under the conditional DiT \(CFGs=1\.5s\{=\}1\.5\)\. Each row fixes the initial latent noise \(seeds 109 and 113\); columns sweep one set of conditioning attributes at a time\. Leftmost column: 3D rendering of the unconditioned baseline\. The remaining seven columns sweep*age*\(HC, 30 vs\. 75 y\),*disease*\(HC vs\. AD at 75 y\),*modality*\(T1, T2, FLAIR\), and*tumor*\(GBM IDH1\-wildtype on T1c\)\. Within a row, identity\-preserving anatomy is largely retained while the swapped attribute drives the visible change\.
#### Generation fidelity\.

Pooled across all 14 conditional arms, the FID between generated samples and the tokenizer’s reconstructions \(*gen\-vs\-recons*\) is34\.434\.4, and between generated samples and raw real volumes \(*gen\-vs\-real*\) is107\.3107\.3\(Table[2\(a\)](https://arxiv.org/html/2606.19651#S3.T2.st1);nreal=2000n\_\{\\text\{real\}\}\{=\}2000vsngen=1088n\_\{\\text\{gen\}\}\{=\}1088\)\. Both are competitive with prior 3D medical generative work \(FIDs4040–120120\[[31](https://arxiv.org/html/2606.19651#bib.bib14)\]\)\. The tokenizer’s own reconstruction floor against the real volumes \(*recons\-vs\-real*\) is77\.477\.4; since gen\-vs\-recons is tighter than this floor, most of the gen\-vs\-real gap reflects the tokenizer’s compression of high\-frequency scanner\-specific detail rather than generator quality\. A sweep over the CFG scaless\(Appendix[G](https://arxiv.org/html/2606.19651#A7)\) confirmss=2\.0s\{=\}2\.0sits on a near\-optimal plateau, and a nearest\-neighbor audit in latent space\[[8](https://arxiv.org/html/2606.19651#bib.bib56)\]finds no evidence of training\-set memorization: generated latents sit1\.55×1\.55\\timesfarther from their nearest training latent than training latents do from each other, with zero of10881088generated samples falling below the5th5^\{\\text\{th\}\}\-percentile real\-to\-real neighbor distance \(Appendix[H](https://arxiv.org/html/2606.19651#A8)\)\.

### 3\.4Patient\-specific longitudinal forecasting

Sections[3\.2](https://arxiv.org/html/2606.19651#S3.SS2)and[3\.3](https://arxiv.org/html/2606.19651#S3.SS3)evaluated the frozen embeddings cross\-sectionally and as the input space for conditional generation\. We now test whether the same frozen tokenizer supports intra\-patient temporal extrapolation: given a single baseline latent and a requestedΔ​t\\Delta t, does the model produce a follow\-up scan whose predicted brain age, as read by a real\-data brain\-age probe, increases byΔ​t\\Delta t?

#### Setup\.

Using the frozen tokenizer of §[2](https://arxiv.org/html/2606.19651#S2)unchanged, we train a longitudinal flow\-matching DiT on∼25\\sim 25k ADNI T1 baseline–followup latent pairs \(subject\-level split, 2\.9k validation pairs\)\. The model is conditioned on diagnosis \(healthy control, mild cognitive impairment, or Alzheimer’s disease\), sex, baseline age, Clinical Dementia Rating \(CDR\) score, and the time horizonΔ​t\\Delta tin years\. The training interpolant couples baselinexbx^\{b\}and follow\-upxtfx^\{t\_\{f\}\}with a vanishing\-endpoint Brownian envelope\[[2](https://arxiv.org/html/2606.19651#bib.bib6)\]:

xt=\(1−t\)​xb\+t​xtf⏟interpolant\+σ​t​\(1−t\)​ε⏟envelope,σ=0\.5\.x\_\{t\}\\;=\\;\\underbrace\{\(1\{\-\}t\)\\,x^\{b\}\+t\\,x^\{t\_\{f\}\}\}\_\{\\text\{interpolant\}\}\\;\+\\;\\underbrace\{\\sigma\\sqrt\{t\(1\{\-\}t\)\}\\,\\varepsilon\}\_\{\\text\{envelope\},\\;\\sigma=0\.5\}\.The interpolant alone admits a closed\-form shortcut that lets the velocity ignoreΔ​t\\Delta t; the envelope is zero at both endpoints \(soxtx\_\{t\}hitsxbx^\{b\}att=1t\{=\}1andxtfx^\{t\_\{f\}\}att=0t\{=\}0in expectation\) but non\-zero in the interior, breaking the shortcut and forcing the velocity to useΔ​t\\Delta t\. Sampling is deterministic Euler integration; the noise is training\-only\. We use CFGs=1\.0s\{=\}1\.0rather thans=2\.0s\{=\}2\.0as in §[3\.3](https://arxiv.org/html/2606.19651#S3.SS3), since higher guidance scales amplify bridge\-sampler artefacts in CSF;s=1\.0s\{=\}1\.0was selected based on visual quality \(CFG sweep in Appendix[G](https://arxiv.org/html/2606.19651#A7)\)\.

#### Evaluation and results\.

For each of 64 held\-out validation baselines \(no subject overlap with training\), we forecast latents atΔ​t∈\{0,1,2,5,10\}\\Delta t\\in\\\{0,1,2,5,10\\\}y under classifier\-free guidance \(s=1\.0s\{=\}1\.0, 100\-step Euler integration\), decode through the frozen tokenizer, and apply a brain\-age probe trained on ADNI T1 \(mean absolute error of3\.973\.97years on held\-out cross\-sectional data\)\. The model recovers approximately 27% of true aging in magnitude \(slope0\.2680\.268, Pearsonr=0\.716r\{=\}0\.716between predicted and requestedΔ​t\\Delta t\), with the anatomical change correctly localized to ventricles and sulci \(Figure[4](https://arxiv.org/html/2606.19651#S3.F4)\)\. The magnitude attenuation is consistent with the stochastic\-bridge regularizer interpolating toward the conditional mean\. Per\-Δ​t\\Delta tsweep visualizations are in Appendix[I](https://arxiv.org/html/2606.19651#A9)\.

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/fig_real_vs_forecast.png)Figure 4:Real vs model\-forecast longitudinal change for held\-out ADNI subjects\. Each row: baseline;\|\|real diff\|\|; real follow\-up atΔ​t\\Delta t;\|\|forecast diff\|\|; model forecast atΔ​t\\Delta t\. All volumes pass through thed′=32d^\{\\prime\}\{=\}32MAE–CNN tokenizer\. Diff cells use independent intensity scaling \(hot colormap\); absolute magnitudes \(z\-score\) printed beneath\. Forecast diffs capture the same anatomical loci \(ventricle, sulci edges\) as real diffs at damped magnitude \(slope0\.2680\.268in §[3\.4](https://arxiv.org/html/2606.19651#S3.SS4)\)\.
#### Scope\.

Specialised longitudinal brain\-MRI LDMs\[[32](https://arxiv.org/html/2606.19651#bib.bib40),[40](https://arxiv.org/html/2606.19651#bib.bib41)\]target volumetric ROI fidelity under task\-specific losses and use CNN\-VAE tokenizers\. In contrast, the longitudinal variant here reuses the same frozen self\-supervised embeddings used for cross\-sectional probing and conditional generation, showing that this single embedding space also supports intra\-patient temporal forecasting without retraining the encoder\.

## 4Discussion

Our work makes two main contributions\. First, a self\-supervised two\-phase MAE\-CNN tokenizer for 3D brain MRI in which a frozen MAE encoder is decoupled from a dedicated CNN decoder: the encoder produces clinically informative embeddings \(outperforming or matching the three published 3D brain\-MRI foundation baselines on 21 of 23 linear\-probing tasks; the two non\-wins, GDS depression and UPDRS\-III, are near\-floor regression for every encoder\), and the decoder reconstructs voxels faithfully\. Second, a conditional flow\-matching DiT trained on those frozen embeddings supports controllable generation: real\-data probes applied to DiT samples recover the requested condition under classifier\-free guidance, withcond–nullgaps consistent with each probe’s real\-data signal\. This supports our hypothesis that the MAE serves as a viable tokenizer for downstream diffusion in 3D brain MRI\.

Both contributions are delivered by a single MAE encoder, MAE\-CNN tokenizer, and DiT that span all four structural MRI modalities \(T1, T2, FLAIR, T1c\) across pretraining, probing, and conditional generation, in contrast to alternative approaches that rely on 2D slicewise pretraining or train separate per\-modality models\.

#### Broader impact\.

The variables recovered by the linear probes \(§[3\.2](https://arxiv.org/html/2606.19651#S3.SS2)\) drive routine clinical decisions in neurology and neuro\-oncology that today rely on invasive or costly ancillary procedures: for instance, IDH1 mutation status currently requires tumor biopsy, apolipoprotein E \(APOE\) genotyping requires blood draw, and cognitive scores such as CDR, MMSE, and the Montreal Cognitive Assessment \(MoCA\) require dedicated examiner time\. Predicting these directly from MRI could therefore provide an image\-only readout, with two operating points: linear probing on frozen embeddings \(no GPU required at inference; sites without dedicated infrastructure can fit and apply task\-specific classifiers on cached features\) or supervised fine\-tuning for higher performance when GPU compute is available\. Brain\-age regression similarly yields a continuous biomarker \(the brain\-age gap\) that could serve as an indicator of accelerated neurodegeneration risk\. The same variables condition the DiT \(§[3\.3](https://arxiv.org/html/2606.19651#S3.SS3)\), allowing patient\-specific longitudinal forecasting \(§[3\.4](https://arxiv.org/html/2606.19651#S3.SS4)\), counterfactual generation by changing conditioning labels \(e\.g\., “what would this patient look like under alternative disease status”\), and clinically grounded synthetic data for augmentation of under\-represented patient cohorts and privacy\-preserving cohort sharing\.

#### Limitations\.

Findings are established on brain MRI; transfer to other 3D modalities \(CT, whole\-body MRI, ultrasound\) is untested\. On longitudinal forecasting \(§[3\.4](https://arxiv.org/html/2606.19651#S3.SS4)\), the bridge sampler recovers the trajectory direction and anatomical loci of true aging but only∼\\sim27% of its magnitude \(slope0\.2680\.268\), consistent with the stochastic\-bridge regularizer interpolating toward the conditional mean\. The AKL comparison that motivates the two\-phase choice is at 1100 volumes \(Appendix[D](https://arxiv.org/html/2606.19651#A4)\); the architectural verdict at 35,309\-volume scale rests on the head\-to\-head against published foundation models rather than a re\-run AKL\. Pretraining distributions differ across the published baselines, with cohort overlap with our corpus ranging from substantial to none, so a portion of the head\-to\-head gaps conflates architecture and method with distribution mismatch\. Acquisition confounders are detectable on the embeddings \(field\-strength and scanner\-vendor probes exceed 0\.94 AUC\), so cohort\-matched evaluation is required for any downstream application\.

#### Reproducibility\.

Code, model weights, and deterministic splits will be released under a research\-use license\. Approximate total training: 200 H100\-hours; probing runs on a 16\-core CPU in≈2\\approx 2h per task\. Hyperparameters in Appendix[C](https://arxiv.org/html/2606.19651#A3)\.

## References

- \[1\]\(2017\)Multimodal neuroimaging in schizophrenia: description and dissemination\.Neuroinformatics15\(4\),pp\. 343–364\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.14.14.1)\.
- \[2\]M\. S\. Albergo, M\. Goldstein, N\. M\. Boffi, R\. Ranganath, and E\. Vanden\-Eijnden\(2024\)Stochastic interpolants with data\-dependent couplings\.InICML,Note:arXiv:2310\.03725Cited by:[Table 4](https://arxiv.org/html/2606.19651#A3.T4.35.35.35.1.1.1),[§3\.4](https://arxiv.org/html/2606.19651#S3.SS4.SSS0.Px1.p1.4)\.
- \[3\]L\. M\. Alexanderet al\.\(2017\)An open resource for transdiagnostic research in pediatric mental health and learning disorders\.Scientific Data\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.7.7.1)\.
- \[4\]B\. B\. Avants, C\. L\. Epstein, M\. Grossman, and J\. C\. Gee\(2008\)Symmetric diffeomorphic image registration with cross\-correlation\.Medical Image Analysis\.Cited by:[Appendix B](https://arxiv.org/html/2606.19651#A2.p1.3),[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px1.p1.2)\.
- \[5\]S\. Bakaset al\.\(2022\)The University of Pennsylvania glioblastoma \(UPenn\-GBM\) cohort\.Scientific Data\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.5.5.1)\.
- \[6\]B\. B\. Biswalet al\.\(2010\)Toward discovery science of human brain function\.PNAS\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.11.11.1)\.
- \[7\]E\. Calabreseet al\.\(2022\)The UCSF\-PDGM: a public radiology\-pathology dataset for diffuse glioma\.Radiology: Artificial Intelligence\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.8.8.1)\.
- \[8\]N\. Carlini, J\. Hayes, M\. Nasr, M\. Jagielski, V\. Sehwag, F\. Tramèr, B\. Balle, D\. Ippolito, and E\. Wallace\(2023\)Extracting training data from diffusion models\.InUSENIX Security,Cited by:[§3\.3](https://arxiv.org/html/2606.19651#S3.SS3.SSS0.Px3.p1.12)\.
- \[9\]H\. Chen, Y\. Han, F\. Chen, X\. Li, Y\. Wang, J\. Wang, Z\. Wang, Z\. Liu, D\. Zou, and B\. Raj\(2025\)Masked autoencoders are effective tokenizers for diffusion models\.InICML,Note:arXiv:2502\.03444Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p2.1)\.
- \[10\]S\. Chen, K\. Ma, and Y\. Zheng\(2019\)Med3D: transfer learning for 3d medical image analysis\.arXiv:1904\.00625\.Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p3.6),[§3\.2](https://arxiv.org/html/2606.19651#S3.SS2.SSS0.Px1.p1.6)\.
- \[11\]J\. Cox, P\. Liu, S\. E\. Stolte, Y\. Yang, K\. Liu, K\. B\. See, H\. Ju, and R\. Fang\(2024\)BrainSegFounder: towards 3D foundation models for neuroimage segmentation\.Medical Image Analysis\.Note:arXiv:2406\.10395Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p3.6),[§3\.2](https://arxiv.org/html/2606.19651#S3.SS2.SSS0.Px1.p1.6)\.
- \[12\]P\. Dhariwal and A\. Nichol\(2021\)Diffusion models beat GANs on image synthesis\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px5.p1.10)\.
- \[13\]A\. Di Martinoet al\.\(2014\)The autism brain imaging data exchange\.Molecular Psychiatry\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.13.13.1)\.
- \[14\]A\. Di Martinoet al\.\(2017\)Enhancing studies of the connectome in autism using the Autism Brain Imaging Data Exchange II\.Scientific Data\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.10.10.1)\.
- \[15\]A\. Dosovitskiyet al\.\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]R\. L\. Gollubet al\.\(2013\)The MCIC collection: a shared repository of multi\-modal, multi\-site brain image data from a clinical investigation of schizophrenia\.Neuroinformatics11\(3\),pp\. 367–388\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.14.14.1)\.
- \[17\]I\. E\. Hamamci, S\. Er, F\. Almas, A\. G\. Simsek, S\. N\. Esirgun, I\. Dogan, M\. F\. Dasdelen, B\. Wittmann, E\. Simsar, M\. Simsek,et al\.\(2024\)GenerateCT: text\-conditional generation of 3D chest CT volumes\.InECCV,Note:arXiv:2305\.16037Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p1.1)\.
- \[18\]I\. E\. Hamamci, S\. Er, S\. Shit, H\. Reynaud, D\. Yang, P\. Guo, M\. Edgar, D\. Xu, B\. Kainz, and B\. Menze\(2025\)Better tokens for better 3D: advancing vision\-language modeling in 3D medical imaging\.InNeurIPS,Note:arXiv:2510\.20639Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p1.1)\.
- \[19\]K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick\(2022\)Masked autoencoders are scalable vision learners\.InCVPR,Cited by:[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px2.p1.1)\.
- \[20\]J\. Ho and T\. Salimans\(2021\)Classifier\-free diffusion guidance\.NeurIPS Workshop\.Cited by:[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px5.p1.10)\.
- \[21\]A\. J\. Holmeset al\.\(2015\)Brain Genomics Superstruct Project initial data release with structural, functional, and behavioral measures\.Scientific Data\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.9.9.1)\.
- \[22\]F\. Isenseeet al\.\(2019\)Automated brain extraction of multisequence MRI using artificial neural networks\.Human Brain Mapping\.Cited by:[Appendix B](https://arxiv.org/html/2606.19651#A2.p1.3),[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px1.p1.2)\.
- \[23\]IXI ProjectIXI dataset\.Note:[https://brain\-development\.org/ixi\-dataset/](https://brain-development.org/ixi-dataset/)Accessed 2025Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.12.12.1)\.
- \[24\]Y\. Lipman, R\. T\.Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le\(2023\)Flow matching for generative modeling\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p3.6),[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px5.p1.3)\.
- \[25\]D\. S\. Marcuset al\.\(2007\)Open access series of imaging studies \(OASIS\)\.Journal of Cognitive Neuroscience\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.17.17.1)\.
- \[26\]D\. S\. Marcuset al\.\(2010\)Open access series of imaging studies \(OASIS\): longitudinal MRI data in nondemented and demented older adults\.Journal of Cognitive Neuroscience\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.18.18.1)\.
- \[27\]K\. Mareket al\.\(2011\)The parkinson progression marker initiative \(PPMI\)\.Progress in Neurobiology\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.3.3.1)\.
- \[28\]K\. B\. Nooneret al\.\(2012\)The NKI\-Rockland sample: a model for accelerating the pace of discovery science in psychiatry\.Frontiers in Neuroscience\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.19.19.1),[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.4.4.1)\.
- \[29\]W\. Peebles and S\. Xie\(2023\)Scalable diffusion models with transformers\.InICCV,Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p3.6),[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px5.p1.10),[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px5.p1.3)\.
- \[30\]R\. C\. Petersenet al\.\(2010\)Alzheimer’s Disease Neuroimaging Initiative \(ADNI\): clinical characterization\.Neurology74\(3\),pp\. 201–209\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.2.2.1)\.
- \[31\]W\. H\. L\. Pinaya, P\. Tudosiu, J\. Dafflon, P\. F\. Da Costa, V\. Fernandez, P\. Nachev, S\. Ourselin, and M\. J\. Cardoso\(2022\)Brain imaging generation with latent diffusion models\.arXiv:2209\.07162\.Cited by:[Appendix D](https://arxiv.org/html/2606.19651#A4.p1.3),[§1](https://arxiv.org/html/2606.19651#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.19651#S3.SS1.SSS0.Px2.p1.7),[§3\.3](https://arxiv.org/html/2606.19651#S3.SS3.SSS0.Px3.p1.12)\.
- \[32\]L\. Puglisi, D\. C\. Alexander, and D\. Ravì\(2024\)Enhancing spatiotemporal disease progression models via latent diffusion and prior knowledge\.InMICCAI,Note:arXiv:2405\.03328Cited by:[§3\.4](https://arxiv.org/html/2606.19651#S3.SS4.SSS0.Px3.p1.1)\.
- \[33\]T\. Rohlfing, N\. M\. Zahr, E\. V\. Sullivan, and A\. Pfefferbaum\(2010\)The SRI24 multichannel atlas of normal adult human brain structure\.Human Brain Mapping31\(5\)\.Cited by:[Appendix B](https://arxiv.org/html/2606.19651#A2.p1.3),[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px1.p1.2)\.
- \[34\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InCVPR,Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p1.1)\.
- \[35\]C\. Sadée, S\. Testa, T\. Barba, K\. Hartmann, M\. Schuessler, A\. Thieme, G\. M\. Church, I\. Okoye, T\. Hernandez\-Boussard, L\. Hood, I\. Shmulevich, E\. Kuhl, and O\. Gevaert\(2025\)Medical digital twins: enabling precision medicine and medical artificial intelligence\.The Lancet Digital Health7\(7\),pp\. e100864\.External Links:[Document](https://dx.doi.org/10.1016/j.landig.2025.02.004)Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p1.1)\.
- \[36\]D\. Tak, B\. A\. Garomsa, A\. Zapaishchykova, T\. L\. Chaunzwa, J\. C\. C\. Pardo, Z\. Ye, J\. Zielke, Y\. Ravipati, S\. Pai, S\. Vajapeyam,et al\.\(2026\)A generalizable foundation model for analysis of human brain MRI\.Nature Neuroscience29,pp\. 945–956\.Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p3.6),[§3\.2](https://arxiv.org/html/2606.19651#S3.SS2.SSS0.Px1.p1.6)\.
- \[37\]The ADHD\-200 Consortium\(2012\)The ADHD\-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience\.Frontiers in Systems Neuroscience\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.15.15.1)\.
- \[38\]N\. J\. Tustisonet al\.\(2010\)N4ITK: improved N3 bias correction\.IEEE TMI\.Cited by:[Appendix B](https://arxiv.org/html/2606.19651#A2.p1.3),[§2](https://arxiv.org/html/2606.19651#S2.SS0.SSS0.Px1.p1.2)\.
- \[39\]D\. C\. Van Essenet al\.\(2013\)The WU\-Minn human connectome project: an overview\.NeuroImage\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.6.6.1)\.
- \[40\]C\. Wan, B\. Jafrasteh, E\. Adeli, M\. Zhang, and Q\. Zhao\(2026\)Anatomically guided latent diffusion for brain MRI progression modeling\.arXiv:2601\.14584\.Cited by:[§3\.4](https://arxiv.org/html/2606.19651#S3.SS4.SSS0.Px3.p1.1)\.
- \[41\]H\. Wang, Z\. Liu, K\. Sun, X\. Wang, D\. Shen, and Z\. Cui\(2024\)3D MedDiffusion: a 3D medical latent diffusion model for controllable and high\-quality medical image generation\.arXiv:2412\.13059\.Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p1.1)\.
- \[42\]L\. Wanget al\.\(2016\)SchizConnect: mediating neuroimaging databases on schizophrenia and related disorders for large\-scale integration\.NeuroImage124,pp\. 1155–1167\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.14.14.1)\.
- \[43\]Y\. Wu, S\. Wang, Y\. Li, M\. Safari, M\. Hu, C\. Chang, H\. Veeraraghavan, and X\. Yang\(2026\)BrainDINO: a brain MRI foundation model for generalizable clinical representation learning\.arXiv preprint arXiv:2604\.27277\.Cited by:[§3\.2](https://arxiv.org/html/2606.19651#S3.SS2.SSS0.Px1.p1.6)\.
- \[44\]Y\. Yu, Y\. Gu, S\. Zhang, and X\. Zhang\(2024\)MedDiff\-FM: a diffusion\-based foundation model for versatile medical image applications\.arXiv:2410\.15432\.Cited by:[§1](https://arxiv.org/html/2606.19651#S1.p1.1)\.
- \[45\]X\. Zuoet al\.\(2014\)An open science resource for establishing reliability and reproducibility in functional connectomics\.Scientific Data\.Cited by:[Table 3](https://arxiv.org/html/2606.19651#A1.T3.8.16.16.1)\.

## Appendix ADataset card

#### Aggregation\.

35,309 preprocessed brain MRI volumes from 17,399 unique subjects across 18 public cohorts and 200\+ acquisition sites worldwide\. Four imaging modalities \(T1, T2, FLAIR, T1c\)\. Ages 5–98\. 6,576 subjects have longitudinal \(multi\-visit\) scans\.

Table 3:Per\-cohort composition\. Dominant diagnoses listed; full per\-subject labels involumes\.csv\. Diagnosis abbreviations: HC \(healthy control\), AD \(Alzheimer’s disease\), MCI \(mild cognitive impairment\), PD \(Parkinson’s disease\), GBM \(glioblastoma\), ASD \(autism spectrum disorder\), ADHD \(attention\-deficit/hyperactivity disorder\), SCZ \(schizophrenia\)\.CohortVolumesSubjectsModalitiesDominant diagnosesADNI\[[30](https://arxiv.org/html/2606.19651#bib.bib51)\]10,9102,592T1HC, MCI, ADPPMI\[[27](https://arxiv.org/html/2606.19651#bib.bib30)\]3,9081,484T1, T2, FLAIRHC, PDNKI\-RS\[[28](https://arxiv.org/html/2606.19651#bib.bib42)\]2,4551,326T1HCUPENN\-GBM\[[5](https://arxiv.org/html/2606.19651#bib.bib34)\]2,444611T1/T2/FLAIR/T1cGBMHCP\-YA\[[39](https://arxiv.org/html/2606.19651#bib.bib32)\]2,2261,113T1, T2HCHBN\[[3](https://arxiv.org/html/2606.19651#bib.bib43)\]2,1352,135T1PaediatricUCSF\-PDGM\[[7](https://arxiv.org/html/2606.19651#bib.bib33)\]1,980495T1/T2/FLAIR/T1cGBM, GliomaBGSP\[[21](https://arxiv.org/html/2606.19651#bib.bib44)\]1,6361,568T1HCABIDE\-II\[[14](https://arxiv.org/html/2606.19651#bib.bib45)\]1,4301,082T1HC, ASDFCON1000\[[6](https://arxiv.org/html/2606.19651#bib.bib46)\]1,1971,197T1HCIXI\[[23](https://arxiv.org/html/2606.19651#bib.bib50)\]1,159582T1, T2HCABIDE\-I\[[13](https://arxiv.org/html/2606.19651#bib.bib31)\]984947T1HC, ASDSCHIZO\[[1](https://arxiv.org/html/2606.19651#bib.bib52),[16](https://arxiv.org/html/2606.19651#bib.bib53),[42](https://arxiv.org/html/2606.19651#bib.bib54)\]670335T1, T2HC, SCZADHD\-200\[[37](https://arxiv.org/html/2606.19651#bib.bib47)\]598598T1HC, ADHDCoRR\[[45](https://arxiv.org/html/2606.19651#bib.bib48)\]546546T1HCOASIS\-1\[[25](https://arxiv.org/html/2606.19651#bib.bib28)\]436416T1HC, MCI, ADOASIS\-2\[[26](https://arxiv.org/html/2606.19651#bib.bib49)\]373150T1HC, MCI, ADNKI\-RS 2\[[28](https://arxiv.org/html/2606.19651#bib.bib42)\]222222T1HCTotal35,30917,3994 modalities10 classes
#### Disease coverage\.

HC 15,274, MCI 5,808, GBM 4,028, PD 3,381, paediatric\-mixed 2,130, AD 1,458, ASD 1,061, Glioma \(non\-GBM\) 396, SCZ 328, ADHD 247\. Healthy subjects dominate \(43% of the corpus\) by design: the probing benchmark requires a large HC pool to separate clinical signal from healthy\-brain variability, and the generative model benefits from a well\-represented null distribution\.

#### Clinical metadata\.

Coverage of key fields: age\-at\-scan 90%, sex 90%, dx 96%, CDR 34% \(99% within AD/MCI\), MMSE 41% \(99% within AD/MCI\), MoCA 18%, APOE 41%, tumor grade 8% \(100% within tumor cohorts\), IDH1 7% \(91% within tumor cohorts\), MGMT 6% \(69% within tumor cohorts\), site 78%\.

## Appendix BPreprocessing pipeline

All volumes pass through a single harmonized pipeline before entering the model\. Raw NIfTI/DICOM inputs first undergo N4 bias\-field correction\[[38](https://arxiv.org/html/2606.19651#bib.bib23)\]to remove low\-frequency intensity inhomogeneity from RF coil sensitivity profiles\. HD\-BET\[[22](https://arxiv.org/html/2606.19651#bib.bib21)\]then performs skull stripping, removing skull, meninges, and extracranial tissue\. The resulting skull\-stripped brain is affine\-registered to the SRI24 atlas\[[33](https://arxiv.org/html/2606.19651#bib.bib20),[4](https://arxiv.org/html/2606.19651#bib.bib22)\]at240×240×155240\{\\times\}240\{\\times\}155voxels,11mm isotropic, LPS orientation\. For multi\-modality studies, T1 is registered as the “center” modality and T2/FLAIR/T1c are co\-registered through T1’s transform \(avoiding independent re\-registration that would create cross\-modal alignment drift\)\. Negative voxels from registration interpolation are clipped to zero\. Intensity normalization is not applied at preprocessing time; per\-volume z\-scoring is performed in the training dataloader\. For modeling, volumes are center\-cropped/padded to160×192×160160\{\\times\}192\{\\times\}160\.

## Appendix CHyperparameters

Table 4:Training hyperparameters for each stage\.ComponentHyperparameterValueMAE encoder \(Phase 1\)ArchitectureViT, 12 layers, hidden11521152, MLP46084608, 16 headsDecoderViT, 8 layers, hidden11521152, MLP46084608, 16 headsPatch size16316^\{3\}; 1200 tokens / volumeMasking ratio0\.70Batch / GPUs1616/ 4×\\timesH100OptimizerAdamW, lr1×10−41\{\\times\}10^\{\-4\}, cosineEpochs100Final train / val loss0\.1170\.117/0\.1210\.121patch MSEPhase 2 tokenizer \(frozen enc\.\+\+proj\.\+\+CNN dec\.\)ProjectionlinearP∈ℝ1152×32P\{\\in\}\\mathbb\{R\}^\{1152\{\\times\}32\}CNN decoderResBlocks\+\+attn,4×↑24\{\\times\}\{\\uparrow\}2, 43\.7M paramsLossvoxelℓ1\\ell\_\{1\}Batch / GPUs1212/ 4×\\timesH100, bf16OptimizerAdamW, lr1×10−41\{\\times\}10^\{\-4\}Epochs80, early\-stop patience 40Conditional DiT \(§[3\.3](https://arxiv.org/html/2606.19651#S3.SS3)\)ArchitectureDiT\-L: 12 blocks, hidden11521152,1818heads, MLP4×4\{\\times\}Conditions6 \(disease, sex, modality \[never\-drop\], site, age, IDH1\)CFG dropoutper\-conditionp=0\.1p=0\.1Batch / GPUs2828/ 4×\\timesH100, bf16OptimizerAdamW, lr5×10−55\{\\times\}10^\{\-5\}, wd0\.010\.01, clip0\.50\.5EMA decay0\.99990\.9999SamplingEuler ODE, 50 steps, CFGs=2\.0s\{=\}2\.0Longitudinal DiT \(§[3\.4](https://arxiv.org/html/2606.19651#S3.SS4)\)Architectureshares Conditional\-DiT backbone \(12 blocks, hidden11521152,1818heads\)Training data∼25\\sim 25k ADNI T1 baseline–followup latent pairs \(subject\-level split\)Conditions5 \(disease, sex, baseline age, CDR,Δ​t\\Delta tin years\)Interpolantvanishing\-endpoint Brownian bridge,σ=0\.5\\sigma\{=\}0\.5\[[2](https://arxiv.org/html/2606.19651#bib.bib6)\]CFG dropoutper\-conditionp=0\.1p=0\.1\(Δ​t\\Delta tand disease never\-drop\)Batch / GPUs2828/1×1\\timesH100, bf16OptimizerAdamW, lr5×10−55\{\\times\}10^\{\-5\}, wd0\.010\.01, clip0\.50\.5, EMA0\.99990\.9999Samplingdeterministic Euler ODE, 100 steps, CFGs=1\.0s\{=\}1\.0, EMA epoch\-75ProbingFeaturesfrozen encoder, mean\-pool over 1200 tokens,d=1152d\{=\}1152Classifierlogistic regression \(max\_iter=1000\)Regressorridge \(α=1\.0\\alpha=1\.0\)CV5\-fold stratified groupkk\-fold \(subject\-grouped\)Seed2025

Approximate total training cost:≈200\\approx 200H100\-hours \(100 epochs MAE pretraining\+\+80 epochs Phase 2\+\+≈1100\\approx 1100DiT epochs\)\. Probing runs on a 16\-core CPU in≈2\\approx 2h per task\.

## Appendix DAKL baseline training details

The AutoencoderKL \(AKL\) baseline reported in Table[5](https://arxiv.org/html/2606.19651#A4.T5)is the canonical CNN\-VAE tokenizer architecture from the 3D medical generative literature\[[31](https://arxiv.org/html/2606.19651#bib.bib14)\]\. We train AKL from scratch on the same 1100\-volume tumor cohort \(UCSF\-PDGM\+\+UPENN\-GBM\) as the MAE\+CNN tokenizer, with reconstruction\+\+KL loss; the resulting latent has 614K total elements, matching the MAE\+CNNd′=512d^\{\\prime\}\{=\}512configuration\. Held\-out reconstruction metrics are reported on the 20% test split; clinical probing uses 5\-fold subject\-grouped cross\-validation with logistic regression\.

Table 5:Architectural validation on the 1100\-volume tumor cohort: MAE\+CNN at projection bottlenecksd′d^\{\\prime\}vs an AutoencoderKL \(AKL\) baseline \(matched compute\)\. Reconstruction on held\-out volumes; probing AUC under 5\-fold logistic regression\. Bold: best probing per column\.#### Reconstruction\-vs\-probing trade\-off\.

On voxel reconstruction, AKL is competitive with MAE\+CNN atd′=32d^\{\\prime\}\{=\}32\(PSNR 33\.62 vs 32\.54\), but MAE exceeds AKL atd′≥128d^\{\\prime\}\{\\geq\}128\(Table[5](https://arxiv.org/html/2606.19651#A4.T5)\)\. On clinical probing, MAE outperforms AKL on IDH1 at every bottleneck and on WHO tumor grade at every bottleneck aboved′=32d^\{\\prime\}\{=\}32, where the two are essentially tied\. The MAE\-as\-tokenizer advantage is driven by what the encoder embedding captures rather than by raw voxel detail preservation\.

## Appendix EFull 23\-task probing breakdown \(per modality\)

Table 6:Frozen linear\-probe performance per task and modality split, 35,309\-volume corpus, 5\-fold stratified groupkk\-fold grouped by subject\. AUC for classification,R2R^\{2\}for regression\. Dashes: too few samples in that modality subset\. Bold==best modality per task\. Abbreviations: APOE \(apolipoprotein E\), IDH1 \(isocitrate dehydrogenase 1\), MGMT \(O6\-methylguanine\-DNA methyltransferase promoter methylation\), CDR \(Clinical Dementia Rating\), MMSE \(Mini\-Mental State Examination\), MoCA \(Montreal Cognitive Assessment\), GDS \(Geriatric Depression Scale\), UPDRS\-III \(Unified Parkinson’s Disease Rating Scale, motor\), SCOPA\-AUT \(Scale for Outcomes in PD\-Autonomic\)\.
## Appendix FHead\-to\-head competitor probing — full 23\-task breakdown

Complement to Table[1](https://arxiv.org/html/2606.19651#S3.T1): all 23 probed tasks, four frozen encoders on identical splits and probe code \(35,309\-volume corpus, 5\-fold stratified subject\-grouped CV\)\. Each bar shows the best\-performing modality slice for that encoder on that task, so each encoder is reported at its strongest configuration\. Table[1](https://arxiv.org/html/2606.19651#S3.T1)fixes the modality slice across encoders on 8 representative tasks for a directly comparable head\-to\-head\.

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/fig_competitors_full.png)Figure 5:Frozen\-feature linear probing across all 23 tasks, best modality per encoder\. Left: 15 classification tasks, AUC \(↑\\uparrow\)\. Right: 8 regression tasks,R2R^\{2\}\(↑\\uparrow\)\. Ours outperforms or matches every competitor on 21 of 23 tasks; the two non\-wins \(GDS depression, UPDRS\-III\) are near\-floor regression for every encoder\.
## Appendix GCFG scale sensitivity

We sweep CFG scales∈\{1\.5,2\.0,3\.0\}s\\in\\\{1\.5,\\,2\.0,\\,3\.0\\\}atn=32n\{=\}32samples per arm \(Table[7](https://arxiv.org/html/2606.19651#A7.T7)\)\. CFGs=1\.5s\{=\}1\.5ands=2\.0s\{=\}2\.0are within noise on every controllable condition; ats=3\.0s\{=\}3\.0the continuous age axis degrades \(Pearsonrrdrops0\.94→0\.820\.94\\to 0\.82\) as strong guidance over\-extrapolates away from the conditional manifold\. IDH1 stays near chance throughout the sweep \(mean agreement0\.510\.51–0\.520\.52\), reflecting CFG’s difficulty steering toward the rare mutant class on the small T1c tumor subset rather than a CFG\-scale effect\. We uses=2\.0s\{=\}2\.0throughout the paper as a point on the near\-optimal plateau\.

Table 7:Probe recovery of conditional attributes under different CFG scales\. Columns: probe agreement with requested class \(disease HC\-vs\-AD, sex F\-vs\-M, IDH1 wt\-vs\-mut\) and Pearsonrrbetween requested and predicted age on the 30 / 50 / 70 sweep, with the per\-requested\-age predicted means\.
## Appendix HGeneration fidelity — full decomposition

#### Protocol\.

3D\-FID is computed via Inception\-V3 on 2D slices\. For each volume we extract55central slices per anatomical view \(×3\\times 3views\)==1515slices/volume\. Real reference sets are drawn from the 35,309\-volume corpus under the stated filter; generated samples are drawn from the conditional DiT under CFGs=2\.0s\{=\}2\.0,5050\-step Euler ODE\. For each controllability arm we report three FIDs:gen\-vs\-real\(generated vs real volumes\),gen\-vs\-recons\(generated vs the same real volumes passed through the tokenizer and reconstructed\), andrecons\-vs\-real\(reconstructions vs real, i\.e\. the tokenizer floor\)\.

Table 8:FID decomposition across controllability arms\.gen\-vs\-reconsmeasures how well the generator matches the tokenizer’s conditional latent distribution;recons\-vs\-realis the tokenizer floor\. Lower is better\.
#### Memorization audit\.

Nearest\-neighbor L2distances computed in the1200×321200\{\\times\}32token latent \(38,40038\{,\}400\-dim flattened\), real pooln=33,639n\{=\}33\{,\}639, syntheticn=1088n\{=\}1088\. Real\-to\-real NN: mean37\.237\.2, median33\.433\.4,5th5^\{\\text\{th\}\}percentile14\.414\.4, min2\.952\.95, max116\.3116\.3\. Synthetic\-to\-real NN: mean57\.757\.7, median60\.460\.4, min23\.623\.6, max75\.075\.0\. Ratio of means=1\.55=1\.55; zero synthetic samples fall below the5th5^\{\\text\{th\}\}\-percentile real\-to\-real threshold\. Verdict: no memorization\.

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/memorization_hist.png)Figure 6:Nearest\-neighbor L2distributions in the38,40038\{,\}400\-dim latent space\. Real\-to\-real \(slate\) versus generated\-to\-real \(coral\)\. The synthetic distribution sits entirely to the right of the real distribution’s lower tail; zero of10881088generated samples falls below the5th5^\{\\text\{th\}\}\-percentile real\-to\-real threshold \(dashed line,14\.414\.4\)\.

## Appendix ILongitudinal sweep visualization

Complement to §[3\.4](https://arxiv.org/html/2606.19651#S3.SS4): per\-Δ​t\\Delta tself\-consistency of the bridge sampler\. The same two held\-out ADNI cases shown in Figure[4](https://arxiv.org/html/2606.19651#S3.F4)are bridge\-sampled at requestedΔ​t∈\{1,2,5\}\\Delta t\\in\\\{1,2,5\\\}y, and each forecast is compared to theΔ​t=0\\Delta t\{=\}0sample to isolateΔ​t\\Delta t\-driven structural change \(subtracting the round\-trip sampler noise floor\)\. Figure[7](https://arxiv.org/html/2606.19651#A9.F7)shows that the difference maps grow monotonically with requestedΔ​t\\Delta tand concentrate along the same cortex and ventricle loci that the real\-vs\-forecast comparison in Figure[4](https://arxiv.org/html/2606.19651#S3.F4)highlights\.

![Refer to caption](https://arxiv.org/html/2606.19651v1/figures/fig_longit.png)Figure 7:Longitudinal forecasting at fixed baseline\. Two held\-out ADNI cases \(HC 75\-yr\-old male, AD 77\-yr\-old male\) bridge\-sampled at requestedΔ​t∈\{1,2,5\}\\Delta t\\in\\\{1,2,5\\\}y\. Each pair: forecasted axial slice \(left\) and the absolute voxel difference\|sample​\(Δ​t\)−sample​\(0\)\|\|\\,\\text\{sample\}\(\\Delta t\)\-\\text\{sample\}\(0\)\\,\|\(right, hot colormap,γ=1\.5\\gamma\{=\}1\.5\); subtracting the round\-trip noise floor isolatesΔ​t\\Delta t\-driven structural change\. Difference maps grow with requestedΔ​t\\Delta tand concentrate along cortex and ventricles, the expected loci of age\-related change\. T1\.

Similar Articles

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Hugging Face Daily Papers

This paper introduces a meta-optimized approach for semantic visual decoding from fMRI signals that generalizes to novel subjects without fine-tuning, using in-context learning to infer unique neural encoding patterns from a small set of image-brain activation examples. The method achieves strong cross-subject and cross-scanner generalization without requiring anatomical alignment or stimulus overlap.