Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings
Summary
This paper investigates the latent structure of multimodal embeddings from a masked autoencoder for pediatric sleep analysis. It shows that augmenting embeddings with geometric, topological, and clinical features improves prediction and calibration for sleep-related events.
View Cached Full Text
Cached at: 05/15/26, 06:27 AM
# Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings
Source: [https://arxiv.org/html/2605.14156](https://arxiv.org/html/2605.14156)
\\jmlrvolume
297\\jmlryear2025\\jmlrworkshopMachine Learning for Health \(ML4H\) 2025
\\NameScott Ye\\nametag\\Emailscott\.ye@ucsf\.edu \\addrDepartment of RadiologyWork initiated during the author’s M\.S\. paper in Biostatistics at UNC–Chapel Hill and continued in collaboration with the UNC School of Data Science and Society\. Short version presented at TS4H Workshop at NeurIPS 2025\.San Francisco \\NameHarlin Lee\\Emailharlin@unc\.edu \\addrSchool of Data Science and SocietyUniversity of North Carolina at Chapel Hill
###### Abstract
While generative models have shown promise in pediatric sleep analysis, the latent structure of their multimodal embeddings remains poorly understood\. This work investigatessession\-widediagnostic information contained in thesequencesof 30\-second pediatric PSG epochs embedded by a multimodal masked autoencoder\. We test whether augmenting embeddings with \(i\) PHATE\-derived per\-epoch coordinates and whole\-night movement descriptors, \(ii\) persistent homology summaries of the embedding cloud, and \(iii\) EHR yields task\-relevant signals\. Simple linear and MLP models, chosen for interpretability rather than state\-of\-the\-art performance, show that geometric, topological, and clinical features each provide complementary gains\. For binary predictions, feature importance is task\-dependent, and more expressive late\-fusion models generally perform better, with AUPRC improving 0\.26→0\.34 for desaturation, 0\.31→0\.48 for EEG arousal, 0\.09→0\.22 for hypopnea, and 0\.05→0\.14 for apnea\. We also report Brier score and Expected Calibration Error, where the full fusion model yields the best calibration across all four binary tasks\. Our study reveals that latent geometry/topology and EHR offer complementary, interpretable signals beyond embeddings, improving calibration and robustness under extreme imbalance\.
###### keywords:
pediatric sleep, polysomnography, PHATE, physiological time\-series, topological data analysis, trajectory analysis, EHR, multimodal representation learning
##### Data and Code Availability
The Nationwide Children’s Hospital Sleep DataBank is available at NSRR\(zhang2018national\)and Physionet\(goldberger2000physiobank\)\. All preprocessing, feature extraction, and model\-training scripts are available at[https://github\.com/scottye009/PedSleep\-TTA](https://github.com/scottye009/PedSleep-TTA)\.
##### Institutional Review Board \(IRB\)
This work analyzes a publicly released, de\-identified dataset and does not constitute human subjects research\. Therefore, IRB approval is not required\.
## 1Introduction
Sleep disorders in the pediatric population significantly impact developmental health, cognition, behavior, and cardiometabolic outcomes\(ANDERS19979;Marcus2012;american2007aasm\)\. Understanding pediatric sleep presents unique challenges compared to adults: respiratory events are shorter and subtler, arousals and hypopneas are harder to detect, and clinical scoring criteria differ\(american2007aasm;berry2012rules\)\.
Figure 1:Parallel PHATE views for one sleep session: left colored by epoch index \(=time of night\), right by sleep stages\. Each point is multimodal embedding of 30 seconds of sleep \(=1 epoch\)\. The embedding model was trained without access to the epoch index information\. Still, the 2\-D diffusion map reveals a smooth, time\-ordered trajectory whose regions align with expert staging\.Overnight polysomnography \(PSG\) provides rich multimodal recordings for clinical diagnosis, and most recently, for generative learning\(pandey2024\)\. These models are shown to identify pediatric sleep stages and events, but they raise a deeper question: what do they actually encode? Much less attention has been paid to understanding whethertheir latent structure captures higher\-level information about disease burden, sleep continuity, or clinical severity\. This gap is particularly important in pediatrics, where small differences in scoring rules and the lower prevalence of events that vary between demographic strata \(Appendix Tables[B](https://arxiv.org/html/2605.14156#A2)and[6](https://arxiv.org/html/2605.14156#A2.T6)\) can lead to substantial differences in diagnosis and treatment decisions\(Marcus2012;berry2012rules\)\.
We present a motivating example in Figure[1](https://arxiv.org/html/2605.14156#S1.F1)\. It visualizes PedSleepMAE\(pandey2024\)embeddings, which are fixed, multimodal representations learned generatively from raw pediatric PSG via a masked autoencoder\(he2022masked\)\. PedSleepMAE was trained by treating every 30 seconds of PSG as an independent sample, i\.e\. reconstructing masked signals without knowing time of night or patient identity\. Yet Figure[1](https://arxiv.org/html/2605.14156#S1.F1)reveals that the embeddings captured time\-dependent structure despite not knowing it explicitly in training\. PHATE\(Moon2019\)maps each night to a smooth, time‐ordered path whose geometry matches expert stages: lighter stages at the entrance, N3 near the center, and REM along peripheral arcs\. Across sessions we observe consistent curvature, drift, and occasional bifurcations that align with canonical sleep progressions\.
This motivates our novel research question of investigating the*session\-wide diagnostic information*contained in the*sequences of multimodal generative embeddings*\. To answer this question, we ask whether the embeddings’ \(a\) latent trajectory information, \(b\) topological shape, and \(c\) augmentation with EHR can \(i\) reflect disease burden across AHI strata and \(ii\) improve detection of apnea, hypopnea, desaturation, EEG arousal, and five sleep stages\.
Concretely, we map per\-epoch PedSleepMAE embeddings to 2\-D PHATE and treat each study as a smooth latent trajectory, yielding \(i\)*trajectory\-local*per\-epoch coordinates/derivatives and \(ii\)*trajectory\-global*movement/fragmentation summaries\. In parallel, we compute persistent homology directly on the original 7,680\-D embedding cloud and summarizeH0H\_\{0\}andH1H\_\{1\}characteristics with a compact and stable panel\. Routine EHR \(age/sex and common pediatric comorbidities\) provides low\-overhead clinical context that can reduce confounding and improve generalization when fused with signal features\.
Unlike prior work that primarily treats embeddings as inputs for classification, we study their session\-wide trajectory and topological signatures\. Our contributions are:
- •Trajectory analysis:We perform session\-wide study of pediatric PSG embeddings, moving beyond per\-epoch classification and visualization to analyze latent*trajectories*that capture time\-dependent sleep dynamics\.
- •Topological characterization:We apply manifold learning and persistent homology directly to high\-dimensional PedSleepMAE embeddings, yielding stable, compact signatures of sleep continuity and fragmentation\.
- •Clinical fusion:We show that augmenting embeddings with geometric features and routine EHR covariates improves generalization across AHI strata and enhances sleep event detection\.
Together, these results suggest a new approach to interpreting generative models in sleep medicine\.
## 2Background and Related Work
##### Terminology\.
Sessionis a sleep study or PSG\.Epochis 30\-seconds of sleep, which is a clinical term unrelated to training iterations in machine learning\.Subjectis a patient with PSG\(s\)\.
##### Physiological signals\.
Overnight PSG typically includes EEG/EOG/EMG, airflow, thoracoabdominal effort, ECG, and SpO2at 100–500 Hz \(channels vary by site\)\. EEG/EOG/EMG support sleep staging; airflow/effort capture apneas/hypopneas; SpO2reflects desaturation burden\. Our labels follow this physiology: staging from EEG/EOG/EMG, respiratory events from airflow/effort, desaturation from SpO2, and arousals from EEG conventions\.
##### Deep learning in pediatric sleep\.
Most studies focus on per\-epoch classification of sleep stages or respiratory events\(supratak2017deepsleepnet;phan2021xsleepnet;lee2022automatic\)with the aim of automating resource\-intensive manual labeling\. With the rise of AI, increasing number of works explore self\-supervised learning or generative modeling\(banville2021uncovering\), including in foundation models\(pmlr\-v235\-thapa24a\)and in pediatric sleep\(pandey2024\)\. Our work analyzes the geometry and topology of embeddings generated by such models\.
##### Manifold learning in physiological signals\.
Manifold learning is widely used to visualize high\-dimensional trajectories\(Becht2019\)\. PHATE’s diffusion geometry preserves local neighborhoods while maintaining global progression and denoises noisy biological measurements, making it suitable for sleep dynamics\(kuchroo2020\)\. Prior PSG work more often models raw or time–frequency inputs with sequence architectures, e\.g\., SleepTransformer\(phan2022sleeptransformer\)\. Our approach goes beyond visualization \(e\.g\.,\(banville2021uncovering\)\) and leverages*trajectory geometry in representation space*to summarize whole\-night dynamics\.
PHATE preserved both local continuity and global progression better than UMAP in Appendix[A](https://arxiv.org/html/2605.14156#A1)Figure[A](https://arxiv.org/html/2605.14156#A1); PHATE formed a smooth temporal manifold while UMAP fragmented it into disconnected clusters\. In an ablation study \(Appendix[A](https://arxiv.org/html/2605.14156#A1)Table[A](https://arxiv.org/html/2605.14156#A1)\), swapping PHATE with UMAP reduced AUPRC on three of four binary tasks while slightly hurting sleep scoring F1\.
##### Topological data analysis \(TDA\)\.
Persistent homology offers stable vectorizations that capture multiscale loop/cluster structure for learning\(bubenik2015statistical;adams2017persistence;atienza2018stability\)\. Pediatric sleep EEGs have related such structure to respiratory burden and desaturation\(sathyanarayana2025topological\)\. We extend this to latent spaces learned from multimodal PSG, adding complementary structure beyond pointwise embeddings\.
Table 1:EHR, trajectory, and TDA\-related features considered in this study\.
##### Electronic Health Records \(EHR\)\.
Combining signal representations with structured EHR via late fusion is a common and effective pattern in clinical prediction\(10\.1093;huang2020multimodal\)\. Our results align with this pattern: EHR helps for the rarest outcome \(apnea\), while trajectory and topology features contribute more for desaturation, hypopnea, and EEG arousal\.
## 3Methods
Our methodological pipeline is designed to test whether multimodal pediatric sleep embeddings encode clinically meaningful structure\. After describing the PSG and EHR dataset, we outline the derivation of geometric and topological features, perform an AHI\-stratified feature analysis to examine their relationship with disease severity, and then evaluate their predictive ability through late\-fusion models\.
### 3\.1PSG and EHR Data
We use the pediatric overnight polysomnography \(PSG\) from Nationwide Children’s Hospital Sleep DataBank \(NCHSDB\)\(Lee2022\)\. The analysis set includes 2,522 PSGs \(2,379 unique subjects\), each identified by a \(subject ID, session ID\) pair\. Recordings are divided into consecutive 30\-second epochs\. Each epoch is represented by a 7,680\-D PedSleepMAE embedding \(120×\\times64\) learned generatively from raw PSG channels\(pandey2024\)\. Modalities considered by PedSleepMAE include 7\-channel EEG, 2\-channel EOG, EMG, snoring, respiratory effort, airflow, oxygen saturation, and CO2 level\. Labels for sleep stage, apnea, hypopnea, desaturation, and EEG arousal align one\-to\-one with each epoch\. We use subject\-wise, stratified splits \(70/10/20% train/val/test\) per label and 5\-fold subject\-wise cross\-validation\.
Structured EHR from NCHSDB are also linked to each subject ID\. Routine EHR provides clinical context that can reduce confounding and improve generalization when fused with PSG features\. Therefore, we include a demographic and comorbidity set in our analysis \(Table[1](https://arxiv.org/html/2605.14156#S2.T1)\)\. All EHR variables are session\-level, as subjects with multiple PSGs may have different values for, e\.g\., age\.
##### Subgroup prevalence\.
We compute per\-epoch positive rates by age group, sex, and race on the study population and report them for all labels in Appendix[B](https://arxiv.org/html/2605.14156#A2)Tables[B](https://arxiv.org/html/2605.14156#A2)and[6](https://arxiv.org/html/2605.14156#A2.T6)\.
### 3\.2Feature Sets
Our features mirror the ablation order: per\-epoch PedSleepMAE embeddings as baselines, EHR \(Sec\.[3\.1](https://arxiv.org/html/2605.14156#S3.SS1)\), then \(i\) PHATE\-based trajectory features and \(ii\) topological descriptors\.
##### PHATE trajectory features\.
2\-D PHATE is fit on training sessions and applied out of sample to validation/test\. We use \(a\)*trajectory\-local*per\-epoch coordinates/derivatives and \(b\)*trajectory\-global*session summaries of movement/fragmentation: mean and max inter\-epoch delta distance, mean turning angle, directional entropy of turns, tortuosity \(path\-length vs\. end\-to\-end\), and a change\-point count on the step\-length series usingruptureswith PELT \(Pruned Exact Linear Time\)\(truong2020selective;Killick2012PELT\)\. Session\-level quantities are broadcast to all epochs of that session\.
#### PHATE trajectory quantities
Letpt=\[ptx,pty\]∈ℝ2p\_\{t\}=\[p\_\{t\}^\{x\},p\_\{t\}^\{y\}\]\\in\\mathbb\{R\}^\{2\}denote PHATE coordinates at epochtt\.ttgoes from 1 toTTin a given session\.
δt\\displaystyle\\delta\_\{t\}=∥pt−pt−1∥2\\displaystyle=\\lVert p\_\{t\}\-p\_\{t\-1\}\\rVert\_\{2\}cumt\\displaystyle\\mathrm\{cum\}\_\{t\}=∑i=2tδi\\displaystyle=\\sum\_\{i=2\}^\{t\}\\delta\_\{i\}θt\\displaystyle\\theta\_\{t\}=atan2\(pty−pt−1y,ptx−pt−1x\)\\displaystyle=\\operatorname\{atan2\}\(p\_\{t\}^\{y\}\-p\_\{t\-1\}^\{y\},\\,p\_\{t\}^\{x\}\-p\_\{t\-1\}^\{x\}\)turnt\\displaystyle\\mathrm\{turn\}\_\{t\}=θt−θt−1in\(−π,π\]\\displaystyle=\\theta\_\{t\}\-\\theta\_\{t\-1\}\\text\{~in~\}\(\-\\pi,\\pi\]curvt\\displaystyle\\mathrm\{curv\}\_\{t\}=\|turnt\|δt\+ε\\displaystyle=\\tfrac\{\|\\mathrm\{turn\}\_\{t\}\|\}\{\\delta\_\{t\}\+\\varepsilon\}dist\_startt\\displaystyle\\mathrm\{dist\\\_start\}\_\{t\}=∥pt−p1∥2\\displaystyle=\\lVert p\_\{t\}\-p\_\{1\}\\rVert\_\{2\}dir\_entropy\\displaystyle\\mathrm\{dir\\\_entropy\}=−∑bp^blogp^b\\displaystyle=\-\\sum\_\{b\}\\hat\{p\}\_\{b\}\\log\\hat\{p\}\_\{b\}tortuosity\\displaystyle\\mathrm\{tortuosity\}=∑tδt∥pT−p1∥2\+ε\\displaystyle=\\tfrac\{\\sum\_\{t\}\\delta\_\{t\}\}\{\\lVert p\_\{T\}\-p\_\{1\}\\rVert\_\{2\}\+\\varepsilon\}nsegments\\displaystyle n\_\{\\text\{segments\}\}=\#\{PELT changepoints onδt\}\\displaystyle=\\\#\\\{\\text\{PELT changepoints on \}\\delta\_\{t\}\\\}
##### Topological features\.
We extracted a wide panel of persistence\-derived statistics on the original 7,680\-D PedSleepMAE point cloud\. For stability and interpretability in the late\-fusion model, we retained six robust statistics as the TDA features: H0 sum persistence,H0H\_\{0\}number of bars,H1H\_\{1\}number of bars,H1H\_\{1\}max persistence, Betti–1L2L^\{2\}norm, and theH1H\_\{1\}/H0H\_\{0\}lifetime ratio\. These capture cluster spread/fragmentation \(H0H\_\{0\}\), loop prevalence/strength \(H1H\_\{1\}/Betti–1 energy\), and loop\-vs\-cluster balance, producing stable, fixed\-length vectors for learning\(bubenik2015statistical;adams2017persistence;atienza2018stability\)\.
To provide intuition:H0H\_\{0\}statistics measure how dispersed or fragmented the embedding cloud is, reflecting continuity of sleep trajectories\.H1H\_\{1\}descriptors capture the presence and persistence of loops in the latent space, which corresponds to recurrent cycles\. Ratios such asH1/H0H\_\{1\}/H\_\{0\}summarize the balance between fragmentation and cyclic structure\.
#### Topological descriptors
Let𝒳=\{xi\}\\mathcal\{X\}=\\\{x\_\{i\}\\\}be the 7,680\-D embedding cloud; compute Vietoris–Rips persistence withH0H\_\{0\}andH1H\_\{1\}barcodes lifetimes\{ℓj\(0\)\}\\\{\\ell^\{\(0\)\}\_\{j\}\\\},\{ℓk\(1\)\}\\\{\\ell^\{\(1\)\}\_\{k\}\\\}\.
H0\_sum\_pers\\displaystyle H\_\{0\}\\\_\\mathrm\{sum\\\_pers\}=∑jℓj\(0\)\\displaystyle=\\sum\_\{j\}\\ell^\{\(0\)\}\_\{j\}H0\_n\_bars\\displaystyle H\_\{0\}\\\_\\mathrm\{n\\\_bars\}=\#\{ℓj\(0\)\>0\}\\displaystyle=\\\#\\\{\\ell^\{\(0\)\}\_\{j\}\>0\\\}H1\_n\_bars\\displaystyle H\_\{1\}\\\_\\mathrm\{n\\\_bars\}=\#\{ℓk\(1\)\>0\}\\displaystyle=\\\#\\\{\\ell^\{\(1\)\}\_\{k\}\>0\\\}H1\_max\_pers\\displaystyle H\_\{1\}\\\_\\mathrm\{max\\\_pers\}=maxkℓk\(1\)\\displaystyle=\\max\_\{k\}\\ell^\{\(1\)\}\_\{k\}Betti–L2\\displaystyle\\text\{Betti\-\-\}L^\{2\}=∥β1\(r\)∥2\\displaystyle=\\lVert\\beta\_\{1\}\(r\)\\rVert\_\{2\}ratio\_sum\_H1H\_\{1\}\_H0H\_\{0\}=∑kℓk\(1\)∑jℓj\(0\)\+ε\\displaystyle=\\tfrac\{\\sum\_\{k\}\\ell^\{\(1\)\}\_\{k\}\}\{\\sum\_\{j\}\\ell^\{\(0\)\}\_\{j\}\+\\varepsilon\}
### 3\.3AHI\-stratified feature analysis
The apnea–hypopnea index \(AHI\) is a standard measure of sleep\-disordered breathing, defined as the average number of apneas and hypopneas per hour of sleep\(american2007aasm;berry2012rules\)\. Higher AHI values indicate more severe sleep\-disordered breathing\. We grouped sessions by pediatric AHI thresholds into healthy \(<<1\), mild \(1–5\), moderate \(5–10\), and severe \(≥\\geq10\), following commonly used pediatric criteria\(Marcus2012\)\.
For each session\-level candidate we ran a Kruskal–Wallis omnibus test\(kruskal1952\), Dunn post\-hoc comparisons\(dunn1964\)with Holm correction\(holm1979\), reported Cliff’sδ\\deltaas an effect size\(cliff1996\), and visualized box/ECDF/KDE distributions with adjustedqqvalues\.
Because AHI is defined per session, the screen applied only to trajectory\-global PHATE features and to TDA summaries\. EHR features were included as a pre\-specified confounder block and also subjected to the same testing for completeness, but were not used for feature selection in order to avoid label leakage and to preserve a stable set of clinical covariates across all tasks\. All association tests are observational; reported effects are correlational and do not imply causality\.
Figure 2:AHI associations for a representative metric \(mean\(δt\)\\mathrm\{mean\}\(\\delta\_\{t\}\)\)\. Top\-left: ECDF; top\-right: mean±\\pm95% CI; bottom: KDE density\. Groups: healthy \(<<1\), mild \(1–5\), moderate \(5–10\), severe \(≥\\geq10\)\.Figure 3:Precision–recall curves for the four binary tasks on the test set\. Each panel overlays all ablation models \(M0–M3\), with legend reporting AUPRC\.
### 3\.4Diagnostic models
We benchmarked a family of late\-fusion classifiers on identical train/val/test splits\. Given the extreme class imbalance \(Table[B](https://arxiv.org/html/2605.14156#A2)\), we used class\-weighted and focal losses\. Our goal was not to maximize raw performance with deep or complex architectures, but to use deliberately simple models, linear probes and shallow MLPs, to isolate and interpret the incremental value of each feature family\. See Appendix[C](https://arxiv.org/html/2605.14156#A3)for implementation details\.
##### Model variants\.
Our baselines,M0 \(Emb\-linear\)applies a single linear layer directly to each 7,680\-D PedSleepMAE embedding, setting a lower bound on representation quality\(pandey2024\)\.M0\.1 \(Emb\-MLP\)increases capacity by passing embeddings through a shallow encoder MLP\.
M1 \(Emb\+EHR\)uses a two\-branch late\-fusion MLP: embeddings and structured EHR features \(23\-D; Table[1](https://arxiv.org/html/2605.14156#S2.T1)\) are encoded separately and concatenated\.M1\.1 \(Emb\+PHATE\)fuses embeddings with both trajectory\-local and trajectory\-global PHATE features\.M1\.2 \(Emb\+TDA\)fuses embeddings with persistent\-homology descriptors from the embedding cloud\.M2 \(Emb\+EHR\+PHATE\)adds both PHATE branches to M1\.M2\.1 \(Emb\+EHR\+TDA\)adds TDA instead\.M3 \(Emb\+EHR\+PHATE\+TDA\)incorporates all five branches in a full late\-fusion MLP\. Ablations are ordered for interpretability and deployment\. This design allows us to isolate the incremental diagnostic value of clinical context \(EHR\), trajectory geometry \(PHATE\), and topology \(TDA\)\.
## 4Results
We present results in two stages\. First, we perform an AHI\-stratified analysis to examine how geometric, topological, and EHR features vary across clinical severity levels, assessing whether latent structure encodes disease\-related information\. We then evaluate the predictive performance of late\-fusion models across multiple diagnostic tasks, comparing ablation models to determine the individual and complementary contributions of each feature family\.
### 4\.1Clinical Association with AHI
Trajectory movement, topology, and routine EHR features all co‐vary with AHI\. See Appendix[D](https://arxiv.org/html/2605.14156#A4)for the full omnibus and contrast tables\. Permutation Kruskal–Wallis omnibus tests \(Table[D](https://arxiv.org/html/2605.14156#A4)\) remained significant for all six topological descriptors and multiple PHATE movement metrics\. Severe nights exhibited reduced topological richness \(fewer connected components and loops, lower Betti energies\), larger average and more variable manifold steps, and higherH1H\_\{1\}\_max\_pers\. Healthy nights showed the opposite pattern, with greater topological diversity and smoother, less variable trajectories\. Pairwise contrasts \(Table[D](https://arxiv.org/html/2605.14156#A4)\) reinforce these findings, showing statistically significant effect sizes in the expected directions \(e\.g\., negativeΔ\\Deltafor topological counts in severe cases, positiveΔ\\Deltafor movement variance\)\. The pattern indicates that the metrics are not just differentiating extremes, but capturing a graded trajectory of disease burden\.
Figure[2](https://arxiv.org/html/2605.14156#S3.F2)illustrates these trends for a representative PHATE\-derived movement metric \(mean\(δt\)\\mathrm\{mean\}\(\\delta\_\{t\}\)\)\. The empirical CDF shows a general rightward shift as severity increases: healthy nights cluster at lower mean steps, while severe nights accumulate probability mass later\. The mean±\\pm95% CI plot shows broad overlap among healthy, mild, and moderate groups, with clearer separation for the severe group\. The KDE density estimate reinforces this pattern, showing both a rightward shift and a broader distribution for severe cases, including a heavier right tail that reflects greater heterogeneity\. These views suggest that higher AHI severity is associated with larger and more variable movement steps, with the strongest distinction seen between severe and non\-severe groups, consistent with the omnibus and contrast test results\.
Beyond movement and topology, our extended screen revealed that several EHR variables also stratify by AHI\. Omnibus tests \(Table[D](https://arxiv.org/html/2605.14156#A4)\) flagged obesity, diabetes, hypertension, anxiety, male sex, and race \(Black, White\) among the strongest signals \(q<0\.005q<0\.005\), with additional demographic and comorbidity features \(e\.g\., age, depression/mood disorder, GERD\) passing atq<0\.05q<0\.05\.
Contrasts \(Table[D](https://arxiv.org/html/2605.14156#A4)\) clarify these effects\. We omit the median\-difference effect \(Δ=median\(group\)−median\(others\)\\Delta=\\mathrm\{median\}\(\\text\{group\}\)\-\\mathrm\{median\}\(\\text\{others\}\)\) in the table because values were small; nonetheless, several features show significant distributional shifts as captured by positive Cliff’sδ\\delta\. Healthy cases were enriched for female sex, lower cardiometabolic burden, and White race, while severe cases showed higher prevalence of hypertension and Black race \(with White race underrepresented\)\. These associations suggest that routine demographic and comorbidity indicators carry complementary disease\-burden information alongside PSG\-derived descriptors, and reinforce their role as a stable context/confounder block for predictive modeling\.
### 4\.2Predictive Performance
We report AUPRC as the primary discrimination metric for highly imbalanced clinical tasks\(saito2015precision\), complemented by balanced accuracy, macro–F1, and ROC–AUC \(Tables in Appendix[E](https://arxiv.org/html/2605.14156#A5)\)\. To assess calibration, which is critical for clinical interpretability, we additionally report the Brier score\(glenn1950verification\)and Expected Calibration Error \(ECE\)\(naeini2015obtaining\)\. Lower Brier/ECE values indicate better\-calibrated probability estimates\. Full cross\-validated metrics are summarized in Appendix[F](https://arxiv.org/html/2605.14156#A6): Table[11](https://arxiv.org/html/2605.14156#A6.T11)\(staging\), Table[12](https://arxiv.org/html/2605.14156#A6.T12)\(desaturation\), Table[13](https://arxiv.org/html/2605.14156#A6.T13)\(EEG arousal\), Table[14](https://arxiv.org/html/2605.14156#A6.T14)\(apnea\), and Table[15](https://arxiv.org/html/2605.14156#A6.T15)\(hypopnea\)\. Figure[3](https://arxiv.org/html/2605.14156#S3.F3)overlays PR curves for the binary tasks, illustrating clear separation from the linear\-probe baseline once contextual branches are added\.
##### Sleep staging \(5\-class\)\.
Performance improves steadily across the ablation ladder: M0 \(linear baseline\) reaches F1 0\.655±0\.001; M0\.1 \(MLP baseline\) improves to 0\.664±0\.004\. Adding context gives consistent lifts, with M2 \(Emb\+EHR\+PHATE\) at 0\.679±0\.003 and the full M3 at 0\.678±0\.004\. Balanced accuracy follows the same trend \(M3: 0\.762±0\.004\)\.
##### Desaturation\.
AUPRC increases from 0\.257±0\.022 \(M0\) to 0\.343±0\.028 with M1\.2 \(Emb\+TDA\)\. The full M3 remains competitive \(0\.342±0\.022\) and yields the best calibration \(Brier 0\.138, ECE 0\.183\)\. Improvements are mainly driven by the addition of PHATE and topological features, which agrees with Sec\.[4\.1](https://arxiv.org/html/2605.14156#S4.SS1)where variable sleep trajectories differentiated AHI strata\. The lower Brier/ECE values indicate not only better discrimination but also more reliable probability estimates for clinical interpretation\.
##### EEG arousal\.
Trajectory features dominate this task\. AUPRC rises from 0\.309±0\.013 \(M0\) to 0\.476±0\.015 \(M1\.1\), and the full M3 performs best overall at 0\.478±0\.014 with the strongest calibration \(Brier 0\.084, ECE 0\.130\)\. These results reinforce the role of trajectory features, consistent with their AHI monotone shifts\.
##### Apnea\.
The rarest label benefits the most from the clinical context\. AUPRC increases from 0\.043±0\.013 \(M0\) to 0\.142±0\.029 with M2 \(Emb\+EHR\+PHATE\)\. Although M3’s discrimination is comparable \(0\.136±0\.036\), it exhibits the lowest Brier \(0\.014\) and ECE \(0\.016\)\. These gains arise mainly from demographic and comorbidity features that encode baseline disease burden \(hypertension, obesity, and race\) which in Sec\.[4\.1](https://arxiv.org/html/2605.14156#S4.SS1)were significantly associated with AHI severity\. These patterns suggest the importance of routine EHR context when predicting rare but clinically consequential events\.
##### Hypopnea\.
Signals are complementary: AUPRC moves from 0\.088±0\.011 \(M0\) to 0\.219±0\.020 with M2\.1 \(Emb\+EHR\+TDA\)\. The full M3 provides the best calibration \(Brier 0\.067, ECE 0\.103\) while remaining competitive in AUPRC \(0\.186±0\.022\)\. Improvements here reflect the joint effect of topological and clinical context: topology contributing sensitivity to subtle respiratory variability, and EHR stabilizing predictions across subjects with differing baseline risk\. The trend aligns with Sec\.[4\.1](https://arxiv.org/html/2605.14156#S4.SS1), where both TDA and EHR features showed graded associations with AHI\.
## 5Discussion
Our results support our hypothesis that latent geometry and topology encode physiologically meaningful structure in pediatric sleep\. Importantly, our results should not be interpreted as incremental gains in classification accuracy\. Rather, they show that generative embeddings contain rich session\-wide structure: by analyzing trajectories and topology, we uncover interpretable signatures of sleep fragmentation and disease severity that complement conventional machine learning models\. The present approach provides one interpretive tool for inspecting learned embedding spaces and generating research hypotheses\. It is therefore not a clinical decision system and requires further study on external datasets and workflow integration\.
##### No single modality dominates\.
Across predictive tasks, no single modality dominates\. EHR provides the most value for the rarest and most clinically anchored label \(apnea, best AUPRC with Emb\+EHR\+PHATE\), PHATE trajectory features are most impactful for EEG arousals, and TDA is strongest for desaturations\. Hypopnea benefits from combining EHR with TDA \(best AUPRC with Emb\+EHR\+TDA\)\. Notably, the full fusion models \(M3\) consistently delivers the best calibration \(lowest Brier/ECE\) across all four binaries, even when another ablation narrowly wins AUPRC\.
##### Why not apply PHATE/TDA directly to raw signals?
A natural question is why our analyses focus on learned embeddings rather than the original PSG waveforms\.
Raw epochs are extremely high\-dimensional: a single 30\-second segment spans tens of thousands of samples across 16 multimodal channels with distinct noise characteristics\. In this space, computation is prohibitive and distance metrics are dominated by artifacts rather than physiology, making the geometry unreliable for manifold learning or TDA\.
PedSleepMAE embeddings instead provide compact, multimodal representations trained to reconstruct missing segments, which enforces denoising and cross\-channel learning\. Distances in this latent space reflect meaningful similarity and yield neighborhoods that are stable, smooth, and well\-suited for our analysis\.
Moreover, analysis in the embedding space aligns with what many downstream classifiers “see,” providing a way to peer into the model’s feature space\. While our study focuses on PedSleepMAE, the same approach can be applied to embeddings from any model, making our framework a general tool for interpreting how machine learning models re\-structure complex sleep signals\.
##### Limitations and future work\.
Our study has several limitations: \(i\) PHATE and persistent homology depend on design choices \(distance, scale, filtration\), which may affect effect sizes; \(ii\) we use simple linear/shallow MLP models, aiding interpretability but likely underestimating peak performance; and \(iii\) AHI\-stratified analyses are associative and should not be interpreted causally\.
Future work will explore end\-to\-end integration of manifold and topological structure, and test generalization across datasets and embedding models\. We will extend the late\-fusion MLP to a*dynamic fusion*architecture that assigns context\-dependent weights to branches \(embeddings, PHATE, TDA, EHR\), which allows the model to adaptively route information across modalities and could improve both calibration and robustness\.
## 6Conclusion
We investigated the diagnostic information contained in thesequencesof per\-epoch PedSleepMAE embeddings, and presented a late\-fusion pipeline that augments the embeddings with PHATE\-based temporal descriptors, persistent\-homology summaries of latent geometry, and structured EHR context\. On 2\.5k\+ pediatric sleep studies, all three feature families showed AHI\-stratified associations, with topology and movement metrics tracking disease severity and several EHR variables stratifying independently\.
In predictive tasks, adding context consistently improved calibration and interpretability over embedding\-only baselines, with different modalities contributing in a task\-specific manner\. The full late\-fusion model \(M3\) achieved the best calibration \(lowest Brier/ECE\) across all four binary tasks and the top AUPRC for EEG arousal \(0\.478\), while desaturation peaked with Emb\+TDA \(M1\.2; 0\.343\) and hypopnea with Emb\+EHR\+TDA \(M2\.1; 0\.219\); apnea favored Emb\+EHR\+PHATE \(M2; 0\.142\)\.
Together, these findings suggest that latent trajectory, topology, and clinical context capture complementary dimensions of pediatric sleep\. More broadly, our framework provides a lens into the structure of generative embeddings, offering one way to interpret what multimodal sleep models encode, rather than merely how well they classify\.
## Acknowledgements
The authors thank Saurav Raj Pandey for his help with the dataset and the Longleaf cluster\. Additionally, the first author thanks Baiming Zou for early guidance on the paper’s direction, and the second author thanks Jeremy Purvis for suggesting PHATE\.
## References
## Appendix AUMAP vs\. PHATE
Figure 4:Comparison of PHATE and UMAP embeddings for the same held\-out PSG, colored by epoch index\. PHATE reveals a smooth, time\-continuous trajectory that preserves global temporal structure, whereas UMAP fragments the data into local clusters with weaker overall ordering\.Table 2:Comparison of PHATE \(M1\.1\) and UMAP on a single held\-out session\-wise split\. Each manifold was fit on the training set and used to transform validation/test sessions\.
## Appendix BPrevalence
Table[B](https://arxiv.org/html/2605.14156#A2)summarizes class distributions across train/validation/test splits\. Subgroup prevalence is included in Table[B](https://arxiv.org/html/2605.14156#A2)to further highlight age and demographic heterogeneity\.
Table 4:Class distributions by split\. Sleep staging shows % for \{W, REM, N1, N2, N3\}; binary tasks list positive prevalence \(%\)\.Table 5:Subgroup prevalence and subgroup share of sessions for Apnea and Desaturation tasks\.Table 6:Subgroup prevalence and subgroup share of sessions for EEG Arousal and Hypopnea tasks\.
## Appendix CImplementation Details
##### Branch encoders\.
Each modality is encoded separately by a shallow block: a linear layer mapping the raw input dimension to 128 units, followed by layer normalization and ReLU activation:
z\(k\)=ReLU\(LN\(W\(k\)x\(k\)\+b\(k\)\)\),z\(k\)∈ℝ128\.z^\{\(k\)\}=\\mathrm\{ReLU\}\\\!\\Big\(\\mathrm\{LN\}\\\!\\big\(W^\{\(k\)\}x^\{\(k\)\}\+b^\{\(k\)\}\\big\)\\Big\),\\qquad z^\{\(k\)\}\\in\\mathbb\{R\}^\{128\}\.The five possible inputs are: embeddings \(7,680\-D\), EHR \(23\-D\), PHATE\-point \(6\-D\), PHATE\-time \(6\-D\), and TDA \(6\-D\)\. Each branch yields a 128\-D latent representation\.
##### Fusion and classifier head\.
Encoded features are concatenated into a single latent vector\. Depending on branch inventory, this vector ranges from 128\-D \(M0\.1\) to 640\-D \(M3\)\. It is passed through a two\-layer classifier head: Linear→\\to256 units, ReLU, Dropout\(0\.30\), and Linear→K\\to Klogits\. Thus each model differs only in which branches are included; the head is otherwise matched across ablations\.
##### Training protocol\.
We used subject\-wise splits to avoid data leakage across sessions from the same participant\. All models were trained with AdamW \(learning rate10−310^\{\-3\}, weight decay10−510^\{\-5\}\), batch size 256, and automatic mixed precision\. Learning rate decay was controlled by aReduceLROnPlateauscheduler \(factor 0\.5, patience 3\), and early stopping after 8 non\-improving validations\. We performed 5\-fold subject\-wise cross\-validation using the same training recipe and report mean ± SD across folds\. Binary decision thresholds were chosen by maximizing validation F1; multiclass sleep staging reported macro–F1\.
##### Normalization\.
All features were standardized with train\-only statistics\. For stability, values were clipped to\[−8,8\]\[\-8,8\]\. Epoch\-level features \(embeddings, PHATE\-point\) were normalized across all epochs, while session\-level vectors \(EHR, PHATE\-time, TDA\) were normalized across sessions and broadcast to all epochs\. Non\-finite values were set to zero before normalization\.
##### Loss functions and imbalance handling\.
To handle severe class imbalance, we applied class\-weighted losses withwk∝1/freqkw\_\{k\}\\propto 1/\\text\{freq\}\_\{k\}\. Binary tasks used focal cross\-entropy with focusing parameterγ=1\.5\\gamma=1\.5\(lin2017focal\), while sleep staging used weighted cross\-entropy\.
##### Architecture schematic\.
For reference, Figure[6](https://arxiv.org/html/2605.14156#A3.F6)\(in the Appendix\) visualizes the full M3 late\-fusion MLP\.
Figure 6:Late\-fusion MLP \(M3\)\. Each branch input is encoded with Linear→LayerNorm→ReLU \(128\-D\)\. Latents are concatenated \(5×128=640\-D\) and passed through a classifier head: Linear 640→256, ReLU, Dropout\(0\.30\), and Linear 256→K\. Per\-epoch branches are Embeddings and PHATE\-point; session\-level branches are EHR, PHATE\-time, and TDA \(broadcast across epochs\)\.
## Appendix DAHI Tests
Table 7:Permutation Kruskal–Wallis omnibus tests across AHI groups for pre\-specified session\-level descriptors \(Table[1](https://arxiv.org/html/2605.14156#S2.T1)\) \(significant results only;q<0\.05q<0\.05\)\. LargerHH\(with smallqqafter Holm correction\) indicates stronger distributional differences across AHI strata\.Table 8:Permutation Mann–Whitney contrasts for AHI extremes on pre\-specified descriptors \(significant results only;q<0\.05q<0\.05\)\. Effect is median\(group\)−\-median\(others\);δ\\deltais Cliff’s delta\.Table 9:Permutation Kruskal–Wallis omnibus tests across AHI groups for all 23 EHR features\. LargerHH\(with smallqqafter Holm correction\) indicates stronger distributional differences across AHI strata\.
Table 10:Permutation Mann–Whitney contrasts for AHI extremes on EHR features \(significant results only;q<0\.05q<0\.05\)\. Effect \(Δ\\Delta\) = median\(group\)−\-median\(others\) are small and omitted for brevity;δ\\deltais Cliff’s delta\.
## Appendix EAdditional Predictive Results
Figure 7:ROC curves for the four binary tasks on the test set\. Each panel overlays all ablation models, with legend reporting ROC–AUC\. These curves complement the precision–recall plots in Figure[3](https://arxiv.org/html/2605.14156#S3.F3)and illustrate consistent improvements beyond the linear probe baseline once contextual branches are added\.
## Appendix FFull Test\-Set Metrics
Reading guide\.Below are the full test\-set results with 5\-fold subject\-wise cross\-validation\. Table[11](https://arxiv.org/html/2605.14156#A6.T11)reports multiclass sleep staging results \(balanced accuracy and macro–F1 as mean ± SD\)\. Tables[12](https://arxiv.org/html/2605.14156#A6.T12)–[15](https://arxiv.org/html/2605.14156#A6.T15)list Accuracy, F1, ROC–AUC, AUPRC \(reported as mean ± SD\), and the calibration metrics Brier score and ECE for the four binary tasks, following the same ablation order\. Lower Brier and ECE values indicate better\-calibrated probability estimates, which are especially important for clinical interpretability\.
Table 11:Multiclass Sleep staging\.Table 12:Desaturation\.Table 13:EEG arousal\.Table 14:Apnea\.Table 15:Hypopnea\.Similar Articles
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
This paper applies TopK Sparse Autoencoders to three EEG foundation models (SleepFM, REVE, LaBraM) to extract interpretable feature dictionaries and introduces a framework for concept steering, revealing representational failures and clinical entanglements.
Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling
This paper introduces an unsupervised framework for modeling acquisition-related variability in structural connectomes using hybrid latent space modeling, eliminating the need for manual capacity tuning by architecturally annealing encoder outputs.
Interpretable EEG Microstate Discovery via Variational Deep Embedding: A Systematic Architecture Search with Multi-Quadrant Evaluation
This paper presents Conv-VaDE, a variational deep embedding model for interpretable EEG microstate discovery that jointly learns topographic reconstruction and probabilistic soft clustering. It includes a systematic architecture search evaluated on resting-state EEG data to determine optimal model configurations for stability and interpretability.
A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification
ConfSleepNet is a conflict-aware evidential framework for reliable sleep stage classification using multi-modal data. It introduces hybrid category structures and a conflict-aware aggregation method to resolve inter-view conflicts, demonstrating effectiveness on sleep staging tasks.
Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning
NERVE proposes a network-aware bilinear tokenization method for self-supervised learning on brain functional connectivity matrices using masked autoencoders, improving representation learning across developmental cohorts.