Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

arXiv cs.LG 05/21/26, 04:00 AM Papers
Summary
This paper identifies a collapse-and-refine mechanism in diffusion models under the manifold hypothesis, proposing Score-induced Latent Diffusion (SiLD) that provably avoids the curse of dimensionality. Experiments show SiLD matches or outperforms VAE-based latent diffusion models.
arXiv:2605.20235v1 Announce Type: new Abstract: Diffusion models generate high-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low-dimensional manifolds, remains theoretically unexplained. We identify a collapse-and-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold. We instantiate this principle as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE-based latent diffusion models. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions.
Original Article
View Cached Full Text
Cached at: 05/21/26, 06:20 AM
# Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine
Source: [https://arxiv.org/html/2605.20235](https://arxiv.org/html/2605.20235)
Wei Huang RIKEN AIP & The Institute of Statistical Mathematics wei\.huang\.vr@riken\.jp &Andi Han University of Sydney andi\.han@sydney\.edu\.au Mingyuan Bai Agency for Science, Technology and Research & The Institute of Statistical Mathematics Bai\_Mingyuan\_from\.Riken@a\-star\.edu\.sg &Huanjian Zhou The University of Tokyo zhou\-huanjian185@g\.ecc\.u\-tokyo\.ac\.jp&Qixin Zhang Nanyang Technological University qixin\.zhang@ntu\.edu\.sg &Taiji Suzuki The University of Tokyo & RIKEN AIP taiji@mist\.i\.u\-tokyo\.ac\.jp &Kenji Fukumizu The Institute of Statistical Mathematics fukumizu@ism\.ac\.jp

###### Abstract

Diffusion models generate high\-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low\-dimensional manifolds, remains theoretically unexplained\. We identify a collapse\-and\-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold\. We instantiate this principle as Score\-induced Latent Diffusion \(SiLD\), a two\-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE\-based latent diffusion models\. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension\. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE\-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions\.

## 1Introduction

Diffusion models have emerged as a dominant paradigm for generative modeling, demonstrating remarkable capability in synthesizing high\-fidelity samples from complex, high\-dimensional data distributionsSohl\-Dicksteinet al\.\([2015](https://arxiv.org/html/2605.20235#bib.bib53)\); Hoet al\.\([2020](https://arxiv.org/html/2605.20235#bib.bib6)\); Song and Ermon \([2019](https://arxiv.org/html/2605.20235#bib.bib1)\); Songet al\.\([2020](https://arxiv.org/html/2605.20235#bib.bib54)\)\. The connection between these models and score matching, particularly through the lens of denoising autoencoders, has been firmly establishedVincent \([2011](https://arxiv.org/html/2605.20235#bib.bib3)\); Hoet al\.\([2020](https://arxiv.org/html/2605.20235#bib.bib6)\)\. Despite their empirical success, the theoretical foundations enabling efficient learning in high\-dimensional spaces remain a subject of intense inquiry\. A central puzzle lies in the “curse of dimensionality”: theoretically, learning a probability distribution in an ambient space of dimensionddtypically requires a sample complexity exponential inddWainwright \([2019](https://arxiv.org/html/2605.20235#bib.bib55)\); Biroliet al\.\([2024](https://arxiv.org/html/2605.20235#bib.bib97)\)\. The prevailing resolution to this paradox is the manifold hypothesis, which posits that real\-world data, while embedded in a high\-dimensional ambient space, \(approximately\) resides on a low\-dimensional manifold of intrinsic dimensionk≪dk\\ll dFeffermanet al\.\([2016](https://arxiv.org/html/2605.20235#bib.bib56)\); Loaiza\-Ganemet al\.\([2024](https://arxiv.org/html/2605.20235#bib.bib48)\)\.

Recent theoretical works have leveraged this low\-dimensional structure to establish improved bounds on statistical estimation and sampling complexityDe Bortoli \([2022](https://arxiv.org/html/2605.20235#bib.bib47)\); Chenet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib7)\); Okoet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib8)\); Liet al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib46)\); Tang and Yang \([2024](https://arxiv.org/html/2605.20235#bib.bib57)\); Potaptchiket al\.\([2024](https://arxiv.org/html/2605.20235#bib.bib58)\); Azangulovet al\.\([2024](https://arxiv.org/html/2605.20235#bib.bib59)\), proving that sample complexity depends on the intrinsic dimensionkkrather thanddwhen the score is well\-approximated\. Notably,Li and Yan \([2024](https://arxiv.org/html/2605.20235#bib.bib60)\)andHuanget al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib61)\)showed that the DDPM sampler automatically adapts to unknown low\-dimensional structure, achieving iteration complexity scaling nearly linearly inkkwithout any prior knowledge of the manifold\. From a structural perspective,Pidstrigach \([2022](https://arxiv.org/html/2605.20235#bib.bib63)\)andStanczuket al\.\([2024](https://arxiv.org/html/2605.20235#bib.bib64)\)proved that trained diffusion models detect and encode the data manifold by approximating its normal bundle, whileFarghlyet al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib65)\)showed that score smoothing implicitly regularizes toward manifold\-adaptive solutions\. More recently,Boffiet al\.\([2024](https://arxiv.org/html/2605.20235#bib.bib42)\); Gao and Li \([2024](https://arxiv.org/html/2605.20235#bib.bib40)\); Kumaret al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib62)\)investigated how diffusion and flow matching models adapt to low\-dimensional structures\.

However, these analyses predominantly focus on the properties of the converged score estimator, treating the optimization process abstractly or relying on specific architectural assumptions such as single\-layer networksBoffiet al\.\([2024](https://arxiv.org/html/2605.20235#bib.bib42)\)\. While recent works have begun to probe the training dynamics of diffusion models,Shahet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib66)\)provided the first provably efficient result linking gradient descent on the DDPM objective to recovering mixture model parameters, andWanget al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib67)\)proved that optimizing the diffusion training loss with a low\-rank parameterization is equivalent to a subspace clustering problem; yet these results are confined to restricted model classes and do not characterize the fine\-grained weight evolution of general deep networks\. From the broader neural network optimization perspective, mean\-field theoryMeiet al\.\([2018](https://arxiv.org/html/2605.20235#bib.bib68)\); Chizat \([2022](https://arxiv.org/html/2605.20235#bib.bib69)\); Suzukiet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib71)\)provides a powerful framework for analyzing two\-layer network training dynamics, and feature learning theoryDamianet al\.\([2022](https://arxiv.org/html/2605.20235#bib.bib43)\); Mousavi\-Hosseiniet al\.\([2022](https://arxiv.org/html/2605.20235#bib.bib72)\); Abbeet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib73)\)has shown that gradient\-based training discovers low\-dimensional relevant subspaces through saddle\-to\-saddle dynamics\. Yet how these mechanisms manifest specifically in the score matching setting, and whether the optimization process itself exploits the low\-dimensional geometric structure of the data, remains largely unexplored\. This raises a fundamental question that current theory cannot answer:

*How does a neural network, initialized isotropically, adaptively discover the low\-dimensional data support and efficiently learn the intrinsic distribution density amidst high\-dimensional noise?*

We answer this question by proposingScore\-induced Latent Diffusion \(SiLD\), a theoretically grounded framework that characterizes the gradient descent dynamics of score matching on low\-dimensional manifolds\. The key insight is that the singularity of the score function at small noise levels naturally induces a*two\-stage*learning mechanism: the network first discovers the geometry of the data manifold, and then refines the intrinsic probability density within it\. This key insight naturally induces a novel two\-stage training strategy: first learning the manifold with low\-noise score matching and then learning the density on manifold\. Both stages are trained under a single DDPM objective, without any auxiliary losses or heuristic regularization; the latent representation is induced by the score function itself rather than imposed by a separate encoder\. Our main contributions are as follows:

- •Convergence Guarantees\.We prove quantitative convergence rates for both stages\. In Stage 1, a mean\-field gradient flow analysis shows that the geometric alignment risk decays exponentially fast\. In Stage 2, we establish generalization bounds via Random Feature regression on the low\-dimensional manifold, proving that the excess risk depends polynomially on the intrinsic dimensionkkand the sample sizenn, independent of the ambient dimensiondd\.
- •End\-to\-End Sample Complexity\.We establish an end\-to\-end sampling guarantee\. The manifold\-regime contribution achieves a Wasserstein\-2 rate depending only on the intrinsic dimension\. The high\-noise contribution, handled by an auxiliary random\-feature head, contributes a polynomial\-in\-ddterm that is exponentially damped by the integration time\. Together, this avoids the curse of ambient dimensionality\.
- •Empirical Validation\.We validate our theoretical predictions on Stacked MNIST, CelebA, and molecular generation benchmarks, demonstrating that SiLD matches or outperforms VAE\-based latent diffusion modelsRombachet al\.\([2022](https://arxiv.org/html/2605.20235#bib.bib74)\)in generation quality, confirming that the score matching objective alone is sufficient to drive both manifold learning and density estimation\.

## 2Related Work

Statistical Theory of Diffusion on Manifold\.Beyond the convergence and adaptivity results discussed in the introduction, several works have examined more fine\-grained aspects of diffusion models on manifolds\.Bentonet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib79)\)proved nearlydd\-linear convergence bounds for diffusion models via stochastic localization, establishing that the number of reverse steps scales nearly linearly in the intrinsic dimension\. Most recently,Chakrabortyet al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib52)\)introduced the\(p,q\)\(p,q\)\-Wasserstein dimension and proved the first Wasserstein\-ppconvergence guarantee for diffusion models under a finite\-moment condition only, without compact support, manifold, or smooth density assumptions, achieving the sharpest known rates to date\.Chandramoorthy and de Clercq \([2025](https://arxiv.org/html/2605.20235#bib.bib44)\)showed that even with inexact score estimates, generated samples tend to drift along rather than away from the manifold, whileFukumizuet al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib75)\)proved that OT\-CFM dynamics on manifold\-supported targets contracts exponentially in normal directions and remains neutral along tangential directions\.Liuet al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib49)\)identified the score singularity in the normal direction as a hindrance to sampling accuracy and proposed methods to mitigate it\. From an analytical perspective,George and Macris \([2026](https://arxiv.org/html/2605.20235#bib.bib96)\)derived asymptotically exact learning curves for denoising score matching with random feature networks on manifold data, confirming that sample complexity scales linearly with the intrinsic dimension for linear manifolds while showing that this benefit diminishes for nonlinear manifolds — a limitation that our two\-stage decoupling is designed to address\. Our work complements this body of literature by shifting focus from the statistical convergence properties to the optimization dynamics\.

Training Dynamics of Generative Models\.The theoretical study of how neural networks learn during diffusion training remains substantially less developed than their statistical theory\.Hanet al\.\([2024b](https://arxiv.org/html/2605.20235#bib.bib80)\)leveraged the Neural Tangent Kernel \(NTK\)Jacotet al\.\([2018](https://arxiv.org/html/2605.20235#bib.bib81)\)to establish the first generalization bounds for gradient\-descent\-trained networks on the score matching objective, though NTK analyses cannot capture the feature learning dynamics that arise in practice\.Shahet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib66)\)andWanget al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib67)\)showed that gradient descent on the DDPM objective recovers low\-dimensional structure, respectively linking it to Gaussian mixture recovery and subspace clustering\.Hanet al\.\([2024a](https://arxiv.org/html/2605.20235#bib.bib82)\)andLiet al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib83)\)further analyzed feature learning and representation dynamics in diffusion models through low\-dimensional data models\. Complementary lines of work analyze diffusion training via high\-dimensional asymptotics and convex optimizationCui and Zdeborová \([2023](https://arxiv.org/html/2605.20235#bib.bib92)\); Cuiet al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib94)\); Zhang and Pilanci \([2024](https://arxiv.org/html/2605.20235#bib.bib93)\); Zenoet al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib95)\)or establish coarse\-to\-fine spectral dynamics in samplingWang and Vastola \([2023](https://arxiv.org/html/2605.20235#bib.bib90)\); Wang and Pehlevan \([2025](https://arxiv.org/html/2605.20235#bib.bib89)\)\. Recently,Bonnaireet al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib11)\)identified an implicit dynamical regularization mechanism via two training timescales; our work provides the geometric counterpart, proving that the “collapse\-then\-refine” mechanism is the structural driver behind why memorization is dynamically postponed and showing that the score singularity drives an analogous geometry\-before\-density hierarchy\.

Latent Diffusion Models\.Latent diffusion models \(LDMs\)Rombachet al\.\([2022](https://arxiv.org/html/2605.20235#bib.bib74)\)achieve state\-of\-the\-art generation quality by operating in a compressed latent space learned by a VAEKingma and Welling \([2013](https://arxiv.org/html/2605.20235#bib.bib84)\), rather than directly in the high\-dimensional pixel space\. While highly effective in practice, this approach introduces a fundamental tension: the VAE encoder is trained with a heuristic KL regularization term to encourage a well\-structured latent space, which is independent of and potentially misaligned with the score matching objective used in the diffusion stage\. Several works have attempted to bridge this gap by jointly training the encoder and diffusion modelVahdatet al\.\([2021](https://arxiv.org/html/2605.20235#bib.bib85)\)or by revisiting how the latent itself is trainedHeeket al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib88)\)\. Our work, SiLD, offers a principled alternative: both manifold learning and density estimation emerge naturally from the score matching objective at different noise scales, eliminating the need for KL regularization entirely\. This provides a theoretical justification for why the latent space induced by the score function is geometrically well\-suited for diffusion, and consistently improves reconstruction quality as validated in our experiments\.

## 3Score\-induced Latent Diffusion Model

### 3\.1Preliminaries

Notation\.We use∥⋅∥\\\|\\cdot\\\|to denote the Euclidean norm for vectors and the Frobenius norm for matrices unless otherwise specified\. For a matrixAA,‖A‖op\\\|A\\\|\_\{\\mathrm\{op\}\}denotes its operator norm\. We employ standard asymptotic notations such asO\(⋅\)O\(\\cdot\),Ω\(⋅\)\\Omega\(\\cdot\), ando\(⋅\)o\(\\cdot\)\. We writef≍gf\\asymp giff=O\(g\)f=O\(g\)andg=O\(f\)g=O\(f\)\.

Data distribution\.Letpdatap\_\{\\mathrm\{data\}\}denote the unknown target distribution supported on akk\-dimensional compact smooth manifoldℳ\\mathcal\{M\}embedded inℝd\\mathbb\{R\}^\{d\}, withk≪dk\\ll d\. For any pointxxin the tubular neighborhood ofℳ\\mathcal\{M\}, we denote byΠℳ\(x\)\\Pi\_\{\\mathcal\{M\}\}\(x\)its projection ontoℳ\\mathcal\{M\}and bydℳ\(x\)=infz∈ℳ‖x−z‖d\_\{\\mathcal\{M\}\}\(x\)=\\inf\_\{z\\in\\mathcal\{M\}\}\\\|x\-z\\\|its distance toℳ\\mathcal\{M\}\. The tangent and normal spaces ofℳ\\mathcal\{M\}at a pointzzare denotedTzℳT\_\{z\}\\mathcal\{M\}andNzℳN\_\{z\}\\mathcal\{M\}, respectively\. We denote byτ\>0\\tau\>0the reach ofℳ\\mathcal\{M\}, i\.e\., the largest radius such that the projectionΠℳ\\Pi\_\{\\mathcal\{M\}\}is uniquely defined within the tubular neighborhoodUτ=\{x∈ℝd:dℳ\(x\)<τ\}\{U\}\_\{\\tau\}=\\\{x\\in\\mathbb\{R\}^\{d\}:d\_\{\\mathcal\{M\}\}\(x\)<\\tau\\\}\.

Diffusion process\.We consider the Variance\-Preserving \(VP\) forward processHoet al\.\([2020](https://arxiv.org/html/2605.20235#bib.bib6)\), which corrupts a clean samplex0∼pdatax\_\{0\}\\sim p\_\{\\mathrm\{data\}\}asxt=α¯tx0\+1−α¯tε,ε∼𝒩\(0,Id\),x\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\varepsilon,\\quad\\varepsilon\\sim\\mathcal\{N\}\(0,I\_\{d\}\),where\{α¯t\}t∈\[0,T\]\\\{\\bar\{\\alpha\}\_\{t\}\\\}\_\{t\\in\[0,T\]\}is a monotonically decreasing noise schedule\. We denote the marginal distribution ofxtx\_\{t\}byptp\_\{t\}, and writeh\(t\):=1−α¯th\(t\):=1\-\\bar\{\\alpha\}\_\{t\}for the noise variance at timett\. The score function ofptp\_\{t\}iss∗\(x,t\):=∇xlog⁡pt\(x\)s^\{\*\}\(x,t\):=\\nabla\_\{x\}\\log p\_\{t\}\(x\)\.

Score matching objective\.FollowingHoet al\.\([2020](https://arxiv.org/html/2605.20235#bib.bib6)\); Vincent \([2011](https://arxiv.org/html/2605.20235#bib.bib3)\), the score function is estimated by minimizing the denoising score matching \(DSM\) objective:ℒ\(θ\)=𝔼t,x0,ε\[‖sθ\(xt,t\)\+ε/h\(t\)‖2\],\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{t,x\_\{0\},\\varepsilon\}\\left\[\\left\\\|s\_\{\\theta\}\(x\_\{t\},t\)\+\{\\varepsilon\}/\{\\sqrt\{h\(t\)\}\}\\right\\\|^\{2\}\\right\],where the expectation is overt∼Unif\[0,T\]t\\sim\\mathrm\{Unif\}\[0,T\],x0∼pdatax\_\{0\}\\sim p\_\{\\mathrm\{data\}\}, andε∼𝒩\(0,Id\)\\varepsilon\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\. It is well establishedVincent \([2011](https://arxiv.org/html/2605.20235#bib.bib3)\)that minimizingℒ\(θ\)\\mathcal\{L\}\(\\theta\)is equivalent to minimizing the explicit score matching error𝔼\[‖sθ\(xt,t\)−s∗\(xt,t\)‖2\]\\mathbb\{E\}\[\\\|s\_\{\\theta\}\(x\_\{t\},t\)\-s^\{\*\}\(x\_\{t\},t\)\\\|^\{2\}\]up to a constant\.

### 3\.2Manifold Hypothesis and Score Singularity

Manifold hypothesis\.We operate under the manifold hypothesis, which posits thatpdatap\_\{\\mathrm\{data\}\}is supported on akk\-dimensional compact smooth manifoldℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}withk≪dk\\ll d\. For any pointxxin the tubular neighborhoodUτ\{U\}\_\{\\tau\}, the perturbed distributionptp\_\{t\}admits the following decomposition of its score\.

###### Proposition 3\.1\(Score decompositionDe Bortoli \([2022](https://arxiv.org/html/2605.20235#bib.bib47)\); Liet al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib46)\)\)\.

Assumep0∈C2\(ℳ\)p\_\{0\}\\in C^\{2\}\(\\mathcal\{M\}\)with0<pmin≤p0≤pmax<∞0<p\_\{\\min\}\\leq p\_\{0\}\\leq p\_\{\\max\}<\\infty\. Forx∈Uτ/2x\\in U\_\{\\tau/2\}andh\(t\)≤cτ2h\(t\)\\leq c\\,\\tau^\{2\}with a sufficiently small universal constantc∈\(0,1\)c\\in\(0,1\), the score functions∗\(x,t\):=∇xlog⁡pt\(x\)s^\{\*\}\(x,t\):=\\nabla\_\{x\}\\log p\_\{t\}\(x\)admits the scale\-separated decomposition

s∗\(x,t\)=−x−Πℳ\(x\)h\(t\)⏟\(I\):O\(h\(t\)−1\)\+∇xlog⁡p0\(Πℳ\(x\)\)\+∇xH\(Πℳ\(x\),x−Πℳ\(x\)\)⏟\(II\):O\(1\)\+O\(h\(t\)\),s^\{\*\}\(x,t\)=\\underbrace\{\-\\,\\frac\{x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\}\{h\(t\)\}\}\_\{\\text\{\(I\): \}\{O\}\(h\(t\)^\{\-1\}\)\}\+\\underbrace\{\\nabla\_\{x\}\\log p\_\{0\}\\bigl\(\\Pi\_\{\\mathcal\{M\}\}\(x\)\\bigr\)\+\\nabla\_\{x\}H\\bigl\(\\Pi\_\{\\mathcal\{M\}\}\(x\),\\,x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\\bigr\)\}\_\{\\text\{\(II\): \}O\(1\)\}\+\{O\}\(h\(t\)\),\(1\)whereH:\{\(z,ν\):z∈ℳ,ν∈Nzℳ\}→ℝH:\\\{\(z,\\nu\):z\\in\\mathcal\{M\},\\,\\nu\\in N\_\{z\}\\mathcal\{M\}\\\}\\to\\mathbb\{R\}is a smooth function\.

Term*\(I\)*is the*normal restoring force*, pointing fromxxtoward its projection onℳ\\mathcal\{M\}\. Term*\(II\)*is the*intrinsic\-density and geometric correction*: the first summand∇xlog⁡p0\(Πℳ\(x\)\)\\nabla\_\{x\}\\log p\_\{0\}\(\\Pi\_\{\\mathcal\{M\}\}\(x\)\)denotes the Euclidean gradient of the compositionx↦log⁡p0\(Πℳ\(x\)\)x\\mapsto\\log p\_\{0\}\(\\Pi\_\{\\mathcal\{M\}\}\(x\)\), which lies inTΠℳ\(x\)ℳT\_\{\\Pi\_\{\\mathcal\{M\}\}\(x\)\}\\mathcal\{M\}and encodes the intrinsic density gradient; the second summand accounts for the geometry of the embedding\. Explicitly,HHcombines two contributions,H\(z,ν\)=−12⟨ν,z⟩−12logdet\(Ik−Bν\)H\(z,\\nu\)=\-\\tfrac\{1\}\{2\}\\,\\langle\\nu,z\\rangle\\;\-\\;\\tfrac\{1\}\{2\}\\,\\log\\det\\bigl\(I\_\{k\}\-B\_\{\\nu\}\\bigr\), where the first term is a*VP\-shrinkage correction*arising from the rescalingy=x/1−h\(t\)y=x/\\sqrt\{1\-h\(t\)\}, and the second is an*extrinsic\-curvature correction*with\(Bν\)ij:=⟨ν,II\(ei,ej\)⟩\(B\_\{\\nu\}\)\_\{ij\}:=\\langle\\nu,\\mathrm\{I\\\!I\}\(e\_\{i\},e\_\{j\}\)\\rangleandII\\mathrm\{I\\\!I\}the second fundamental form ofℳ\\mathcal\{M\}\.

#### Implications of scale separation\.

Proposition[3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1)reveals a fundamental asymmetry in the learning problem\. Ash\(t\)→0h\(t\)\\to 0, the normal restoring force dominates the score by a factor ofO\(h\(t\)−1\)\{O\}\(h\(t\)^\{\-1\}\), creating a strong geometric signal that dwarfs the tangential density information\. This*scale separation*Liet al\.\([2026](https://arxiv.org/html/2605.20235#bib.bib46)\)has three consequences\. First, learning the geometry ofℳ\\mathcal\{M\}is statistically easier than learning the intrinsic densityp0p\_\{0\}: at small noise scales, the DSM loss is dominated by the projection term and provides a strong gradient signal toward approximatingΠℳ\(⋅\)\\Pi\_\{\\mathcal\{M\}\}\(\\cdot\)before the residual density term becomes relevant\. Second, the dominant term*\(I\)*is the gradient of a scalar potential,−x−Πℳ\(x\)h\(t\)=−∇x\(dℳ\(x\)22h\(t\)\),\-\\,\\frac\{x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\}\{h\(t\)\}\\;=\\;\-\\,\\nabla\_\{x\}\\left\(\\frac\{d\_\{\\mathcal\{M\}\}\(x\)^\{2\}\}\{2\\,h\(t\)\}\\right\),by Federer’s identity∇x\(12dℳ2\(x\)\)=x−Πℳ\(x\)\\nabla\_\{x\}\\bigl\(\\tfrac\{1\}\{2\}\\,d\_\{\\mathcal\{M\}\}^\{2\}\(x\)\\bigr\)=x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\. This conservative structure motivates a conservative\-form architecture \(Section[3\.3](https://arxiv.org/html/2605.20235#S3.SS3)\), whose output is structurally constrained to conservative vector fields and therefore aligned with the geometric category of the target score at small noise\.

#### The optimization challenge\.

The same singularity that enables efficient geometry learning poses an obstacle for simultaneous density learning: at small noise scales, theO\(h\(t\)−1\)\{O\}\(h\(t\)^\{\-1\}\)magnitude of the normal component dominates the DSM loss, drowning out theO\(1\)\{O\}\(1\)density signalLiuet al\.\([2025](https://arxiv.org/html/2605.20235#bib.bib49)\)\. A single network trained end\-to\-end on the DSM objective must therefore resolve two tasks operating at incompatible scales, leading to inefficient optimization\. This motivates ourScore\-induced Latent Diffusion \(SiLD\)framework, introduced in Section[3\.3](https://arxiv.org/html/2605.20235#S3.SS3), which decouples the two stages by exploiting the scale separation structure of Proposition[3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1)\.

### 3\.3Score\-induced Latent Diffusion: A Two\-Stage Framework

The scale separation in Proposition[3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1)motivates a principled decoupling of the score learning problem into two sequential stages, each targeting one term in the decomposition \([1](https://arxiv.org/html/2605.20235#S3.E1)\)\. We call this frameworkScore\-induced Latent Diffusion \(SiLD\)\(with generic algorithm presented in Algorithm[1](https://arxiv.org/html/2605.20235#alg1)\), reflecting that the latent representation is induced by the geometric structure of the score function itself, rather than imposed by an auxiliary objective\. For tractability of the analysis, Stage 1 uses a two\-layer conservative\-form network whose gradient\-of\-potential structure matches the conservative target score at small noise, and Stage 2 a random\-feature network operating on the Stage 1 projection\.

Stage 1: Geometric Alignment\.At small noise scalesh\(t1\)≪1h\(t\_\{1\}\)\\ll 1, the DSM loss is dominated by the normal restoring force*\(I\)*\. Since this dominant term is a conservative vector field, we constrain our network to the same class by parametrizing it as the gradient of a scalar potential\. Concretely, we train a*conservative\-form*two\-layer network:

f1\(x;θ\)=1h\(t1\)\(Wdiag\(a\)σ\(W⊤x\+b\)−x\),f\_\{1\}\(x;\\theta\)\\;=\\;\\frac\{1\}\{h\(t\_\{1\}\)\}\\Bigl\(W\\,\\mathrm\{diag\}\(a\)\\,\\sigma\(W^\{\\top\}x\+b\)\\;\-\\;x\\Bigr\),\(2\)with parametersθ=\(W,a,b\)\\theta=\(W,a,b\), whereW=\[w1,…,wm\]∈ℝd×mW=\[w\_\{1\},\\ldots,w\_\{m\}\]\\in\\mathbb\{R\}^\{d\\times m\},a∈ℝma\\in\\mathbb\{R\}^\{m\},b∈ℝmb\\in\\mathbb\{R\}^\{m\}, andσ\(⋅\)\\sigma\(\\cdot\)is a nonlinear activation\. Per\-neuron,f1\(x;θ\)=1h\(t1\)\(∑j=1majwjσ\(wj⊤x\+bj\)−x\)f\_\{1\}\(x;\\theta\)=\\tfrac\{1\}\{h\(t\_\{1\}\)\}\\bigl\(\\sum\_\{j=1\}^\{m\}a\_\{j\}\\,w\_\{j\}\\,\\sigma\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)\-x\\bigr\): each neuron contributes a vector along its input directionwjw\_\{j\}, scaled byajσ\(wj⊤x\+bj\)a\_\{j\}\\sigma\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)\. This direction\-parallel structure is precisely what makesf1f\_\{1\}a conservative vector field: as we show in Lemma[4\.1](https://arxiv.org/html/2605.20235#S4.Thmtheorem1),f1=−∇xΦnetf\_\{1\}=\-\\nabla\_\{x\}\\Phi\_\{\\mathrm\{net\}\}for an explicit scalar potentialΦnet\\Phi\_\{\\mathrm\{net\}\}, matching the category of the target score at small noise\. Minimizing the DSM objective drives the induced projection map to align withΠℳ\(x\)\\Pi\_\{\\mathcal\{M\}\}\(x\):

x^:=h\(t1\)f1\(x;θ\)\+x=Wdiag\(a\)σ\(W⊤x\+b\)≈Πℳ\(x\)\.\\hat\{x\}\\;:=\\;h\(t\_\{1\}\)\\,f\_\{1\}\(x;\\theta\)\+x\\;=\\;W\\,\\mathrm\{diag\}\(a\)\\,\\sigma\(W^\{\\top\}x\+b\)\\;\\approx\\;\\Pi\_\{\\mathcal\{M\}\}\(x\)\.\(3\)
Stage 2: Density Estimation\.Once Stage 1 has converged andWWis frozen, we decouple the singular normal component by constructing the full score network as:

sθ\(x,t2\)=−x−x^h\(t2\)\+f2\(x^,t2;θ2\),s\_\{\\theta\}\(x,t\_\{2\}\)=\-\\frac\{x\-\\hat\{x\}\}\{h\(t\_\{2\}\)\}\+f\_\{2\}\(\\hat\{x\},t\_\{2\};\\theta\_\{2\}\),\(4\)wheref2\(x^,t2;θ2\)=UΦ\(x^,t2\)f\_\{2\}\(\\hat\{x\},t\_\{2\};\\theta\_\{2\}\)=U\\Phi\(\\hat\{x\},t\_\{2\}\)is a Random Feature network with frozen features, andU∈ℝd×mU\\in\\mathbb\{R\}^\{d\\times m\}is the only trainable parameter\. By construction, the singular normal components cancel exactly, and Stage 2 reduces to estimating theO\(1\)\{O\}\(1\)residual intrinsic score:

minθ2⁡𝔼t2,x∼pt2‖f2\(x^,t2;θ2\)−sres∗\(x^,t2\)‖2,\\min\_\{\\theta\_\{2\}\}\\mathbb\{E\}\_\{t\_\{2\},x\\sim p\_\{t\_\{2\}\}\}\\left\\\|f\_\{2\}\(\\hat\{x\},t\_\{2\};\\theta\_\{2\}\)\-s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\_\{2\}\)\\right\\\|^\{2\},\(5\)wheresres∗\(x^,t\):=∇xlog⁡p0\(x^\)\+∇xH\(x^,x−x^\)s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\):=\\nabla\_\{x\}\\log p\_\{0\}\(\\hat\{x\}\)\+\\nabla\_\{x\}H\(\\hat\{x\},x\-\\hat\{x\}\)is the residual intrinsic score, which takes values in the tangent bundle ofℳ\\mathcal\{M\}and is uniformly bounded onℳ\\mathcal\{M\}\.

Algorithm 1Score\-induced Latent Diffusion \(SiLD\) Training1:Dataset

𝒟=\{xi\}i=1n⊂ℝd\\mathcal\{D\}=\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\\subset\\mathbb\{R\}^\{d\}, noise levels

h\(t1\)≪h\(t2\)h\(t\_\{1\}\)\\ll h\(t\_\{2\}\), learning rates

η1,η2\\eta\_\{1\},\\eta\_\{2\}, regularization

λ\\lambda
2:Stage 1\(Geometric Alignment, at noise level

h\(t1\)h\(t\_\{1\}\)\):

3:Train

f1∈ℱ1f\_\{1\}\\in\\mathcal\{F\}\_\{1\}via gradient descent on the DSM loss to learn the manifold projection

4:Compute

x^:=x\+h\(t1\)⋅f1\(x\)≈Πℳ\(x\)\\hat\{x\}:=x\+h\(t\_\{1\}\)\\cdot f\_\{1\}\(x\)\\approx\\Pi\_\{\\mathcal\{M\}\}\(x\)
5:Stage 2\(Density Estimation, at noise level

h\(t2\)h\(t\_\{2\}\)\):

6:Train

f2∈ℱ2f\_\{2\}\\in\\mathcal\{F\}\_\{2\}on the residual score:

f^2=arg⁡minf2∈ℱ2⁡𝔼t,x‖f2\(x^,t\)−sres∗\(x^,t\)‖2\+λR\(f2\)\\hat\{f\}\_\{2\}=\\arg\\min\_\{f\_\{2\}\\in\\mathcal\{F\}\_\{2\}\}\\mathbb\{E\}\_\{t,x\}\\\|f\_\{2\}\(\\hat\{x\},t\)\-s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\)\\\|^\{2\}\+\\lambda R\(f\_\{2\}\)
7:returnScore network

sθ\(x,t\)=−x−x^h\(t\)\+f^2\(x^,t\)s\_\{\\theta\}\(x,t\)=\-\\dfrac\{x\-\\hat\{x\}\}\{h\(t\)\}\+\\hat\{f\}\_\{2\}\(\\hat\{x\},t\)

## 4Theoretical Analysis

We now provide rigorous theoretical guarantees for the two stages of SiLD\. All proofs are deferred to the Appendix[B](https://arxiv.org/html/2605.20235#A2)\.

### 4\.1Stage 1: Dimensional Collapse

We first justify the architectural choice of \([2](https://arxiv.org/html/2605.20235#S3.E2)\)\. The key observation is that the dominant term in the score decomposition \([1](https://arxiv.org/html/2605.20235#S3.E1)\) is a conservative vector field, and our conservative\-form network is structurally constrained to the same functional class\.

###### Lemma 4\.1\(Structural Alignment\)\.

Letσ∈C2\(ℝ\)\\sigma\\in C^\{2\}\(\\mathbb\{R\}\)be non\-polynomial with primitiveΨ\(z\):=∫0zσ\(u\)𝑑u\\Psi\(z\):=\\int\_\{0\}^\{z\}\\sigma\(u\)\\,du\. Consider the conservative\-form network \([2](https://arxiv.org/html/2605.20235#S3.E2)\) with parametersθ=\(W,a,b\)\\theta=\(W,a,b\), and the neural potential

Φnet\(x;θ\):=‖x‖22h\(t\)−1h\(t\)∑j=1majΨ\(wj⊤x\+bj\)\.\\Phi\_\{\\mathrm\{net\}\}\(x;\\theta\)\\;:=\\;\\frac\{\\\|x\\\|^\{2\}\}\{2\\,h\(t\)\}\\;\-\\;\\frac\{1\}\{h\(t\)\}\\sum\_\{j=1\}^\{m\}a\_\{j\}\\,\\Psi\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)\.Thenf1\(⋅;θ\)=−∇xΦnet\(⋅;θ\)f\_\{1\}\(\\,\\cdot\\,;\\theta\)=\-\\nabla\_\{x\}\\Phi\_\{\\mathrm\{net\}\}\(\\,\\cdot\\,;\\theta\)onℝd\\mathbb\{R\}^\{d\}for everyθ\\theta\. Moreover, for anyε\>0\\varepsilon\>0, there existm=m\(ε,ℳ,τ,σ\)∈ℕm=m\(\\varepsilon,\\mathcal\{M\},\\tau,\\sigma\)\\in\\mathbb\{N\}and parametersθ=\(W,a,b\)\\theta=\(W,a,b\)withW∈ℝd×mW\\in\\mathbb\{R\}^\{d\\times m\}anda,b∈ℝma,b\\in\\mathbb\{R\}^\{m\}such that

supx∈Uτ/2¯‖Wdiag\(a\)σ\(W⊤x\+b\)−Πℳ\(x\)‖<ε\.\\sup\_\{x\\in\\overline\{U\_\{\\tau/2\}\}\}\\bigl\\\|W\\,\\mathrm\{diag\}\(a\)\\,\\sigma\(W^\{\\top\}x\+b\)\\;\-\\;\\Pi\_\{\\mathcal\{M\}\}\(x\)\\bigr\\\|\\;<\\;\\varepsilon\.\(6\)

We then analyze the gradient flow dynamics of Stage 1 under the mean\-field frameworkMeiet al\.\([2018](https://arxiv.org/html/2605.20235#bib.bib68)\); Chizat \([2022](https://arxiv.org/html/2605.20235#bib.bib69)\); Suzukiet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib71)\), which captures feature learning and enjoys favorable convergence guarantees\. For tractability of the analysis, we adopt a random\-feature\-style simplification standard in mean\-field studies of two\-layer networksMeiet al\.\([2018](https://arxiv.org/html/2605.20235#bib.bib68)\); Chizat \([2022](https://arxiv.org/html/2605.20235#bib.bib69)\): the output coefficientsaaand biasesbbare frozen at their initial values, and only the input weightswwevolve under gradient flow\. Lemma[4\.1](https://arxiv.org/html/2605.20235#S4.Thmtheorem1)establishes expressivity of the full conservative\-form class; for the frozen\-\(a,b\)\(a,b\)dynamics analyzed below, the required approximation and reachability properties are captured by the non\-degeneracy condition in Assumption[4\.2](https://arxiv.org/html/2605.20235#S4.Thmtheorem2)\.

Letq∈𝒫2\(ℝd\+2\)q\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\+2\}\)denote the joint distribution over neuron parametersθ=\(w,a,b\)∈ℝd×ℝ×ℝ\\theta=\(w,a,b\)\\in\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}\\times\\mathbb\{R\}, initialized atq0=νw⊗νa⊗νbq\_\{0\}=\\nu\_\{w\}\\otimes\\nu\_\{a\}\\otimes\\nu\_\{b\}, whereνw=𝒩\(0,σw2Id\)\\nu\_\{w\}=\\mathcal\{N\}\(0,\\sigma\_\{w\}^\{2\}I\_\{d\}\)withσw\>0\\sigma\_\{w\}\>0, andνa,νb\\nu\_\{a\},\\nu\_\{b\}are bounded\-support distributions onℝ\\mathbb\{R\}\(e\.g\., Rademacher\) with\|a\|≤A\|a\|\\leq Aand\|b\|≤B\|b\|\\leq Balmost surely, and second momentsσa2,σb2\>0\\sigma\_\{a\}^\{2\},\\sigma\_\{b\}^\{2\}\>0\. The mean\-field neural network is

Pq\(x\):=∫ℝd\+2awσ\(w⊤x\+b\)q\(dθ\),P\_\{q\}\(x\):=\\int\_\{\\mathbb\{R\}^\{d\+2\}\}a\\,w\\,\\sigma\(w^\{\\top\}x\+b\)\\,q\(\\mathrm\{d\}\\theta\),\(7\)which corresponds to the limit ofh\(t1\)f1\(x;θ\)\+xh\(t\_\{1\}\)\\,f\_\{1\}\(x;\\theta\)\+xin \([3](https://arxiv.org/html/2605.20235#S3.E3)\)\. We minimize the DSM lossLt\(q\):=12𝔼x∼pt‖f1\(x;q\)−s∗\(x,t\)‖2L\_\{t\}\(q\):=\\tfrac\{1\}\{2\}\\,\\mathbb\{E\}\_\{x\\sim p\_\{t\}\}\\bigl\\\|f\_\{1\}\(x;q\)\-s^\{\*\}\(x,t\)\\bigr\\\|^\{2\}via the Wasserstein gradient flow restricted to the trainable direction:

∂sqs=∇w⋅\(qs∇wδLtδq\(qs\)\),\\partial\_\{s\}q\_\{s\}\\;=\\;\\nabla\_\{w\}\\cdot\\\!\\left\(q\_\{s\}\\,\\nabla\_\{w\}\\frac\{\\delta L\_\{t\}\}\{\\delta q\}\(q\_\{s\}\)\\right\),\(8\)so the\(a,b\)\(a,b\)\-marginal ofqsq\_\{s\}remainsνa⊗νb\\nu\_\{a\}\\otimes\\nu\_\{b\}for alls≥0s\\geq 0\. The Stage 1 alignment risk is defined as

Ft\(q\):=12𝔼x∼pt‖Pq\(x\)−Πℳ\(x\)‖2\.F\_\{t\}\(q\):=\\tfrac\{1\}\{2\}\\,\\mathbb\{E\}\_\{x\\sim p\_\{t\}\}\\left\\\|P\_\{q\}\(x\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\\right\\\|^\{2\}\.\(9\)Our analysis tracksFtF\_\{t\}along this flow, rather thanLtL\_\{t\}directly, sinceFtF\_\{t\}isolates the geometric alignment error\.

###### Assumption 4\.2\(Geometric Non\-degeneracy\)\.

There existsν\>0\\nu\>0,*independent ofh\(t\)h\(t\)*, such that for alls≥0s\\geq 0,

∫ℝd\+2‖∇wδFtδq\(qs\)\(θ\)‖2qs\(dθ\)≥νFt\(qs\)\.\\int\_\{\\mathbb\{R\}^\{d\+2\}\}\\left\\\|\\nabla\_\{w\}\\frac\{\\delta F\_\{t\}\}\{\\delta q\}\(q\_\{s\}\)\(\\theta\)\\right\\\|^\{2\}q\_\{s\}\(\\mathrm\{d\}\\theta\)\\;\\geq\\;\\nu\\,F\_\{t\}\(q\_\{s\}\)\.\(10\)

Assumption[4\.2](https://arxiv.org/html/2605.20235#S4.Thmtheorem2)is a functional Polyak\-Łojasiewicz inequality forFtF\_\{t\}along theww\-restricted Wasserstein flow, comparing the gradient norm ofFtF\_\{t\}atqsq\_\{s\}to its functional value\. Since\(a,b\)\(a,b\)are frozen, this is effectively a PL condition on theww\-marginal conditioned on\(a,b\)\(a,b\)\. Assumption[4\.2](https://arxiv.org/html/2605.20235#S4.Thmtheorem2)ensures that any remaining projection error is detectable through the trainable feature directionsww, conditional on the frozen labels\(a,b\)\(a,b\)\. We verify it explicitly for linear manifolds in Proposition[B\.3](https://arxiv.org/html/2605.20235#A2.Thmtheorem3)\.

###### Assumption 4\.3\(Second\-Moment Confinement\)\.

The second moment of the trainable weightsm2\(qs\):=∫‖w‖2qs\(dθ\)m\_\{2\}\(q\_\{s\}\):=\\int\\\|w\\\|^\{2\}\\,q\_\{s\}\(\\mathrm\{d\}\\theta\)is uniformly bounded along the flow:sups≥0m2\(qs\)≤M2<∞\\sup\_\{s\\geq 0\}m\_\{2\}\(q\_\{s\}\)\\leq M\_\{2\}<\\infty\.

Assumption[4\.3](https://arxiv.org/html/2605.20235#S4.Thmtheorem3)is a standard regularity condition in mean\-field analyses of two\-layer networksMeiet al\.\([2018](https://arxiv.org/html/2605.20235#bib.bib68)\); Chizat \([2022](https://arxiv.org/html/2605.20235#bib.bib69)\); Suzukiet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib71)\), and we verify that it holds along the gradient flow under mildℓ2\\ell\_\{2\}regularization ofww\. Since\(a,b\)\(a,b\)are frozen at bounded\-support distributions, the corresponding moments∫a2qs=σa2\\int a^\{2\}q\_\{s\}=\\sigma\_\{a\}^\{2\}and∫b2qs=σb2\\int b^\{2\}q\_\{s\}=\\sigma\_\{b\}^\{2\}are constant inss, and all mixed moments involvinga,ba,binherit the bounds\|a\|≤A\|a\|\\leq A,\|b\|≤B\|b\|\\leq Bautomatically\.

###### Theorem 4\.5\(Dimensional Collapse\)\.

Assumeℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}is a compactC∞C^\{\\infty\}submanifold of intrinsic dimensionkkwith reachτ\>0\\tau\>0,p0∈C2\(ℳ\)p\_\{0\}\\in C^\{2\}\(\\mathcal\{M\}\)with0<pmin≤p0≤pmax<∞0<p\_\{\\min\}\\leq p\_\{0\}\\leq p\_\{\\max\}<\\inftyuniformly, andσ∈C2\(ℝ\)\\sigma\\in C^\{2\}\(\\mathbb\{R\}\)is non\-polynomial withmax⁡\(‖σ‖∞,‖σ′‖∞,‖σ′′‖∞\)≤Cσ\\max\(\\\|\\sigma\\\|\_\{\\infty\},\\\|\\sigma^\{\\prime\}\\\|\_\{\\infty\},\\\|\\sigma^\{\\prime\\prime\}\\\|\_\{\\infty\}\)\\leq C\_\{\\sigma\}\. Fixh\(t1\)≤cτ2h\(t\_\{1\}\)\\leq c\\,\\tau^\{2\}for a sufficiently small universal constantc∈\(0,1\)c\\in\(0,1\)\. Under Assumptions[4\.2](https://arxiv.org/html/2605.20235#S4.Thmtheorem2)and[4\.3](https://arxiv.org/html/2605.20235#S4.Thmtheorem3), the Stage\-1 alignment risk satisfies

Ft\(qs\)≤Ft\(q0\)exp⁡\(−νs2h\(t1\)2\)\+Cth\(t1\)2ν,F\_\{t\}\(q\_\{s\}\)\\;\\leq\\;F\_\{t\}\(q\_\{0\}\)\\,\\exp\\\!\\left\(\-\\frac\{\\nu\\,s\}\{2\\,h\(t\_\{1\}\)^\{2\}\}\\right\)\\;\+\\;\\frac\{C\_\{t\}\\,h\(t\_\{1\}\)^\{2\}\}\{\\nu\},\(11\)whereFt\(q0\)≤12Tr⁡\(Σℳ\)\+Cσ2σa2σw2dF\_\{t\}\(q\_\{0\}\)\\leq\\tfrac\{1\}\{2\}\\operatorname\{Tr\}\(\\Sigma\_\{\\mathcal\{M\}\}\)\+C\_\{\\sigma\}^\{2\}\\,\\sigma\_\{a\}^\{2\}\\,\\sigma\_\{w\}^\{2\}\\,dwithΣℳ:=𝔼\[Πℳ\(x\)Πℳ\(x\)⊤\]\\Sigma\_\{\\mathcal\{M\}\}:=\\mathbb\{E\}\[\\Pi\_\{\\mathcal\{M\}\}\(x\)\\Pi\_\{\\mathcal\{M\}\}\(x\)^\{\\top\}\], andCt\>0C\_\{t\}\>0depends onℳ\\mathcal\{M\},τ\\tau,CσC\_\{\\sigma\},AA, andM2M\_\{2\}but not onss\. Consequently,Ft\(qs\)→Cth\(t1\)2/νF\_\{t\}\(q\_\{s\}\)\\to C\_\{t\}h\(t\_\{1\}\)^\{2\}/\\nuass→∞s\\to\\infty, and the expected projection error satisfies𝔼‖Pqs\(x\)−Πℳ\(x\)‖=O\(h\(t1\)\)\\mathbb\{E\}\\\|P\_\{q\_\{s\}\}\(x\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|=O\(h\(t\_\{1\}\)\)\.

### 4\.2Stage 2: Density Estimation on the Manifold

Having established that Stage 1 produces anϵproj\\epsilon\_\{\\mathrm\{proj\}\}\-accurate projectionx^≈Πℳ\(x\)\\hat\{x\}\\approx\\Pi\_\{\\mathcal\{M\}\}\(x\), we now analyze Stage 2\. By construction ofsθs\_\{\\theta\}in \([4](https://arxiv.org/html/2605.20235#S3.E4)\), the singular normal component cancels exactly, and the optimization reduces to estimating theO\(1\)O\(1\)residual intrinsic scoresres∗\(x^,t\)s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\)on thekk\-dimensional manifoldℳ\\mathcal\{M\}\. We modelf2f\_\{2\}as a Random Feature \(RF\) network with spatio\-temporal features:

f2\(x^,t2;U\)=UΦ\(x^,t2\),Φ\(x^,t2\)=1mσ\(Vx⊤x^\+b\)∈ℝm,f\_\{2\}\(\\hat\{x\},t\_\{2\};U\)=U\\Phi\(\\hat\{x\},t\_\{2\}\),\\quad\\Phi\(\\hat\{x\},t\_\{2\}\)=\\frac\{1\}\{\\sqrt\{m\}\}\\sigma\\left\(V\_\{x\}^\{\\top\}\\hat\{x\}\+b\\right\)\\in\\mathbb\{R\}^\{m\},\(12\)whereVx∈ℝd×mV\_\{x\}\\in\\mathbb\{R\}^\{d\\times m\}andbbare frozen random features and bias feature, andU∈ℝd×mU\\in\\mathbb\{R\}^\{d\\times m\}is the only trainable parameter\. Training Stage 2 thus reduces to a convex vector\-valued ridge regression:

U^=arg⁡minU⁡1n∑i=1n‖UΦ\(x^t2,t2\)−sres∗\(x^t2,t2\)‖2\+λ‖U‖F2\.\\hat\{U\}=\\arg\\min\_\{U\}\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\bigl\\\|U\\Phi\(\\hat\{x\}\_\{t\_\{2\}\},t\_\{2\}\)\-s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\}\_\{t\_\{2\}\},t\_\{2\}\)\\bigr\\\|^\{2\}\+\\lambda\\\|U\\\|\_\{F\}^\{2\}\.\(13\)
###### Theorem 4\.7\(Generalization of Stage 2\)\.

Fix two noise levelsh\(t1\)≪h\(t2\)h\(t\_\{1\}\)\\ll h\(t\_\{2\}\), both in the manifold regime\. Suppose further that the residual score satisfies the source conditionsres∗=TKαgs^\{\*\}\_\{\\mathrm\{res\}\}=T\_\{K\}^\{\\alpha\}gwith‖g‖L2\(ℳ\)≤G\\\|g\\\|\_\{L^\{2\}\(\\mathcal\{M\}\)\}\\leq Gand smoothnessα\>k/\(2r\)\\alpha\>k/\(2r\), and letU^\\hat\{U\}be the regularized empirical\-risk minimizer of \([13](https://arxiv.org/html/2605.20235#S4.E13)\) overnni\.i\.d\. samples at noise levelh\(t2\)h\(t\_\{2\}\)\. Then with probability at least1−δ1\-\\delta,

𝔼xt2‖sθ^\(xt2,t2\)−s∗\(xt2,t2\)‖2≤Cint2log⁡\(1/δ\)n⏟estimation\+O\(m−\(2αr/k−1\)\)⏟approximation\+CLip2\(h\(t1\)h\(t2\)\)2⏟Stage 1 residual,\\mathbb\{E\}\_\{x\_\{t\_\{2\}\}\}\\\!\\left\\\|s\_\{\\hat\{\\theta\}\}\(x\_\{t\_\{2\}\},t\_\{2\}\)\-s^\{\*\}\(x\_\{t\_\{2\}\},t\_\{2\}\)\\right\\\|^\{2\}\\leq\\underbrace\{\\frac\{C\_\{\\mathrm\{int\}\}^\{2\}\\log\(1/\\delta\)\}\{n\}\}\_\{\\text\{estimation\}\}\+\\underbrace\{\{O\}\\bigl\(m^\{\-\(2\\alpha r/k\-1\)\}\\bigr\)\}\_\{\\text\{approximation\}\}\+\\underbrace\{C\_\{\\mathrm\{Lip\}\}^\{2\}\\left\(\\frac\{h\(t\_\{1\}\)\}\{h\(t\_\{2\}\)\}\\right\)^\{\\\!2\}\}\_\{\\text\{Stage 1 residual\}\},\(14\)where the constantsCint,CLip\>0C\_\{\\mathrm\{int\}\},C\_\{\\mathrm\{Lip\}\}\>0depend on\(G,Cσ,λ,ℳ\)\(G,C\_\{\\sigma\},\\lambda,\\mathcal\{M\}\)but*not*on the ambient dimensiondd\.

### 4\.3End\-to\-End Sampling Guarantee

Theorems[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5)and[4\.7](https://arxiv.org/html/2605.20235#S4.Thmtheorem7)bound the score estimation error within the manifold regime\. We now lift these results to an end\-to\-end guarantee on the generated distribution by analyzing the reverse SDE over the full integration path\. FollowingDe Bortoli \([2022](https://arxiv.org/html/2605.20235#bib.bib47)\); Bentonet al\.\([2023](https://arxiv.org/html/2605.20235#bib.bib79)\), we partition\[tmin,T\]\[t\_\{\\min\},T\]attmaxt\_\{\\max\}withh\(tmax\)≍τ2h\(t\_\{\\max\}\)\\asymp\\tau^\{2\}\.*Phase I*\[tmin,tmax\]\[t\_\{\\min\},t\_\{\\max\}\]\(*manifold regime*\):ptp\_\{t\}concentrates on the tubular neighborhoodUτU\_\{\\tau\}, the score exhibits theO\(h\(t\)−1\)\{O\}\(h\(t\)^\{\-1\}\)singularity, and the two\-stage SiLD architecture \(Eq\.[4](https://arxiv.org/html/2605.20235#S3.E4)\) applies\.*Phase II*\[tmax,T\]\[t\_\{\\max\},T\]\(*Gaussian regime*\):ptp\_\{t\}no longer concentrates nearℳ\\mathcal\{M\}, and the manifold\-projection ansatz of Eq\.[4](https://arxiv.org/html/2605.20235#S3.E4)is no longer geometrically appropriate\.

#### High\-noise extension\.

To cover Phase II, we extend the score network with a complementary head, gated by a hard time indicatorα\(t\):=1\[h\(t\)≤τ2\]\\alpha\(t\):=\{1\}\[h\(t\)\\leq\\tau^\{2\}\]:

sθfull\(x,t\)=α\(t\)⋅sθSiLD\(x,t\)\+\(1−α\(t\)\)⋅sθHN\(x,t\),s\_\{\\theta\}^\{\\mathrm\{full\}\}\(x,t\)=\\alpha\(t\)\\cdot s\_\{\\theta\}^\{\\mathrm\{SiLD\}\}\(x,t\)\+\(1\-\\alpha\(t\)\)\\cdot s\_\{\\theta\}^\{\\mathrm\{HN\}\}\(x,t\),\(15\)wheresθSiLDs\_\{\\theta\}^\{\\mathrm\{SiLD\}\}is the Stage 1\+2 output of Eq\.[4](https://arxiv.org/html/2605.20235#S3.E4)andsθHNs\_\{\\theta\}^\{\\mathrm\{HN\}\}is a multiplicatively time\-modulated random\-feature head:

sθHN\(x,t\)=∑ℓ=0L−1ϕℓ\(t\)⋅Uℓ⋅1mσ\(Vℓ⊤x\),s\_\{\\theta\}^\{\\mathrm\{HN\}\}\(x,t\)=\\sum\_\{\\ell=0\}^\{L\-1\}\\phi\_\{\\ell\}\(t\)\\cdot U\_\{\\ell\}\\cdot\\tfrac\{1\}\{\\sqrt\{m\}\}\\,\\sigma\\\!\\left\(V\_\{\\ell\}^\{\\top\}x\\right\),\(16\)with\{ϕℓ\}ℓ=0L−1\\\{\\phi\_\{\\ell\}\\\}\_\{\\ell=0\}^\{L\-1\}a fixed Fourier basis on\[0,T\]\[0,T\],Vℓ∈ℝd×mV\_\{\\ell\}\\in\\mathbb\{R\}^\{d\\times m\}frozen random features, andUℓ∈ℝd×mU\_\{\\ell\}\\in\\mathbb\{R\}^\{d\\times m\}the only trainable parameters, fitted by ridge regression on the DSM loss restricted to Phase II\. The ansatz \([16](https://arxiv.org/html/2605.20235#S4.E16)\) is multiplicatively separable, so the induced kernel factorizes asKt⊗KxK\_\{t\}\\otimes K\_\{x\}and the spatial eigenvalue analysis of Lemma[B\.6](https://arxiv.org/html/2605.20235#A2.Thmtheorem6)extends without modification\. At high noise,s∗\(x,t\)s^\{\*\}\(x,t\)is approximately affine inxxwith smooth time\-dependent coefficients, so \([16](https://arxiv.org/html/2605.20235#S4.E16)\) achieves a parametric Phase II generalization bound that is*linear*inddrather than exponential, with a residualO\(e−T\)\{O\}\(e^\{\-T\}\)accounting for the gap betweenpTp\_\{T\}and𝒩\(0,Id\)\\mathcal\{N\}\(0,I\_\{d\}\); see Lemma[B\.9](https://arxiv.org/html/2605.20235#A2.Thmtheorem9)in Appendix[B\.5](https://arxiv.org/html/2605.20235#A2.SS5)\.

###### Theorem 4\.10\(End\-to\-End Sampling Guarantee\)\.

Letpdatap\_\{\\mathrm\{data\}\}be supported on akk\-dimensional manifoldℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}, and letpgen,tminp\_\{\\mathrm\{gen\},t\_\{\\min\}\}be the distribution generated by simulating the reverse SDE withsθ^fulls\_\{\\hat\{\\theta\}\}^\{\\mathrm\{full\}\}fromTTdown totmint\_\{\\min\}\. Choosingtmin∝1/nt\_\{\\min\}\\propto 1/nandT∝log⁡nT\\propto\\log n, with probability at least1−δ1\-\\delta:

W2\(pdata,pgen,tmin\)≲poly\(k\)n1/4⏟Phase I \(manifold\)\+CHN⋅dn1/4⏟Phase II \(Gaussian\),W\_\{2\}\\bigl\(p\_\{\\mathrm\{data\}\},\\,p\_\{\\mathrm\{gen\},t\_\{\\min\}\}\\bigr\)\\;\\lesssim\\;\\underbrace\{\\frac\{\\mathrm\{poly\}\(k\)\}\{n^\{1/4\}\}\}\_\{\\text\{Phase~I \(manifold\)\}\}\\;\+\\;\\underbrace\{\\frac\{\\sqrt\{C\_\{\\mathrm\{HN\}\}\\cdot d\}\}\{n^\{1/4\}\}\}\_\{\\text\{Phase~II \(Gaussian\)\}\},\(17\)whereCHNC\_\{\\mathrm\{HN\}\}depends on\(ℳ,σ,L\)\(\\mathcal\{M\},\\sigma,L\)but not onkk\.

## 5Experiments

We evaluate SiLD \(Algorithm[1](https://arxiv.org/html/2605.20235#alg1)\) on synthetic low\-dimensional manifolds and real\-world image and molecular generation benchmarks\. We compare SiLD against LDMRombachet al\.\([2022](https://arxiv.org/html/2605.20235#bib.bib74)\), a standard latent diffusion model with a VAE encoder trained with KL regularization\. All models share the same backbone and training budget to ensure a fair comparison\. For images, we report Fréchet Inception Distance \(FID\)Heuselet al\.\([2017](https://arxiv.org/html/2605.20235#bib.bib86)\)for generation quality, and mean squared error \(Recon MSE\) and LPIPSZhanget al\.\([2018](https://arxiv.org/html/2605.20235#bib.bib87)\)for reconstruction quality\. For molecules, we report validity, uniqueness, novelty \(fraction of valid generations absent from the training set\), internal diversity \(IntDiv\), drug\-likeness \(QED\)Bickertonet al\.\([2012](https://arxiv.org/html/2605.20235#bib.bib98)\), and Fréchet ChemNet Distance \(FCD\)Preueret al\.\([2018](https://arxiv.org/html/2605.20235#bib.bib99)\), alongside reconstruction MSE\.

Toy experiment\.We validate the dimensional\-collapse mechanism on a synthetic mixture\-of\-Gaussians on ak=5k=5linear subspace ofℝ100\\mathbb\{R\}^\{100\}, which admits an analytical score decomposition for direct comparison \(setup and hyperparameters in Appendix[C\.1](https://arxiv.org/html/2605.20235#A3.SS1)\)\. Figure[1](https://arxiv.org/html/2605.20235#S5.F1)shows the training dynamics: in Stage 1, the orthogonal contraction error \(green\) collapses rapidly while the manifold component error \(red\) remains flat; after the stage switch, Stage 2 exclusively refines the manifold component withWWfrozen\. The sharp transition empirically confirms Theorem[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5)\.

![Refer to caption](https://arxiv.org/html/2605.20235v1/figures/linear_result.png)Figure 1:Two\-stage learning dynamics on the Mixture of Gaussian on manifold experiment\.*Left:*total MSE loss \(black\), manifold component error \(red\), and orthogonal contraction error \(green, dashed\); vertical dotted line marks the Stage 1→\\toStage 2 switch\.*Middle:*learned 1D score profile along the manifold matches the analytical MoG score after Stage 2\.*Right:*learned score along normal directions matches the theoretical contraction−x⟂/ht\-x\_\{\\perp\}/h\_\{t\}, confirming correct isotropic behavior off the manifold\.Stacked MNIST\.We evaluate on Stacked MNIST, a 1000\-mode benchmark that stacks three random MNIST digits into the R/G/B channels \(d=2352d=2352\)\. The data manifold is non\-smooth — discrete digit identities per channel — so the VAE’s𝒩\(0,I\)\\mathcal\{N\}\(0,I\)prior is misaligned with the data geometry, making this a diagnostic test for the Stage\-1 objective\. SiLD and LDM\-CNN share an*identical*CNN encoder, decoder, and latent score network; only the Stage\-1 loss differs\. Both methods achieve full1000/10001000/1000mode coverage \(Table[1](https://arxiv.org/html/2605.20235#S5.T1)\), ruling out mode collapse; at matched capacity, SiLD attains2\.0×2\.0\\timeslower FID \(7\.94±0\.337\.94\\pm 0\.33vs\.16\.11±0\.0916\.11\\pm 0\.09, mean±\\pmstd over 3 seeds\) and1\.77×1\.77\\timeslower reconstruction MSE\. The reconstruction gap is visible*before the diffusion head is applied*, pinpointing the KL regularizer rather than the diffusion process as the bottleneck: KL forces the encoder to smooth the thin data manifold into a Gaussian posterior, losing geometric fidelity\. Qualitative samples in Appendix[C\.2](https://arxiv.org/html/2605.20235#A3.SS2)confirm the same picture\.

Table 1:Stacked MNIST \(1000\-mode benchmark\)\. SiLD and LDM\-CNN share an identical CNN encoder, decoder, and latent score network; only the Stage\-1 autoencoder objective differs\. Both methods achieve full 1000/1000 mode coverage, so the2\.0×2\.0\\timesFID gap and1\.77×1\.77\\timesreconstruction\-MSE gap reflect latent geometry rather than mode collapse\.Table 2:Generation and reconstruction quality on CelebA \(64×\\times64\)\. All models share identical encoder, decoder, and score network architecture\. Only the autoencoder training objective differs\.CelebA\.Table[2](https://arxiv.org/html/2605.20235#S5.T2)compares SiLD and LDM on CelebA \(64×6464\\times 64\) at matched architectural capacity, isolating the effect of the Stage\-1 objective\. At matched regularization strategy, SiLD consistently outperforms LDM: with MMD regularization, SiLD attains42\.5242\.52FID vs\.43\.5843\.58for KL\-regularized LDM; with GAN \+ EMA, SiLD reaches39\.5739\.57vs\.40\.6940\.69\. Reconstruction MSE is lower for SiLD across every setting\. The pure, unregularized SiLD variant attains the best reconstruction among KL\-free methods \(0\.002380\.00238vs\. LDM’s0\.002650\.00265\) but higher FID \(46\.2246\.22\), reflecting that the latent distribution can contain irregular structures absent any smoothing prior; light auxiliary regularization \(MMD or GAN\) recovers generation quality while preserving the reconstruction advantage\. Similar trends hold on CelebA\-HQ \(64×64×364\\times 64\\times 3, intrinsic dimensionk95=223k\_\{95\}=223\); see Appendix[C\.4](https://arxiv.org/html/2605.20235#A3.SS4)\.

Table 3:Unconditional molecule generation results on four MoleculeNet benchmarks\. All methods evaluated on the same protocol \(10K samples, RDKit validity, Frechet ChemNet Distance against held\-out test set\)\. Our reproductions of prior methods use SELFIES representation for fair comparison\. Best result per column per dataset inbold\(among generative methods, excluding real data\)\.‡SELFIES guarantees syntactic validity\.§CharRNN/Transformer AR achieve low FCD by partially memorizing training molecules \(novelty<<8% on QM9\)\.Molecular generation\.We evaluate on four MoleculeNet benchmarks \(QM9, MUV, HIV, PCBA\) using the SELFIES representationKrennet al\.\([2020](https://arxiv.org/html/2605.20235#bib.bib100)\), which guarantees syntactic validity by construction and thus isolates*distributional*learning as the object of comparison\. SiLD and LDM share identical MLP encoder, decoder, and latent score network \(∼\\sim13M parameters\); only the Stage\-1 objective differs\. We additionally compare against CharRNN, a 2\-layer LSTM at matching scale, and report prior string\-based methods on QM9 for external reference\.

Table[3](https://arxiv.org/html/2605.20235#S5.T3)reports the results\. On QM9 \(small molecules\), SiLD and LDM both match the reference distribution closely, with SiLD slightly ahead on uniqueness \(97\.44%97\.44\\%vs\.96\.93%96\.93\\%\) and IntDiv \(0\.9180\.918vs\.0\.9150\.915\)\. The comparison separates dramatically on the larger, drug\-like datasets \(HIV, MUV, PCBA\): LDM undergoes distributional collapse, with uniqueness dropping to0\.96%0\.96\\%on HIV,95\.60%95\.60\\%on MUV, and0\.57%0\.57\\%on PCBA, and IntDiv collapsing to0\.0140\.014–0\.8110\.811\(versus real\-data values of0\.860\.86–0\.900\.90\)\. In contrast, SiLD preserves the distribution: uniqueness98\.57%98\.57\\%–99\.42%99\.42\\%, IntDiv0\.8900\.890–0\.9040\.904, all within close range of the real training distribution\. FCD shows the same pattern — on PCBA, SiLD achieves17\.7917\.79versus LDM’s77\.1077\.10\(real\-data self\-reference9\.129\.12\)\. CharRNN, the autoregressive baseline, matches the real\-data distribution tightly but mostly by*memorizing*: novelty is1\.05%1\.05\\%on QM9,4\.39%4\.39\\%on HIV — it reproduces training molecules rather than generating new ones \(except on the large PCBA where novelty reaches64\.8%64\.8\\%\)\. SiLD, in contrast, achieves near\-perfect novelty \(99\.9%99\.9\\%–100%100\\%\) while preserving distributional structure, directly demonstrating that the Stage\-1 objective produces a latent faithful to the data manifold rather than collapsing onto it or memorizing from it\.

## 6Conclusion

We introducedScore\-induced Latent Diffusion \(SiLD\), a theoretically grounded framework in which manifold learning and density estimation emerge from a single DSM objective at different noise scales\. The score singularity at small noise, rather than being an obstacle, drives a rapid dimensional collapse onto the data manifold; subsequent training at moderate noise reduces to ridge regression on akk\-dimensional support, yielding an end\-to\-end sample complexity governed by the intrinsic dimension\. Experiments on Stacked MNIST, CelebA, and four molecular benchmarks validate the predicted two\-stage dynamic, most strikingly on drug\-like molecular datasets, where KL\-regularized LDMs undergo catastrophic distributional collapse while SiLD preserves the intrinsic distribution\. More broadly, SiLD reframes the role of regularization in latent diffusion: the geometric structure of the score itself, rather than an externally imposed prior, is sufficient to organize the latent space\.

## Acknowledgments

We thank Atsushi Nitanda and Guoji Fu for the useful discussion\. WH was supported by JSPS KAKENHI \(24K20848\) and JST BOOST \(JPMJBY24G6\)\. TS was partially supported by JSPS KAKENHI \(24K02905\) and JST CREST \(JPMJCR2015\)\. This research is supported by the National Research Foundation, Singapore and the Ministry of Digital Development and Information under the AI Visiting Professorship Programme \(award number AIVP\-2024\-004\)\. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author\(s\) and do not reflect the views of National Research Foundation, Singapore and the Ministry of Digital Development and Information\. KF was supported by JSPS KAKENHI Grant\-in\-Aid for Transformative Research Areas \(A\) 22H05106\.

## References

- Sgd learning on neural networks: leap complexity and saddle\-to\-saddle dynamics\.InThe Thirty Sixth Annual Conference on Learning Theory,pp\. 2552–2623\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1)\.
- I\. Azangulov, G\. Deligiannidis, and J\. Rousseau \(2024\)Convergence of diffusion models under the manifold hypothesis in high\-dimensions\.arXiv preprint arXiv:2409\.18804\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[Remark 4\.11](https://arxiv.org/html/2605.20235#S4.Thmtheorem11.p1.8)\.
- J\. Benton, V\. De Bortoli, A\. Doucet, and G\. Deligiannidis \(2023\)Nearlydd\-linear convergence bounds for diffusion models via stochastic localization\.arXiv preprint arXiv:2308\.03686\.Cited by:[§B\.5](https://arxiv.org/html/2605.20235#A2.SS5.SSS0.Px3.1.p1.2),[§B\.5](https://arxiv.org/html/2605.20235#A2.SS5.SSS0.Px3.4.p4.2),[§2](https://arxiv.org/html/2605.20235#S2.p1.3),[§4\.3](https://arxiv.org/html/2605.20235#S4.SS3.p1.10),[Remark 4\.11](https://arxiv.org/html/2605.20235#S4.Thmtheorem11.p1.8)\.
- G\. R\. Bickerton, G\. V\. Paolini, J\. Besnard, S\. Muresan, and A\. L\. Hopkins \(2012\)Quantifying the chemical beauty of drugs\.Nature chemistry4\(2\),pp\. 90–98\.Cited by:[§5](https://arxiv.org/html/2605.20235#S5.p1.1)\.
- G\. Biroli, T\. Bonnaire, V\. De Bortoli, and M\. Mézard \(2024\)Dynamical regimes of diffusion models\.Nature Communications15\(1\),pp\. 9957\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3)\.
- N\. M\. Boffi, A\. Jacot, S\. Tu, and I\. Ziemann \(2024\)Shallow diffusion networks provably learn hidden low\-dimensional structure\.arXiv preprint arXiv:2410\.11275\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[§1](https://arxiv.org/html/2605.20235#S1.p3.1)\.
- T\. Bonnaire, R\. Urfin, G\. Biroli, and M\. Mézard \(2025\)Why diffusion models don’t memorize: the role of implicit dynamical regularization in training\.arXiv preprint arXiv:2505\.17638\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- S\. Chakraborty, Q\. Berthet, and P\. L\. Bartlett \(2026\)Generalization properties of score\-matching diffusion models for intrinsically low\-dimensional data\.arXiv preprint arXiv:2603\.03700\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p1.3),[Remark 4\.9](https://arxiv.org/html/2605.20235#S4.Thmtheorem9.p1.1)\.
- N\. Chandramoorthy and A\. de Clercq \(2025\)When and how can inexact generative models still sample from the data manifold?\.arXiv preprint arXiv:2508\.07581\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p1.3)\.
- M\. Chen, K\. Huang, T\. Zhao, and M\. Wang \(2023\)Score approximation, estimation and distribution recovery of diffusion models on low\-dimensional data\.InInternational Conference on Machine Learning,pp\. 4672–4712\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[Remark 4\.11](https://arxiv.org/html/2605.20235#S4.Thmtheorem11.p1.8),[Remark 4\.9](https://arxiv.org/html/2605.20235#S4.Thmtheorem9.p1.1)\.
- L\. Chizat \(2022\)Mean\-field langevin dynamics: exponential convergence and annealing\.arXiv preprint arXiv:2202\.01009\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.20235#S4.SS1.p2.4),[§4\.1](https://arxiv.org/html/2605.20235#S4.SS1.p5.9),[Remark 4\.4](https://arxiv.org/html/2605.20235#S4.Thmtheorem4.p1.6)\.
- H\. Cui, C\. Pehlevan, and Y\. M\. Lu \(2025\)A precise asymptotic analysis of learning diffusion models: theory and insights\.arXiv e\-prints,pp\. arXiv–2501\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- H\. Cui and L\. Zdeborová \(2023\)High\-dimensional asymptotics of denoising autoencoders\.Advances in Neural Information Processing Systems36,pp\. 11850–11890\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- A\. Damian, J\. Lee, and M\. Soltanolkotabi \(2022\)Neural networks can learn representations with gradient descent\.InConference on Learning Theory,pp\. 5413–5452\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1)\.
- V\. De Bortoli \(2022\)Convergence of denoising diffusion models under the manifold hypothesis\.Transactions on Machine Learning Research\.Cited by:[§B\.5](https://arxiv.org/html/2605.20235#A2.SS5.SSS0.Px3.2.p2.3),[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[Proposition 3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1),[§4\.3](https://arxiv.org/html/2605.20235#S4.SS3.p1.10)\.
- T\. Farghly, P\. Potaptchik, S\. Howard, G\. Deligiannidis, and J\. Pidstrigach \(2025\)Diffusion models and the manifold hypothesis: log\-domain smoothing is geometry adaptive\.arXiv preprint arXiv:2510\.02305\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3)\.
- H\. Federer \(1959\)Curvature measures\.Transactions of the American Mathematical Society93\(3\),pp\. 418–491\.Cited by:[§B\.2](https://arxiv.org/html/2605.20235#A2.SS2.2.p2.6)\.
- C\. Fefferman, S\. Mitter, and H\. Narayanan \(2016\)Testing the manifold hypothesis\.Journal of the American Mathematical Society29\(4\),pp\. 983–1049\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3)\.
- K\. Fukumizu, W\. Huang, H\. Bao, S\. Xu, and N\. Chandramoothy \(2026\)Flow matching from viewpoint of proximal operators\.arXiv preprint arXiv:2602\.12683\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p1.3)\.
- W\. Gao and M\. Li \(2024\)How do flow matching models memorize and generalize in sample data subspaces?\.arXiv preprint arXiv:2410\.23594\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3)\.
- A\. J\. George and N\. Macris \(2026\)Asymptotic learning curves for diffusion models with random features score and manifold data\.arXiv preprint arXiv:2603\.22962\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p1.3),[Remark 4\.9](https://arxiv.org/html/2605.20235#S4.Thmtheorem9.p1.1)\.
- R\. Gómez\-Bombarelli, J\. N\. Wei, D\. Duvenaud, J\. M\. Hernández\-Lobato, B\. Sánchez\-Lengeling, D\. Sheberla, J\. Aguilera\-Iparraguirre, T\. D\. Hirzel, R\. P\. Adams, and A\. Aspuru\-Guzik \(2018\)Automatic chemical design using a data\-driven continuous representation of molecules\.ACS central science4\(2\),pp\. 268–276\.Cited by:[Table 3](https://arxiv.org/html/2605.20235#S5.T3.17.11.11.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.25.19.19.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.33.27.27.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.41.35.35.2)\.
- A\. Han, W\. Huang, Y\. Cao, and D\. Zou \(2024a\)On the feature learning in diffusion models\.arXiv preprint arXiv:2412\.01021\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- Y\. Han, M\. Razaviyayn, and R\. Xu \(2024b\)Neural network\-based score estimation in diffusion models: optimization and generalization\.arXiv preprint arXiv:2401\.15604\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- J\. Heek, E\. Hoogeboom, T\. Mensink, and T\. Salimans \(2026\)Unified latents \(ul\): how to train your latents\.arXiv preprint arXiv:2602\.17270\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p3.1)\.
- M\. Heusel, H\. Ramsauer, T\. Unterthiner, B\. Nessler, and S\. Hochreiter \(2017\)Gans trained by a two time\-scale update rule converge to a local nash equilibrium\.Advances in neural information processing systems30\.Cited by:[§5](https://arxiv.org/html/2605.20235#S5.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3),[§3\.1](https://arxiv.org/html/2605.20235#S3.SS1.p3.9),[§3\.1](https://arxiv.org/html/2605.20235#S3.SS1.p4.6)\.
- Z\. Huang, Y\. Wei, and Y\. Chen \(2026\)Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality\.Mathematics of Operations Research\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[Remark 4\.11](https://arxiv.org/html/2605.20235#S4.Thmtheorem11.p1.8)\.
- A\. Jacot, F\. Gabriel, and C\. Hongler \(2018\)Neural tangent kernel: convergence and generalization in neural networks\.Advances in neural information processing systems31\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- D\. P\. Kingma and M\. Welling \(2013\)Auto\-encoding variational bayes\.arXiv preprint arXiv:1312\.6114\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p3.1)\.
- M\. Krenn, F\. Häse, A\. Nigam, P\. Friederich, and A\. Aspuru\-Guzik \(2020\)Self\-referencing embedded strings \(selfies\): a 100% robust molecular string representation\.Machine Learning: Science and Technology1\(4\),pp\. 045024\.Cited by:[§5](https://arxiv.org/html/2605.20235#S5.p5.1)\.
- S\. Kumar, Y\. Wang, and L\. Lin \(2026\)Flow matching is adaptive to manifold structures\.arXiv preprint arXiv:2602\.22486\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3)\.
- G\. Li and Y\. Yan \(2024\)Adapting to unknown low\-dimensional structures in score\-based diffusion models\.Advances in Neural Information Processing Systems37,pp\. 126297–126331\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[Remark 4\.11](https://arxiv.org/html/2605.20235#S4.Thmtheorem11.p1.8)\.
- X\. Li, Z\. Shen, Y\. Hsieh, and N\. He \(2026\)When scores learn geometry: rate separations under the manifold hypothesis\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=34V0IZytle)Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[§3\.2](https://arxiv.org/html/2605.20235#S3.SS2.SSS0.Px1.p1.7),[Proposition 3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1),[Remark 4\.6](https://arxiv.org/html/2605.20235#S4.Thmtheorem6.p1.5)\.
- X\. Li, Z\. Zhang, X\. Li, S\. Chen, Z\. Zhu, P\. Wang, and Q\. Qu \(2025\)Understanding representation dynamics of diffusion models via low\-dimensional modeling\.arXiv preprint arXiv:2502\.05743\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- Z\. Liu, W\. Zhang, and T\. Li \(2025\)Improving the euclidean diffusion generation of manifold data by mitigating score function singularity\.arXiv preprint arXiv:2505\.09922\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p1.3),[§3\.2](https://arxiv.org/html/2605.20235#S3.SS2.SSS0.Px2.p1.2)\.
- G\. Loaiza\-Ganem, B\. L\. Ross, R\. Hosseinzadeh, A\. L\. Caterini, and J\. C\. Cresswell \(2024\)Deep generative models through the lens of the manifold hypothesis: a survey and new connections\.arXiv preprint arXiv:2404\.02954\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3)\.
- S\. Mei, A\. Montanari, and P\. Nguyen \(2018\)A mean field view of the landscape of two\-layer neural networks\.Proceedings of the National Academy of Sciences115\(33\),pp\. E7665–E7671\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.20235#S4.SS1.p2.4),[§4\.1](https://arxiv.org/html/2605.20235#S4.SS1.p5.9),[Remark 4\.4](https://arxiv.org/html/2605.20235#S4.Thmtheorem4.p1.6)\.
- A\. Mousavi\-Hosseini, S\. Park, M\. Girotti, I\. Mitliagkas, and M\. A\. Erdogdu \(2022\)Neural networks efficiently learn low\-dimensional representations with sgd\.arXiv preprint arXiv:2209\.14863\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1)\.
- E\. Noutahi, C\. Gabellini, M\. Craig, J\. S\. Lim, and P\. Tossou \(2024\)Gotta be safe: a new framework for molecular design\.Digital Discovery3\(4\),pp\. 796–804\.Cited by:[Table 3](https://arxiv.org/html/2605.20235#S5.T3.15.9.9.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.23.17.17.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.31.25.25.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.39.33.33.1)\.
- K\. Oko, S\. Akiyama, and T\. Suzuki \(2023\)Diffusion models are minimax optimal distribution estimators\.InInternational Conference on Machine Learning,pp\. 26517–26582\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[Remark 4\.11](https://arxiv.org/html/2605.20235#S4.Thmtheorem11.p1.8),[Remark 4\.9](https://arxiv.org/html/2605.20235#S4.Thmtheorem9.p1.1)\.
- J\. Pidstrigach \(2022\)Score\-based generative models detect manifolds\.Advances in Neural Information Processing Systems35,pp\. 35852–35865\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3)\.
- A\. Pinkus \(1999\)Approximation theory of the mlp model in neural networks\.Acta numerica8,pp\. 143–195\.Cited by:[§B\.2](https://arxiv.org/html/2605.20235#A2.SS2.4.p4.15)\.
- P\. Potaptchik, I\. Azangulov, and G\. Deligiannidis \(2024\)Linear convergence of diffusion models under the manifold hypothesis\.arXiv preprint arXiv:2410\.09046\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3)\.
- K\. Preuer, P\. Renz, T\. Unterthiner, S\. Hochreiter, and G\. Klambauer \(2018\)Fréchet chemnet distance: a metric for generative models for molecules in drug discovery\.Journal of chemical information and modeling58\(9\),pp\. 1736–1741\.Cited by:[§5](https://arxiv.org/html/2605.20235#S5.p1.1)\.
- O\. Prykhodko, S\. V\. Johansson, P\. Kotsias, J\. Arús\-Pous, E\. J\. Bjerrum, O\. Engkvist, and H\. Chen \(2019\)A de novo molecular generation method using latent vector based generative adversarial network\.Journal of cheminformatics11\(1\),pp\. 74\.Cited by:[Table 3](https://arxiv.org/html/2605.20235#S5.T3.18.12.12.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.26.20.20.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.34.28.28.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.42.36.36.2)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[3rd item](https://arxiv.org/html/2605.20235#S1.I1.i3.p1.1),[§2](https://arxiv.org/html/2605.20235#S2.p3.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.19.13.13.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.27.21.21.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.35.29.29.2),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.43.37.37.2),[§5](https://arxiv.org/html/2605.20235#S5.p1.1)\.
- M\. H\. Segler, T\. Kogej, C\. Tyrchan, and M\. P\. Waller \(2018\)Generating focused molecule libraries for drug discovery with recurrent neural networks\.ACS central science4\(1\),pp\. 120–131\.Cited by:[Table 3](https://arxiv.org/html/2605.20235#S5.T3.13.7.7.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.21.15.15.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.29.23.23.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.37.31.31.1)\.
- K\. Shah, S\. Chen, and A\. Klivans \(2023\)Learning mixtures of gaussians using the ddpm objective\.Advances in Neural Information Processing Systems36,pp\. 19636–19649\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1),[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InInternational conference on machine learning,pp\. 2256–2265\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2020\)Score\-based generative modeling through stochastic differential equations\.arXiv preprint arXiv:2011\.13456\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3)\.
- J\. P\. Stanczuk, G\. Batzolis, T\. Deveney, and C\. Schönlieb \(2024\)Diffusion models encode the intrinsic dimension of data manifolds\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3)\.
- T\. Suzuki, D\. Wu, and A\. Nitanda \(2023\)Convergence of mean\-field langevin dynamics: time\-space discretization, stochastic gradient, and variance reduction\.Advances in Neural Information Processing Systems36,pp\. 15545–15577\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.20235#S4.SS1.p2.4),[§4\.1](https://arxiv.org/html/2605.20235#S4.SS1.p5.9)\.
- R\. Tang and Y\. Yang \(2024\)Adaptivity of diffusion models to manifold structures\.InInternational conference on artificial intelligence and statistics,pp\. 1648–1656\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p2.3),[Remark 4\.11](https://arxiv.org/html/2605.20235#S4.Thmtheorem11.p1.8),[Remark 4\.9](https://arxiv.org/html/2605.20235#S4.Thmtheorem9.p1.1)\.
- A\. Vahdat, K\. Kreis, and J\. Kautz \(2021\)Score\-based generative modeling in latent space\.Advances in neural information processing systems34,pp\. 11287–11302\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p3.1)\.
- P\. Vincent \(2011\)A connection between score matching and denoising autoencoders\.Neural computation23\(7\),pp\. 1661–1674\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3),[§3\.1](https://arxiv.org/html/2605.20235#S3.SS1.p4.6)\.
- M\. J\. Wainwright \(2019\)High\-dimensional statistics: a non\-asymptotic viewpoint\.Vol\.48,Cambridge university press\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p1.3)\.
- B\. Wang and C\. Pehlevan \(2025\)An analytical theory of spectral bias in the learning dynamics of diffusion models\.arXiv preprint arXiv:2503\.03206\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- B\. Wang and J\. J\. Vastola \(2023\)Diffusion models generate images like painters: an analytical theory of outline first, details later\.arXiv preprint arXiv:2303\.02490\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- P\. Wang, H\. Zhang, Z\. Zhang, S\. Chen, Y\. Ma, and Q\. Qu \(2025\)Diffusion models learn low\-dimensional distributions via subspace clustering\.In2025 IEEE 10th International Workshop on Computational Advances in Multi\-Sensor Adaptive Processing \(CAMSAP\),pp\. 211–215\.Cited by:[§1](https://arxiv.org/html/2605.20235#S1.p3.1),[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- Y\. Wang, H\. Zhao, S\. Sciabola, and W\. Wang \(2023\)CMolGPT: a conditional generative pre\-trained transformer for target\-specific de novo molecular generation\.Molecules28\(11\),pp\. 4430\.Cited by:[Table 3](https://arxiv.org/html/2605.20235#S5.T3.15.9.9.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.23.17.17.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.31.25.25.1),[Table 3](https://arxiv.org/html/2605.20235#S5.T3.39.33.33.1)\.
- C\. Zeno, H\. Manor, G\. Ongie, N\. Weinberger, T\. Michaeli, and D\. Soudry \(2025\)When diffusion models memorize: inductive biases in probability flow of minimum\-norm shallow neural nets\.arXiv preprint arXiv:2506\.19031\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- F\. Zhang and M\. Pilanci \(2024\)Analyzing neural network\-based generative diffusion models through convex optimization\.arXiv preprint arXiv:2402\.01965\.Cited by:[§2](https://arxiv.org/html/2605.20235#S2.p2.1)\.
- R\. Zhang, P\. Isola, A\. A\. Efros, E\. Shechtman, and O\. Wang \(2018\)The unreasonable effectiveness of deep features as a perceptual metric\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 586–595\.Cited by:[§5](https://arxiv.org/html/2605.20235#S5.p1.1)\.

## Appendix ALimitations

Our work is deliberately scoped to characterize the training dynamics of score matching when data lies on a low\-dimensional manifold — the regime where the score singularity and intrinsic geometry interact most directly\. The complementary problem of statistical optimality at the level of converged estimators, which has been well studied in prior work, is intentionally not the focus of our analysis\. On the empirical side, we evaluate SiLD as a framework rather than as a system optimized for any single benchmark; integration with large\-scale architectures and domain\-specific design \(e\.g\., text\-to\-image, video, scientific simulation\) constitute natural extensions left to future work\.

## Appendix BProofs of Main Results

We collect the proofs of all theoretical results in this appendix, organized in the order they appear in the main text\.

### B\.1Proofs of Proposition[3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1)

###### Proof of Proposition[3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1)\.

We reduce the VP marginal to a heat\-kernel convolution onℳ\\mathcal\{M\}via a rescaling, apply Laplace’s method in Fermi coordinates, and return toxx\.

Step 1: Marginal density\.Under the VP forward process,xt∣x0=z∼𝒩\(atz,h\(t\)Id\)x\_\{t\}\\mid x\_\{0\}=z\\sim\\mathcal\{N\}\(a\_\{t\}z,h\(t\)I\_\{d\}\)withat:=1−h\(t\)a\_\{t\}:=\\sqrt\{1\-h\(t\)\}, so

pt\(x\)=\(2πh\(t\)\)−d/2∫ℳp0\(z\)exp⁡\(−‖x−atz‖22h\(t\)\)𝑑Volℳ\(z\)\.p\_\{t\}\(x\)=\(2\\pi h\(t\)\)^\{\-d/2\}\\int\_\{\\mathcal\{M\}\}p\_\{0\}\(z\)\\,\\exp\\\!\\left\(\-\\frac\{\\\|x\-a\_\{t\}z\\\|^\{2\}\}\{2h\(t\)\}\\right\)d\\mathrm\{Vol\}\_\{\\mathcal\{M\}\}\(z\)\.\(18\)
Step 2: Rescaling to a heat kernel onℳ\\mathcal\{M\}\.Sety:=x/aty:=x/a\_\{t\}andσ2:=h\(t\)/at2\\sigma^\{2\}:=h\(t\)/a\_\{t\}^\{2\}\. Using‖x−atz‖2=at2‖y−z‖2\\\|x\-a\_\{t\}z\\\|^\{2\}=a\_\{t\}^\{2\}\\\|y\-z\\\|^\{2\}, equation \([18](https://arxiv.org/html/2605.20235#A2.E18)\) becomes

pt\(x\)=at−dqσ2\(y\),qσ2\(y\):=1\(2πσ2\)d/2∫ℳp0\(z\)e−‖y−z‖2/\(2σ2\)𝑑Volℳ\(z\)\.p\_\{t\}\(x\)=a\_\{t\}^\{\-d\}\\,q\_\{\\sigma^\{2\}\}\(y\),\\qquad q\_\{\\sigma^\{2\}\}\(y\):=\\frac\{1\}\{\(2\\pi\\sigma^\{2\}\)^\{d/2\}\}\\int\_\{\\mathcal\{M\}\}p\_\{0\}\(z\)\\,e^\{\-\\\|y\-z\\\|^\{2\}/\(2\\sigma^\{2\}\)\}\\,d\\mathrm\{Vol\}\_\{\\mathcal\{M\}\}\(z\)\.Thusqσ2q\_\{\\sigma^\{2\}\}is the \(Euclidean\) heat\-kernel convolution ofp0p\_\{0\}onℳ\\mathcal\{M\}at scaleσ2\\sigma^\{2\}, and

log⁡pt\(x\)=−dlog⁡at\+log⁡qσ2\(x/at\)\.\\log p\_\{t\}\(x\)=\-d\\log a\_\{t\}\+\\log q\_\{\\sigma^\{2\}\}\(x/a\_\{t\}\)\.\(19\)Sinceσ2=h\(t\)\+O\(h\(t\)2\)\\sigma^\{2\}=h\(t\)\+O\(h\(t\)^\{2\}\), the regimeh\(t\)≪1h\(t\)\\ll 1is equivalent toσ2≪1\\sigma^\{2\}\\ll 1\.

Step 3: Fermi coordinates onℳ\\mathcal\{M\}\.Fixy∈Uτy\\in U\_\{\\tau\}and letzy∗:=Πℳ\(y\)z\_\{y\}^\{\*\}:=\\Pi\_\{\\mathcal\{M\}\}\(y\),νy:=y−zy∗∈Nzy∗ℳ\\nu\_\{y\}:=y\-z\_\{y\}^\{\*\}\\in N\_\{z\_\{y\}^\{\*\}\}\\mathcal\{M\}\. Pick Fermi \(geodesic normal\) coordinatesu=\(u1,…,uk\)u=\(u^\{1\},\\ldots,u^\{k\}\)onℳ\\mathcal\{M\}centered atzy∗z\_\{y\}^\{\*\}, with\{e1,…,ek\}\\\{e\_\{1\},\\ldots,e\_\{k\}\\\}an orthonormal frame ofTzy∗ℳT\_\{z\_\{y\}^\{\*\}\}\\mathcal\{M\}\. The standard expansions are

z\(u\)\\displaystyle z\(u\)=zy∗\+∑ieiui\+12∑i,jII\(ei,ej\)uiuj\+O\(\|u\|3\),\\displaystyle=z\_\{y\}^\{\*\}\+\\sum\_\{i\}e\_\{i\}u^\{i\}\+\\tfrac\{1\}\{2\}\\sum\_\{i,j\}\\mathrm\{II\}\(e\_\{i\},e\_\{j\}\)\\,u^\{i\}u^\{j\}\+O\(\|u\|^\{3\}\),\(20\)detg\(u\)\\displaystyle\\sqrt\{\\det g\(u\)\}=1−16Ricij\(zy∗\)uiuj\+O\(\|u\|3\),\\displaystyle=1\-\\tfrac\{1\}\{6\}\\mathrm\{Ric\}\_\{ij\}\(z\_\{y\}^\{\*\}\)\\,u^\{i\}u^\{j\}\+O\(\|u\|^\{3\}\),\(21\)whereII:Tzy∗ℳ×Tzy∗ℳ→Nzy∗ℳ\\mathrm\{II\}:T\_\{z\_\{y\}^\{\*\}\}\\mathcal\{M\}\\times T\_\{z\_\{y\}^\{\*\}\}\\mathcal\{M\}\\to N\_\{z\_\{y\}^\{\*\}\}\\mathcal\{M\}is the vector\-valued second fundamental form\. Becauseνy⟂Tzy∗ℳ\\nu\_\{y\}\\perp T\_\{z\_\{y\}^\{\*\}\}\\mathcal\{M\}andII\\mathrm\{II\}is normal\-valued, a direct computation gives

‖y−z\(u\)‖2=‖νy‖2\+u⊤\(Ik−Bνy\)u\+O\(\|u\|3\),\(Bνy\)ij:=⟨νy,II\(ei,ej\)⟩\.\\\|y\-z\(u\)\\\|^\{2\}=\\\|\\nu\_\{y\}\\\|^\{2\}\+u^\{\\top\}\(I\_\{k\}\-B\_\{\\nu\_\{y\}\}\)\\,u\+O\(\|u\|^\{3\}\),\\qquad\(B\_\{\\nu\_\{y\}\}\)\_\{ij\}:=\\langle\\nu\_\{y\},\\,\\mathrm\{II\}\(e\_\{i\},e\_\{j\}\)\\rangle\.\(22\)Positive definiteness ofIk−BνyI\_\{k\}\-B\_\{\\nu\_\{y\}\}onUτU\_\{\\tau\}follows from‖Bνy‖op≤‖νy‖/τ<1\\\|B\_\{\\nu\_\{y\}\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|\\nu\_\{y\}\\\|/\\tau<1, since the principal curvatures ofℳ\\mathcal\{M\}are bounded by1/τ1/\\tau\.

Step 4: Laplace approximation ofqσ2q\_\{\\sigma^\{2\}\}\.Substituting \([22](https://arxiv.org/html/2605.20235#A2.E22)\) into the integral, expandingp0\(z\(u\)\)=p0\(zy∗\)\+∇ℳp0\(zy∗\)⋅u\+O\(\|u\|2\)p\_\{0\}\(z\(u\)\)=p\_\{0\}\(z\_\{y\}^\{\*\}\)\+\\nabla\_\{\\mathcal\{M\}\}p\_\{0\}\(z\_\{y\}^\{\*\}\)\\cdot u\+O\(\|u\|^\{2\}\), and absorbing the volume\-element correction into theO\(\|u\|2\)O\(\|u\|^\{2\}\)remainder, the first\-order term inuuvanishes by symmetry\. The Gaussian integral∫ℝke−u⊤\(Ik−Bνy\)u/\(2σ2\)𝑑u=\(2πσ2\)k/2det\(Ik−Bνy\)−1/2\\int\_\{\\mathbb\{R\}^\{k\}\}e^\{\-u^\{\\top\}\(I\_\{k\}\-B\_\{\\nu\_\{y\}\}\)u/\(2\\sigma^\{2\}\)\}\\,du=\(2\\pi\\sigma^\{2\}\)^\{k/2\}\\det\(I\_\{k\}\-B\_\{\\nu\_\{y\}\}\)^\{\-1/2\}then yields

qσ2\(y\)=\(2πσ2\)−\(d−k\)/2p0\(zy∗\)e−dℳ\(y\)2/\(2σ2\)det\(Ik−Bνy\)1/2\(1\+O\(σ2\)\)\.q\_\{\\sigma^\{2\}\}\(y\)=\(2\\pi\\sigma^\{2\}\)^\{\-\(d\-k\)/2\}\\,p\_\{0\}\(z\_\{y\}^\{\*\}\)\\,\\frac\{e^\{\-d\_\{\\mathcal\{M\}\}\(y\)^\{2\}/\(2\\sigma^\{2\}\)\}\}\{\\det\(I\_\{k\}\-B\_\{\\nu\_\{y\}\}\)^\{1/2\}\}\\,\\bigl\(1\+O\(\\sigma^\{2\}\)\\bigr\)\.\(23\)
Step 5: Pullback to thexx\-variable\.Combining \([19](https://arxiv.org/html/2605.20235#A2.E19)\) and \([23](https://arxiv.org/html/2605.20235#A2.E23)\), and collectingxx\-independent terms intoC\(t\)C\(t\),

log⁡pt\(x\)=−dℳ\(x/at\)22σ2\+log⁡p0\(Πℳ\(x/at\)\)−12logdet\(Ik−Bνx/at\)\+C\(t\)\+O\(h\(t\)\)\.\\log p\_\{t\}\(x\)=\-\\frac\{d\_\{\\mathcal\{M\}\}\(x/a\_\{t\}\)^\{2\}\}\{2\\sigma^\{2\}\}\+\\log p\_\{0\}\\\!\\bigl\(\\Pi\_\{\\mathcal\{M\}\}\(x/a\_\{t\}\)\\bigr\)\-\\tfrac\{1\}\{2\}\\log\\det\\bigl\(I\_\{k\}\-B\_\{\\nu\_\{x/a\_\{t\}\}\}\\bigr\)\+C\(t\)\+O\(h\(t\)\)\.\(24\)We expand each term to leading order inh\(t\)h\(t\)\. Using1/at−1=h\(t\)/2\+O\(h\(t\)2\)1/a\_\{t\}\-1=h\(t\)/2\+O\(h\(t\)^\{2\}\)and Federer’s identity∇\[12dℳ2\]⁡\(x\)=x−Πℳ\(x\)=νx\\nabla\\bigl\[\\tfrac\{1\}\{2\}d\_\{\\mathcal\{M\}\}^\{2\}\\bigr\]\(x\)=x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)=\\nu\_\{x\},

dℳ\(x/at\)2=dℳ\(x\)2\+h\(t\)⟨νx,x⟩\+O\(h\(t\)2\)\.d\_\{\\mathcal\{M\}\}\(x/a\_\{t\}\)^\{2\}=d\_\{\\mathcal\{M\}\}\(x\)^\{2\}\+h\(t\)\\,\\langle\\nu\_\{x\},\\,x\\rangle\+O\(h\(t\)^\{2\}\)\.Combined with1/σ2=1/h\(t\)−1\+O\(h\(t\)\)1/\\sigma^\{2\}=1/h\(t\)\-1\+O\(h\(t\)\),

−dℳ\(x/at\)22σ2=−dℳ\(x\)22h\(t\)\+12dℳ\(x\)2−12⟨νx,x⟩\+O\(h\(t\)\)\.\-\\frac\{d\_\{\\mathcal\{M\}\}\(x/a\_\{t\}\)^\{2\}\}\{2\\sigma^\{2\}\}=\-\\frac\{d\_\{\\mathcal\{M\}\}\(x\)^\{2\}\}\{2h\(t\)\}\+\\tfrac\{1\}\{2\}d\_\{\\mathcal\{M\}\}\(x\)^\{2\}\-\\tfrac\{1\}\{2\}\\langle\\nu\_\{x\},x\\rangle\+O\(h\(t\)\)\.Usingx=Πℳ\(x\)\+νxx=\\Pi\_\{\\mathcal\{M\}\}\(x\)\+\\nu\_\{x\}and⟨νx,νx⟩=dℳ\(x\)2\\langle\\nu\_\{x\},\\nu\_\{x\}\\rangle=d\_\{\\mathcal\{M\}\}\(x\)^\{2\}, the twoO\(1\)O\(1\)pieces collapse:

12dℳ\(x\)2−12⟨νx,x⟩=−12⟨νx,Πℳ\(x\)⟩\.\\tfrac\{1\}\{2\}d\_\{\\mathcal\{M\}\}\(x\)^\{2\}\-\\tfrac\{1\}\{2\}\\langle\\nu\_\{x\},x\\rangle=\-\\tfrac\{1\}\{2\}\\langle\\nu\_\{x\},\\,\\Pi\_\{\\mathcal\{M\}\}\(x\)\\rangle\.The remaining two terms in \([24](https://arxiv.org/html/2605.20235#A2.E24)\) differ from theirat=1a\_\{t\}=1counterparts byO\(h\(t\)\)O\(h\(t\)\)\(sinceΠℳ\\Pi\_\{\\mathcal\{M\}\}andν⋅\\nu\_\{\\cdot\}areC1C^\{1\}onUτU\_\{\\tau\}andx/at−x=O\(h\(t\)\)x/a\_\{t\}\-x=O\(h\(t\)\)\), so they contribute only to the remainder\. Collecting:

log⁡pt\(x\)=−dℳ\(x\)22h\(t\)\+log⁡p0\(Πℳ\(x\)\)\+H\(Πℳ\(x\),νx\)\+C\(t\)\+O\(h\(t\)\),\\log p\_\{t\}\(x\)=\-\\frac\{d\_\{\\mathcal\{M\}\}\(x\)^\{2\}\}\{2h\(t\)\}\+\\log p\_\{0\}\(\\Pi\_\{\\mathcal\{M\}\}\(x\)\)\+H\(\\Pi\_\{\\mathcal\{M\}\}\(x\),\\nu\_\{x\}\)\+C\(t\)\+O\(h\(t\)\),\(25\)where theO\(1\)O\(1\)correction term is

H\(z,ν\):=−12⟨ν,z⟩−12logdet\(Ik−Bν\),z∈ℳ,ν∈Nzℳ\.H\(z,\\nu\):=\-\\tfrac\{1\}\{2\}\\langle\\nu,\\,z\\rangle\-\\tfrac\{1\}\{2\}\\log\\det\\bigl\(I\_\{k\}\-B\_\{\\nu\}\\bigr\),\\qquad z\\in\\mathcal\{M\},\\ \\nu\\in N\_\{z\}\\mathcal\{M\}\.\(26\)Crucially,HHdepends only on\(Πℳ\(x\),νx\)\(\\Pi\_\{\\mathcal\{M\}\}\(x\),\\,\\nu\_\{x\}\), as required\. The first summand in \([26](https://arxiv.org/html/2605.20235#A2.E26)\) is the VP\-shrinkage contribution, while the second is the extrinsic\-curvature correction\.

Step 6: Taking the gradient\.Applying∇x\\nabla\_\{x\}to \([25](https://arxiv.org/html/2605.20235#A2.E25)\) and using Federer’s identity once more,

s∗\(x,t\)=−x−Πℳ\(x\)h\(t\)\+∇xlog⁡p0\(Πℳ\(x\)\)\+∇xH\(Πℳ\(x\),νx\)\+O\(h\(t\)\),s^\{\*\}\(x,t\)=\-\\frac\{x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\}\{h\(t\)\}\+\\nabla\_\{x\}\\log p\_\{0\}\\\!\\bigl\(\\Pi\_\{\\mathcal\{M\}\}\(x\)\\bigr\)\+\\nabla\_\{x\}H\\\!\\bigl\(\\Pi\_\{\\mathcal\{M\}\}\(x\),\\,\\nu\_\{x\}\\bigr\)\+O\(h\(t\)\),which is the decomposition in the statement, with remaindero\(1\)o\(1\)ash\(t\)→0h\(t\)\\to 0\. Term \(I\),−\(x−Πℳ\(x\)\)/h\(t\)\-\(x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\)/h\(t\), is the conservative normal restoring force of orderO\(h\(t\)−1\)O\(h\(t\)^\{\-1\}\); term \(II\),∇log⁡p0\+∇H\\nabla\\log p\_\{0\}\+\\nabla H, is theO\(1\)O\(1\)tangential score plus boundary corrections\. ∎

### B\.2Proof of Lemma[4\.1](https://arxiv.org/html/2605.20235#S4.Thmtheorem1)

###### Proof of Lemma[4\.1](https://arxiv.org/html/2605.20235#S4.Thmtheorem1)\.

*Gradient identity\.*SinceΨ′\(z\)=σ\(z\)\\Psi^\{\\prime\}\(z\)=\\sigma\(z\), the chain rule gives∇xΨ\(wj⊤x\+bj\)=σ\(wj⊤x\+bj\)wj\\nabla\_\{x\}\\Psi\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)=\\sigma\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)\\,w\_\{j\}\. DifferentiatingΦnet\\Phi\_\{\\mathrm\{net\}\}termwise,

∇xΦnet\(x;θ\)=xh\(t\)−1h\(t\)∑j=1majσ\(wj⊤x\+bj\)wj=xh\(t\)−1h\(t\)Wdiag\(a\)σ\(W⊤x\+b\)=−f1\(x;θ\),\\nabla\_\{x\}\\Phi\_\{\\mathrm\{net\}\}\(x;\\theta\)=\\frac\{x\}\{h\(t\)\}\-\\frac\{1\}\{h\(t\)\}\\sum\_\{j=1\}^\{m\}a\_\{j\}\\,\\sigma\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)\\,w\_\{j\}=\\frac\{x\}\{h\(t\)\}\-\\frac\{1\}\{h\(t\)\}\\,W\\,\\mathrm\{diag\}\(a\)\\,\\sigma\(W^\{\\top\}x\+b\)=\-f\_\{1\}\(x;\\theta\),which establishes the identity on all ofℝd\\mathbb\{R\}^\{d\}\.

*Projection approximation\.*Define the target potentialΦ∗\(x,t\):=dℳ\(x\)2/\(2h\(t\)\)\\Phi^\{\*\}\(x,t\):=d\_\{\\mathcal\{M\}\}\(x\)^\{2\}/\(2\\,h\(t\)\)and the smooth part

G∗\(x\):=12‖x‖2−12dℳ\(x\)2=⟨x,Πℳ\(x\)⟩−12‖Πℳ\(x\)‖2,x∈Uτ,G^\{\*\}\(x\):=\\tfrac\{1\}\{2\}\\\|x\\\|^\{2\}\-\\tfrac\{1\}\{2\}d\_\{\\mathcal\{M\}\}\(x\)^\{2\}=\\langle x,\\Pi\_\{\\mathcal\{M\}\}\(x\)\\rangle\-\\tfrac\{1\}\{2\}\\\|\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|^\{2\},\\qquad x\\in U\_\{\\tau\},where the second equality follows by expandingdℳ\(x\)2=‖x−Πℳ\(x\)‖2=‖x‖2−2⟨x,Πℳ\(x\)⟩\+‖Πℳ\(x\)‖2d\_\{\\mathcal\{M\}\}\(x\)^\{2\}=\\\|x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|^\{2\}=\\\|x\\\|^\{2\}\-2\\langle x,\\Pi\_\{\\mathcal\{M\}\}\(x\)\\rangle\+\\\|\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|^\{2\}\. Correspondingly, define the ridge sum

Gθ\(x\):=∑j=1majΨ\(wj⊤x\+bj\),G\_\{\\theta\}\(x\):=\\sum\_\{j=1\}^\{m\}a\_\{j\}\\,\\Psi\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\),so thatΦnet\(x;θ\)=‖x‖2/\(2h\(t\)\)−Gθ\(x\)/h\(t\)\\Phi\_\{\\mathrm\{net\}\}\(x;\\theta\)=\\\|x\\\|^\{2\}/\(2\\,h\(t\)\)\-G\_\{\\theta\}\(x\)/h\(t\)andΦ∗\(x,t\)=‖x‖2/\(2h\(t\)\)−G∗\(x\)/h\(t\)\\Phi^\{\*\}\(x,t\)=\\\|x\\\|^\{2\}/\(2\\,h\(t\)\)\-G^\{\*\}\(x\)/h\(t\)\. Taking gradients,

∇xGθ\(x\)=∑j=1majσ\(wj⊤x\+bj\)wj=Wdiag\(a\)σ\(W⊤x\+b\),∇xG∗\(x\)=Πℳ\(x\),\\nabla\_\{x\}G\_\{\\theta\}\(x\)=\\sum\_\{j=1\}^\{m\}a\_\{j\}\\,\\sigma\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)\\,w\_\{j\}=W\\,\\mathrm\{diag\}\(a\)\\,\\sigma\(W^\{\\top\}x\+b\),\\qquad\\nabla\_\{x\}G^\{\*\}\(x\)=\\Pi\_\{\\mathcal\{M\}\}\(x\),\(27\)where the second identity follows from Federer’s formula∇x\[12dℳ\(x\)2\]=x−Πℳ\(x\)\\nabla\_\{x\}\[\\tfrac\{1\}\{2\}d\_\{\\mathcal\{M\}\}\(x\)^\{2\}\]=x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)onUτU\_\{\\tau\}\[[17](https://arxiv.org/html/2605.20235#bib.bib108)\]\. Therefore

Wdiag\(a\)σ\(W⊤x\+b\)−Πℳ\(x\)=∇xGθ\(x\)−∇xG∗\(x\),W\\,\\mathrm\{diag\}\(a\)\\,\\sigma\(W^\{\\top\}x\+b\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\)=\\nabla\_\{x\}G\_\{\\theta\}\(x\)\-\\nabla\_\{x\}G^\{\*\}\(x\),and \([6](https://arxiv.org/html/2605.20235#S4.E6)\) reduces to approximatingG∗G^\{\*\}by ridge sumsGθG\_\{\\theta\}in theC1C^\{1\}\-norm onK:=Uτ/2¯K:=\\overline\{U\_\{\\tau/2\}\}\.

The setKKis compact sinceℳ\\mathcal\{M\}is compact andKKis the closed tube of radiusτ/2\\tau/2around it\. Under the standing assumptionℳ∈C∞\\mathcal\{M\}\\in C^\{\\infty\}, Federer’s structure theorem givesΠℳ∈C∞\(Uτ\)\\Pi\_\{\\mathcal\{M\}\}\\in C^\{\\infty\}\(U\_\{\\tau\}\), henceG∗∈C∞\(K\)G^\{\*\}\\in C^\{\\infty\}\(K\)\. In particularG∗∈C1\(K\)G^\{\*\}\\in C^\{1\}\(K\)\.

Sinceσ∈C2\\sigma\\in C^\{2\}is non\-polynomial, so is its primitiveΨ∈C3\\Psi\\in C^\{3\}; indeed, ifΨ\\Psiwere polynomial thenσ=Ψ′\\sigma=\\Psi^\{\\prime\}would be polynomial, contradicting the hypothesis\. The conservative\-form parametrizationθ=\(W,a,b\)\\theta=\(W,a,b\)produces precisely the standard ridge dictionary

ℛ\(Ψ\):=\{∑j=1majΨ\(wj⊤x\+bj\):m∈ℕ,aj∈ℝ,wj∈ℝd,bj∈ℝ\}\\mathcal\{R\}\(\\Psi\):=\\Big\\\{\\textstyle\\sum\_\{j=1\}^\{m\}a\_\{j\}\\,\\Psi\(w\_\{j\}^\{\\top\}x\+b\_\{j\}\)\\,:\\,m\\in\\mathbb\{N\},\\;a\_\{j\}\\in\\mathbb\{R\},\\;w\_\{j\}\\in\\mathbb\{R\}^\{d\},\\;b\_\{j\}\\in\\mathbb\{R\}\\Big\\\}for the non\-polynomial activationΨ∈C3\\Psi\\in C^\{3\}\. By theCkC^\{k\}density theorem for ridge functions\[[43](https://arxiv.org/html/2605.20235#bib.bib109), Thm\. 4\.1\],ℛ\(Ψ\)\\mathcal\{R\}\(\\Psi\)is dense inC1\(K\)C^\{1\}\(K\)under the norm‖g‖C1\(K\):=supK\|g\|\+supK‖∇g‖\\\|g\\\|\_\{C^\{1\}\(K\)\}:=\\sup\_\{K\}\|g\|\+\\sup\_\{K\}\\\|\\nabla g\\\|\. Consequently, for anyδ\>0\\delta\>0there existm∈ℕm\\in\\mathbb\{N\}and parametersθ=\(W,a,b\)∈ℝd×m×ℝm×ℝm\\theta=\(W,a,b\)\\in\\mathbb\{R\}^\{d\\times m\}\\times\\mathbb\{R\}^\{m\}\\times\\mathbb\{R\}^\{m\}with‖Gθ−G∗‖C1\(K\)<δ\\\|G\_\{\\theta\}\-G^\{\*\}\\\|\_\{C^\{1\}\(K\)\}<\\delta\. Choosingδ:=ε\\delta:=\\varepsilon,

supx∈K‖Wdiag\(a\)σ\(W⊤x\+b\)−Πℳ\(x\)‖=supx∈K‖∇Gθ\(x\)−∇G∗\(x\)‖≤‖Gθ−G∗‖C1\(K\)<ε,\\sup\_\{x\\in K\}\\\|W\\,\\mathrm\{diag\}\(a\)\\,\\sigma\(W^\{\\top\}x\+b\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|=\\sup\_\{x\\in K\}\\\|\\nabla G\_\{\\theta\}\(x\)\-\\nabla G^\{\*\}\(x\)\\\|\\leq\\\|G\_\{\\theta\}\-G^\{\*\}\\\|\_\{C^\{1\}\(K\)\}<\\varepsilon,which is \([6](https://arxiv.org/html/2605.20235#S4.E6)\)\. ∎

### B\.3Proof of Theorem[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5)

###### Proof of Theorem[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5)\.

Step 1: Initial risk bound\.Ats=0s=0, parameters are drawn fromq0=νw⊗νa⊗νbq\_\{0\}=\\nu\_\{w\}\\otimes\\nu\_\{a\}\\otimes\\nu\_\{b\}withνw=𝒩\(0,σw2Id\)\\nu\_\{w\}=\\mathcal\{N\}\(0,\\sigma\_\{w\}^\{2\}I\_\{d\}\)and\|a\|≤A\|a\|\\leq Aalmost surely underνa\\nu\_\{a\}\. By Jensen’s inequality and‖σ‖∞≤Cσ\\\|\\sigma\\\|\_\{\\infty\}\\leq C\_\{\\sigma\},

‖Pq0\(x\)‖2=‖∫awσ\(w⊤x\+b\)q0\(dθ\)‖2≤∫a2‖w‖2σ\(w⊤x\+b\)2q0\(dθ\)≤Cσ2σa2σw2d,\\\|P\_\{q\_\{0\}\}\(x\)\\\|^\{2\}\\;=\\;\\Bigl\\\|\\\!\\int a\\,w\\,\\sigma\(w^\{\\top\}x\+b\)\\,q\_\{0\}\(\\mathrm\{d\}\\theta\)\\Bigr\\\|^\{2\}\\;\\leq\\;\\int a^\{2\}\\\|w\\\|^\{2\}\\sigma\(w^\{\\top\}x\+b\)^\{2\}\\,q\_\{0\}\(\\mathrm\{d\}\\theta\)\\;\\leq\\;C\_\{\\sigma\}^\{2\}\\,\\sigma\_\{a\}^\{2\}\\,\\sigma\_\{w\}^\{2\}\\,d,where the last step uses independence of\(w,a\)\(w,a\)underq0q\_\{0\}:∫a2‖w‖2q0\(dθ\)=σa2⋅σw2d\\int a^\{2\}\\\|w\\\|^\{2\}\\,q\_\{0\}\(\\mathrm\{d\}\\theta\)=\\sigma\_\{a\}^\{2\}\\cdot\\sigma\_\{w\}^\{2\}d\. This bound is independent ofxx, so by the triangle inequality and𝔼‖Πℳ\(x\)‖2=Tr⁡\(Σℳ\)\\mathbb\{E\}\\\|\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|^\{2\}=\\operatorname\{Tr\}\(\\Sigma\_\{\\mathcal\{M\}\}\),

Ft\(q0\)=12𝔼‖Pq0\(x\)−Πℳ\(x\)‖2≤𝔼‖Pq0\(x\)‖2\+𝔼‖Πℳ\(x\)‖2≤Cσ2σa2σw2d\+Tr⁡\(Σℳ\)\.F\_\{t\}\(q\_\{0\}\)=\\tfrac\{1\}\{2\}\\,\\mathbb\{E\}\\\|P\_\{q\_\{0\}\}\(x\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|^\{2\}\\leq\\mathbb\{E\}\\\|P\_\{q\_\{0\}\}\(x\)\\\|^\{2\}\+\\mathbb\{E\}\\\|\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|^\{2\}\\leq C\_\{\\sigma\}^\{2\}\\,\\sigma\_\{a\}^\{2\}\\,\\sigma\_\{w\}^\{2\}\\,d\+\\operatorname\{Tr\}\(\\Sigma\_\{\\mathcal\{M\}\}\)\.
Step 2: Loss decomposition\.LetRq\(x\):=f1\(x;q\)−s∗\(x,t\)R\_\{q\}\(x\):=f\_\{1\}\(x;q\)\-s^\{\*\}\(x,t\)be the score residual\. By Proposition[3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1),s∗\(x,t\)=−x−Πℳ\(x\)h\(t\)\+r\(x,t\)s^\{\*\}\(x,t\)=\-\\frac\{x\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\}\{h\(t\)\}\+r\(x,t\)where‖r\(x,t\)‖≤CR\\\|r\(x,t\)\\\|\\leq C\_\{R\}uniformly onUτ/2U\_\{\\tau/2\}for a constantCR\>0C\_\{R\}\>0depending onℳ\\mathcal\{M\}andp0p\_\{0\}\. Sincef1\(x;q\)=1h\(t\)\(Pq\(x\)−x\)f\_\{1\}\(x;q\)=\\frac\{1\}\{h\(t\)\}\(P\_\{q\}\(x\)\-x\), the residual is

Rq\(x\)=Pq\(x\)−Πℳ\(x\)h\(t\)−r\(x,t\)\.R\_\{q\}\(x\)=\\frac\{P\_\{q\}\(x\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\}\{h\(t\)\}\-r\(x,t\)\.Squaring and taking expectations \(using thatptp\_\{t\}concentrates onUτ/2U\_\{\\tau/2\}up to ane−τ2/\(8h\(t\)\)e^\{\-\\tau^\{2\}/\(8h\(t\)\)\}tail\):

Lt\(q\)=1h\(t\)2Ft\(q\)−1h\(t\)R~t\(q\)\+12𝔼‖r‖2,L\_\{t\}\(q\)\\;=\\;\\frac\{1\}\{h\(t\)^\{2\}\}\\,F\_\{t\}\(q\)\\;\-\\;\\frac\{1\}\{h\(t\)\}\\,\\widetilde\{R\}\_\{t\}\(q\)\\;\+\\;\\tfrac\{1\}\{2\}\\,\\mathbb\{E\}\\\|r\\\|^\{2\},\(28\)whereR~t\(q\):=𝔼\[⟨Pq\(x\)−Πℳ\(x\),r\(x,t\)⟩\]\\widetilde\{R\}\_\{t\}\(q\):=\\mathbb\{E\}\\bigl\[\\langle P\_\{q\}\(x\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\),r\(x,t\)\\rangle\\bigr\]is the scale\-crossing term\. The first variation ofLtL\_\{t\}along the trainableww\-direction satisfies

∇wδLtδq\(q\)\(θ\)=1h\(t\)2∇wδFtδq\(q\)\(θ\)−1h\(t\)∇wδR~tδq\(q\)\(θ\)\.\\nabla\_\{w\}\\frac\{\\delta L\_\{t\}\}\{\\delta q\}\(q\)\(\\theta\)\\;=\\;\\frac\{1\}\{h\(t\)^\{2\}\}\\,\\nabla\_\{w\}\\frac\{\\delta F\_\{t\}\}\{\\delta q\}\(q\)\(\\theta\)\\;\-\\;\\frac\{1\}\{h\(t\)\}\\,\\nabla\_\{w\}\\frac\{\\delta\\widetilde\{R\}\_\{t\}\}\{\\delta q\}\(q\)\(\\theta\)\.\(29\)
Step 3: Energy dissipation\.The chain rule along theww\-restricted Wasserstein gradient flow \([8](https://arxiv.org/html/2605.20235#S4.E8)\) ofLtL\_\{t\}gives

ddsFt\(qs\)\\displaystyle\\frac\{d\}\{ds\}F\_\{t\}\(q\_\{s\}\)=−∫ℝd\+2⟨∇wδFtδq,∇wδLtδq⟩qs\(dθ\)\\displaystyle=\-\\int\_\{\\mathbb\{R\}^\{d\+2\}\}\\\!\\Bigl\\langle\\nabla\_\{w\}\\tfrac\{\\delta F\_\{t\}\}\{\\delta q\},\\;\\nabla\_\{w\}\\tfrac\{\\delta L\_\{t\}\}\{\\delta q\}\\Bigr\\rangle q\_\{s\}\(\\mathrm\{d\}\\theta\)=−1h\(t\)2∫‖∇wδFtδq‖2qs\(dθ\)\+1h\(t\)∫⟨∇wδFtδq,∇wδR~tδq⟩qs\(dθ\),\\displaystyle=\-\\frac\{1\}\{h\(t\)^\{2\}\}\\int\\\!\\Bigl\\\|\\nabla\_\{w\}\\tfrac\{\\delta F\_\{t\}\}\{\\delta q\}\\Bigr\\\|^\{2\}q\_\{s\}\(\\mathrm\{d\}\\theta\)\\;\+\\;\\frac\{1\}\{h\(t\)\}\\int\\\!\\Bigl\\langle\\nabla\_\{w\}\\tfrac\{\\delta F\_\{t\}\}\{\\delta q\},\\;\\nabla\_\{w\}\\tfrac\{\\delta\\widetilde\{R\}\_\{t\}\}\{\\delta q\}\\Bigr\\rangle q\_\{s\}\(\\mathrm\{d\}\\theta\),where we used integration by parts \(inww\) for theww\-restricted continuity equation and substituted \([29](https://arxiv.org/html/2605.20235#A2.E29)\)\.

Step 4: Young’s inequality\.For the cross\-term, apply1h\(t\)⟨A,B⟩≤12h\(t\)2‖A‖2\+12‖B‖2\\tfrac\{1\}\{h\(t\)\}\\langle A,B\\rangle\\leq\\tfrac\{1\}\{2h\(t\)^\{2\}\}\\\|A\\\|^\{2\}\+\\tfrac\{1\}\{2\}\\\|B\\\|^\{2\}pointwise, yielding

ddsFt\(qs\)≤−12h\(t\)2∫‖∇wδFtδq‖2qs\(dθ\)\+12∫‖∇wδR~tδq‖2qs\(dθ\)\.\\frac\{d\}\{ds\}F\_\{t\}\(q\_\{s\}\)\\;\\leq\\;\-\\frac\{1\}\{2h\(t\)^\{2\}\}\\int\\\!\\Bigl\\\|\\nabla\_\{w\}\\tfrac\{\\delta F\_\{t\}\}\{\\delta q\}\\Bigr\\\|^\{2\}q\_\{s\}\(\\mathrm\{d\}\\theta\)\\;\+\\;\\frac\{1\}\{2\}\\int\\\!\\Bigl\\\|\\nabla\_\{w\}\\tfrac\{\\delta\\widetilde\{R\}\_\{t\}\}\{\\delta q\}\\Bigr\\\|^\{2\}q\_\{s\}\(\\mathrm\{d\}\\theta\)\.\(30\)
Step 5: Bound the remainder\.The first variation ofR~t\\widetilde\{R\}\_\{t\}atθ=\(w,a,b\)\\theta=\(w,a,b\)isδR~tδq\(θ\)=a𝔼x\[σ\(w⊤x\+b\)⟨w,r\(x,t\)⟩\]\\frac\{\\delta\\widetilde\{R\}\_\{t\}\}\{\\delta q\}\(\\theta\)=a\\,\\mathbb\{E\}\_\{x\}\\bigl\[\\sigma\(w^\{\\top\}x\+b\)\\,\\langle w,r\(x,t\)\\rangle\\bigr\]\. Differentiating inwwand using\|σ\|,\|σ′\|≤Cσ\|\\sigma\|,\|\\sigma^\{\\prime\}\|\\leq C\_\{\\sigma\},

∇wδR~tδq\(θ\)=a𝔼x\[σ′\(w⊤x\+b\)x⟨r\(x,t\),w⟩\+σ\(w⊤x\+b\)r\(x,t\)\],\\nabla\_\{w\}\\frac\{\\delta\\widetilde\{R\}\_\{t\}\}\{\\delta q\}\(\\theta\)=a\\,\\mathbb\{E\}\_\{x\}\\\!\\bigl\[\\sigma^\{\\prime\}\(w^\{\\top\}x\+b\)\\,x\\,\\langle r\(x,t\),w\\rangle\+\\sigma\(w^\{\\top\}x\+b\)\\,r\(x,t\)\\bigr\],so‖∇wδR~tδq\(θ\)‖≤\|a\|CσCR\(‖w‖⋅𝔼\[‖x‖\]\+1\)\\bigl\\\|\\nabla\_\{w\}\\tfrac\{\\delta\\widetilde\{R\}\_\{t\}\}\{\\delta q\}\(\\theta\)\\bigr\\\|\\leq\|a\|\\,C\_\{\\sigma\}\\,C\_\{R\}\\,\(\\\|w\\\|\\cdot\\mathbb\{E\}\[\\\|x\\\|\]\+1\)\. Since\|a\|≤A\|a\|\\leq Aalmost surely along the flow \(aais frozen atνa\\nu\_\{a\}\), and by Assumption[4\.3](https://arxiv.org/html/2605.20235#S4.Thmtheorem3)∫‖w‖2qs\(dθ\)≤M2\\int\\\|w\\\|^\{2\}q\_\{s\}\(\\mathrm\{d\}\\theta\)\\leq M\_\{2\}uniformly inss,

∫∥∇wδR~tδq∥2qs\(dθ\)≤2A2Cσ2CR2\(𝔼\[∥x∥2\]M2\+1\)=:Ct,\\int\\\!\\Bigl\\\|\\nabla\_\{w\}\\tfrac\{\\delta\\widetilde\{R\}\_\{t\}\}\{\\delta q\}\\Bigr\\\|^\{2\}q\_\{s\}\(\\mathrm\{d\}\\theta\)\\;\\leq\\;2\\,A^\{2\}\\,C\_\{\\sigma\}^\{2\}\\,C\_\{R\}^\{2\}\\bigl\(\\mathbb\{E\}\[\\\|x\\\|^\{2\}\]\\,M\_\{2\}\+1\\bigr\)\\;=:\\;C\_\{t\},where the inequality uses\(p\+q\)2≤2p2\+2q2\(p\+q\)^\{2\}\\leq 2p^\{2\}\+2q^\{2\}followed by Jensen\(𝔼‖x‖\)2≤𝔼‖x‖2\(\\mathbb\{E\}\\\|x\\\|\)^\{2\}\\leq\\mathbb\{E\}\\\|x\\\|^\{2\}\. The constantCt\>0C\_\{t\}\>0depends onℳ\\mathcal\{M\},τ\\tau,CσC\_\{\\sigma\},AA, andM2M\_\{2\}, but not onss\.

Step 6: Non\-degeneracy and Grönwall\.Applying the Geometric Non\-degeneracy Condition \([10](https://arxiv.org/html/2605.20235#S4.E10)\) to the first integral in \([30](https://arxiv.org/html/2605.20235#A2.E30)\) yields the linear differential inequality

ddsFt\(qs\)≤−ν2h\(t\)2Ft\(qs\)\+Ct2\.\\frac\{d\}\{ds\}F\_\{t\}\(q\_\{s\}\)\\;\\leq\\;\-\\frac\{\\nu\}\{2\\,h\(t\)^\{2\}\}\\,F\_\{t\}\(q\_\{s\}\)\\;\+\\;\\frac\{C\_\{t\}\}\{2\}\.Grönwall’s lemma from time0tossgives

Ft\(qs\)≤Ft\(q0\)e−νs/\(2h\(t\)2\)\+Cth\(t\)2ν\(1−e−νs/\(2h\(t\)2\)\)≤Ft\(q0\)e−νs/\(2h\(t\)2\)\+Cth\(t\)2ν,F\_\{t\}\(q\_\{s\}\)\\;\\leq\\;F\_\{t\}\(q\_\{0\}\)\\,e^\{\-\\nu s/\(2h\(t\)^\{2\}\)\}\+\\frac\{C\_\{t\}\\,h\(t\)^\{2\}\}\{\\nu\}\\,\\bigl\(1\-e^\{\-\\nu s/\(2h\(t\)^\{2\}\)\}\\bigr\)\\;\\leq\\;F\_\{t\}\(q\_\{0\}\)\\,e^\{\-\\nu s/\(2h\(t\)^\{2\}\)\}\+\\frac\{C\_\{t\}\\,h\(t\)^\{2\}\}\{\\nu\},establishing \([11](https://arxiv.org/html/2605.20235#S4.E11)\)\. SinceFt\(qs\)=12𝔼‖Pqs\(x\)−Πℳ\(x\)‖2F\_\{t\}\(q\_\{s\}\)=\\tfrac\{1\}\{2\}\\mathbb\{E\}\\\|P\_\{q\_\{s\}\}\(x\)\-\\Pi\_\{\\mathcal\{M\}\}\(x\)\\\|^\{2\}, theL2L^\{2\}error of the learned projection converges toO\(h\(t\)\)O\(h\(t\)\)ass→∞s\\to\\infty\. ∎

###### Lemma B\.1\(Second\-Moment Confinement\)\.

Under the hypotheses of Theorem[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5), suppose additionally that the training objective includes anℓ2\\ell\_\{2\}weight regularizer with coefficientλw\>0\\lambda\_\{w\}\>0, i\.e\., the flow \([8](https://arxiv.org/html/2605.20235#S4.E8)\) is replaced by

∂sqs=∇w⋅\(qs∇wδL~tδq\(qs\)\),L~t\(q\):=Lt\(q\)\+λw2∫‖w‖2q\(dw\)\.\\partial\_\{s\}q\_\{s\}=\\nabla\_\{w\}\\cdot\\\!\\left\(q\_\{s\}\\,\\nabla\_\{w\}\\frac\{\\delta\\widetilde\{L\}\_\{t\}\}\{\\delta q\}\(q\_\{s\}\)\\right\),\\qquad\\widetilde\{L\}\_\{t\}\(q\):=L\_\{t\}\(q\)\+\\frac\{\\lambda\_\{w\}\}\{2\}\\int\\\|w\\\|^\{2\}\\,q\(dw\)\.\(31\)Then the second moment satisfies

m2\(qs\)≤max⁡\(m2\(q0\),Cconf/λw\)for alls≥0,m\_\{2\}\(q\_\{s\}\)\\;\\leq\\;\\max\\\!\\bigl\(m\_\{2\}\(q\_\{0\}\),\\;C\_\{\\mathrm\{conf\}\}/\\lambda\_\{w\}\\bigr\)\\qquad\\text\{for all \}s\\geq 0,\(32\)whereCconfC\_\{\\mathrm\{conf\}\}depends onCσC\_\{\\sigma\},𝔼\[‖x‖2\]\\mathbb\{E\}\[\\\|x\\\|^\{2\}\],h\(t\)h\(t\), and‖s∗\(⋅,t\)‖L2\(pt\)\\\|s^\{\*\}\(\\cdot,t\)\\\|\_\{L^\{2\}\(p\_\{t\}\)\}, but not onss\.

###### Proof\.

The regularized velocity field isv\(w\)=−∇wδLtδq\(w\)−λwwv\(w\)=\-\\nabla\_\{w\}\\frac\{\\delta L\_\{t\}\}\{\\delta q\}\(w\)\-\\lambda\_\{w\}w\. Computing the evolution of the second moment,

ddsm2\(qs\)=−2∫⟨w,∇wδLtδq\(w\)⟩qs\(dw\)−2λwm2\(qs\)\.\\frac\{d\}\{ds\}m\_\{2\}\(q\_\{s\}\)=\-2\\int\\\!\\bigl\\langle w,\\,\\nabla\_\{w\}\\tfrac\{\\delta L\_\{t\}\}\{\\delta q\}\(w\)\\bigr\\rangle\\,q\_\{s\}\(dw\)\\;\-\\;2\\lambda\_\{w\}\\,m\_\{2\}\(q\_\{s\}\)\.\(33\)For the first term, the functional derivative ofLtL\_\{t\}isδLtδq\(w\)=1h\(t\)𝔼x\[σ\(w⊤x\)⟨Rq\(x\),w⟩\]\\frac\{\\delta L\_\{t\}\}\{\\delta q\}\(w\)=\\frac\{1\}\{h\(t\)\}\\,\\mathbb\{E\}\_\{x\}\\\!\\bigl\[\\sigma\(w^\{\\top\}x\)\\,\\langle R\_\{q\}\(x\),\\,w\\rangle\\bigr\]whereRq\(x\):=f1\(x;q\)−s∗\(x,t\)R\_\{q\}\(x\):=f\_\{1\}\(x;q\)\-s^\{\*\}\(x,t\)is the residual\. Itsww\-gradient satisfies

∇wδLtδq\(w\)=1h\(t\)𝔼x\[σ′\(w⊤x\)x⟨Rq,w⟩\+σ\(w⊤x\)Rq\]\.\\nabla\_\{w\}\\frac\{\\delta L\_\{t\}\}\{\\delta q\}\(w\)=\\frac\{1\}\{h\(t\)\}\\,\\mathbb\{E\}\_\{x\}\\\!\\bigl\[\\sigma^\{\\prime\}\(w^\{\\top\}x\)\\,x\\,\\langle R\_\{q\},w\\rangle\+\\sigma\(w^\{\\top\}x\)\\,R\_\{q\}\\bigr\]\.Taking the inner product withwwand using\|σ\|≤Cσ\|\\sigma\|\\leq C\_\{\\sigma\},\|σ′\|≤Cσ\|\\sigma^\{\\prime\}\|\\leq C\_\{\\sigma\}:

\|⟨w,∇wδLtδq\(w\)⟩\|≤Cσh\(t\)\(‖w‖2\+‖w‖\)𝔼\[‖Rq\(x\)‖\(‖x‖\+1\)\]\.\\bigl\|\\bigl\\langle w,\\,\\nabla\_\{w\}\\tfrac\{\\delta L\_\{t\}\}\{\\delta q\}\(w\)\\bigr\\rangle\\bigr\|\\;\\leq\\;\\frac\{C\_\{\\sigma\}\}\{h\(t\)\}\\,\\bigl\(\\\|w\\\|^\{2\}\+\\\|w\\\|\\bigr\)\\,\\mathbb\{E\}\\\!\\bigl\[\\\|R\_\{q\}\(x\)\\\|\\,\(\\\|x\\\|\+1\)\\bigr\]\.Since‖Rq‖L22=2Lt\(qs\)\\\|R\_\{q\}\\\|\_\{L^\{2\}\}^\{2\}=2L\_\{t\}\(q\_\{s\}\)is non\-increasing along the gradient flow ofL~t\\widetilde\{L\}\_\{t\}, and𝔼\[\(‖x‖\+1\)2\]\\mathbb\{E\}\[\(\\\|x\\\|\+1\)^\{2\}\]is a fixed constantCx,t2C\_\{x,t\}^\{2\}underptp\_\{t\}, integrating againstqsq\_\{s\}yields

\|∫⟨w,∇wδLtδq⟩qs\(dw\)\|≤CσCx,t2Lt\(q0\)h\(t\)\(m2\(qs\)\+m2\(qs\)\)≤C′\(m2\(qs\)\+1\),\\left\|\\int\\\!\\bigl\\langle w,\\,\\nabla\_\{w\}\\tfrac\{\\delta L\_\{t\}\}\{\\delta q\}\\bigr\\rangle\\,q\_\{s\}\(dw\)\\right\|\\;\\leq\\;\\frac\{C\_\{\\sigma\}C\_\{x,t\}\\sqrt\{2L\_\{t\}\(q\_\{0\}\)\}\}\{h\(t\)\}\\,\\bigl\(m\_\{2\}\(q\_\{s\}\)\+\\sqrt\{m\_\{2\}\(q\_\{s\}\)\}\\bigr\)\\;\\leq\\;C^\{\\prime\}\(m\_\{2\}\(q\_\{s\}\)\+1\),whereC′:=CσCx,t2Lt\(q0\)/h\(t\)C^\{\\prime\}:=C\_\{\\sigma\}C\_\{x,t\}\\sqrt\{2L\_\{t\}\(q\_\{0\}\)\}/h\(t\)\. Substituting back into \([33](https://arxiv.org/html/2605.20235#A2.E33)\):

ddsm2≤2C′\(m2\+1\)−2λwm2=−\(2λw−2C′\)m2\+2C′\.\\frac\{d\}\{ds\}m\_\{2\}\\;\\leq\\;2C^\{\\prime\}\(m\_\{2\}\+1\)\-2\\lambda\_\{w\}m\_\{2\}=\-\(2\\lambda\_\{w\}\-2C^\{\\prime\}\)\\,m\_\{2\}\+2C^\{\\prime\}\.Choosingλw\>C′\\lambda\_\{w\}\>C^\{\\prime\}\(i\.e\.,λw\\lambda\_\{w\}depends onh\(t\)h\(t\),CσC\_\{\\sigma\}, andLt\(q0\)L\_\{t\}\(q\_\{0\}\)\), the coefficient ofm2m\_\{2\}is negative, and Grönwall’s lemma gives \([32](https://arxiv.org/html/2605.20235#A2.E32)\) withCconf=2C′C\_\{\\mathrm\{conf\}\}=2C^\{\\prime\}\. ∎

###### Proposition B\.3\(Non\-degeneracy for Linear Manifolds\)\.

Letℳ=range\(A\)\\mathcal\{M\}=\\mathrm\{range\}\(A\)for someA∈ℝd×kA\\in\\mathbb\{R\}^\{d\\times k\}withA⊤A=IkA^\{\\top\}A=I\_\{k\}, so thatΠℳ\(x\)=AA⊤x\\Pi\_\{\\mathcal\{M\}\}\(x\)=AA^\{\\top\}x\. LetΣt:=𝔼x∼pt\[xx⊤\]=\(1−h\(t\)\)Σ0\+h\(t\)Id\\Sigma\_\{t\}:=\\mathbb\{E\}\_\{x\\sim p\_\{t\}\}\[xx^\{\\top\}\]=\(1\-h\(t\)\)\\Sigma\_\{0\}\+h\(t\)I\_\{d\}, whereΣ0=𝔼\[x0x0⊤\]\\Sigma\_\{0\}=\\mathbb\{E\}\[x\_\{0\}x\_\{0\}^\{\\top\}\]\. Assumeσ∈C2\(ℝ\)\\sigma\\in C^\{2\}\(\\mathbb\{R\}\)is non\-polynomial with‖σ‖∞,‖σ′‖∞≤Cσ\\\|\\sigma\\\|\_\{\\infty\},\\\|\\sigma^\{\\prime\}\\\|\_\{\\infty\}\\leq C\_\{\\sigma\}, the trainable weights satisfy∫‖w‖2qs\(dθ\)≤M2\\int\\\|w\\\|^\{2\}q\_\{s\}\(\\mathrm\{d\}\\theta\)\\leq M\_\{2\}for alls≥0s\\geq 0, and the frozen output coefficients have second momentσa2:=𝔼νa\[a2\]\>0\\sigma\_\{a\}^\{2\}:=\\mathbb\{E\}\_\{\\nu\_\{a\}\}\[a^\{2\}\]\>0\.

Then the Geometric Non\-degeneracy Condition \([10](https://arxiv.org/html/2605.20235#S4.E10)\) holds with

ν=σa2⋅λmin\(𝒯t\)⋅μσ2,\\nu\\;=\\;\\sigma\_\{a\}^\{2\}\\,\\cdot\\,\\lambda\_\{\\min\}\(\\mathcal\{T\}\_\{t\}\)\\,\\cdot\\,\\mu\_\{\\sigma\}^\{2\},\(34\)whereλmin\(𝒯t\)\\lambda\_\{\\min\}\(\\mathcal\{T\}\_\{t\}\)is the smallest eigenvalue of the*feature covariance operator*

𝒯t\[g\]\(w,b\):=𝔼x∼pt\[σ\(w⊤x\+b\)⟨g,x⟩x\],\\mathcal\{T\}\_\{t\}\[g\]\(w,b\):=\\mathbb\{E\}\_\{x\\sim p\_\{t\}\}\\\!\\bigl\[\\sigma\(w^\{\\top\}x\+b\)\\,\\langle g,x\\rangle\\,x\\bigr\],acting on vector fields aligned with\(range\(A\)\)⟂\(\\mathrm\{range\}\(A\)\)^\{\\perp\}and indexed by\(w,b\)\(w,b\)under the\(w,b\)\(w,b\)\-marginal ofqsq\_\{s\}, andμσ\>0\\mu\_\{\\sigma\}\>0is a constant depending only onσ\\sigmaand the spectral bounds ofΣt\\Sigma\_\{t\}\. The factorσa2\\sigma\_\{a\}^\{2\}originates from the multiplicativeaain∇wδFtδq\(θ\)\\nabla\_\{w\}\\tfrac\{\\delta F\_\{t\}\}\{\\delta q\}\(\\theta\)\(cf\. Step 5 of the proof of Theorem[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5)\)\.

In particular,ν\\nuis independent ofh\(t\)h\(t\)in the following sense: for any compact interval\[hmin,hmax\]⊂\(0,1\)\[h\_\{\\min\},h\_\{\\max\}\]\\subset\(0,1\),

infh\(t\)∈\[hmin,hmax\]ν\(h\(t\)\)\>0,\\inf\_\{h\(t\)\\in\[h\_\{\\min\},h\_\{\\max\}\]\}\\nu\(h\(t\)\)\\;\>\\;0,\(35\)sinceΣt\\Sigma\_\{t\}is positive definite and varies continuously inh\(t\)h\(t\), and the spectral gap of𝒯t\\mathcal\{T\}\_\{t\}depends continuously onΣt\\Sigma\_\{t\}\.

###### Proof sketch\.

The argument follows the three\-step structure of the original case \(ww\-only flow\), adapted to the Option B setting in which\(a,b\)\(a,b\)are frozen atνa⊗νb\\nu\_\{a\}\\otimes\\nu\_\{b\}and onlywwevolves\.

Step 1: Reduction to a kernel eigenvalue problem\.For the linear manifold,Δq\(x\):=Pq\(x\)−AA⊤x\\Delta\_\{q\}\(x\):=P\_\{q\}\(x\)\-AA^\{\\top\}xandFt\(q\)=12𝔼‖Δq\(x\)‖2F\_\{t\}\(q\)=\\tfrac\{1\}\{2\}\\mathbb\{E\}\\\|\\Delta\_\{q\}\(x\)\\\|^\{2\}\. The first variation along the trainable direction is

∇wδFtδq\(θ\)=a𝔼x\[σ′\(w⊤x\+b\)x⟨Δq,w⟩\+σ\(w⊤x\+b\)Δq\],\\nabla\_\{w\}\\frac\{\\delta F\_\{t\}\}\{\\delta q\}\(\\theta\)\\;=\\;a\\,\\mathbb\{E\}\_\{x\}\\\!\\bigl\[\\sigma^\{\\prime\}\(w^\{\\top\}x\+b\)\\,x\\,\\langle\\Delta\_\{q\},w\\rangle\+\\sigma\(w^\{\\top\}x\+b\)\\,\\Delta\_\{q\}\\bigr\],so squaring produces an outera2a^\{2\}factor\. Sinceaais frozen atνa\\nu\_\{a\}\(independent of\(w,b\)\(w,b\)ats=0s=0and unchanged thereafter\) andσa2=𝔼\[a2\]\>0\\sigma\_\{a\}^\{2\}=\\mathbb\{E\}\[a^\{2\}\]\>0, this factor contributesσa2\\sigma\_\{a\}^\{2\}toν\\nu\. The residual\(w,b\)\(w,b\)\-integral then reduces to the spectral gap of the integral operator with kernel

Kt\(\(w,b\),\(w′,b′\)\):=𝔼x∼pt\[σ\(w⊤x\+b\)σ\(w′⊤x\+b′\)\(w⊤w′\)\]K\_\{t\}\\bigl\(\(w,b\),\(w^\{\\prime\},b^\{\\prime\}\)\\bigr\):=\\mathbb\{E\}\_\{x\\sim p\_\{t\}\}\\\!\\bigl\[\\sigma\(w^\{\\top\}x\+b\)\\,\\sigma\(\{w^\{\\prime\}\}^\{\\top\}x\+b^\{\\prime\}\)\\,\(w^\{\\top\}w^\{\\prime\}\)\\bigr\]acting onL2L^\{2\}of the\(w,b\)\(w,b\)\-marginal ofqsq\_\{s\}\.

Step 2: Spectral gap from non\-degeneracy ofσ\\sigma\.Sinceσ\\sigmais non\-polynomial andC2C^\{2\}, the family\{x↦σ\(w⊤x\+b\):\(w,b\)∈supp\(νw⊗νb\)\}\\\{x\\mapsto\\sigma\(w^\{\\top\}x\+b\):\(w,b\)\\in\\mathrm\{supp\}\(\\nu\_\{w\}\\otimes\\nu\_\{b\}\)\\\}spans a dense subspace ofL2\(pt\)L^\{2\}\(p\_\{t\}\)\(by the non\-polynomial universality theorem for ridge functions, with translation parameterbb\)\. Combined with the positive definiteness ofΣt\\Sigma\_\{t\}for allh\(t\)\>0h\(t\)\>0, this ensuresλmin\(𝒯t\)\>0\\lambda\_\{\\min\}\(\\mathcal\{T\}\_\{t\}\)\>0\. Concretely, for the Gaussian distributionptp\_\{t\}with covarianceΣt\\Sigma\_\{t\}, the Hermite expansion ofσ\(⋅\+b\)\\sigma\(\\cdot\+b\)has non\-vanishing coefficients at all orders \(sinceσ\\sigmais non\-polynomial\), and each Hermite component contributes a strictly positive term to the spectral decomposition ofKtK\_\{t\}\.

Step 3: Uniformity inh\(t\)h\(t\)\.The eigenvalues ofKtK\_\{t\}are continuous functions ofΣt\\Sigma\_\{t\}\(since they are expectations of continuous functions of the data covariance\)\. On the compact seth\(t\)∈\[hmin,hmax\]h\(t\)\\in\[h\_\{\\min\},h\_\{\\max\}\],Σt\\Sigma\_\{t\}ranges over a compact family of positive definite matrices, soλmin\(𝒯t\)\\lambda\_\{\\min\}\(\\mathcal\{T\}\_\{t\}\)attains a positive minimum\. Thereforeν\\nucan be chosen uniformly overh\(t\)∈\[hmin,hmax\]h\(t\)\\in\[h\_\{\\min\},h\_\{\\max\}\]\. ∎

### B\.4Proof of Theorem[4\.7](https://arxiv.org/html/2605.20235#S4.Thmtheorem7)

#### Bounded feature energy\.

A crucial property of the RF network is that its feature energy is entirely independent of the ambient dimensiondd\.

###### Lemma B\.5\(Bounded Spatio\-Temporal Feature Energy\)\.

Assume the activationσ∈C2\(ℝ\)\\sigma\\in C^\{2\}\(\\mathbb\{R\}\)satisfies‖σ‖∞≤Cσ\\\|\\sigma\\\|\_\{\\infty\}\\leq C\_\{\\sigma\}for some constantCσ\>0C\_\{\\sigma\}\>0\. Then for anyVx∈ℝd×mV\_\{x\}\\in\\mathbb\{R\}^\{d\\times m\}andbb, the random feature mapΦ\(x^,t\):=1mσ\(Vx⊤x^\+b\)∈ℝm\\Phi\(\\hat\{x\},t\):=\\frac\{1\}\{\\sqrt\{m\}\}\\,\\sigma\\\!\\bigl\(V\_\{x\}^\{\\top\}\\hat\{x\}\+b\\bigr\)\\in\\mathbb\{R\}^\{m\}satisfies

supx^∈ℳ,t∥Φ\(x^,t\)∥2≤Cσ2=:CST,\\sup\_\{\\hat\{x\}\\in\\mathcal\{M\},\\,t\}\\\|\\Phi\(\\hat\{x\},t\)\\\|^\{2\}\\leq C\_\{\\sigma\}^\{2\}=:C\_\{ST\},\(36\)whereCSTC\_\{ST\}depends only onσ\\sigmaand is independent of the ambient dimensiondd, the feature dimensionmm, the manifoldℳ\\mathcal\{M\}, and the diffusion timett\.

#### Kernel eigenvalue decay\.

The generalization of Stage 2 hinges on the spectral properties of the induced kernelK\(x^,x^′\)=𝔼V\[σ\(V⊤x^\)σ\(V⊤x^′\)\]K\(\\hat\{x\},\\hat\{x\}^\{\\prime\}\)=\\mathbb\{E\}\_\{V\}\[\\sigma\(V^\{\\top\}\\hat\{x\}\)\\sigma\(V^\{\\top\}\\hat\{x\}^\{\\prime\}\)\]restricted toℳ\\mathcal\{M\}\. Sinceℳ\\mathcal\{M\}is a compactkk\-dimensional Riemannian manifold, Weyl’s law controls the eigenvalue decay\.

###### Lemma B\.6\(Eigenvalue Decay on Manifold\)\.

Let\(ℳ,g\)\(\\mathcal\{M\},g\)be a compactkk\-dimensionalC∞C^\{\\infty\}Riemannian manifold\. Consider the kernel

K\(x^,x^′\):=𝔼V\[σ\(V⊤x^\)σ\(V⊤x^′\)\],K\(\\hat\{x\},\\hat\{x\}^\{\\prime\}\):=\\mathbb\{E\}\_\{V\}\\\!\\bigl\[\\sigma\(V^\{\\top\}\\hat\{x\}\)\\,\\sigma\(V^\{\\top\}\\hat\{x\}^\{\\prime\}\)\\bigr\],\(37\)whereVVis drawn from a distribution with density onℝd×m\\mathbb\{R\}^\{d\\times m\}, and assumeσ∈Cr\(ℝ\)\\sigma\\in C^\{r\}\(\\mathbb\{R\}\)for somer≥2r\\geq 2with bounded derivatives up to orderrr\. Then:

1. 1\.\(Kernel regularity\.\)The kernelKKbelongs toCr\(ℳ×ℳ\)C^\{r\}\(\\mathcal\{M\}\\times\\mathcal\{M\}\), hence toHr−k/2\(ℳ×ℳ\)H^\{r\-k/2\}\(\\mathcal\{M\}\\times\\mathcal\{M\}\)by Sobolev embedding\.
2. 2\.\(Eigenvalue decay\.\)The eigenvalues\{λj\}j≥1\\\{\\lambda\_\{j\}\\\}\_\{j\\geq 1\}of the induced integral operatorTK:L2\(ℳ\)→L2\(ℳ\)T\_\{K\}:L^\{2\}\(\\mathcal\{M\}\)\\to L^\{2\}\(\\mathcal\{M\}\)satisfy λj≤Cℳ,r‖K‖Crj−r/k,\\lambda\_\{j\}\\;\\leq\\;C\_\{\\mathcal\{M\},r\}\\,\\\|K\\\|\_\{C^\{r\}\}\\;j^\{\-r/k\},\(38\)whereCℳ,rC\_\{\\mathcal\{M\},r\}depends only on\(ℳ,g\)\(\\mathcal\{M\},g\)andrr, not on the ambient dimensiondd\.

The polynomial decay ratej−r/kj^\{\-r/k\}depends on the*intrinsic*dimensionkkrather thandd, which is the key to controlling the ambient dependence in Stage 2\.

###### Proof\.

By definition ofΦ\\Phi,

‖Φ\(x^,t\)‖2=1m∑i=1mσ\(\(Vx⊤x^\)i\+bi\)2≤1m∑i=1m‖σ‖∞2=‖σ‖∞2≤Cσ2,\\\|\\Phi\(\\hat\{x\},t\)\\\|^\{2\}=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\sigma\\bigl\(\(V\_\{x\}^\{\\top\}\\hat\{x\}\)\_\{i\}\+b\_\{i\}\\bigr\)^\{2\}\\leq\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\\|\\sigma\\\|\_\{\\infty\}^\{2\}=\\\|\\sigma\\\|\_\{\\infty\}^\{2\}\\leq C\_\{\\sigma\}^\{2\},where the inequality applies the pointwise bound\|σ\(z\)\|≤Cσ\|\\sigma\(z\)\|\\leq C\_\{\\sigma\}termwise\. The resulting bound is independent of\(x^,t\)\(\\hat\{x\},t\),\(Vx,Vt\)\(V\_\{x\},V\_\{t\}\), andmm, so taking the supremum overx^∈ℳ\\hat\{x\}\\in\\mathcal\{M\}andttin the time interval under consideration yields \([36](https://arxiv.org/html/2605.20235#A2.E36)\)\. ∎

###### Proof\.

Part \(i\): Kernel regularity\.Sinceσ∈Cr\(ℝ\)\\sigma\\in C^\{r\}\(\\mathbb\{R\}\)with bounded derivatives, the mapx^↦σ\(V⊤x^\)\\hat\{x\}\\mapsto\\sigma\(V^\{\\top\}\\hat\{x\}\)isCrC^\{r\}onℳ\\mathcal\{M\}for each fixedVV, with derivatives bounded uniformly inVVbymax0≤j≤r⁡‖σ\(j\)‖∞⋅‖V‖opj\\max\_\{0\\leq j\\leq r\}\\\|\\sigma^\{\(j\)\}\\\|\_\{\\infty\}\\cdot\\\|V\\\|\_\{\\mathrm\{op\}\}^\{j\}\. Since the distribution ofVVhas finite moments of all orders, dominated convergence ensures thatK\(x^,x^′\)=𝔼V\[σ\(V⊤x^\)σ\(V⊤x^′\)\]K\(\\hat\{x\},\\hat\{x\}^\{\\prime\}\)=\\mathbb\{E\}\_\{V\}\[\\sigma\(V^\{\\top\}\\hat\{x\}\)\\sigma\(V^\{\\top\}\\hat\{x\}^\{\\prime\}\)\]inheritsCrC^\{r\}regularity jointly in\(x^,x^′\)∈ℳ×ℳ\(\\hat\{x\},\\hat\{x\}^\{\\prime\}\)\\in\\mathcal\{M\}\\times\\mathcal\{M\}\. The Sobolev embeddingCr\(ℳ\)↪Hτ\(ℳ\)C^\{r\}\(\\mathcal\{M\}\)\\hookrightarrow H^\{\\tau\}\(\\mathcal\{M\}\)forτ≤r−k/2\\tau\\leq r\-k/2then givesK∈Hτ\(ℳ×ℳ\)K\\in H^\{\\tau\}\(\\mathcal\{M\}\\times\\mathcal\{M\}\)\.

Part \(ii\): Eigenvalue decay\.The argument proceeds via the factorization ofTKT\_\{K\}through Sobolev spaces\.

*Step 1: Mapping property\.*SinceK\(⋅,x^′\)∈Cr\(ℳ\)K\(\\cdot,\\hat\{x\}^\{\\prime\}\)\\in C^\{r\}\(\\mathcal\{M\}\)uniformly inx^′\\hat\{x\}^\{\\prime\}, the integral operatorTKT\_\{K\}mapsL2\(ℳ\)L^\{2\}\(\\mathcal\{M\}\)boundedly intoCr\(ℳ\)↪Hr\(ℳ\)C^\{r\}\(\\mathcal\{M\}\)\\hookrightarrow H^\{r\}\(\\mathcal\{M\}\):

‖TKf‖Hr\(ℳ\)≤C‖K‖Cr\(ℳ×ℳ\)‖f‖L2\(ℳ\)\.\\\|T\_\{K\}f\\\|\_\{H^\{r\}\(\\mathcal\{M\}\)\}\\;\\leq\\;C\\,\\\|K\\\|\_\{C^\{r\}\(\\mathcal\{M\}\\times\\mathcal\{M\}\)\}\\,\\\|f\\\|\_\{L^\{2\}\(\\mathcal\{M\}\)\}\.
*Step 2: Singular values of the Sobolev embedding\.*The inclusionι:Hr\(ℳ\)↪L2\(ℳ\)\\iota:H^\{r\}\(\\mathcal\{M\}\)\\hookrightarrow L^\{2\}\(\\mathcal\{M\}\)is compact, and its singular values\{μj\}j≥1\\\{\\mu\_\{j\}\\\}\_\{j\\geq 1\}satisfyμj≍j−r/k\\mu\_\{j\}\\asymp j^\{\-r/k\}\. This follows from the spectral theory of the Laplace–Beltrami operator−Δℳ\-\\Delta\_\{\\mathcal\{M\}\}on the compact manifoldℳ\\mathcal\{M\}: Weyl’s law gives that the eigenvalues of−Δℳ\-\\Delta\_\{\\mathcal\{M\}\}grow asλjΔ≍j2/k\\lambda\_\{j\}^\{\\Delta\}\\asymp j^\{2/k\}, so the eigenvalues of\(I−Δℳ\)−r/2\(I\-\\Delta\_\{\\mathcal\{M\}\}\)^\{\-r/2\}\(which characterizeHr↪L2H^\{r\}\\hookrightarrow L^\{2\}\) decay asj−r/kj^\{\-r/k\}\.

*Step 3: Factorization\.*SinceTK=ι∘\(TK:L2→Hr\)T\_\{K\}=\\iota\\circ\(T\_\{K\}:L^\{2\}\\to H^\{r\}\), the multiplicative property of singular values gives

λj\(TK\)≤∥TK:L2→Hr∥op⋅μj≤C∥K∥Crj−r/k,\\lambda\_\{j\}\(T\_\{K\}\)\\;\\leq\\;\\\|T\_\{K\}:L^\{2\}\\to H^\{r\}\\\|\_\{\\mathrm\{op\}\}\\;\\cdot\\;\\mu\_\{j\}\\;\\leq\\;C\\,\\\|K\\\|\_\{C^\{r\}\}\\;j^\{\-r/k\},which is \([38](https://arxiv.org/html/2605.20235#A2.E38)\)\. The constant depends on\(ℳ,g\)\(\\mathcal\{M\},g\)through the Weyl constant and the Sobolev embedding constant, but not on the ambient dimensiondd\. ∎

###### Proof of Theorem[4\.7](https://arxiv.org/html/2605.20235#S4.Thmtheorem7)\.

The proof proceeds in four steps\.

Step \(i\): Rademacher complexity \(estimation error\)\.Consider the hypothesis classℱR=\{x↦UΦ\(x,t\):‖U‖F≤BU\}\\mathcal\{F\}\_\{R\}=\\\{x\\mapsto U\\Phi\(x,t\):\\\|U\\\|\_\{F\}\\leq B\_\{U\}\\\}\. By Lemma[B\.5](https://arxiv.org/html/2605.20235#A2.Thmtheorem5),‖Φ\(x^,t\)‖2≤Cσ2\\\|\\Phi\(\\hat\{x\},t\)\\\|^\{2\}\\leq C\_\{\\sigma\}^\{2\}independently ofdd,mm,ℳ\\mathcal\{M\}, andtt\. The empirical Rademacher complexity ofℱR\\mathcal\{F\}\_\{R\}is therefore bounded by

ℛ^n\(ℱR\)≤BUn1n∑i=1n‖Φ\(x^i,ti\)‖2≤BUCσn=Cintn,\\hat\{\\mathcal\{R\}\}\_\{n\}\(\\mathcal\{F\}\_\{R\}\)\\;\\leq\\;\\frac\{B\_\{U\}\}\{\\sqrt\{n\}\}\\sqrt\{\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\Phi\(\\hat\{x\}\_\{i\},t\_\{i\}\)\\\|^\{2\}\}\\;\\leq\\;\\frac\{B\_\{U\}\\,C\_\{\\sigma\}\}\{\\sqrt\{n\}\}\\;=\\;\\frac\{C\_\{\\mathrm\{int\}\}\}\{\\sqrt\{n\}\},\(39\)independently ofdd\.

It remains to show thatBUB\_\{U\}isdd\-independent\. The ridge regression solution isU^=𝔼\[sres∗Φ⊤\]\(𝔼\[ΦΦ⊤\]\+λI\)−1\\hat\{U\}=\\mathbb\{E\}\[s^\{\*\}\_\{\\mathrm\{res\}\}\\,\\Phi^\{\\top\}\]\(\\mathbb\{E\}\[\\Phi\\,\\Phi^\{\\top\}\]\+\\lambda I\)^\{\-1\}\. By the Cauchy–Schwarz inequality,

‖U^‖F≤1λ‖𝔼\[sres∗Φ⊤\]‖F≤1λ𝔼‖sres∗‖2⋅𝔼‖Φ‖2≤‖sres∗‖L2Cσλ\.\\\|\\hat\{U\}\\\|\_\{F\}\\;\\leq\\;\\frac\{1\}\{\\lambda\}\\,\\bigl\\\|\\mathbb\{E\}\[s^\{\*\}\_\{\\mathrm\{res\}\}\\,\\Phi^\{\\top\}\]\\bigr\\\|\_\{F\}\\;\\leq\\;\\frac\{1\}\{\\lambda\}\\,\\sqrt\{\\mathbb\{E\}\\\|s^\{\*\}\_\{\\mathrm\{res\}\}\\\|^\{2\}\}\\;\\cdot\\;\\sqrt\{\\mathbb\{E\}\\\|\\Phi\\\|^\{2\}\}\\;\\leq\\;\\frac\{\\\|s^\{\*\}\_\{\\mathrm\{res\}\}\\\|\_\{L^\{2\}\}\\,C\_\{\\sigma\}\}\{\\lambda\}\.The key observation is thatsres∗\(x^,t\)∈Tx^ℳs^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\)\\in T\_\{\\hat\{x\}\}\\mathcal\{M\}is a tangential vector field onℳ\\mathcal\{M\}, so itsL2L^\{2\}\-norm is controlled by intrinsic quantities: under the source condition,‖sres∗‖L2≤‖TKs‖op⋅G≤λ1sG\\\|s^\{\*\}\_\{\\mathrm\{res\}\}\\\|\_\{L^\{2\}\}\\leq\\\|T\_\{K\}^\{s\}\\\|\_\{\\mathrm\{op\}\}\\cdot G\\leq\\lambda\_\{1\}^\{s\}\\,G, whereλ1\\lambda\_\{1\}is the leading eigenvalue ofTKT\_\{K\}\. ThereforeBU≤λ1sGCσ/λB\_\{U\}\\leq\\lambda\_\{1\}^\{s\}\\,G\\,C\_\{\\sigma\}/\\lambda, which depends on\(ℳ,p0,σ,λ\)\(\\mathcal\{M\},p\_\{0\},\\sigma,\\lambda\)but not ondd\.

Standard Rademacher\-to\-generalization conversion with a union bound over the time discretization then gives the estimation errorCint2log⁡\(1/δ\)/nC\_\{\\mathrm\{int\}\}^\{2\}\\log\(1/\\delta\)/n\.

Step \(ii\): Approximation error\.By Lemma[B\.6](https://arxiv.org/html/2605.20235#A2.Thmtheorem6)and the source condition, the Random Feature approximation error is controlled by the spectral tail:

ϵapprox\(m,k\)=∑j\>cmλj2α≤Cλ2α∑j\>cmj−2αr/k=O\(m−\(2αr/k−1\)\),\\epsilon\_\{\\mathrm\{approx\}\}\(m,k\)\\;=\\;\\sum\_\{j\>cm\}\\lambda\_\{j\}^\{2\\alpha\}\\;\\leq\\;C\_\{\\lambda\}^\{2\\alpha\}\\sum\_\{j\>cm\}j^\{\-2\\alpha r/k\}\\;=\\;O\\\!\\bigl\(m^\{\-\(2\\alpha r/k\-1\)\}\\bigr\),where the convergence of the sum uses2αr/k\>12\\alpha r/k\>1\. The decay rate depends only onkk\(through the eigenvalue exponentr/kr/k\) and not ondd\.

Step \(iii\): Stage 1 error propagation\.The imperfect projectionx^≈Πℳ\(x\)\\hat\{x\}\\approx\\Pi\_\{\\mathcal\{M\}\}\(x\)introduces two distinct error contributions\. By the score architecture \([1](https://arxiv.org/html/2605.20235#S3.E1)\),

sθ\(x,t\)−s∗\(x,t\)=Πℳ\(x\)−x^h\(t\)⏟\(a\) normal mismatch\+\(f2\(x^,t\)−sres∗\(x^,t\)\)⏟Stage 2 error atx^\+\(sres∗\(x^,t\)−sres∗\(Πℳ\(x\),t\)\)⏟\(b\) tangential feature perturbation\.s\_\{\\theta\}\(x,t\)\-s^\{\*\}\(x,t\)=\\underbrace\{\\frac\{\\Pi\_\{\\mathcal\{M\}\}\(x\)\-\\hat\{x\}\}\{h\(t\)\}\}\_\{\\text\{\(a\) normal mismatch\}\}\+\\underbrace\{\\bigl\(f\_\{2\}\(\\hat\{x\},t\)\-s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\)\\bigr\)\}\_\{\\text\{Stage~2 error at \}\\hat\{x\}\}\+\\underbrace\{\\bigl\(s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\)\-s^\{\*\}\_\{\\mathrm\{res\}\}\(\\Pi\_\{\\mathcal\{M\}\}\(x\),t\)\\bigr\)\}\_\{\\text\{\(b\) tangential feature perturbation\}\}\.
*Term \(a\): Normal mismatch\.*This is the dominant Stage 1 contribution\. Taking the squared expectation:

𝔼‖Πℳ\(x\)−x^h\(t\)‖2=ϵproj2h\(t\)2\.\\mathbb\{E\}\\left\\\|\\frac\{\\Pi\_\{\\mathcal\{M\}\}\(x\)\-\\hat\{x\}\}\{h\(t\)\}\\right\\\|^\{2\}=\\frac\{\\epsilon\_\{\\mathrm\{proj\}\}^\{2\}\}\{h\(t\)^\{2\}\}\.
*Term \(b\): Tangential feature perturbation\.*Sincesres∗s^\{\*\}\_\{\\mathrm\{res\}\}is aC1C^\{1\}vector field onℳ\\mathcal\{M\}with Lipschitz constantLip\(sres∗\)\\mathrm\{Lip\}\(s^\{\*\}\_\{\\mathrm\{res\}\}\)depending on the curvature and density gradients ofℳ\\mathcal\{M\}:

𝔼‖sres∗\(x^,t\)−sres∗\(Πℳ\(x\),t\)‖2≤Lip\(sres∗\)2ϵproj2\.\\mathbb\{E\}\\bigl\\\|s^\{\*\}\_\{\\mathrm\{res\}\}\(\\hat\{x\},t\)\-s^\{\*\}\_\{\\mathrm\{res\}\}\(\\Pi\_\{\\mathcal\{M\}\}\(x\),t\)\\bigr\\\|^\{2\}\\;\\leq\\;\\mathrm\{Lip\}\(s^\{\*\}\_\{\\mathrm\{res\}\}\)^\{2\}\\,\\epsilon\_\{\\mathrm\{proj\}\}^\{2\}\.
*Cross term\.*The cross term between \(a\) and the Stage 2 error is controlled by the AM–GM inequality:2⟨A,B⟩≤‖A‖2\+‖B‖22\\langle A,B\\rangle\\leq\\\|A\\\|^\{2\}\+\\\|B\\\|^\{2\}, so it can be absorbed into the other terms at the cost of constant factors\.

Combining, the total Stage 1 error is

ϵproj2h\(t\)2⏟dominant\+Lip\(sres∗\)2ϵproj2⏟lower order sinceh\(t\)≪1≤CLip2ϵproj2h\(t\)2,\\underbrace\{\\frac\{\\epsilon\_\{\\mathrm\{proj\}\}^\{2\}\}\{h\(t\)^\{2\}\}\}\_\{\\text\{dominant\}\}\\;\+\\;\\underbrace\{\\mathrm\{Lip\}\(s^\{\*\}\_\{\\mathrm\{res\}\}\)^\{2\}\\,\\epsilon\_\{\\mathrm\{proj\}\}^\{2\}\}\_\{\\text\{lower order since \}h\(t\)\\ll 1\}\\;\\leq\\;\\frac\{C\_\{\\mathrm\{Lip\}\}^\{2\}\\,\\epsilon\_\{\\mathrm\{proj\}\}^\{2\}\}\{h\(t\)^\{2\}\},whereCLip:=1\+h\(t\)2Lip\(sres∗\)2≥1C\_\{\\mathrm\{Lip\}\}:=\\sqrt\{1\+h\(t\)^\{2\}\\,\\mathrm\{Lip\}\(s^\{\*\}\_\{\\mathrm\{res\}\}\)^\{2\}\}\\geq 1absorbs both contributions\.

Step \(iv\): Combining\.Summing the three terms from Steps \(i\)–\(iii\) yields \([14](https://arxiv.org/html/2605.20235#S4.E14)\)\. ∎

### B\.5Proof of Theorem[4\.10](https://arxiv.org/html/2605.20235#S4.Thmtheorem10)

This section establishes the Phase II generalization bound and the end\-to\-end Wasserstein\-2 sampling guarantee\. The argument proceeds in three pieces: an affine structural decomposition of the high\-noise score \(Lemma[B\.8](https://arxiv.org/html/2605.20235#A2.Thmtheorem8)\), a parametric Phase II generalization bound for the multiplicative random\-feature head \(Lemma[B\.9](https://arxiv.org/html/2605.20235#A2.Thmtheorem9)\), and the Girsanov\-based end\-to\-end argument that combines Phase I and Phase II\.

#### High\-noise score structure\.

A crucial property of the score function in the Gaussian regime is that it is approximately*affine*inxx, with smooth time\-dependent coefficients\. This linear\-in\-xxstructure is what enables the multiplicative random\-feature ansatz \([16](https://arxiv.org/html/2605.20235#S4.E16)\) to achieve a parametric \(rather than nonparametric\) generalization bound\.

###### Lemma B\.8\(Affine approximation of the high\-noise score\)\.

Assumeℳ\\mathcal\{M\}has finite diameterRℳR\_\{\\mathcal\{M\}\}\. For anyttwithh\(t\)∈\[τ2,1\)h\(t\)\\in\[\\tau^\{2\},1\), writeat:=1−h\(t\)a\_\{t\}:=\\sqrt\{1\-h\(t\)\}\. Then the score function admits the decomposition

s∗\(x,t\)=−xh\(t\)\+ath\(t\)μt\+at2h\(t\)2Ctx\+Rt\(x\),s^\{\*\}\(x,t\)=\-\\frac\{x\}\{h\(t\)\}\+\\frac\{a\_\{t\}\}\{h\(t\)\}\\mu\_\{t\}\+\\frac\{a\_\{t\}^\{2\}\}\{h\(t\)^\{2\}\}C\_\{t\}\\,x\+R\_\{t\}\(x\),\(40\)whereμt:=𝔼z∼p0\[z\]∈ℝd\\mu\_\{t\}:=\\mathbb\{E\}\_\{z\\sim p\_\{0\}\}\[z\]\\in\\mathbb\{R\}^\{d\},Ct:=Covz∼p0\(z\)∈ℝd×dC\_\{t\}:=\\mathrm\{Cov\}\_\{z\\sim p\_\{0\}\}\(z\)\\in\\mathbb\{R\}^\{d\\times d\}are the first two moments ofp0p\_\{0\}, and the residual satisfies

supx∈supp\(pt\)‖Rt\(x\)‖≤Cℳ\(1\)⋅at3h\(t\)3⋅Rℳ3,\\sup\_\{x\\in\\mathrm\{supp\}\(p\_\{t\}\)\}\\\|R\_\{t\}\(x\)\\\|\\;\\leq\\;C\_\{\\mathcal\{M\}\}^\{\(1\)\}\\cdot\\frac\{a\_\{t\}^\{3\}\}\{h\(t\)^\{3\}\}\\cdot R\_\{\\mathcal\{M\}\}^\{3\},\(41\)for a universal constantCℳ\(1\)\>0C\_\{\\mathcal\{M\}\}^\{\(1\)\}\>0\.

###### Proof\.

The marginal density satisfies

pt\(x\)=\(2πh\(t\)\)−d/2exp⁡\(−‖x‖22h\(t\)\)⋅Φt\(x\),Φt\(x\):=𝔼z∼p0\[exp⁡\(at⟨x,z⟩h\(t\)−at2‖z‖22h\(t\)\)\]\.p\_\{t\}\(x\)=\(2\\pi h\(t\)\)^\{\-d/2\}\\exp\\\!\\left\(\-\\frac\{\\\|x\\\|^\{2\}\}\{2h\(t\)\}\\right\)\\cdot\\Phi\_\{t\}\(x\),\\qquad\\Phi\_\{t\}\(x\):=\\mathbb\{E\}\_\{z\\sim p\_\{0\}\}\\\!\\left\[\\exp\\\!\\left\(\\frac\{a\_\{t\}\\langle x,z\\rangle\}\{h\(t\)\}\-\\frac\{a\_\{t\}^\{2\}\\\|z\\\|^\{2\}\}\{2h\(t\)\}\\right\)\\right\]\.\(42\)Taking gradients,

s∗\(x,t\)=−xh\(t\)\+∇xlog⁡Φt\(x\)\.s^\{\*\}\(x,t\)=\-\\frac\{x\}\{h\(t\)\}\+\\nabla\_\{x\}\\log\\Phi\_\{t\}\(x\)\.\(43\)Settingξ:=atx/h\(t\)\\xi:=a\_\{t\}x/h\(t\),Φt\\Phi\_\{t\}is the moment generating function of the tilted distributionp0\(z\)exp⁡\(−at2‖z‖2/\(2h\(t\)\)\)p\_\{0\}\(z\)\\exp\(\-a\_\{t\}^\{2\}\\\|z\\\|^\{2\}/\(2h\(t\)\)\)evaluated atξ\\xi\. Cumulant expansion oflog⁡Φt\\log\\Phi\_\{t\}aroundξ=0\\xi=0gives

log⁡Φt\(ξ\)=⟨μt,ξ⟩\+12ξ⊤Ctξ\+O\(‖ξ‖3\),\\log\\Phi\_\{t\}\(\\xi\)=\\langle\\mu\_\{t\},\\xi\\rangle\+\\tfrac\{1\}\{2\}\\xi^\{\\top\}C\_\{t\}\\xi\+O\(\\\|\\xi\\\|^\{3\}\),\(44\)with cubic remainder bounded by the third moment ofp0p\_\{0\}, which is in turn bounded byRℳ3R\_\{\\mathcal\{M\}\}^\{3\}sincep0p\_\{0\}is supported onℳ⊂B\(0,Rℳ\)\\mathcal\{M\}\\subset B\(0,R\_\{\\mathcal\{M\}\}\)\. Differentiating inxxand substitutingξ=atx/h\(t\)\\xi=a\_\{t\}x/h\(t\)yields \([40](https://arxiv.org/html/2605.20235#A2.E40)\) with the cubic remainder controlled as in \([41](https://arxiv.org/html/2605.20235#A2.E41)\)\. ∎

#### Phase II generalization\.

The kernel induced by \([16](https://arxiv.org/html/2605.20235#S4.E16)\) factorizes asKt⊗KxK\_\{t\}\\otimes K\_\{x\}due to the multiplicative structure\. The spatial componentKxK\_\{x\}inherits the eigenvalue decay of Lemma[B\.6](https://arxiv.org/html/2605.20235#A2.Thmtheorem6)onℳ\\mathcal\{M\}\(or, in Phase I, on a bounded ball inℝd\\mathbb\{R\}^\{d\}\), while the temporal componentKtK\_\{t\}is finite\-dimensional \(LLFourier modes\)\. The combination yields a hypothesis class whose Rademacher complexity is parametric innnwith add\-dependent norm scale that propagates linearly into the bound\.

###### Lemma B\.9\(Phase II generalization\)\.

Assumeℳ\\mathcal\{M\}has finite diameterRℳR\_\{\\mathcal\{M\}\}, the activationσ∈C2\\sigma\\in C^\{2\}satisfies‖σ‖∞,‖σ′‖∞≤Cσ\\\|\\sigma\\\|\_\{\\infty\},\\\|\\sigma^\{\\prime\}\\\|\_\{\\infty\}\\leq C\_\{\\sigma\}, and the Fourier basis\{ϕℓ\}ℓ=0L−1\\\{\\phi\_\{\\ell\}\\\}\_\{\\ell=0\}^\{L\-1\}satisfiesL≥L0L\\geq L\_\{0\}for a constantL0L\_\{0\}depending only on the noise schedule\. LetU^=\(U^0,…,U^L−1\)\\hat\{U\}=\(\\hat\{U\}\_\{0\},\\dots,\\hat\{U\}\_\{L\-1\}\)be the regularized empirical\-risk minimizer

U^=argminU⁡1n∑i=1n‖sθHN\(xti,ti\)−s∗\(xti,ti\)‖2\+λ∑ℓ‖Uℓ‖F2,\\hat\{U\}\\;=\\;\\operatorname\*\{arg\\,min\}\_\{U\}\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\big\\\|s^\{\\mathrm\{HN\}\}\_\{\\theta\}\(x\_\{t\_\{i\}\},t\_\{i\}\)\-s^\{\*\}\(x\_\{t\_\{i\}\},t\_\{i\}\)\\big\\\|^\{2\}\+\\lambda\\sum\_\{\\ell\}\\\|U\_\{\\ell\}\\\|\_\{F\}^\{2\},\(45\)overnni\.i\.d\. samples\(xti,ti\)\(x\_\{t\_\{i\}\},t\_\{i\}\)withti∼Unif\(\[tmax,T\]\)t\_\{i\}\\sim\\mathrm\{Unif\}\(\[t\_\{\\max\},T\]\)andxti∼ptix\_\{t\_\{i\}\}\\sim p\_\{t\_\{i\}\}\. Then with probability at least1−δ1\-\\delta,

𝔼t∈\[tmax,T\]𝔼xt∼pt‖sθ^HN\(xt,t\)−s∗\(xt,t\)‖2≤CHN⋅d⋅log⁡\(1/δ\)n\+O\(e−T\),\\mathbb\{E\}\_\{t\\in\[t\_\{\\max\},T\]\}\\,\\mathbb\{E\}\_\{x\_\{t\}\\sim p\_\{t\}\}\\big\\\|s\_\{\\hat\{\\theta\}\}^\{\\mathrm\{HN\}\}\(x\_\{t\},t\)\-s^\{\*\}\(x\_\{t\},t\)\\big\\\|^\{2\}\\;\\leq\\;\\frac\{C\_\{\\mathrm\{HN\}\}\\cdot d\\cdot\\log\(1/\\delta\)\}\{n\}\+O\(e^\{\-T\}\),\(46\)whereCHNC\_\{\\mathrm\{HN\}\}depends on\(Rℳ,Cσ,L,λ\)\(R\_\{\\mathcal\{M\}\},C\_\{\\sigma\},L,\\lambda\)but not on the intrinsic dimensionkk\.

###### Proof\.

The proof proceeds in three steps: approximation, estimation, and the residual atTT\.

Step 1 \(Approximation\)\.By Lemma[B\.8](https://arxiv.org/html/2605.20235#A2.Thmtheorem8), the dominant terms ofs∗\(x,t\)s^\{\*\}\(x,t\)in Phase I form the affine target

saff∗\(x,t\):=−xh\(t\)\+ath\(t\)μt\+at2h\(t\)2Ctx=g0\(t\)\+g1\(t\)⋅x,s^\{\*\}\_\{\\mathrm\{aff\}\}\(x,t\):=\-\\frac\{x\}\{h\(t\)\}\+\\frac\{a\_\{t\}\}\{h\(t\)\}\\mu\_\{t\}\+\\frac\{a\_\{t\}^\{2\}\}\{h\(t\)^\{2\}\}C\_\{t\}\\,x=g\_\{0\}\(t\)\+g\_\{1\}\(t\)\\cdot x,\(47\)withg0\(t\)=atμt/h\(t\)g\_\{0\}\(t\)=a\_\{t\}\\mu\_\{t\}/h\(t\)andg1\(t\)=−Id/h\(t\)\+at2Ct/h\(t\)2g\_\{1\}\(t\)=\-I\_\{d\}/h\(t\)\+a\_\{t\}^\{2\}C\_\{t\}/h\(t\)^\{2\}\. Bothg0g\_\{0\}andg1g\_\{1\}are real\-analytic intton\[tmax,T\]\[t\_\{\\max\},T\]under standard noise schedules \(linear or cosineβt\\beta\_\{t\}\), so their Fourier expansions on\[tmax,T\]\[t\_\{\\max\},T\]decay super\-polynomially\. Consequently, there existsL0=L0\(βt\)L\_\{0\}=L\_\{0\}\(\\beta\_\{t\}\)such that forL≥L0L\\geq L\_\{0\},

minUsup\(x,t\)‖sθHN\(x,t\)−saff∗\(x,t\)‖≤O\(L−rsched\),\\min\_\{U\}\\sup\_\{\(x,t\)\}\\big\\\|s^\{\\mathrm\{HN\}\}\_\{\\theta\}\(x,t\)\-s^\{\*\}\_\{\\mathrm\{aff\}\}\(x,t\)\\big\\\|\\;\\leq\\;O\(L^\{\-r\_\{\\mathrm\{sched\}\}\}\),\(48\)whererschedr\_\{\\mathrm\{sched\}\}is the analytic decay rate of the Fourier coefficients\. Combining with the cubic remainder from Lemma[B\.8](https://arxiv.org/html/2605.20235#A2.Thmtheorem8), which is uniformly bounded byRℳ3R\_\{\\mathcal\{M\}\}^\{3\}onsupp\(pt\)\\mathrm\{supp\}\(p\_\{t\}\)sinceat3/h\(t\)3≤1a\_\{t\}^\{3\}/h\(t\)^\{3\}\\leq 1in Phase I, the total approximation error is absorbed intoCHNC\_\{\\mathrm\{HN\}\}\.

Step 2 \(Estimation via Rademacher complexity\)\.The hypothesis class is

ℱHN:=\{sθHN:∑ℓ‖Uℓ‖F2≤B2\},\\mathcal\{F\}\_\{\\mathrm\{HN\}\}:=\\left\\\{s^\{\\mathrm\{HN\}\}\_\{\\theta\}\\;:\\;\\sum\_\{\\ell\}\\\|U\_\{\\ell\}\\\|\_\{F\}^\{2\}\\leq B^\{2\}\\right\\\},\(49\)withBBcontrolled byλ\\lambdavia the standard ridge boundB≤‖saff∗‖L2/λB\\leq\\\|s^\{\*\}\_\{\\mathrm\{aff\}\}\\\|\_\{L^\{2\}\}/\\sqrt\{\\lambda\}\. Since‖μt‖≤Rℳ\\\|\\mu\_\{t\}\\\|\\leq R\_\{\\mathcal\{M\}\}and‖Ct‖F≤d⋅Rℳ2\\\|C\_\{t\}\\\|\_\{F\}\\leq\\sqrt\{d\}\\cdot R\_\{\\mathcal\{M\}\}^\{2\}\(the latter using thatCtC\_\{t\}hasddeigenvalues each bounded byRℳ2R\_\{\\mathcal\{M\}\}^\{2\}\),

𝔼t,xt‖saff∗\(xt,t\)‖2≤C⋅d⋅poly\(Rℳ,hmin−1\),\\mathbb\{E\}\_\{t,x\_\{t\}\}\\\|s^\{\*\}\_\{\\mathrm\{aff\}\}\(x\_\{t\},t\)\\\|^\{2\}\\;\\leq\\;C\\cdot d\\cdot\\mathrm\{poly\}\(R\_\{\\mathcal\{M\}\},h\_\{\\min\}^\{\-1\}\),\(50\)wherehmin:=h\(tmax\)≍τ2h\_\{\\min\}:=h\(t\_\{\\max\}\)\\asymp\\tau^\{2\}is bounded away from zero throughout Phase I\. By Lemma[B\.5](https://arxiv.org/html/2605.20235#A2.Thmtheorem5)\(adapted to the spatial input domain restricted to a ball of radiusO\(d⋅hmax\)O\(\\sqrt\{d\\cdot h\_\{\\max\}\}\)\), the feature map has boundedℓ2\\ell^\{2\}norm independently ofddandmm\. The empirical Rademacher complexity is therefore

ℜ^n\(ℱHN\)≤B⋅CσLn\.\\widehat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\_\{\\mathrm\{HN\}\}\)\\;\\leq\\;\\frac\{B\\cdot C\_\{\\sigma\}\\sqrt\{L\}\}\{\\sqrt\{n\}\}\.\(51\)Standard Rademacher\-to\-generalization conversion yields excess riskO\(B2Cσ2Llog⁡\(1/δ\)/n\)O\(B^\{2\}C\_\{\\sigma\}^\{2\}L\\log\(1/\\delta\)/n\)\. SubstitutingB2=O\(d/λ\)B^\{2\}=O\(d/\\lambda\)from the affine target’sL2L^\{2\}norm gives the parametric bound

𝔼‖sθ^HN\(xt,t\)−saff∗\(xt,t\)‖2≤CHN⋅d⋅log⁡\(1/δ\)n\.\\mathbb\{E\}\\big\\\|s^\{\\mathrm\{HN\}\}\_\{\\hat\{\\theta\}\}\(x\_\{t\},t\)\-s^\{\*\}\_\{\\mathrm\{aff\}\}\(x\_\{t\},t\)\\big\\\|^\{2\}\\;\\leq\\;\\frac\{C\_\{\\mathrm\{HN\}\}\\cdot d\\cdot\\log\(1/\\delta\)\}\{n\}\.\(52\)
Step 3 \(Residual atTT\)\.Under the VP\-SDE, the marginalpTp\_\{T\}converges to𝒩\(0,Id\)\\mathcal\{N\}\(0,I\_\{d\}\)exponentially fast\. Specifically,

KL\(pT∥𝒩\(0,Id\)\)≤Cℳ\(2\)⋅e−T,\\mathrm\{KL\}\(p\_\{T\}\\,\\\|\\,\\mathcal\{N\}\(0,I\_\{d\}\)\)\\;\\leq\\;C\_\{\\mathcal\{M\}\}^\{\(2\)\}\\cdot e^\{\-T\},\(53\)which translates by Pinsker’s inequality and Lemma[B\.8](https://arxiv.org/html/2605.20235#A2.Thmtheorem8)into anO\(e−T\)O\(e^\{\-T\}\)contribution in \([46](https://arxiv.org/html/2605.20235#A2.E46)\)\.

Combining Steps 1–3 yields the stated bound\. ∎

#### End\-to\-end argument\.

Theorem[4\.10](https://arxiv.org/html/2605.20235#S4.Thmtheorem10)now follows by decomposing the reverse\-SDE Girsanov bound across the two phases and applying the KL\-to\-W2W\_\{2\}interpolation inequality\.

###### Proof of Theorem[4\.10](https://arxiv.org/html/2605.20235#S4.Thmtheorem10)\.

We boundKL\(pdata,tmin∥pgen,tmin\)\\mathrm\{KL\}\(p\_\{\\mathrm\{data\},t\_\{\\min\}\}\\,\\\|\\,p\_\{\\mathrm\{gen\},t\_\{\\min\}\}\)via Girsanov’s theorem, then convert toW2W\_\{2\}through the standard interpolation inequality of\[[3](https://arxiv.org/html/2605.20235#bib.bib79)\]\.

Step 1 \(Girsanov decomposition\)\.Let\(x←t\)t∈\[tmin,T\]\(\\overleftarrow\{x\}\_\{t\}\)\_\{t\\in\[t\_\{\\min\},T\]\}denote the true reverse process and\(y←t\)\(\\overleftarrow\{y\}\_\{t\}\)the simulated process driven bysθ^fulls^\{\\mathrm\{full\}\}\_\{\\hat\{\\theta\}\}\. Standard Girsanov estimates\[[15](https://arxiv.org/html/2605.20235#bib.bib47)\]yield

KL\(pdata,tmin∥pgen,tmin\)≤12∫tminT𝔼xt∼pt‖sθ^full\(xt,t\)−s∗\(xt,t\)‖2𝑑t\.\\mathrm\{KL\}\\bigl\(p\_\{\\mathrm\{data\},t\_\{\\min\}\}\\,\\\|\\,p\_\{\\mathrm\{gen\},t\_\{\\min\}\}\\bigr\)\\;\\leq\\;\\frac\{1\}\{2\}\\int\_\{t\_\{\\min\}\}^\{T\}\\mathbb\{E\}\_\{x\_\{t\}\\sim p\_\{t\}\}\\big\\\|s^\{\\mathrm\{full\}\}\_\{\\hat\{\\theta\}\}\(x\_\{t\},t\)\-s^\{\*\}\(x\_\{t\},t\)\\big\\\|^\{2\}\\,dt\.\(54\)
Step 2 \(Phase decomposition\)\.Usingα\(t\)=𝟏\[h\(t\)≤τ2\]\\alpha\(t\)=\\mathbf\{1\}\[h\(t\)\\leq\\tau^\{2\}\], the gated architecture \([15](https://arxiv.org/html/2605.20235#S4.E15)\) coincides withsθ^SiLDs^\{\\mathrm\{SiLD\}\}\_\{\\hat\{\\theta\}\}on\[tmin,tmax\]\[t\_\{\\min\},t\_\{\\max\}\]and withsθ^HNs^\{\\mathrm\{HN\}\}\_\{\\hat\{\\theta\}\}on\[tmax,T\]\[t\_\{\\max\},T\]\. Splitting the integral in \([54](https://arxiv.org/html/2605.20235#A2.E54)\) attmaxt\_\{\\max\},

KL\(pdata,tmin∥pgen,tmin\)\\displaystyle\\mathrm\{KL\}\\bigl\(p\_\{\\mathrm\{data\},t\_\{\\min\}\}\\,\\\|\\,p\_\{\\mathrm\{gen\},t\_\{\\min\}\}\\bigr\)≤12∫tmintmax𝔼‖sθ^SiLD−s∗‖2𝑑t\+12∫tmaxT𝔼‖sθ^HN−s∗‖2𝑑t\\displaystyle\\leq\\tfrac\{1\}\{2\}\\int\_\{t\_\{\\min\}\}^\{t\_\{\\max\}\}\\mathbb\{E\}\\big\\\|s^\{\\mathrm\{SiLD\}\}\_\{\\hat\{\\theta\}\}\-s^\{\*\}\\big\\\|^\{2\}dt\+\\tfrac\{1\}\{2\}\\int\_\{t\_\{\\max\}\}^\{T\}\\mathbb\{E\}\\big\\\|s^\{\\mathrm\{HN\}\}\_\{\\hat\{\\theta\}\}\-s^\{\*\}\\big\\\|^\{2\}dt≤O\(poly\(k\)log⁡\(1/δ\)n\)⏟Phase I, by Theorems[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5)and[4\.7](https://arxiv.org/html/2605.20235#S4.Thmtheorem7)\+O\(CHN⋅d⋅log⁡\(1/δ\)n\+e−T\)⏟Phase II, by Lemma[B\.9](https://arxiv.org/html/2605.20235#A2.Thmtheorem9)\.\\displaystyle\\leq\\underbrace\{O\\\!\\left\(\\frac\{\\mathrm\{poly\}\(k\)\\log\(1/\\delta\)\}\{\\sqrt\{n\}\}\\right\)\}\_\{\\text\{Phase I, by Theorems~\\ref\{thm:dimensional\_collapse\} and~\\ref\{thm:stage2\_generalization\}\}\}\+\\underbrace\{O\\\!\\left\(\\frac\{C\_\{\\mathrm\{HN\}\}\\cdot d\\cdot\\log\(1/\\delta\)\}\{n\}\+e^\{\-T\}\\right\)\}\_\{\\text\{Phase II, by Lemma~\\ref\{lem:phase1\_gen\}\}\}\.\(55\)The Phase II term carries thee−Te^\{\-T\}damping becausept→𝒩\(0,Id\)p\_\{t\}\\to\\mathcal\{N\}\(0,I\_\{d\}\)exponentially ast→Tt\\to T, so the score\-error integrand on\[tmax,T\]\[t\_\{\\max\},T\]is dominated by the contribution nearTTwhere the affine approximation is exact up toO\(e−T\)O\(e^\{\-T\}\)\.

Step 3 \(KL\-to\-W2W\_\{2\}conversion\)\.Withtmin∝1/nt\_\{\\min\}\\propto 1/n, the interpolation inequality\[[3](https://arxiv.org/html/2605.20235#bib.bib79), Lemma 3\]gives

W2\(pdata,pgen,tmin\)≲KL⋅n−1/4\+n−1/2\.W\_\{2\}\(p\_\{\\mathrm\{data\}\},p\_\{\\mathrm\{gen\},t\_\{\\min\}\}\)\\;\\lesssim\\;\\sqrt\{\\mathrm\{KL\}\}\\,\\cdot n^\{\-1/4\}\+n^\{\-1/2\}\.\(56\)Substituting Step 2 yields \([17](https://arxiv.org/html/2605.20235#S4.E17)\), completing the proof\. ∎

## Appendix CExperimental Details and Additional Results

### C\.1Toy experiment: full setup

We provide the complete specification of the synthetic experiment\.

#### Data generation\.

We consider a manifold\-plus\-noise model inℝd\\mathbb\{R\}^\{d\}with ambient dimensiond=100d=100and intrinsic dimensionk=5k=5\. LetA∈ℝd×kA\\in\\mathbb\{R\}^\{d\\times k\}be a fixed orthonormal basis of the true manifold subspace, drawn uniformly from the Stiefel manifold once at the start of the experiment\. Clean samples are generated asx0=Azx\_\{0\}=Az, where the latent codez∈ℝkz\\in\\mathbb\{R\}^\{k\}follows a mixture of three Gaussians:

z∼∑c=13πc𝒩\(μc,0\.52Ik\),z\\;\\sim\\;\\sum\_\{c=1\}^\{3\}\\pi\_\{c\}\\,\\mathcal\{N\}\(\\mu\_\{c\},0\.5^\{2\}\\,I\_\{k\}\),with uniform mixing weightsπc=1/3\\pi\_\{c\}=1/3and means\{μc\}c=13\\\{\\mu\_\{c\}\\\}\_\{c=1\}^\{3\}placed at equidistant points on a circle of radius22in the first two coordinates ofℝk\\mathbb\{R\}^\{k\}\. Noisy observations are

xt=x0\+σtε,ε∼𝒩\(0,Id\),x\_\{t\}\\;=\\;x\_\{0\}\+\\sigma\_\{t\}\\,\\varepsilon,\\qquad\\varepsilon\\sim\\mathcal\{N\}\(0,I\_\{d\}\),withσt=0\.1\\sigma\_\{t\}=0\.1, a small\-noise regime in which the score singularity is pronounced and Proposition[3\.1](https://arxiv.org/html/2605.20235#S3.Thmtheorem1)applies\.

#### Model\.

We train the two\-stage score model of Section[3\.3](https://arxiv.org/html/2605.20235#S3.SS3)with hidden widthm=200m=200\. Stage 1 uses the conservative\-form network \([2](https://arxiv.org/html/2605.20235#S3.E2)\) withtanh\\tanhactivation\. Stage 2 is a Random Feature network with frozen spatial featuresVx∈ℝd×mV\_\{x\}\\in\\mathbb\{R\}^\{d\\times m\}drawn i\.i\.d\. from𝒩\(0,Id/d\)\\mathcal\{N\}\(0,I\_\{d\}/d\)\.

#### Optimization\.

Both stages are trained with Adam\. Stage 1 uses learning rateη1=10−3\\eta\_\{1\}=10^\{\-3\}; Stage 2 usesη2=5×10−3\\eta\_\{2\}=5\\times 10^\{\-3\}\. Batch size is40964096for both stages\. Stage 1 is trained until the orthogonal\-component error plateaus, after whichWWis frozen and Stage 2 is trained on the ridge\-regression objective of \([13](https://arxiv.org/html/2605.20235#S4.E13)\)\.

#### Diagnostics\.

Because the data\-generating process is analytically tractable, we decompose the total score estimation error𝔼‖s^\(xt\)−s∗\(xt\)‖22\\mathbb\{E\}\\\|\\hat\{s\}\(x\_\{t\}\)\-s^\{\*\}\(x\_\{t\}\)\\\|\_\{2\}^\{2\}into its manifold and orthogonal components exactly, using the ground\-truth projectionΠℳ=AA⊤\\Pi\_\{\\mathcal\{M\}\}=AA^\{\\top\}\. Writingxt=AA⊤xt\+\(I−AA⊤\)xtx\_\{t\}=AA^\{\\top\}x\_\{t\}\+\(I\-AA^\{\\top\}\)x\_\{t\}, the orthogonal component of the target score is−\(1/ht\)\(I−AA⊤\)xt\-\(1/h\_\{t\}\)\(I\-AA^\{\\top\}\)x\_\{t\}, and the manifold component is the MoG score evaluated atA⊤x0A^\{\\top\}x\_\{0\}\. The red and green curves in Figure[1](https://arxiv.org/html/2605.20235#S5.F1)\(left\) track these two components independently throughout training, providing a direct empirical probe of the stage\-by\-stage learning dynamics predicted by Theorem[4\.5](https://arxiv.org/html/2605.20235#S4.Thmtheorem5)\.

### C\.2Stacked MNIST qualitative samples

Figure[2](https://arxiv.org/html/2605.20235#A3.F2)shows uncurated samples from both methods on Stacked MNIST, alongside real samples for reference\. Both methods achieve full1000/10001000/1000mode coverage \(Table[1](https://arxiv.org/html/2605.20235#S5.T1)\)\.

![Refer to caption](https://arxiv.org/html/2605.20235v1/figures/stacked_mnist_composite_sild.png)Figure 2:Uncurated samples from Stacked MNIST\. Each image is three random MNIST digits stacked into the R/G/B channels \(1000 possible modes\)\.*Left:*real samples\.*Middle:*SiLD\.*Right:*LDM\-CNN\.
### C\.3CelebA denoising comparison

Figure[3](https://arxiv.org/html/2605.20235#A3.F3)shows the learned encoder\-decoder applied to Gaussian\-corrupted CelebA inputs without any diffusion steps, illustrating the Stage\-1 objective in isolation\. The SiLD Stage\-1 loss trains the encoder\-decoder as a denoising operator on noisy inputs, while LDM’s VAE objective optimizes reconstruction of clean inputs with KL regularization\.

![Refer to caption](https://arxiv.org/html/2605.20235v1/figures/paper_celeba_reconstruction.png)Figure 3:Denoising and reconstruction on CelebA \(64×6464\\times 64\)\. Encoder\-decoder applied to Gaussian\-corrupted inputs \(σ=0\.1\\sigma=0\.1\), no diffusion steps\.*Row 1:*Original\.*Row 2:*Noisy input\.*Row 3:*LDM\+GAN\.*Row 4:*SiLD\+GAN\.
### C\.4CelebA\-HQ results

We extend the CelebA comparison to CelebA\-HQ \(64×64×364\\times 64\\times 3, ambient dimensiond=12,288d=12\{,\}288, estimated intrinsic dimensionk95=223k\_\{95\}=223via PCA at 95% variance\)\. All models use latent dimension 256\. Table[4](https://arxiv.org/html/2605.20235#A3.T4)reports results across two model sizes and two training budgets\.

SiLD consistently outperforms LDM\-CNN on reconstruction MSE across all settings, with the gap present at the smaller network \(0\.004400\.00440vs\.0\.005030\.00503\) and persisting at the larger network \(0\.003450\.00345vs\.0\.003960\.00396\)\. At10×10\\timestraining, both methods converge to near\-identical reconstruction MSE \(0\.000610\.00061vs\.0\.000630\.00063\), suggesting that reconstruction capacity saturates at sufficient optimization; at this scale LDM\-CNN achieves slightly better Eps Loss \(0\.2610\.261vs\.0\.2960\.296\)\. The overall pattern is consistent with CelebA: SiLD’s Stage\-1 objective produces a latent more faithful to the data manifold for reconstruction, while the generation metric depends additionally on the latent’s amenability to diffusion\.

Table 4:CelebA\-HQ results \(64×64×364\\times 64\\times 3,d=12,288d=12\{,\}288, intrinsic dimk95=223k\_\{95\}=223\)\. All models use latent dim 256\.
### C\.5Compute Resources

All experiments were conducted on a single NVIDIA A100 \(40 GB\) GPU; times below are wall\-clock training time per configuration, excluding scheduler queue waits\. Stacked MNIST \(Table[1](https://arxiv.org/html/2605.20235#S5.T1)\) takes≈50\\approx 50min\. CelebA \(64×6464\{\\times\}64, Table[2](https://arxiv.org/html/2605.20235#S5.T2)\) ranges from11–22h for the LDM/SiLD baselines,≈3\.3\\approx\\\!3\.3h for the MMDα\\alpha\-sweep, and≈12\.4\\approx 12\.4h for each GAN\-regularized row, totalling≈22\\approx 22GPU\-hours\. MoleculeNet \(Table[3](https://arxiv.org/html/2605.20235#S5.T3), four datasets\) takes0\.40\.4–2\.62\.6h per \(method, dataset\) configuration, totalling≈16\.5\\approx 16\.5GPU\-hours\. CelebA\-HQ \(Table[4](https://arxiv.org/html/2605.20235#A3.T4)\) base configurations run in11–22h; the10×10\\timestraining row \(100100k AE \+300300k diffusion steps,104104M parameters\) is the most expensive single configuration at≈17\\approx\\\!17h per method, with the table totalling≈49\\approx 49GPU\-hours\. The reported experiments thus account for≈90\\approx 90GPU\-hours; including ablations, hyperparameter sweeps, and preliminary architecture exploration, total measured wall time across all jobs is approximately320320GPU\-hours\.
Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

Similar Articles

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees

Membership Inference Attacks on Discrete Diffusion Language Models

Drifting Objectives for Refining Discrete Diffusion Language Models

A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models

Submit Feedback

Similar Articles

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models
Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
Membership Inference Attacks on Discrete Diffusion Language Models
Drifting Objectives for Refining Discrete Diffusion Language Models
A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models