Spectral Guidance for Flexible and Efficient Control of Diffusion Models

arXiv cs.LG 05/29/26, 04:00 AM Papers
diffusion-models guidance spectral-guidance generative-ai machine-learning research
Summary
Introduces Spectral Guidance, a framework for controlling diffusion models by leveraging low-dimensional representations of the diffusion process, enabling flexible and stable control without task-specific retraining or backpropagation through the denoiser.
arXiv:2605.28900v1 Announce Type: new Abstract: We introduce Spectral Guidance, a framework for controlling diffusion models by leveraging the intrinsic geometry of the generative process. As data is progressively corrupted by noise, only a small number of features remain informative for control. We characterize them as the singular functions of a conditional expectation operator and show that they can be learned via a self-supervised objective. Once recovered, this basis enables the projection of arbitrary guidance signals, such as labels, CLIP embeddings, or masks, directly onto the sampling trajectory. This approach allows for stable, high-fidelity control without retraining or denoiser backpropagation during sampling. Empirically, we improve conditional accuracy on CIFAR-10 by 37 percentage points over the strongest training-free baseline while offering $4\times$ faster sampling. Moreover, the same representations that support label and CLIP guidance also enable spatial control, such as mask-based guidance, without auxiliary models. Finally, our framework reveals a phase transition in the generative process, pinpointing the optimal time window for effective guidance.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:13 AM
# Spectral Guidance for Flexible and Efficient Control of Diffusion Models
Source: [https://arxiv.org/html/2605.28900](https://arxiv.org/html/2605.28900)
## 1Introduction

Generative modeling has seen tremendous progress with diffusion\-based approaches\(Sohl\-Dicksteinet al\.,[2015](https://arxiv.org/html/2605.28900#bib.bib44)\), which now produce high\-fidelity samples across images, audio, and other modalities\(Hoet al\.,[2020](https://arxiv.org/html/2605.28900#bib.bib14); Songet al\.,[2020b](https://arxiv.org/html/2605.28900#bib.bib18); Rombachet al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib51)\)\. These models operate by reversing a progressive corruption process, gradually transforming structured data into noise\. While the forward and reverse dynamics of generation are well\-understood, the utility of these models hinges onguidance: sampling according to user specifications, which can be in the form of labels\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2605.28900#bib.bib33)\), text prompts\(Sahariaet al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib55)\), or custom objectives\(Zhanget al\.,[2023](https://arxiv.org/html/2605.28900#bib.bib58)\)\.

The key challenge is how to impose such guidance in practice\. Successful mainstream approaches rely on training models to be conditional from the outset\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2605.28900#bib.bib33); Ho and Salimans,[2022](https://arxiv.org/html/2605.28900#bib.bib25)\), so that desired guidance can be injected directly during sampling\. This strategy yields strong and stable control, but tightly couples the model to a fixed set of conditions and requires retraining or additional models whenever the specification changes\. An alternative is to start from an unconditional model and enforce the desired behavior only at sampling time by optimizing a user\-defined objective\(Chunget al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib29); Yeet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib34)\)\. While this offers greater flexibility, it typically requires differentiating through the denoiser during sampling and approximating an intractable posterior distribution\. In practice, this leads to higher computational cost and often unstable control, especially for complex objectives\.

In this work, we propose Spectral Guidance, a framework that enables flexible control by leveraging the intrinsic structure of the diffusion process\. As data is gradually corrupted by noise, fine\-grained details are lost while coarse semantic features persist\. We show that these features form a natural, low\-dimensional coordinate system that tracks how information propagates through the diffusion\. By learning this representation, we can project arbitrary guidance objectives, from simple labels to masks, directly into the generative trajectory\. This allows for stable control without the need for task\-specific retraining or denoiser gradients\.

Our technical approach is based on a low\-rank approximation of the conditional expectation operator across diffusion timesteps\. Because high\-frequency details are progressively destroyed, the leading singular functions of this operator form a time\-indexed low\-dimensional basis that captures persistent axes of variation over diffusion time\. We show that a self\-supervised learning \(SSL\) objective with orthogonality constraints\(Bardeset al\.,[2021](https://arxiv.org/html/2605.28900#bib.bib3)\)is a variational estimator for the operator’s leading singular functions, with independently diffused views of the same sample acting as augmentations\. Once these representations are learned, guidance reduces to a simple linear projection onto this basis\.

This formulation provides an efficient and flexible guidance mechanism\. It removes the need for backpropagation through the denoiser during sampling, requiring only a one\-time training to learn the intrinsic coordinates of the diffusion process\. These representations reveal which features are recoverable and when guidance is effective\. Further, being task\-agnostic, they support arbitrary downstream control objectives without retraining\.

Empirically, we achieve consistent gains across label, attribute, and CLIP guidance; notably, on CIFAR\-10, we surpass the strongest training\-free baseline by 37 percentage points in accuracy while improving FID and sampling 4×\\timesfaster\. In addition, our representations generalize to spatial control like mask\-guided generation without auxiliary models\. Finally, they reveal a spectral phase transition that pinpoints the optimal time window for effective guidance\.

Our contributions are summarized as follows:

- •We propose Spectral Guidance, which reframes diffusion guidance as a projection onto a coordinate system aligned with the generative dynamics, enabling flexible control without retraining the diffusion model\.
- •We introduce an SSL objective that estimates the spectral decomposition of the diffusion operator\. This representation yields a lightweight guidance algorithm, decoupled from the gradients of the denoiser\.
- •We outperform training\-free baselines by over 37 percentage points in accuracy and4×4\\timesin speed, while enabling complex controls like mask guidance without auxiliary models\. Further, our representations reveal a phase transition during the reverse process that aligns with the optimal time window for effective guidance\.

## 2Related Work

We review key advances in conditional generation below\.

#### Diffusion guidance\.

To introduce conditioning or amplify specific signals during sampling, Classifier Guidance \(CG\)\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2605.28900#bib.bib33)\)employs an external classifier trained on noisy data, using its gradients to steer the generative trajectory\. Classifier\-Free Guidance \(CFG\)\(Ho and Salimans,[2022](https://arxiv.org/html/2605.28900#bib.bib25)\)alleviates the need for external classifiers by jointly training a conditional and unconditional model, subsequently interpolating their score estimates during sampling\. While CFG has become the*de facto*standard for modern architectures, ranging from Stable Diffusion\(Rombachet al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib51)\)to Flow Matching models\(Lipmanet al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib52)\), it typically applies a constant guidance scale, remaining agnostic to the dynamics of the diffusion process\. For a comprehensive survey, we refer toZhanet al\.\([2024](https://arxiv.org/html/2605.28900#bib.bib50)\)\. To address the limitations of static guidance, recent works\(Koulischeret al\.,[2025](https://arxiv.org/html/2605.28900#bib.bib53); Kynkäänniemiet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib43)\)leverage insights into diffusion dynamics and phase transitions\(Handkeet al\.,[2025](https://arxiv.org/html/2605.28900#bib.bib42); Raya and Ambrogioni,[2023](https://arxiv.org/html/2605.28900#bib.bib54)\)to target time windows where features are most controllable, thereby optimizing generation quality\.

#### Training\-free guidance\.

A distinct line of research focuses on controlling pre\-trained diffusion models via external loss functions, in a plug\-and\-play fashion, eliminating the need for retraining\. Initially developed for inverse problems\(Kawaret al\.,[2021](https://arxiv.org/html/2605.28900#bib.bib47); Choiet al\.,[2021](https://arxiv.org/html/2605.28900#bib.bib48); Kawaret al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib49)\), this paradigm includes Diffusion Posterior Sampling \(DPS\)\(Chunget al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib29)\), which guides sampling with the gradients of a loss function evaluated on a point\-estimate of the posterior distribution\. For general control, Loss\-Guided Diffusion \(LGD\)\(Songet al\.,[2023](https://arxiv.org/html/2605.28900#bib.bib31)\)refines DPS by estimating the conditional expectation via Monte Carlo sampling, while MPGD\(Heet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib13)\)leverages the manifold hypothesis to constrain guidance to low\-dimensional data manifolds\. Universal Guidance for Diffusion \(UGD\)\(Bansalet al\.,[2023](https://arxiv.org/html/2605.28900#bib.bib30)\)and FreeDoM\(Yuet al\.,[2023](https://arxiv.org/html/2605.28900#bib.bib32)\)further strengthen guidance through iterative “time\-travel” strategies and adaptive schedules across diffusion timesteps\. More recently, Training\-Free Guidance \(TFG\)\(Yeet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib34)\)unifies many of these approaches under a common guidance algorithm\.

#### Editing directions in diffusion models\.

NoiseCLR\(Dalva and Yanardag,[2024](https://arxiv.org/html/2605.28900#bib.bib4)\)discovers interpretable directions in the noise space of pre\-trained diffusion models via a contrastive self\-supervised objective; while it shares the use of SSL\-style training, it targets latent\-space*editing*rather than information\-preserving structure\. More closely related,Parket al\.\([2023](https://arxiv.org/html/2605.28900#bib.bib5)\)andChenet al\.\([2024](https://arxiv.org/html/2605.28900#bib.bib2)\)use the spectral decomposition of the denoiser’s Jacobian as a post hoc tool for identifying semantic editing directions\.

In contrast, we avoid training task\-dependent conditional denoisers and relying on point estimates\. Instead, we propose to guide unconditional models by mapping guidance signals onto the diffusion model’s spectral coordinates\.

## 3Preliminaries

#### Diffusion models\.

LetX0∼p0X\_\{0\}\\sim p\_\{0\}be a data random variable with support𝒳\\mathcal\{X\}\. Denoising Diffusion Probabilistic Models \(DDPMs\)\(Hoet al\.,[2020](https://arxiv.org/html/2605.28900#bib.bib14)\)define a forward process that gradually perturbsX0X\_\{0\}into Gaussian noise using a variance schedule\{αt\}t=1T\\\{\\alpha\_\{t\}\\\}\_\{t=1\}^\{T\}, whereαt∈\(0,1\)\\alpha\_\{t\}\\in\(0,1\)andα¯t:=∏s=1tαs\\bar\{\\alpha\}\_\{t\}:=\\prod\_\{s=1\}^\{t\}\\alpha\_\{s\}\. This process allows for direct sampling of any noisy latentxtx\_\{t\}at timestepttvia

pt\(xt∣x0\)=𝒩\(xt;α¯tx0,\(1−α¯t\)I\)\.p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)=\\mathcal\{N\}\\left\(x\_\{t\};\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\},\(1\-\\bar\{\\alpha\}\_\{t\}\)I\\right\)\.\(1\)To reverse this process, a neural networkϵθ\(xt,t\)\\epsilon\_\{\\theta\}\(x\_\{t\},t\)is trained to predict the noiseϵ\\epsilonadded tox0x\_\{0\}\. This training objective is equivalent to denoising score matching\(Song and Ermon,[2019](https://arxiv.org/html/2605.28900#bib.bib45)\), as the optimal denoiser is related to the score of the marginal distribution by

∇xtlog⁡pt\(xt\)=−ϵθ\(xt,t\)1−α¯t\.\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\)=\-\\frac\{\\epsilon\_\{\\theta\}\(x\_\{t\},t\)\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\.\(2\)While the original DDPM formulation requires a stochastic Markov chain, Denoising Diffusion Implicit Models \(DDIMs\)\(Songet al\.,[2020a](https://arxiv.org/html/2605.28900#bib.bib10)\)provide a faster, non\-Markovian alternative, via the update rule

xt−1=α¯t−1\(xt−1−α¯tϵθ\(xt,t\)α¯t\)\\displaystyle x\_\{t\-1\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\-1\}\}\\left\(\\frac\{x\_\{t\}\-\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilon\_\{\\theta\}\(x\_\{t\},t\)\}\{\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\}\\right\)\+1−α¯t−1−σt2ϵθ\(xt,t\)\+σtε,\\displaystyle\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\-1\}\-\\sigma\_\{t\}^\{2\}\}\\epsilon\_\{\\theta\}\(x\_\{t\},t\)\+\\sigma\_\{t\}\\varepsilon,\(3\)whereε∼𝒩\(0,I\)\\varepsilon\\sim\\mathcal\{N\}\(0,I\)\.

#### Diffusion guidance\.

The practical utility of the generative reverse\-time process depends on the ability to guide the unconditional trajectory in Eq\. \([3](https://arxiv.org/html/2605.28900#S3.E3)\), given a conditioning signalyy\(such as a class label, text prompt, or task objective\)\. This corresponds to sampling from the conditional distributionp\(x0∣y\)p\(x\_\{0\}\\mid y\)\. By Bayes’ rule we can decompose the conditional score as

∇xtlog⁡pt\(xt∣y\)=∇xtlog⁡pt\(xt\)\+∇xtlog⁡pt\(y∣xt\)\.\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\\mid y\)=\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\)\+\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(y\\mid x\_\{t\}\)\.While∇xtlog⁡pt\(xt\)\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\)is provided by the unconditional model, the guidance term∇xtlog⁡pt\(y∣xt\)\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(y\\mid x\_\{t\}\)requires a time\-dependent predictor\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2605.28900#bib.bib33)\),

pt\(y∣xt\)=𝔼X0∼pt\(⋅∣xt\)\[p\(y∣X0\)\]\.p\_\{t\}\(y\\mid x\_\{t\}\)=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{t\}\)\}\[p\(y\\mid X\_\{0\}\)\]\.\(4\)Hence, this approach restricts control of the generation to the conditioning signalyyused for trainingpt\(y∣xt\)p\_\{t\}\(y\\mid x\_\{t\}\)\.

#### Training\-free guidance\.

In contrast to training\-based approaches, training\-free guidance aims at steering the unconditional trajectory in Eq\. \([3](https://arxiv.org/html/2605.28900#S3.E3)\) using any clean data signalp\(y∣x0\)p\(y\\mid x\_\{0\}\)by estimatingpt\(y∣xt\)p\_\{t\}\(y\\mid x\_\{t\}\)during sampling\. As Eq\. \([4](https://arxiv.org/html/2605.28900#S3.E4)\) is generally intractable, methods such as DPS\(Chunget al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib29)\)rely on the denoiser’s point estimate of the posterior meanx^0\(xt\)≈𝔼\[X0∣Xt=xt\]\\hat\{x\}\_\{0\}\(x\_\{t\}\)\\approx\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{t\}=x\_\{t\}\],

𝔼X0∼pt\(⋅∣xt\)\[p\(y∣X0\)\]≈p\(y∣x^0\(xt\)\)\.\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{t\}\)\}\[p\(y\\mid X\_\{0\}\)\]\\approx p\\left\(y\\mid\\hat\{x\}\_\{0\}\(x\_\{t\}\)\\right\)\.\(5\)This substitution is exact only whenp\(y∣x0\)p\(y\\mid x\_\{0\}\)is affine inx0x\_\{0\}, which rarely holds for semantic tasks\. At large noise levels the posterior mean may even lie outside the data manifold\(Heet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib13)\), leading to misaligned gradients\. Further, differentiatingx^0\(xt\)\\hat\{x\}\_\{0\}\(x\_\{t\}\)requires backpropagating through the denoiser at every step, which is computationally expensive and prone to vanishing gradients\.

![[Uncaptioned image]](https://arxiv.org/html/2605.28900v1/x1.png)
\(a\)Spectral Guidance on a Gaussian mixture prior\.Contours depictlog⁡p0\(x0\)\\log p\_\{0\}\(x\_\{0\}\)\.\(a\)Prior samples colored by component\.\(b, c\)Samples generated by spectral guidance \(white\) conditioned on different label subsets, using a fixed set ofK=30K=30spectral modes\. The same features enable sampling fromp\(x0∣y∈𝒴\)p\(x\_\{0\}\\mid y\\in\\mathcal\{Y\}\)for arbitrary conditioning sets𝒴\\mathcal\{Y\}\.## 4Spectral Representation for Guidance

A key observation motivates our approach: as diffusion noise increases, information about the data is progressively destroyed, and only a small number of features remain recoverable\. Consequently, at each diffusion timestep, there exists a low\-dimensional set of intrinsic directions along which guidance can effectively act\. We propose to guide samples along these diffusion\-stable directions \(Fig\.[3\(a\)](https://arxiv.org/html/2605.28900#S3.F3.sf1)\)\.

We can view the expectation in Eq\. \([4](https://arxiv.org/html/2605.28900#S3.E4)\) as the action of an operator from the clean data spaceℋ0=L2\(p0\)\\mathcal\{H\}\_\{0\}=L^\{2\}\(p\_\{0\}\)to the noisy data spaceℋt=L2\(pt\)\\mathcal\{H\}\_\{t\}=L^\{2\}\(p\_\{t\}\)\. This yieldspt\(y∣xt\)=\(Ttp\(y∣⋅\)\)\(xt\)p\_\{t\}\(y\\mid x\_\{t\}\)=\(T\_\{t\}p\(y\\mid\\cdot\)\)\(x\_\{t\}\), whereTt:ℋ0→ℋtT\_\{t\}:\\mathcal\{H\}\_\{0\}\\to\\mathcal\{H\}\_\{t\}is the conditional expectation of a function of the clean dataf\(x0\)f\(x\_\{0\}\)givenxtx\_\{t\},

\(Ttf\)\(xt\):=𝔼X0∼pt\(⋅∣xt\)\[f\(X0\)\]\.\(T\_\{t\}f\)\(x\_\{t\}\):=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{t\}\)\}\[f\(X\_\{0\}\)\]\.\(6\)This operator retains components offfthat are recoverable from the noisy observationxtx\_\{t\}\. To identify them, we consider the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}, where the adjoint operatorTt∗:ℋt→ℋ0T\_\{t\}^\{\\ast\}:\\mathcal\{H\}\_\{t\}\\to\\mathcal\{H\}\_\{0\}corresponds to the forward process

\(Tt∗g\)\(x0\):=𝔼Xt∼pt\(⋅∣x0\)\[g\(Xt\)\]\.\(T\_\{t\}^\{\\ast\}g\)\(x\_\{0\}\):=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{0\}\)\}\[g\(X\_\{t\}\)\]\.\(7\)The covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}represents a round\-trip information bottleneck, smoothing features of noisy data via the adjoint and then reconstructing them,

\(TtTt∗f\)\(x~t\)=𝔼X0∼pt\(⋅∣x~t\)\[𝔼Xt∼pt\(⋅∣X0\)\[f\(Xt\)\]\]\.\(T\_\{t\}T\_\{t\}^\{\\ast\}f\)\(\\tilde\{x\}\_\{t\}\)=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{t\}\(\\cdot\\mid\\tilde\{x\}\_\{t\}\)\}\\left\[\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\(\\cdot\\mid X\_\{0\}\)\}\[f\(X\_\{t\}\)\]\\right\]\.\(8\)The principal modes ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}correspond to the directions in the noisy data that survive the round\-trip\. These are the intrinsic coordinates along which guidance can act\. At high noise levels, only a few modes remain significant, allowingTtT\_\{t\}to be approximated by a low\-rank decomposition\.

### 4\.1Spectral Decomposition

Ifp0p\_\{0\}has compact support,TtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}is compact \(App\.[A\.2](https://arxiv.org/html/2605.28900#A1.SS2), Theorem[A\.4](https://arxiv.org/html/2605.28900#A1.Thmtheorem4)\) and self\-adjoint\. Thus, it admits an eigenvalue decomposition andTtT\_\{t\}a singular value decompositioni\.e\., there exist singular values\{σt,k\}k=1∞\\\{\\sigma\_\{t,k\}\\\}\_\{k=1\}^\{\\infty\}, and orthonormal bases\{ψt,k\}k=1∞⊂ℋ0\\\{\\psi\_\{t,k\}\\\}\_\{k=1\}^\{\\infty\}\\subset\\mathcal\{H\}\_\{0\}and\{ϕt,k\}k=1∞⊂ℋt\\\{\\phi\_\{t,k\}\\\}\_\{k=1\}^\{\\infty\}\\subset\\mathcal\{H\}\_\{t\}such that

\(Ttf\)\(xt\)=∑k=1∞σt,kϕt,k\(xt\)𝔼p0\[f\(X0\)ψt,k\(X0\)\],\(T\_\{t\}f\)\(x\_\{t\}\)=\\sum\_\{k=1\}^\{\\infty\}\\sigma\_\{t,k\}\\phi\_\{t,k\}\(x\_\{t\}\)\\mathbb\{E\}\_\{p\_\{0\}\}\[f\(X\_\{0\}\)\\,\\psi\_\{t,k\}\(X\_\{0\}\)\],\(9\)withσt,1=1\\sigma\_\{t,1\}=1andϕt,1≡ψt,1≡1\\phi\_\{t,1\}\\equiv\\psi\_\{t,1\}\\equiv 1corresponding to the constant mode \(App\.[A\.3](https://arxiv.org/html/2605.28900#A1.SS3), Proposition[A\.5](https://arxiv.org/html/2605.28900#A1.Thmtheorem5)\)\. The right singular functionsψt,k∈ℋ0\\psi\_\{t,k\}\\in\\mathcal\{H\}\_\{0\}capture clean\-space features preserved by diffusion\. The left singular functionsϕt,k∈ℋt\\phi\_\{t,k\}\\in\\mathcal\{H\}\_\{t\}represent their optimal reconstructions from noisy data, corresponding to eigenfunctions of the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}\. The singular valuesσt,k\\sigma\_\{t,k\}quantify robustness: largeσt,k\\sigma\_\{t,k\}indicate modes that are recoverable from noise while smallerσt,k\\sigma\_\{t,k\}correspond to modes lost during diffusion\.

### 4\.2Spectral Guidance

The SVD in Eq\. \([9](https://arxiv.org/html/2605.28900#S4.E9)\) gives a natural decomposition of the posterior expectation of any guidance signal into basis functions\{ϕt,k\}k≥1\\\{\\phi\_\{t,k\}\\\}\_\{k\\geq 1\}, determined by the unconditional diffusion process, weighted by guidance\-dependent coefficientsct,kc\_\{t,k\}\.

###### Proposition 4\.1\.

For anyh∈ℋ0h\\in\\mathcal\{H\}\_\{0\}, its conditional expectation at timettadmits the expansion

𝔼X0∼pt\(⋅∣xt\)\[h\(X0\)\]=∑k=1∞ct,kϕt,k\(xt\),\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{t\}\)\}\[h\(X\_\{0\}\)\]=\\sum\_\{k=1\}^\{\\infty\}c\_\{t,k\}\\phi\_\{t,k\}\(x\_\{t\}\),\(10\)wherect,k:=𝔼X0,Xt\[h\(X0\)ϕt,k\(Xt\)\]c\_\{t,k\}:=\\mathbb\{E\}\_\{X\_\{0\},X\_\{t\}\}\\left\[h\(X\_\{0\}\)\\,\\phi\_\{t,k\}\(X\_\{t\}\)\\right\]\.

We approximate the series in Eq\. \([10](https://arxiv.org/html/2605.28900#S4.E10)\) by truncating it to its leadingKKterms\. This low\-rank approximation incurs anL2\(pt\)L^\{2\}\(p\_\{t\}\)error bounded byσt,K\+12‖h‖p02\\sigma\_\{t,K\+1\}^\{2\}\\,\\\|h\\\|\_\{p\_\{0\}\}^\{2\}\(App\.[A\.4](https://arxiv.org/html/2605.28900#A1.SS4), Proposition[A\.11](https://arxiv.org/html/2605.28900#A1.Thmtheorem11)\)\. However, the diffusion process ensures that

σt,k2≤𝔼p0\[χ2\(pt\(⋅∣X0\)∥pt\)\],k≥2\\sigma\_\{t,k\}^\{2\}\\leq\\mathbb\{E\}\_\{p\_\{0\}\}\\left\[\\chi^\{2\}\(p\_\{t\}\(\\cdot\\mid X\_\{0\}\)\\,\\big\\\|\\,p\_\{t\}\)\\right\],\\quad k\\geq 2\(11\)whereχ2\\chi^\{2\}denotes the chi\-squared divergence\. Because bothpt\(xt∣x0\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)andpt\(xt\)p\_\{t\}\(x\_\{t\}\)converge to the standard normal asα¯t→0\\bar\{\\alpha\}\_\{t\}\\to 0, the right\-hand side vanishes, forcing every singular value beyond the first to zero \(Proposition[A\.7](https://arxiv.org/html/2605.28900#A1.Thmtheorem7)\)\. Thus, at high noise levels, the spectrum concentrates on a few leading modes and the truncated expansion closely approximates the true posterior expectation of the guidance signal\.

### 4\.3Learning the Diffusion Spectrum

Asϕt,k∈ℋt\\phi\_\{t,k\}\\in\\mathcal\{H\}\_\{t\}are eigenfunctions of the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}, they maximize the correlation between diffused views\(xt,x~t\)∼ζ\(x\_\{t\},\\tilde\{x\}\_\{t\}\)\\sim\\zetaof the same clean samplex0x\_\{0\}, with

ζ\(xt,x~t\):=∫𝒳pt\(xt∣x0\)pt\(x~t∣x0\)p0\(x0\)𝑑x0\.\\zeta\(x\_\{t\},\\tilde\{x\}\_\{t\}\):=\\int\_\{\\mathcal\{X\}\}p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)p\_\{t\}\(\\tilde\{x\}\_\{t\}\\mid x\_\{0\}\)p\_\{0\}\(x\_\{0\}\)\\,dx\_\{0\}\.\(12\)This allows us to learn the leading left singular subspacespan\{ϕt,k\}k=2K\+1\\mathrm\{span\}\\\{\\phi\_\{t,k\}\\\}\_\{k=2\}^\{K\+1\}ofTtT\_\{t\}using an SSL objective\.

###### Theorem 4\.2\(Variational characterization\)\.

Letf=\(f1,…,fK\)⊤f=\(f\_\{1\},\\dots,f\_\{K\}\)^\{\\top\}with𝔼pt\[f\]=0\\mathbb\{E\}\_\{p\_\{t\}\}\[f\]=0\. Define the covariance

𝚺t\(f\):=𝔼pt\[f\(Xt\)f\(Xt\)⊤\]≻0\\boldsymbol\{\\Sigma\}\_\{t\}\(f\):=\\mathbb\{E\}\_\{p\_\{t\}\}\\left\[f\(X\_\{t\}\)\\,f\(X\_\{t\}\)^\{\\top\}\\right\]\\succ 0\(13\)and the cross\-covariance

𝐂t\(f\):=𝔼ζ\[f\(Xt\)f\(X~t\)⊤\]\.\\mathbf\{C\}\_\{t\}\(f\):=\\mathbb\{E\}\_\{\\zeta\}\\left\[f\(X\_\{t\}\)\\,f\(\\tilde\{X\}\_\{t\}\)^\{\\top\}\\right\]\.\(14\)Then,

maxf⁡Tr⁡\(𝐂t\(f\)𝚺t\(f\)−1\)=∑k=2K\+1σt,k2\\max\_\{f\}\\;\\operatorname\{Tr\}\\\!\\left\(\\mathbf\{C\}\_\{t\}\(f\)\\,\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1\}\\right\)\\;=\\;\\sum\_\{k=2\}^\{K\+1\}\\sigma\_\{t,k\}^\{2\}\(15\)and any maximizerf⋆f^\{\\star\}satisfiesspan\{fk⋆\}k=1K=span\{ϕt,k\}k=2K\+1\\mathrm\{span\}\\\{f^\{\\star\}\_\{k\}\\\}\_\{k=1\}^\{K\}=\\mathrm\{span\}\\\{\\phi\_\{t,k\}\\\}\_\{k=2\}^\{K\+1\}\.

#### Connection to SSL objectives\.

SSL methods such as VICReg\(Bardeset al\.,[2021](https://arxiv.org/html/2605.28900#bib.bib3)\)and Barlow Twins\(Zbontaret al\.,[2021](https://arxiv.org/html/2605.28900#bib.bib12)\)rely on hand\-crafted augmentations \(cropping, color jittering\) to define invariance\. We replace these heuristics with the diffusion process itself\. The optimization problem in Eq\. \([15](https://arxiv.org/html/2605.28900#S4.E15)\) is the Rayleigh\-Ritz characterization of the top\-KKeigenspace, equivalent to a Kernel PCA with the kernel defined by the joint distributionζ\\zeta\. Maximizing the diagonal of𝐂t\(f\)\\mathbf\{C\}\_\{t\}\(f\)enforces that the representation is invariant under the noise process, while the whitening transformation𝚺t\(f\)−1\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1\}prevents collapse into a constant solution\.

## 5Learning to Guide

The complete Spectral Guidance algorithm consists of an offline stage and a sampling stage; the latter augments an unconditional diffusion sampler with a guidance signal applied at a designated set of diffusion timesteps𝒯\\mathcal\{T\}\. We first describe how we learn the leading left singular functions of the conditional expectation operator\. We then show how they are used for guidance\.

### 5\.1Learning Singular Functions

As the constant mode is knowna priori\(ϕt,1≡1\\phi\_\{t,1\}\\equiv 1\), we parameterize the nextKKleft singular functions ofTtT\_\{t\}by a ResNet\(Heet al\.,[2016](https://arxiv.org/html/2605.28900#bib.bib60)\)with time\-modulation using FiLM layers\(Perezet al\.,[2018](https://arxiv.org/html/2605.28900#bib.bib35)\)\. The networkfϕ:𝒳×ℝ\>0→ℝKf\_\{\\phi\}:\\mathcal\{X\}\\times\\mathbb\{R\}\_\{\>0\}\\to\\mathbb\{R\}^\{K\}receives a noisy sample and its timestep as input\. Given a mini\-batch\{x0\(i\)\}i=1B∼p0\\\{x\_\{0\}^\{\(i\)\}\\\}\_\{i=1\}^\{B\}\\sim p\_\{0\}and a timestept∈𝒯t\\in\\mathcal\{T\}, we draw coupled noisy samples\(xt,x~t\)\(x\_\{t\},\\tilde\{x\}\_\{t\}\)according to

xt\(i\),x~t\(i\)∼i\.i\.d\.pt\(⋅∣x0\(i\)\),x\_\{t\}^\{\(i\)\},\\;\\tilde\{x\}\_\{t\}^\{\(i\)\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}p\_\{t\}\\big\(\\cdot\\mid x\_\{0\}^\{\(i\)\}\\big\),\(16\)wherept\(xt∣x0\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)is the forward diffusion process\. We evaluatefϕf\_\{\\phi\}on these two views, yielding

𝐙=fϕ\(xt,t\),𝐙~=fϕ\(x~t,t\)∈ℝB×K\.\\mathbf\{Z\}=f\_\{\\phi\}\(x\_\{t\},t\),\\quad\\tilde\{\\mathbf\{Z\}\}=f\_\{\\phi\}\(\\tilde\{x\}\_\{t\},t\)\\in\\mathbb\{R\}^\{B\\times K\}\.\(17\)
#### Whitening\.

We implement the whitening transformation𝚺t\(f\)−1\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1\}in Theorem[4\.2](https://arxiv.org/html/2605.28900#S4.Thmtheorem2)via the eigendecomposition of the empirical covariance\. Let𝝁∈ℝK\\boldsymbol\{\\mu\}\\in\\mathbb\{R\}^\{K\}denote the column\-mean of𝐙\\mathbf\{Z\}and write

𝚺^:=1B−1\(𝐙−𝝁\)⊤\(𝐙−𝝁\)=𝐕𝚲𝐕⊤\.\\hat\{\\boldsymbol\{\\Sigma\}\}:=\\frac\{1\}\{B\-1\}\(\\mathbf\{Z\}\-\\boldsymbol\{\\mu\}\)^\{\\top\}\(\\mathbf\{Z\}\-\\boldsymbol\{\\mu\}\)=\\mathbf\{V\}\\boldsymbol\{\\Lambda\}\\mathbf\{V\}^\{\\top\}\.\(18\)We define the whitening matrix as

𝐖:=𝐕\(𝚲\+ξ𝐈\)−1/2,\\mathbf\{W\}:=\\mathbf\{V\}\(\\boldsymbol\{\\Lambda\}\+\\xi\\mathbf\{I\}\)^\{\-1/2\},\(19\)whereξ\>0\\xi\>0is a small ridge hyperparameter ensuring𝐖\\mathbf\{W\}is well defined when𝚺^\\hat\{\\boldsymbol\{\\Sigma\}\}is rank\-deficient\.

#### Loss\.

Let𝐙w:=\(𝐙−𝝁\)𝐖\\mathbf\{Z\}^\{w\}:=\(\\mathbf\{Z\}\-\\boldsymbol\{\\mu\}\)\\,\\mathbf\{W\}and𝐙~w:=\(𝐙~−𝝁\)𝐖\\tilde\{\\mathbf\{Z\}\}^\{w\}:=\(\\tilde\{\\mathbf\{Z\}\}\-\\boldsymbol\{\\mu\}\)\\,\\mathbf\{W\}denote the whitened views\. A Monte Carlo approximation of the objective in Eq\. \([15](https://arxiv.org/html/2605.28900#S4.E15)\) yields the loss

L=−1K\(B−1\)Tr⁡\(\(𝐙w\)⊤𝐙~w\),L\\;=\\;\-\\frac\{1\}\{K\(B\-1\)\}\\operatorname\{Tr\}\\\!\\left\(\(\\mathbf\{Z\}^\{w\}\)^\{\\top\}\\,\\tilde\{\\mathbf\{Z\}\}^\{w\}\\right\),\(20\)which is equivalent to the objective in Theorem[4\.2](https://arxiv.org/html/2605.28900#S4.Thmtheorem2), up to the ridge regularization\. Following common SSL practice, we apply a stop\-gradient \(sg\\mathrm\{sg\}\) to𝐙\\mathbf\{Z\}\(which also detaches the whitening statistics𝝁\\boldsymbol\{\\mu\}and𝐖\\mathbf\{W\}\), which empirically stabilizes training\. The full procedure, which is independent of the U\-Net denoiser, is given in Algorithm[1](https://arxiv.org/html/2605.28900#alg1)\.

Algorithm 1Training loss forfϕf\_\{\\phi\}Input:Prior distribution

p0p\_\{0\}, diffusion schedule

\{α¯t\}t=1T\\\{\\bar\{\\alpha\}\_\{t\}\\\}\_\{t=1\}^\{T\}
Mini\-batch

\{x0\(i\)\}i=1B∼p0\\\{x\_\{0\}^\{\(i\)\}\\\}\_\{i=1\}^\{B\}\\sim p\_\{0\}
Sample timestep

t∼Uniform\(𝒯\)t\\sim\\mathrm\{Uniform\}\(\\mathcal\{T\}\)
Sample noise

\{ϵ\(i\)\}i=1B,\{ϵ~\(i\)\}i=1B∼𝒩\(𝟎,𝐈\)\\\{\\epsilon^\{\(i\)\}\\\}\_\{i=1\}^\{B\},\\ \\\{\\tilde\{\\epsilon\}^\{\(i\)\}\\\}\_\{i=1\}^\{B\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)
xt\(i\)←α¯tx0\(i\)\+1−α¯tϵ\(i\)x\_\{t\}^\{\(i\)\}\\leftarrow\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\}^\{\(i\)\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilon^\{\(i\)\}

x~t\(i\)←α¯tx0\(i\)\+1−α¯tϵ~\(i\)\\tilde\{x\}\_\{t\}^\{\(i\)\}\\leftarrow\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\}^\{\(i\)\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\tilde\{\\epsilon\}^\{\(i\)\}

𝐙←sg\(fϕ\(xt,t\)\)\\mathbf\{Z\}\\leftarrow\\mathrm\{sg\}\(f\_\{\\phi\}\(x\_\{t\},t\)\),

𝐙~←fϕ\(x~t,t\)\\tilde\{\\mathbf\{Z\}\}\\leftarrow f\_\{\\phi\}\(\\tilde\{x\}\_\{t\},t\)
𝝁←col\-mean\(𝐙\)\\boldsymbol\{\\mu\}\\leftarrow\\mathrm\{col\\text\{\-\}mean\}\(\\mathbf\{Z\}\)

𝚺^←\(𝐙−𝝁\)⊤\(𝐙−𝝁\)/\(B−1\)\\hat\{\\boldsymbol\{\\Sigma\}\}\\leftarrow\(\\mathbf\{Z\}\-\\boldsymbol\{\\mu\}\)^\{\\top\}\(\\mathbf\{Z\}\-\\boldsymbol\{\\mu\}\)/\(B\-1\)

𝐕,𝚲←eigh\(𝚺^\)\\mathbf\{V\},\\boldsymbol\{\\Lambda\}\\leftarrow\\mathrm\{eigh\}\(\\hat\{\\boldsymbol\{\\Sigma\}\}\)

𝐖←𝐕\(𝚲\+ξ𝐈\)−1/2\\mathbf\{W\}\\leftarrow\\mathbf\{V\}\\,\(\\boldsymbol\{\\Lambda\}\+\\xi\\mathbf\{I\}\)^\{\-1/2\}

𝐙w←\(𝐙−𝝁\)𝐖\\mathbf\{Z\}^\{w\}\\leftarrow\(\\mathbf\{Z\}\-\\boldsymbol\{\\mu\}\)\\mathbf\{W\},

𝐙~w←\(𝐙~−𝝁\)𝐖\\tilde\{\\mathbf\{Z\}\}^\{w\}\\leftarrow\(\\tilde\{\\mathbf\{Z\}\}\-\\boldsymbol\{\\mu\}\)\\mathbf\{W\}
Return:

L=−Tr⁡\(\(𝐙w\)⊤𝐙~w\)/\(K\(B−1\)\)L=\-\\operatorname\{Tr\}\\bigl\(\(\\mathbf\{Z\}^\{w\}\)^\{\\top\}\\tilde\{\\mathbf\{Z\}\}^\{w\}\\bigr\)/\(K\(B\-1\)\)

#### Reference statistics\.

The training loss whitens the output offϕf\_\{\\phi\}with batch statistics recomputed at every step\. To obtain a deployable estimator that can be evaluated on a single noisy sample and produce population\-whitened features, we compute, post\-training, a whitening transformation per timestep on a reference set𝒟ref=\{x0\(i\)\}i=1M\\mathcal\{D\}\_\{\\text\{ref\}\}=\\\{x\_\{0\}^\{\(i\)\}\\\}\_\{i=1\}^\{M\}\. For eacht∈𝒯t\\in\\mathcal\{T\}, we draw noisy versionsxt\(i\)∼pt\(⋅∣x0\(i\)\)x\_\{t\}^\{\(i\)\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{0\}^\{\(i\)\}\)and encode them withfϕf\_\{\\phi\}to obtain a feature matrix𝐙t∈ℝM×K\\mathbf\{Z\}\_\{t\}\\in\\mathbb\{R\}^\{M\\times K\}\. We set𝝁t\\boldsymbol\{\\mu\}\_\{t\}to its column\-mean and compute𝐖t\\mathbf\{W\}\_\{t\}from Eq\. \([19](https://arxiv.org/html/2605.28900#S5.E19)\) on the centered features\. The whitened network, with the constant mode appended, is then

fϕw\(xt,t\):=\[1\(fϕ\(xt,t\)−𝝁t\)𝐖t\],f\_\{\\phi\}^\{w\}\(x\_\{t\},t\):=\\begin\{bmatrix\}1&\\bigl\(f\_\{\\phi\}\(x\_\{t\},t\)\-\\boldsymbol\{\\mu\}\_\{t\}\\bigr\)\\,\\mathbf\{W\}\_\{t\}\\end\{bmatrix\},\(21\)and we cache its evaluation on the reference set,

𝚽t:=\[𝟏\(𝐙t−𝝁t\)𝐖t\]∈ℝM×\(K\+1\),\\boldsymbol\{\\Phi\}\_\{t\}:=\\begin\{bmatrix\}\\mathbf\{1\}&\(\\mathbf\{Z\}\_\{t\}\-\\boldsymbol\{\\mu\}\_\{t\}\)\\,\\mathbf\{W\}\_\{t\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{M\\times\(K\+1\)\},\(22\)for use in guidance\. By construction,𝚽t\\boldsymbol\{\\Phi\}\_\{t\}has orthogonal columns; on a fresh samplextx\_\{t\}, the outputs offϕwf\_\{\\phi\}^\{w\}are approximately zero\-mean with identity covariance underptp\_\{t\}\.

### 5\.2Guidance

We consider clean\-data guidance signalsh\(x0\)∈ℝDhh\(x\_\{0\}\)\\in\\mathbb\{R\}^\{D\_\{h\}\}\. Examples include the probabilityp\(y∣x0\)p\(y\\mid x\_\{0\}\)of a target classyy\(Dh=1D\_\{h\}=1\), a CLIP image embedding \(Dh=DCLIPD\_\{h\}=D\_\{\\mathrm\{CLIP\}\}\) or a flattened binary segmentation mask over theW×HW\\times Hpixel grid \(Dh=WHD\_\{h\}=WH\)\.

#### Guidance coefficients\.

To guide generation, we first estimate the spectral coefficientsct,k∈ℝDhc\_\{t,k\}\\in\\mathbb\{R\}^\{D\_\{h\}\}from Proposition[4\.1](https://arxiv.org/html/2605.28900#S4.Thmtheorem1)\. Evaluatinghhon the clean samples of𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}, we construct𝐇∈ℝM×Dh\\mathbf\{H\}\\in\\mathbb\{R\}^\{M\\times D\_\{h\}\}and combine it with the cached reference matrix𝚽t\\boldsymbol\{\\Phi\}\_\{t\}from Eq\. \([22](https://arxiv.org/html/2605.28900#S5.E22)\) to obtain the Monte Carlo estimate

𝐜^t=1M𝚽t⊤𝐇∈ℝ\(K\+1\)×Dh,\\hat\{\\mathbf\{c\}\}\_\{t\}=\\frac\{1\}\{M\}\\,\\boldsymbol\{\\Phi\}\_\{t\}^\{\\top\}\\mathbf\{H\}\\;\\in\\;\\mathbb\{R\}^\{\(K\+1\)\\times D\_\{h\}\},\(23\)whosekk\-th rowc^t,k∈ℝDh\\hat\{c\}\_\{t,k\}\\in\\mathbb\{R\}^\{D\_\{h\}\}estimates the coefficientct,kc\_\{t,k\}\.

#### Sampling\.

Algorithm[2](https://arxiv.org/html/2605.28900#alg2)summarizes the sampling stage, which consists of a standard DDIM step followed by computing and applying a guidance vectorggwith strengthκ\\kappa\. For a noisy samplextx\_\{t\}, truncating the spectral expansion of Proposition[4\.1](https://arxiv.org/html/2605.28900#S4.Thmtheorem1)to the leadingK\+1K\+1terms yields the estimate𝐜^t⊤fϕw\(xt,t\)≈𝔼X0∼pt\(⋅∣xt\)\[h\(X0\)\]\\hat\{\\mathbf\{c\}\}\_\{t\}^\{\\top\}f\_\{\\phi\}^\{w\}\(x\_\{t\},t\)\\approx\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{t\}\)\}\[h\(X\_\{0\}\)\]\. We define the guidance vectorg\(xt\)g\(x\_\{t\}\)as

g\(xt\)=∇xtℒ\(𝐜^t⊤fϕw\(xt,t\)\),g\(x\_\{t\}\)=\\nabla\_\{x\_\{t\}\}\\mathcal\{L\}\\left\(\\hat\{\\mathbf\{c\}\}\_\{t\}^\{\\top\}f\_\{\\phi\}^\{w\}\(x\_\{t\},t\)\\right\),\(24\)for a guidance\-dependent loss functionℒ\\mathcal\{L\}\. Ifh\(x0\)h\(x\_\{0\}\)is a class probability,𝐜^t⊤fϕw\(xt,t\)≈pt\(y∣xt\)\\hat\{\\mathbf\{c\}\}\_\{t\}^\{\\top\}f\_\{\\phi\}^\{w\}\(x\_\{t\},t\)\\approx p\_\{t\}\(y\\mid x\_\{t\}\)and we use the log\-likelihood\. Since the truncated series approximation of the density may locally violate positivity, we set∇xtℒ\(z\)=\(∇xtz\)/z\\nabla\_\{x\_\{t\}\}\\mathcal\{L\}\(z\)=\(\\nabla\_\{x\_\{t\}\}z\)/z\. For CLIP guidance, Eq\. \([23](https://arxiv.org/html/2605.28900#S5.E23)\) yields the expected CLIP embedding of the clean image\. We employ the cosine similarity with the normalized text embedding𝐞text\\mathbf\{e\}\_\{\\text\{text\}\}asℒ\(𝐳\)=𝐳⊤𝐞text/‖𝐳‖\\mathcal\{L\}\(\\mathbf\{z\}\)=\\mathbf\{z\}^\{\\top\}\\mathbf\{e\}\_\{\\text\{text\}\}/\\\|\\mathbf\{z\}\\\|\. For mask guidance, Eq\. \([23](https://arxiv.org/html/2605.28900#S5.E23)\) yields the expected clean mask and we setℒ\(𝐳\)=−‖𝐳−𝐳target‖2\\mathcal\{L\}\(\\mathbf\{z\}\)=\-\\\|\\mathbf\{z\}\-\\mathbf\{z\}\_\{\\text\{target\}\}\\\|^\{2\}, for a target mask𝐳target\\mathbf\{z\}\_\{\\text\{target\}\}\.

#### Complexity comparison\.

Table[1](https://arxiv.org/html/2605.28900#S5.T1)provides a complexity comparison between Spectral Guidance, CG and TFG\. Spectral Guidance amortizes computational cost by shifting the heavy optimization to a one\-time offline phase, avoiding the𝒪\(Nrec⋅\(∇xϵθ\+∇xfcls\)\)\\mathcal\{O\}\(N\_\{\\text\{rec\}\}\\cdot\(\\nabla\_\{x\}\\epsilon\_\{\\theta\}\+\\nabla\_\{x\}f\_\{\\text\{cls\}\}\)\)bottleneck of training\-free methods that require backpropagating through the denoiser and classifier\. By using a lightweight networkfϕf\_\{\\phi\}\(16M parameters vs\. denoiser’s 114M on CelebA\-HQ\), the per\-step inference overhead is reduced to a shallow gradient∇xfϕ\\nabla\_\{x\}f\_\{\\phi\}, with per\-step latency comparable to classifier guidance\.

Algorithm 2Spectral GuidanceInput:Timesteps

𝒯\\mathcal\{T\}; Strength

κ\\kappa; Denoiser

ϵθ\\epsilon\_\{\\theta\}; DDIM scheduler

𝒮\\mathcal\{S\}; Coefficients

\{𝐜^t\}t∈𝒯\\\{\\hat\{\\mathbf\{c\}\}\_\{t\}\\\}\_\{t\\in\\mathcal\{T\}\}; Pre\-trained

fϕf\_\{\\phi\}
For

ttin

reverse\(𝒯\)\\mathrm\{reverse\}\(\\mathcal\{T\}\):

Predict noise

ϵ←ϵθ\(x,t\)\\epsilon\\leftarrow\\epsilon\_\{\\theta\}\(x,t\)
Denoising step

x←𝒮\(ϵ,x,t\)x\\leftarrow\\mathcal\{S\}\(\\epsilon,x,t\)
Guidance

g←∇xℒ\(𝐜^t⊤fϕw\(x,t\)\)g\\leftarrow\\nabla\_\{x\}\\mathcal\{L\}\\left\(\\hat\{\\mathbf\{c\}\}\_\{t\}^\{\\top\}f\_\{\\phi\}^\{w\}\(x,t\)\\right\)
Update latent

x←x\+κ1−α¯tgx\\leftarrow x\+\\kappa\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,g
Output:Final sample

xx

Table 1:Computational complexity comparison of guidance approaches\.ϵθ\\epsilon\_\{\\theta\}andfϕf\_\{\\phi\}represent forward passes of the denoiser and the spectral network, respectively\.NrecN\_\{\\text\{rec\}\}denotes the number of recursive steps required by UGD, FreeDoM and TFG\.## 6Experiments

We evaluate Spectral Guidance along four axes: \(i\) its effectiveness and flexibility relative to training\-free guidance baselines across categorical, text, and spatial conditioning tasks \(§[6\.2](https://arxiv.org/html/2605.28900#S6.SS2)\); \(ii\) the sensitivity of guidance to the spectral rankKKand guidance strengthκ\\kappa\(§[6\.3](https://arxiv.org/html/2605.28900#S6.SS3)\); \(iii\) the connection between the spectrum ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}and the temporal window in which guidance is most effective \(§[6\.4](https://arxiv.org/html/2605.28900#S6.SS4)\); and \(iv\) sampling efficiency \(§[6\.5](https://arxiv.org/html/2605.28900#S6.SS5)\)\.

### 6\.1Experimental Setup

#### Models and data\.

We evaluate conditional image generation on CIFAR\-10\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2605.28900#bib.bib37)\), CelebA\-HQ\(Karraset al\.,[2017](https://arxiv.org/html/2605.28900#bib.bib36)\), and ImageNet\(Denget al\.,[2009](https://arxiv.org/html/2605.28900#bib.bib1)\), using unconditional DDPM U\-Nets with 36M \(CIFAR\-10\), 114M \(CelebA\-HQ\), and 121M \(ImageNet\) parameters\. All experiments use a DDIM sampler\. The spectral networksfϕf\_\{\\phi\}are time\-conditioned ResNets\(Heet al\.,[2016](https://arxiv.org/html/2605.28900#bib.bib60)\)with FiLM modulation\(Perezet al\.,[2018](https://arxiv.org/html/2605.28900#bib.bib35)\)\. The output dimensions are set toK=512K=512on CIFAR\-10 and CelebA\-HQ, andK=2000K=2000on ImageNet\. Thefϕf\_\{\\phi\}networks are substantially lighter than the corresponding generative U\-Nets: 9M parameters on CIFAR\-10, 16M on CelebA\-HQ, and 91M on ImageNet\. Full details are provided in Appendix[C\.1](https://arxiv.org/html/2605.28900#A3.SS1)\.

#### Baselines\.

We compare our approach against state\-of\-the\-art training\-free guidance methods: DPS\(Chunget al\.,[2022](https://arxiv.org/html/2605.28900#bib.bib29)\), LGD\(Songet al\.,[2023](https://arxiv.org/html/2605.28900#bib.bib31)\), FreeDoM\(Yuet al\.,[2023](https://arxiv.org/html/2605.28900#bib.bib32)\), UGD\(Bansalet al\.,[2023](https://arxiv.org/html/2605.28900#bib.bib30)\), MPGD\(Heet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib13)\), and TFG\(Yeet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib34)\), using the implementation and hyperparameters provided in the TFG framework\.

#### Tasks\.

We evaluate three conditioning modalities\.

- •Categorical\.We guide toward the 10 classes of CIFAR\-10, combinations of attributes on CelebA\-HQ \(Gender \+ Age and Gender \+ Hair color\), and 4 classes of ImageNet\. The clean\-data signalh\(x0\)h\(x\_\{0\}\)is the one\-hot label available with each dataset; baselines backpropagate through external classifiers\.
- •Text \(CLIP\)\.On CelebA\-HQ, we evaluate generation conditioned on 15 text promptse\.g\.,“a young woman wearing sunglasses”\. In our method,h\(x0\)h\(x\_\{0\}\)is the CLIP embedding of the clean image; baselines backpropagate cosine similarity through CLIP’s image encoder\.
- •Mask\.We evaluate mask guidance on CelebA\-HQ by setting the clean\-data signalh\(x0\)h\(x\_\{0\}\)of our framework to a hair\-segmentation mask\. We guide all models using the gradients of the MSE between the target mask and the clean mask estimate; baselines backpropagate through an external face\-parsing model\.

Crucially, our approach relies on the same spectral representations\{𝚽t\}t∈𝒯\\\{\\boldsymbol\{\\Phi\}\_\{t\}\\\}\_\{t\\in\\mathcal\{T\}\}from Eq\. \([22](https://arxiv.org/html/2605.28900#S5.E22)\) to support all three tasks\. Only the guidance signalhhis swapped at sampling time\.

#### Metrics\.

We report fidelity and guidance validity\. Fidelity is measured via intra\-class FID\(Heuselet al\.,[2017](https://arxiv.org/html/2605.28900#bib.bib59)\)against target\-class training images on CIFAR\-10 and ImageNet, and via the log Kernel Inception Distance \(log KID\) on CelebA\-HQ\. For categorical guidance, validity is the top\-1 accuracy of a pre\-trained classifier on generated samples\. For text guidance, we measure image–prompt alignment with VQAScore\(Linet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib39)\)\(range\[0,1\]\[0,1\], higher is better\), usingllava\-v1\.5\-7b\. For mask guidance, we report the Intersection\-over\-Union \(IoU\) between the target hair mask and the mask predicted by an independent segmentation model\.

### 6\.2Overall Results

Table[2](https://arxiv.org/html/2605.28900#S6.T2)reports validity and fidelity across all guidance tasks\. Spectral Guidance attains the highest validity in every setting, with competitive fidelity on most tasks\.

On CIFAR\-10, Spectral Guidance reaches 89\.4% accuracy, 37 percentage points above the strongest baseline, while also improving FID\. On CelebA\-HQ, it leads on both attribute\-combination tasks; LGD achieves a better log KID but at a substantial cost in accuracy\. On ImageNet, Spectral Guidance reaches 41\.6% accuracy \(vs\. 40\.9% for TFG\)\.

For CLIP guidance, Spectral Guidance attains the best VQAScore, while the training\-free baselines drop markedly relative to their categorical performance\. Fig\.[5\(b\)](https://arxiv.org/html/2605.28900#S6.F5.sf2)shows clearer attribute realization and fewer off\-target generations than DPS, particularly for prompts involving localized attributes \(sunglasses, beards\)\.

For mask guidance, Spectral Guidance reaches an IoU of 0\.80 \(vs\. 0\.78 for TFG and FreeDoM\), reusing the same\{𝚽t\}t∈𝒯\\\{\\boldsymbol\{\\Phi\}\_\{t\}\\\}\_\{t\\in\\mathcal\{T\}\}learned for the categorical and CLIP tasks\. Qualitative examples in Fig\.[5](https://arxiv.org/html/2605.28900#S6.F5)show high\-fidelity faces generated by Spectral Guidance, with hairlines aligning with the target masks \(red boundary\)\.

Table 2:Guidance evaluation\.Validity is measured via accuracy, IoU, or VQAScore \(↑\\uparrow\); fidelity is measured via FID orlog\\logKID \(↓\\downarrow\)\. NG = No guidance \(unconditional sampling\)\.Dataset / TaskMetricNGDPSLGDFreeDoMMPGDUGDTFGOursCIFAR\-10 / LabelsAccuracy \(↑\\uparrow\)10\.050\.132\.234\.838\.045\.952\.089\.4FID \(↓\\downarrow\)98\.117210213588\.394\.291\.770\.7CelebA\-HQ / Gender \+ AgeAccuracy \(↑\\uparrow\)25\.071\.052\.068\.768\.675\.175\.291\.5log\\logKID \(↓\\downarrow\)\-3\.22\-4\.26\-5\.10\-3\.89\-4\.79\-4\.37\-3\.86\-2\.98CelebA\-HQ / Gender \+ HairAccuracy \(↑\\uparrow\)22\.473\.055\.067\.163\.971\.376\.088\.3log\\logKID \(↓\\downarrow\)\-3\.13\-3\.90\-5\.00\-3\.50\-4\.33\-4\.12\-3\.60\-3\.34ImageNet / LabelsAccuracy \(↑\\uparrow\)0\.038\.811\.519\.76\.825\.540\.941\.6FID \(↓\\downarrow\)209193210200239205176183CelebA\-HQ / MaskIoU \(↑\\uparrow\)0\.380\.750\.450\.780\.480\.690\.780\.80log\\logKID \(↓\\downarrow\)\-4\.62\-3\.17\-4\.08\-3\.00\-4\.50\-3\.62\-3\.15\-2\.96CelebA\-HQ / CLIPVQAScore \(↑\\uparrow\)0\.340\.590\.500\.570\.400\.440\.620\.64\(a\)Qualitative comparison on CIFAR\-10\.Spectral Guidance \(a\) generates high\-fidelity samples with clear class semantics\.\(b\)Qualitative comparison of CLIP guidance\.Top rows: Spectral Guidance \(ours\); Bottom rows: DPS\.![Refer to caption](https://arxiv.org/html/2605.28900v1/fig/mask_guidance.jpeg)Figure 5:Mask guidance\. Images generated with Spectral Guidance and target hair masks, indicated by the red boundary\.
\(a\)CIFAR\-10 analysis\.\(a\) Sweeping guidance strengthκ\\kappa, Spectral Guidance achieves a better accuracy\-FID frontier than training\-free baselines\. \(b\) Sweeping rankKKat fixedκ\\kappa: accuracy improves with diminishing returns, while FID is non\-monotonic, mirroring the fidelity\-diversity trade\-off in \(a\)\.\(b\)Spectrum of the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}\.We visualize the eigenvalues of the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}over diffusion timet∈\[0,1000\]t\\in\[0,1000\], sorted by index\. The top row displays the corresponding posterior mean reconstructionx^0=𝔼\[X0∣xt\]\\hat\{x\}\_\{0\}=\\mathbb\{E\}\[X\_\{0\}\\mid x\_\{t\}\]\.### 6\.3Rank and Guidance Strength Ablations

#### Accuracy vs\. FID\.

In Fig\.LABEL:fig:fid\_vs\_acc, we analyze the trade\-off between guidance strengthκ\\kappaand sample quality\. Spectral Guidance achieves significantly higher accuracy at equivalent FID levels\. While supervised Classifier Guidance \(CG\) yields better accuracy for the same FID, it requires training a dedicated noise\-aware classifier for every new task\. In contrast, Spectral Guidance approaches this performance using a single, unsupervised representation\. At extreme guidance scales \(rightmost points\), we observe a degradation of FID, as guidance dominates the unconditional score, driving the sampling trajectory off the data manifold\.

#### Sensitivity to rank \(KK\)\.

Fig\.LABEL:fig:ablation\_kablates the dimension of the spectral approximation\. Accuracy improves monotonically inKK, rising sharply between 8 and 128 and saturating thereafter\. This behavior is predicted by Proposition[A\.11](https://arxiv.org/html/2605.28900#A1.Thmtheorem11), which bounds theL2\(pt\)L^\{2\}\(p\_\{t\}\)error of the rank\-KKapproximation of𝔼\[h\(X0\)∣xt\]\\mathbb\{E\}\[h\(X\_\{0\}\)\\mid x\_\{t\}\]: the rapid decay of the singular spectrum \(§[4](https://arxiv.org/html/2605.28900#S4)\) ensures the leading modes capture the bulk of the recoverable class information\. Beyond this regime,KKtakes on the role of a guidance\-strength knob, with each added mode raising the effective scale at fixed guidance strengthκ\\kappa\. This sharpens class typicality, further improving accuracy at the cost of within\-class diversity, mirroring the well\-known fidelity\-diversity trade\-off of classifier\-free guidance and lifting intra\-class FID\. Despite the worsening FID, reflecting lower class diversity, samples remain high\-fidelity atK=512K=512, as shown in Fig\.[11\(c\)](https://arxiv.org/html/2605.28900#A3.F11.sf3)\(Appendix[C](https://arxiv.org/html/2605.28900#A3)\)\.

\(c\)Guidance windows\. Guidance is most effective during the phase transition of the spectrum ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}\.### 6\.4Diffusion Spectrum and Guidance Window

As shown in Fig\.[7\(b\)](https://arxiv.org/html/2605.28900#S6.F7.sf2), the spectra of CIFAR\-10 and CelebA\-HQ undergo a transition over time, separating three distinct regimes\. For smalltt, the singular values ofTtT\_\{t\}are close to 1\. The operator is close to the identity, indicating that the forward process is nearly information\-preserving\. In this regime, the posterior mean resembles a clean image\. For largett, the singular values concentrate near zero, as the forward process has erased all information\. Between these two extremes we observe a time interval \(t≈400t\\approx 400for CIFAR\-10,t≈700t\\approx 700for CelebA\-HQ\) in which the spectrum transitions, coinciding with the formation of semantically meaningful posterior\. This provides a possible explanation for the lower accuracy of posterior\-mean\-based guidance on CIFAR\-10 \(Table[2](https://arxiv.org/html/2605.28900#S6.T2)\): the operator rank collapses midway through the forward process, leaving no stable guidance signal during early reverse steps\. In contrast, CelebA\-HQ retains information until higher noise levels\.

To empirically validate this interpretation, we perform a sliding window experiment to identify the time interval when guidance is most effective\. For a givenτ\\tau, we generate images while applying guidance only when the timestepttfalls within\[τ−100,τ\+100\]\[\\tau\-100,\\tau\+100\]\. Fig\.[7\(c\)](https://arxiv.org/html/2605.28900#S6.F7.sf3)plots the accuracy as a function ofτ\\tau\. In the same plot, we overlay the normalized trace of the truncated covariance operator, which equals the mean of the covariance eigenvalues and serves as a proxy for the amount of information retained by the forward process\. We observe that guidance effectiveness correlates with the spectral transition\. The operator’s spectrum serves thus as a reliable indicator for the guidance window: the period where the diffusion process is sufficiently flexible to be guided but sufficiently structured to retain semantic information\.

Table 3:Computational cost on CelebA\-HQ\. NG: No guidance\. Batch size 1, DDIM 100 steps\.### 6\.5Sampling Efficiency

A key advantage of Spectral Guidance is its amortized cost\. Unlike training\-free methods that require backpropagation through the denoiser during sampling, our method moves the bulk of the computations offline\. As shown in Table[3](https://arxiv.org/html/2605.28900#S6.T3), while we incur an initial cost of∼\\sim10 GPU hours to trainfϕf\_\{\\phi\}on CelebA\-HQ, this cost is amortized over future generations\. In the sampling phase, our method requires only the gradients offϕf\_\{\\phi\}\(16M parameters\), resulting in a per\-step latency of 21\.7 ms, comparable to unconditional sampling \(19\.2 ms\) and nearly4×4\\timesfaster than TFG withNrec=1N\_\{\\text\{rec\}\}=1\(81\.2 ms\)\. The “Effective time” row demonstrates this trade\-off for 10,000 images\. Even including the one\-time training cost, Spectral Guidance is more efficient than TFG\.

## 7Limitations

#### Scale and latent diffusion\.

Our experiments focus on pixel\-space diffusion models at moderate scale; we do not directly evaluate Spectral Guidance on latent diffusion models or large\-scale text\-to\-image foundation models\. The framework, however, makes no assumption thatx0x\_\{0\}lives in pixel space: the operatorTtT\_\{t\}and its spectral decomposition are defined relative to the diffusion process itself\. Learningfϕf\_\{\\phi\}in the latent space of an autoencoder and applying Spectral Guidance to a latent diffusion model is therefore a straightforward extension\.

#### Reference data for coefficient estimation\.

Estimating𝐜^t\\hat\{\\mathbf\{c\}\}\_\{t\}in Eq\. \([23](https://arxiv.org/html/2605.28900#S5.E23)\) requires evaluating both𝚽t\\boldsymbol\{\\Phi\}\_\{t\}and the guidance signalh\(x0\)h\(x\_\{0\}\)on a reference set𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}drawn fromp0p\_\{0\}, whereas training\-free baselines rely on off\-the\-shelf losses or pretrained models\. This can be a practical constraint when training data is unavailable or labeled examples are scarce\. Two observations mitigate it\. First,𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}need not be large:𝐜^t\\hat\{\\mathbf\{c\}\}\_\{t\}estimates onlyK\+1K\+1coefficients, and the cached𝚽t\\boldsymbol\{\\Phi\}\_\{t\}is reused across downstream tasks \(Table[2](https://arxiv.org/html/2605.28900#S6.T2)\)\. Second, when only the diffusion model is available,𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}can in principle be drawn from the unconditional model itself\.

#### Pixel\-level inverse problems\.

For a linear inverse problemy=Ax0\+ηy=Ax\_\{0\}\+\\eta, conditional sampling requires the measurement constraintAx0≈yAx\_\{0\}\\approx yto be satisfied at the pixel level\. Our method restricts guidance to the span of the top\-KKsingular functions ofTtT\_\{t\}, aKK\-dimensional subspace ofℋt\\mathcal\{H\}\_\{t\}\. For semantic signals such as labels, attributes, CLIP embeddings, and coarse masks, this low\-rank structure is precisely what enables stable, denoiser\-gradient\-free control\. For inverse problems, however, the constraint countCCtypically far exceedsKK: inpainting half a256×256×3256\\times 256\\times 3image imposesC=98,304C=98\{,\}304independent constraints, leaving theKK\-dimensional subspace insufficient to drive‖y−Ax0‖2\\\|y\-Ax\_\{0\}\\\|^\{2\}to zero\. Thus, we view Spectral Guidance as complementary to posterior mean\-based guidance methods such as DPS\.

## 8Conclusion

We introduced Spectral Guidance, a framework that reframes conditional generation as a projection onto the intrinsic coordinates of the unconditional diffusion process\. Because this basis is task\-agnostic, our method provides a flexible and efficient guidance mechanism that eliminates both the need for expensive per\-step denoiser backpropagation used in training\-free guidance, and the task\-specific retraining required for classifier guidance\. Empirically, we demonstrate significant improvements in sampling efficiency and in controllability, from classes and attributes to text prompts and masks, over existing baselines\. Beyond performance, our approach provides new insights into the internal structure of diffusion models\. By learning the spectral decomposition of the diffusion operator, we make the evolution of information over time explicit, which allows us to determine when control is most effective\.

## Acknowledgments

This work is funded by LARSyS FCT fundingUID/50009/2025\(DOI:10\.54499/UID/50009/2025\) andLA/P/0083/2020\(DOI:10\.54499/LA/P/0083/2020\)\. G\. Moreira is also supported via grantSFRH/BD/151466/2021through the Carnegie Mellon Portugal program\. J\. P\. Costeira and M\. Marques are also supported by the PT Smart Retail project \(PRR \-02/C05\-i11/2024\.C645440011\-00000062\), through IAPMEI \- Agência para a Competitividade e Inovação\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- A\. Bansal, H\. Chu, A\. Schwarzschild, S\. Sengupta, M\. Goldblum, J\. Geiping, and T\. Goldstein \(2023\)Universal guidance for diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 843–852\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px2.p1.1)\.
- A\. Bardes, J\. Ponce, and Y\. LeCun \(2021\)Vicreg: variance\-invariance\-covariance regularization for self\-supervised learning\.arXiv preprint arXiv:2105\.04906\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p4.1),[§4\.3](https://arxiv.org/html/2605.28900#S4.SS3.SSS0.Px1.p1.4)\.
- S\. Chen, H\. Zhang, M\. Guo, Y\. Lu, P\. Wang, and Q\. Qu \(2024\)Exploring low\-dimensional subspace in diffusion models for controllable image editing\.Advances in neural information processing systems37,pp\. 27340–27371\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Choi, S\. Kim, Y\. Jeong, Y\. Gwon, and S\. Yoon \(2021\)Ilvr: conditioning method for denoising diffusion probabilistic models\.arXiv preprint arXiv:2108\.02938\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Chung, J\. Kim, M\. T\. Mccann, M\. L\. Klasky, and J\. C\. Ye \(2022\)Diffusion posterior sampling for general noisy inverse problems\.arXiv preprint arXiv:2209\.14687\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p2.1),[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.28900#S3.SS0.SSS0.Px3.p1.3),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px2.p1.1)\.
- R\. R\. Coifman and S\. Lafon \(2006\)Diffusion maps\.Applied and computational harmonic analysis21\(1\),pp\. 5–30\.Cited by:[§A\.1](https://arxiv.org/html/2605.28900#A1.SS1.SSS0.Px1.p1.15)\.
- Y\. Dalva and P\. Yanardag \(2024\)Noiseclr: a contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 24209–24218\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE conference on computer vision and pattern recognition,pp\. 248–255\.Cited by:[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px1.p1.4)\.
- P\. Dhariwal and A\. Nichol \(2021\)Diffusion models beat gans on image synthesis\.Advances in neural information processing systems34,pp\. 8780–8794\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p1.1),[§1](https://arxiv.org/html/2605.28900#S1.p2.1),[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.28900#S3.SS0.SSS0.Px2.p1.4)\.
- F\. Handke, F\. Koulischer, G\. Raya, and L\. Ambrogioni \(2025\)Measuring semantic information production in generative diffusion models\.arXiv preprint arXiv:2506\.10433\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§5\.1](https://arxiv.org/html/2605.28900#S5.SS1.p1.7),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px1.p1.4)\.
- Y\. He, N\. Murata, C\. Lai, Y\. Takida, T\. Uesaka, D\. Kim, W\. Liao, Y\. Mitsufuji, Z\. Kolter, R\. Salakhutdinov,et al\.\(2024\)Manifold preserving guided diffusion\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 44819–44850\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.28900#S3.SS0.SSS0.Px3.p1.6),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px2.p1.1)\.
- M\. Heusel, H\. Ramsauer, T\. Unterthiner, B\. Nessler, and S\. Hochreiter \(2017\)Gans trained by a two time\-scale update rule converge to a local nash equilibrium\.Advances in neural information processing systems30\.Cited by:[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px4.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p1.1),[§3](https://arxiv.org/html/2605.28900#S3.SS0.SSS0.Px1.p1.8)\.
- J\. Ho and T\. Salimans \(2022\)Classifier\-free diffusion guidance\.arXiv preprint arXiv:2207\.12598\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p2.1),[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Karras, T\. Aila, S\. Laine, and J\. Lehtinen \(2017\)Progressive growing of gans for improved quality, stability, and variation\.CoRRabs/1710\.10196\.External Links:[Link](http://arxiv.org/abs/1710.10196),1710\.10196Cited by:[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px1.p1.4)\.
- B\. Kawar, M\. Elad, S\. Ermon, and J\. Song \(2022\)Denoising diffusion restoration models\.Advances in neural information processing systems35,pp\. 23593–23606\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Kawar, G\. Vaksman, and M\. Elad \(2021\)Snips: solving noisy inverse problems stochastically\.Advances in Neural Information Processing Systems34,pp\. 21757–21769\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Koulischer, F\. Handke, J\. Deleu, T\. Demeester, and L\. Ambrogioni \(2025\)Feedback guidance of diffusion models\.arXiv preprint arXiv:2506\.06085\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px1.p1.4)\.
- T\. Kynkäänniemi, M\. Aittala, T\. Karras, S\. Laine, T\. Aila, and J\. Lehtinen \(2024\)Applying guidance in a limited interval improves sample and distribution quality in diffusion models\.Advances in Neural Information Processing Systems37,pp\. 122458–122483\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Lin, D\. Pathak, B\. Li, J\. Li, X\. Xia, G\. Neubig, P\. Zhang, and D\. Ramanan \(2024\)Evaluating text\-to\-visual generation with image\-to\-text generation\.InEuropean Conference on Computer Vision,pp\. 366–384\.Cited by:[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px4.p1.1)\.
- Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2022\)Flow matching for generative modeling\.arXiv preprint arXiv:2210\.02747\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Park, M\. Kwon, J\. Choi, J\. Jo, and Y\. Uh \(2023\)Understanding the latent space of diffusion models through the lens of riemannian geometry\.Advances in Neural Information Processing Systems36,pp\. 24129–24142\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px3.p1.1)\.
- E\. Perez, F\. Strub, H\. De Vries, V\. Dumoulin, and A\. Courville \(2018\)Film: visual reasoning with a general conditioning layer\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§5\.1](https://arxiv.org/html/2605.28900#S5.SS1.p1.7),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px1.p1.4)\.
- G\. Raya and L\. Ambrogioni \(2023\)Spontaneous symmetry breaking in generative diffusion models\.Advances in Neural Information Processing Systems36,pp\. 66377–66389\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p1.1),[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Saharia, W\. Chan, S\. Saxena, L\. Li, J\. Whang, E\. L\. Denton, K\. Ghasemipour, R\. Gontijo Lopes, B\. Karagol Ayan, T\. Salimans,et al\.\(2022\)Photorealistic text\-to\-image diffusion models with deep language understanding\.Advances in neural information processing systems35,pp\. 36479–36494\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p1.1)\.
- J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InInternational conference on machine learning,pp\. 2256–2265\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p1.1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2020a\)Denoising diffusion implicit models\.arXiv preprint arXiv:2010\.02502\.Cited by:[§3](https://arxiv.org/html/2605.28900#S3.SS0.SSS0.Px1.p1.13)\.
- J\. Song, Q\. Zhang, H\. Yin, M\. Mardani, M\. Liu, J\. Kautz, Y\. Chen, and A\. Vahdat \(2023\)Loss\-guided diffusion models for plug\-and\-play controllable generation\.InInternational Conference on Machine Learning,pp\. 32483–32498\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px2.p1.1)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.Advances in neural information processing systems32\.Cited by:[§3](https://arxiv.org/html/2605.28900#S3.SS0.SSS0.Px1.p1.11)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2020b\)Score\-based generative modeling through stochastic differential equations\.arXiv preprint arXiv:2011\.13456\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p1.1)\.
- J\. A\. Tropp \(2015\)An introduction to matrix concentration inequalities\.Foundations and trends® in machine learning8\(1\-2\),pp\. 1–230\.Cited by:[§A\.5](https://arxiv.org/html/2605.28900#A1.SS5.SSS0.Px1.p1.4)\.
- H\. Ye, H\. Lin, J\. Han, M\. Xu, S\. Liu, Y\. Liang, J\. Ma, J\. Y\. Zou, and S\. Ermon \(2024\)Tfg: unified training\-free guidance for diffusion models\.Advances in Neural Information Processing Systems37,pp\. 22370–22417\.Cited by:[§C\.2](https://arxiv.org/html/2605.28900#A3.SS2.SSS0.Px1.p1.2),[§C\.2](https://arxiv.org/html/2605.28900#A3.SS2.SSS0.Px2.p1.2),[§C\.2](https://arxiv.org/html/2605.28900#A3.SS2.SSS0.Px2.p1.3),[§C\.2](https://arxiv.org/html/2605.28900#A3.SS2.SSS0.Px3.p1.2),[§C\.3](https://arxiv.org/html/2605.28900#A3.SS3.p1.3),[§C\.4](https://arxiv.org/html/2605.28900#A3.SS4.p1.3),[§1](https://arxiv.org/html/2605.28900#S1.p2.1),[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px2.p1.1)\.
- J\. Yu, Y\. Wang, C\. Zhao, B\. Ghanem, and J\. Zhang \(2023\)Freedom: training\-free energy\-guided conditional diffusion model\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 23174–23184\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.28900#S6.SS1.SSS0.Px2.p1.1)\.
- J\. Zbontar, L\. Jing, I\. Misra, Y\. LeCun, and S\. Deny \(2021\)Barlow twins: self\-supervised learning via redundancy reduction\.InInternational conference on machine learning,pp\. 12310–12320\.Cited by:[§4\.3](https://arxiv.org/html/2605.28900#S4.SS3.SSS0.Px1.p1.4)\.
- Z\. Zhan, D\. Chen, J\. Mei, Z\. Zhao, J\. Chen, C\. Chen, S\. Lyu, and C\. Wang \(2024\)Conditional image synthesis with diffusion models: a survey\.arXiv preprint arXiv:2409\.19365\.Cited by:[§2](https://arxiv.org/html/2605.28900#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zhang, A\. Rao, and M\. Agrawala \(2023\)Adding conditional control to text\-to\-image diffusion models\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 3836–3847\.Cited by:[§1](https://arxiv.org/html/2605.28900#S1.p1.1)\.

## Appendix ATheoretical Results

We denote byμ0\\mu\_\{0\}andμt\\mu\_\{t\}the measures with densitiesp0p\_\{0\}and the marginalpt\(xt\)=∫pt\(xt∣x0\)𝑑μ0\(x0\)p\_\{t\}\(x\_\{t\}\)=\\int p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\\,d\\mu\_\{0\}\(x\_\{0\}\), respectively, with respect to the Lebesgue measure\. We consider two Hilbert spaces\. The clean data spaceℋ0=L2\(μ0\)\\mathcal\{H\}\_\{0\}=L^\{2\}\(\\mu\_\{0\}\), with inner product

⟨f,g⟩μ0=𝔼X0∼p0\[f\(X0\)g\(X0\)\],\\langle f,\\,g\\rangle\_\{\\mu\_\{0\}\}=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\[f\(X\_\{0\}\)\\,g\(X\_\{0\}\)\],\(25\)and the noisy data spaceℋt=L2\(μt\)\\mathcal\{H\}\_\{t\}=L^\{2\}\(\\mu\_\{t\}\), with inner product

⟨f,g⟩μt=𝔼Xt∼pt\[f\(Xt\)g\(Xt\)\]\.\\langle f,\\,g\\rangle\_\{\\mu\_\{t\}\}=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\[f\(X\_\{t\}\)\\,g\(X\_\{t\}\)\]\.\(26\)
This appendix is structured as follows\.

- •Appendix[A\.1](https://arxiv.org/html/2605.28900#A1.SS1)defines the backward operatorTtT\_\{t\}and diffusion operatorTt∗T\_\{t\}^\{\\ast\}, establishes the adjointness relation⟨g,Ttf⟩μt=⟨Tt∗g,f⟩μ0\\langle g,\\,T\_\{t\}f\\rangle\_\{\\mu\_\{t\}\}=\\langle T\_\{t\}^\{\\ast\}g,\\,f\\rangle\_\{\\mu\_\{0\}\}\(Proposition[A\.3](https://arxiv.org/html/2605.28900#A1.Thmtheorem3)\) and derives the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}\.
- •Appendix[A\.2](https://arxiv.org/html/2605.28900#A1.SS2)shows thatTtT\_\{t\}is Hilbert–Schmidt \(and hence compact\) under the compact\-support assumption onp0p\_\{0\}\(Theorem[A\.4](https://arxiv.org/html/2605.28900#A1.Thmtheorem4)\)\.
- •Appendix[A\.3](https://arxiv.org/html/2605.28900#A1.SS3)develops the singular value decomposition ofTtT\_\{t\}: the leading singular pair is the constant functionψt,1≡1\\psi\_\{t,1\}\\equiv 1,ϕt,1≡1\\phi\_\{t,1\}\\equiv 1withσt,1=1\\sigma\_\{t,1\}=1\(Proposition[A\.5](https://arxiv.org/html/2605.28900#A1.Thmtheorem5)\);TtT\_\{t\}is a contraction \(Proposition[A\.6](https://arxiv.org/html/2605.28900#A1.Thmtheorem6)\); all non\-trivial singular values vanish asα¯t→0\\bar\{\\alpha\}\_\{t\}\\to 0\(Proposition[A\.7](https://arxiv.org/html/2605.28900#A1.Thmtheorem7)\); and establishes a variational characterization of the leading eigenspace \(Theorem[A\.9](https://arxiv.org/html/2605.28900#A1.Thmtheorem9)\)\.
- •Appendix[A\.4](https://arxiv.org/html/2605.28900#A1.SS4)derives the spectral expansion of the guidance signal \(Proposition[A\.10](https://arxiv.org/html/2605.28900#A1.Thmtheorem10)\) and bounds theL2L^\{2\}error of the rank\-KKtruncation \(Proposition[A\.11](https://arxiv.org/html/2605.28900#A1.Thmtheorem11)\)\.
- •Appendix[A\.5](https://arxiv.org/html/2605.28900#A1.SS5)provides a finite\-sample error bound \(Theorem[A\.12](https://arxiv.org/html/2605.28900#A1.Thmtheorem12)\) for the whitened cross\-correlation loss used to learn the singular functions in §[4\.3](https://arxiv.org/html/2605.28900#S4.SS3)\.
- •Appendix[A\.6](https://arxiv.org/html/2605.28900#A1.SS6)derives closed\-form eigenfunctions for tractable priors: Gaussian \(Proposition[A\.13](https://arxiv.org/html/2605.28900#A1.Thmtheorem13)\) and circle \(Proposition[A\.14](https://arxiv.org/html/2605.28900#A1.Thmtheorem14)\)\.

### A\.1Conditional Operators

###### Definition A\.1\(Backward operator\)\.

The backward operatorTt:ℋ0→ℋtT\_\{t\}:\\mathcal\{H\}\_\{0\}\\to\\mathcal\{H\}\_\{t\}is the conditional expectation of a function of the clean dataf\(x0\)f\(x\_\{0\}\)given a noisy observationxtx\_\{t\}\. Forf∈ℋ0f\\in\\mathcal\{H\}\_\{0\},

\(Ttf\)\(xt\):=𝔼X0∼pt\(⋅∣xt\)\[f\(X0\)\]\.\(T\_\{t\}f\)\(x\_\{t\}\):=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{t\}\)\}\[f\(X\_\{0\}\)\]\.\(27\)

This operator formalizes the denoising: it retains components offfthat are inferable fromXtX\_\{t\}\.

###### Definition A\.2\(Diffusion operator\)\.

The diffusion operatorTt∗:ℋt→ℋ0T\_\{t\}^\{\\ast\}:\\mathcal\{H\}\_\{t\}\\to\\mathcal\{H\}\_\{0\}is the conditional expectation ofg\(xt\)g\(x\_\{t\}\)given an initial statex0x\_\{0\}\. Forg∈ℋtg\\in\\mathcal\{H\}\_\{t\},

\(Tt∗g\)\(x0\):=𝔼Xt∼pt\(⋅∣x0\)\[g\(Xt\)\]\.\(T\_\{t\}^\{\\ast\}g\)\(x\_\{0\}\):=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\(\\cdot\\mid x\_\{0\}\)\}\[g\(X\_\{t\}\)\]\.\(28\)

The operatorTt∗T\_\{t\}^\{\\ast\}describes the evolution ofggunder the corruption process\.

###### Proposition A\.3\(Adjointness\)\.

The operatorsTtT\_\{t\}andTt∗T\_\{t\}^\{\\ast\}are Hilbert adjoints\. For allf∈ℋ0f\\in\\mathcal\{H\}\_\{0\}and for allg∈ℋtg\\in\\mathcal\{H\}\_\{t\}we have

⟨g,Ttf⟩μt=⟨Tt∗g,f⟩μ0,\\langle g,\\,T\_\{t\}f\\rangle\_\{\\mu\_\{t\}\}=\\langle T\_\{t\}^\{\\ast\}g,\\,f\\rangle\_\{\\mu\_\{0\}\},\(29\)andTtTt∗:ℋt→ℋtT\_\{t\}T\_\{t\}^\{\\ast\}:\\mathcal\{H\}\_\{t\}\\to\\mathcal\{H\}\_\{t\}is self\-adjoint\.

###### Proof\.

By definition, we have

⟨g,Ttf⟩μt\\displaystyle\\langle g,\\,T\_\{t\}f\\rangle\_\{\\mu\_\{t\}\}=∫g\(xt\)\(∫f\(x0\)pt\(x0∣xt\)𝑑x0\)𝑑μt\(xt\)\\displaystyle=\\int g\(x\_\{t\}\)\\left\(\\int f\(x\_\{0\}\)p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)dx\_\{0\}\\right\)d\\mu\_\{t\}\(x\_\{t\}\)=∬g\(xt\)f\(x0\)pt\(x0∣xt\)pt\(xt\)𝑑x0𝑑xt\\displaystyle=\\iint g\(x\_\{t\}\)f\(x\_\{0\}\)p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)p\_\{t\}\(x\_\{t\}\)\\,dx\_\{0\}\\,dx\_\{t\}=∬g\(xt\)f\(x0\)pt\(x0,xt\)𝑑x0𝑑xt\.\\displaystyle=\\iint g\(x\_\{t\}\)f\(x\_\{0\}\)p\_\{t\}\(x\_\{0\},x\_\{t\}\)\\,dx\_\{0\}\\,dx\_\{t\}\.\(30\)Similarly,

⟨Tt∗g,f⟩μ0\\displaystyle\\langle T\_\{t\}^\{\\ast\}g,\\,f\\rangle\_\{\\mu\_\{0\}\}=∫\(∫g\(xt\)p\(xt∣x0\)𝑑xt\)f\(x0\)𝑑μ0\(x0\)\\displaystyle=\\int\\left\(\\int g\(x\_\{t\}\)p\(x\_\{t\}\\mid x\_\{0\}\)dx\_\{t\}\\right\)f\(x\_\{0\}\)\\,d\\mu\_\{0\}\(x\_\{0\}\)=∬g\(xt\)f\(x0\)pt\(xt∣x0\)p0\(x0\)𝑑xt𝑑x0\\displaystyle=\\iint g\(x\_\{t\}\)f\(x\_\{0\}\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)p\_\{0\}\(x\_\{0\}\)\\,dx\_\{t\}\\,dx\_\{0\}=∬g\(xt\)f\(x0\)pt\(x0,xt\)𝑑x0𝑑xt\.\\displaystyle=\\iint g\(x\_\{t\}\)f\(x\_\{0\}\)p\_\{t\}\(x\_\{0\},x\_\{t\}\)\\,dx\_\{0\}\\,dx\_\{t\}\.\(31\)From Eqs\. \([30](https://arxiv.org/html/2605.28900#A1.E30)\) and \([31](https://arxiv.org/html/2605.28900#A1.E31)\), we have that⟨Tt∗g,f⟩μ0=⟨g,Ttf⟩μt\\langle T^\{\\ast\}\_\{t\}g,\\,f\\rangle\_\{\\mu\_\{0\}\}=\\langle g,\\,T\_\{t\}f\\rangle\_\{\\mu\_\{t\}\}\. It follows thatTt∗T\_\{t\}^\{\\ast\}andTtT\_\{t\}are Hilbert adjoints\. Since

⟨h,TtTt∗g⟩μt=⟨Tt∗h,Tt∗g⟩μ0=⟨TtTt∗h,g⟩μt,\\langle h,\\,T\_\{t\}T\_\{t\}^\{\\ast\}g\\rangle\_\{\\mu\_\{t\}\}=\\langle T\_\{t\}^\{\\ast\}h,\\,T\_\{t\}^\{\\ast\}g\\rangle\_\{\\mu\_\{0\}\}=\\langle T\_\{t\}T\_\{t\}^\{\\ast\}h,\\,g\\rangle\_\{\\mu\_\{t\}\},\(32\)the operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}is self\-adjoint inℋt\\mathcal\{H\}\_\{t\}\. ∎

#### Covariance operator\.

The covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}acts on test functionsf∈ℋtf\\in\\mathcal\{H\}\_\{t\}\. By expanding the operator definitions, we have

\(TtTt∗f\)\(x~t\)\\displaystyle\(T\_\{t\}T\_\{t\}^\{\\ast\}f\)\(\\tilde\{x\}\_\{t\}\)=𝔼X0∼p\(⋅∣x~t\)\[𝔼Xt∼p\(⋅∣X0\)\[f\(Xt\)\]\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\(\\cdot\\mid\\tilde\{x\}\_\{t\}\)\}\[\\mathbb\{E\}\_\{X\_\{t\}\\sim p\(\\cdot\\mid X\_\{0\}\)\}\[f\(X\_\{t\}\)\]\]=∬f\(xt\)pt\(x0∣x~t\)pt\(xt∣x0\)𝑑xt𝑑x0\\displaystyle=\\iint f\(x\_\{t\}\)p\_\{t\}\(x\_\{0\}\\mid\\tilde\{x\}\_\{t\}\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\\,dx\_\{t\}\\,dx\_\{0\}=∫f\(xt\)\(∫pt\(x0∣x~t\)pt\(xt∣x0\)𝑑x0\)𝑑xt\.\\displaystyle=\\int f\(x\_\{t\}\)\\left\(\\int p\_\{t\}\(x\_\{0\}\\mid\\tilde\{x\}\_\{t\}\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\\,dx\_\{0\}\\right\)\\,dx\_\{t\}\.\(33\)To express this as an integral operator with respect to the measureμt\\mu\_\{t\}, we multiply and divide by the marginal densitypt\(xt\)p\_\{t\}\(x\_\{t\}\)

\(TtTt∗f\)\(x~t\)\\displaystyle\(T\_\{t\}T\_\{t\}^\{\\ast\}f\)\(\\tilde\{x\}\_\{t\}\)=∫f\(xt\)\(∫pt\(x0∣x~t\)pt\(xt∣x0\)pt\(xt\)𝑑x0\)pt\(xt\)𝑑xt\\displaystyle=\\int f\(x\_\{t\}\)\\left\(\\int\\frac\{p\_\{t\}\(x\_\{0\}\\mid\\tilde\{x\}\_\{t\}\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\}\{p\_\{t\}\(x\_\{t\}\)\}\\,dx\_\{0\}\\right\)p\_\{t\}\(x\_\{t\}\)\\,dx\_\{t\}=∫f\(xt\)\(∫pt\(x0∣x~t\)pt\(xt∣x0\)pt\(xt\)𝑑x0\)𝑑μt\(xt\)\.\\displaystyle=\\int f\(x\_\{t\}\)\\left\(\\int\\frac\{p\_\{t\}\(x\_\{0\}\\mid\\tilde\{x\}\_\{t\}\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\}\{p\_\{t\}\(x\_\{t\}\)\}\\,dx\_\{0\}\\right\)\\,d\\mu\_\{t\}\(x\_\{t\}\)\.\(34\)Applying Bayes’ rule, we substitutept\(xt∣x0\)pt\(xt\)=pt\(x0∣xt\)p0\(x0\)\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\}\{p\_\{t\}\(x\_\{t\}\)\}=\\frac\{p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)\}\{p\_\{0\}\(x\_\{0\}\)\}to obtain

\(TtTt∗f\)\(x~t\)\\displaystyle\(T\_\{t\}T\_\{t\}^\{\\ast\}f\)\(\\tilde\{x\}\_\{t\}\)=∫f\(xt\)\(∫pt\(x0∣x~t\)pt\(x0∣xt\)p0\(x0\)𝑑x0\)⏟=⁣:kt\(xt,x~t\)𝑑μt\(xt\)\.\\displaystyle=\\int f\(x\_\{t\}\)\\underbrace\{\\left\(\\int\\frac\{p\_\{t\}\(x\_\{0\}\\mid\\tilde\{x\}\_\{t\}\)p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)\}\{p\_\{0\}\(x\_\{0\}\)\}\\,dx\_\{0\}\\right\)\}\_\{=:\\,k\_\{t\}\(x\_\{t\},\\tilde\{x\}\_\{t\}\)\}\\,d\\mu\_\{t\}\(x\_\{t\}\)\.\(35\)Hence,TtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}admits a symmetric diffusion kernelktk\_\{t\}on the noisy data manifold\. Defined with respect toμt\\mu\_\{t\}, the kernel is:

kt\(xt,x~t\)=∫pt\(x0∣xt\)pt\(x0∣x~t\)p0\(x0\)𝑑x0\.k\_\{t\}\(x\_\{t\},\\tilde\{x\}\_\{t\}\)=\\int\\frac\{p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)p\_\{t\}\(x\_\{0\}\\mid\\tilde\{x\}\_\{t\}\)\}\{p\_\{0\}\(x\_\{0\}\)\}\\,dx\_\{0\}\.\(36\)This kernelkt\(xt,x~t\)k\_\{t\}\(x\_\{t\},\\tilde\{x\}\_\{t\}\)represents the transition density of the coupled backward\-forward process \(xt→x0→x~tx\_\{t\}\\to x\_\{0\}\\to\\tilde\{x\}\_\{t\}\)\. It measures the connectivity between two noisy pointsxtx\_\{t\}andx~t\\tilde\{x\}\_\{t\}\. A largettallows the kernel to capture the coarser, global structure of the data\(Coifman and Lafon,[2006](https://arxiv.org/html/2605.28900#bib.bib9)\)\. From Eq\. \([36](https://arxiv.org/html/2605.28900#A1.E36)\),ktk\_\{t\}is symmetrici\.e\.,kt\(xt,x~t\)=kt\(x~t,xt\)k\_\{t\}\(x\_\{t\},\\tilde\{x\}\_\{t\}\)=k\_\{t\}\(\\tilde\{x\}\_\{t\},x\_\{t\}\)\.

### A\.2Compactness

###### Theorem A\.4\(Compactness\)\.

Letp0p\_\{0\}have compact support𝒳:=supp⁡\(p0\)\\mathcal\{X\}:=\\operatorname\{supp\}\(p\_\{0\}\)withp0\>0p\_\{0\}\>0on𝒳\\mathcal\{X\}, and let the forward kernel be Gaussian,pt\(xt∣x0\)=𝒩\(xt;α¯tx0,\(1−α¯t\)I\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)=\\mathcal\{N\}\\\!\\bigl\(x\_\{t\};\\,\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\},\\,\(1\-\\bar\{\\alpha\}\_\{t\}\)I\\bigr\)\. Then, for everyt\>0t\>0, the backward operatorTtT\_\{t\}is Hilbert–Schmidt and thus, compact\.

###### Proof\.

We writeTtT\_\{t\}in integral\-kernel form\. Sincedμ0=p0dx0d\\mu\_\{0\}=p\_\{0\}\\,dx\_\{0\}, we have

\(Ttf\)\(xt\)=∫𝒳f\(x0\)pt\(x0∣xt\)𝑑x0=∫𝒳f\(x0\)pt\(x0∣xt\)p0\(x0\)⏟=⁣:kt\(xt,x0\)𝑑μ0\(x0\)\.\(T\_\{t\}f\)\(x\_\{t\}\)=\\int\_\{\\mathcal\{X\}\}f\(x\_\{0\}\)\\,p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)\\,dx\_\{0\}=\\int\_\{\\mathcal\{X\}\}f\(x\_\{0\}\)\\,\\underbrace\{\\frac\{p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)\}\{p\_\{0\}\(x\_\{0\}\)\}\}\_\{=:\\,k\_\{t\}\(x\_\{t\},x\_\{0\}\)\}\\,d\\mu\_\{0\}\(x\_\{0\}\)\.\(37\)BecauseTt:L2\(μ0\)→L2\(μt\)T\_\{t\}:L^\{2\}\(\\mu\_\{0\}\)\\to L^\{2\}\(\\mu\_\{t\}\)is an integral operator with kernelktk\_\{t\}, its Hilbert–Schmidt norm satisfies

‖Tt‖HS2=∬\|kt\(xt,x0\)\|2𝑑μ0\(x0\)𝑑μt\(xt\)\.\\\|T\_\{t\}\\\|\_\{\\mathrm\{HS\}\}^\{2\}=\\iint\|k\_\{t\}\(x\_\{t\},x\_\{0\}\)\|^\{2\}\\,d\\mu\_\{0\}\(x\_\{0\}\)\\,d\\mu\_\{t\}\(x\_\{t\}\)\.\(38\)Substitutingkt=pt\(x0∣xt\)/p0\(x0\)k\_\{t\}=p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)/p\_\{0\}\(x\_\{0\}\)and applying Bayes’ rule,pt\(x0∣xt\)=pt\(xt∣x0\)p0\(x0\)/pt\(xt\)p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)=p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\\,p\_\{0\}\(x\_\{0\}\)/p\_\{t\}\(x\_\{t\}\), yields

‖Tt‖HS2\\displaystyle\\\|T\_\{t\}\\\|\_\{\\mathrm\{HS\}\}^\{2\}=∬pt\(x0∣xt\)2p0\(x0\)2𝑑μ0\(x0\)𝑑μt\(xt\)\\displaystyle=\\iint\\frac\{p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)^\{2\}\}\{p\_\{0\}\(x\_\{0\}\)^\{2\}\}\\,d\\mu\_\{0\}\(x\_\{0\}\)\\,d\\mu\_\{t\}\(x\_\{t\}\)=∬pt\(xt∣x0\)2pt\(xt\)2p0\(x0\)pt\(xt\)𝑑x0𝑑xt\\displaystyle=\\iint\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)^\{2\}\}\{p\_\{t\}\(x\_\{t\}\)^\{2\}\}\\,p\_\{0\}\(x\_\{0\}\)\\,p\_\{t\}\(x\_\{t\}\)\\,dx\_\{0\}\\,dx\_\{t\}=∫𝒳p0\(x0\)\(∫ℝdpt\(xt∣x0\)2pt\(xt\)𝑑xt\)𝑑x0\.\\displaystyle=\\int\_\{\\mathcal\{X\}\}p\_\{0\}\(x\_\{0\}\)\\left\(\\int\_\{\\mathbb\{R\}^\{d\}\}\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)^\{2\}\}\{p\_\{t\}\(x\_\{t\}\)\}\\,dx\_\{t\}\\right\)dx\_\{0\}\.\(39\)It therefore suffices to show that the inner integral is uniformly bounded forx0∈𝒳x\_\{0\}\\in\\mathcal\{X\}\.

#### Lower bound onpt\(xt\)p\_\{t\}\(x\_\{t\}\)\.

Since𝒳\\mathcal\{X\}is compact, there existsR\>0R\>0such that‖x0′‖≤R\\\|x\_\{0\}^\{\\prime\}\\\|\\leq Rfor allx0′∈𝒳x\_\{0\}^\{\\prime\}\\in\\mathcal\{X\}\. By the triangle inequality,‖xt−α¯tx0′‖≤‖xt‖\+α¯tR\\\|x\_\{t\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}^\{\\prime\}\\\|\\leq\\\|x\_\{t\}\\\|\+\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,R, and because the Gaussian density is decreasing in‖xt−α¯tx0′‖2\\\|x\_\{t\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}^\{\\prime\}\\\|^\{2\}, we obtain

pt\(xt∣x0′\)≥1\(2π\(1−α¯t\)\)d/2exp⁡\(−‖xt‖2\+2α¯tR‖xt‖\+α¯tR22\(1−α¯t\)\)\.p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}^\{\\prime\}\)\\geq\\frac\{1\}\{\\bigl\(2\\pi\(1\-\\bar\{\\alpha\}\_\{t\}\)\\bigr\)^\{d/2\}\}\\,\\exp\\biggl\(\-\\frac\{\\\|x\_\{t\}\\\|^\{2\}\+2\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,R\\,\\\|x\_\{t\}\\\|\+\\bar\{\\alpha\}\_\{t\}R^\{2\}\}\{2\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\\biggr\)\.\(40\)The right\-hand side is independent ofx0′x\_\{0\}^\{\\prime\}, so marginalizing gives the same lower bound for

pt\(xt\)=∫𝒳pt\(xt∣x0′\)p0\(x0′\)𝑑x0′≥1\(2π\(1−α¯t\)\)d/2exp⁡\(−‖xt‖2\+2α¯tR‖xt‖\+α¯tR22\(1−α¯t\)\)\.\\displaystyle p\_\{t\}\(x\_\{t\}\)=\\int\_\{\\mathcal\{X\}\}p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}^\{\\prime\}\)\\,p\_\{0\}\(x\_\{0\}^\{\\prime\}\)\\,dx\_\{0\}^\{\\prime\}\\geq\\frac\{1\}\{\\bigl\(2\\pi\(1\-\\bar\{\\alpha\}\_\{t\}\)\\bigr\)^\{d/2\}\}\\,\\exp\\biggl\(\-\\frac\{\\\|x\_\{t\}\\\|^\{2\}\+2\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,R\\,\\\|x\_\{t\}\\\|\+\\bar\{\\alpha\}\_\{t\}R^\{2\}\}\{2\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\\biggr\)\.\(41\)

#### Bounding the ratio\.

The squared numerator is

pt\(xt∣x0\)2=1\(2π\(1−α¯t\)\)dexp⁡\(−‖xt−α¯tx0‖21−α¯t\)\.p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)^\{2\}=\\frac\{1\}\{\\bigl\(2\\pi\(1\-\\bar\{\\alpha\}\_\{t\}\)\\bigr\)^\{d\}\}\\,\\exp\\biggl\(\-\\frac\{\\\|x\_\{t\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}\\\|^\{2\}\}\{1\-\\bar\{\\alpha\}\_\{t\}\}\\biggr\)\.\(42\)Expanding‖xt−α¯tx0‖2=‖xt‖2−2α¯txt⊤x0\+α¯t‖x0‖2\\\|x\_\{t\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}\\\|^\{2\}=\\\|x\_\{t\}\\\|^\{2\}\-2\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{t\}^\{\\top\}x\_\{0\}\+\\bar\{\\alpha\}\_\{t\}\\\|x\_\{0\}\\\|^\{2\}and combining with the lower bound \([41](https://arxiv.org/html/2605.28900#A1.E41)\), the Gaussian prefactors simplify and the exponent becomes

−‖xt‖2\+4α¯txt⊤x0\+2α¯tR‖xt‖−2α¯t‖x0‖2\+α¯tR22\(1−α¯t\)\.\\frac\{\-\\\|x\_\{t\}\\\|^\{2\}\+4\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{t\}^\{\\top\}x\_\{0\}\+2\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,R\\,\\\|x\_\{t\}\\\|\-2\\bar\{\\alpha\}\_\{t\}\\\|x\_\{0\}\\\|^\{2\}\+\\bar\{\\alpha\}\_\{t\}R^\{2\}\}\{2\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\.\(43\)By Cauchy–Schwarz,xt⊤x0≤‖xt‖‖x0‖≤R‖xt‖x\_\{t\}^\{\\top\}x\_\{0\}\\leq\\\|x\_\{t\}\\\|\\,\\\|x\_\{0\}\\\|\\leq R\\,\\\|x\_\{t\}\\\|forx0∈𝒳x\_\{0\}\\in\\mathcal\{X\}, and−2α¯t‖x0‖2≤0\-2\\bar\{\\alpha\}\_\{t\}\\\|x\_\{0\}\\\|^\{2\}\\leq 0, so

pt\(xt∣x0\)2pt\(xt\)≤1\(2π\(1−α¯t\)\)d/2exp⁡\(−‖xt‖2\+6α¯tR‖xt‖\+α¯tR22\(1−α¯t\)\)\.\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)^\{2\}\}\{p\_\{t\}\(x\_\{t\}\)\}\\leq\\frac\{1\}\{\\bigl\(2\\pi\(1\-\\bar\{\\alpha\}\_\{t\}\)\\bigr\)^\{d/2\}\}\\,\\exp\\biggl\(\\frac\{\-\\\|x\_\{t\}\\\|^\{2\}\+6\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,R\\,\\\|x\_\{t\}\\\|\+\\bar\{\\alpha\}\_\{t\}R^\{2\}\}\{2\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\\biggr\)\.\(44\)

#### Completing the square\.

Writingu=‖xt‖u=\\\|x\_\{t\}\\\|andc=3α¯tRc=3\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,R, we have

−u2\+6α¯tRu\+α¯tR2=−\(u−c\)2\+9α¯tR2\+α¯tR2=−\(u−c\)2\+10α¯tR2\.\-u^\{2\}\+6\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,R\\,u\+\\bar\{\\alpha\}\_\{t\}R^\{2\}=\-\(u\-c\)^\{2\}\+9\\bar\{\\alpha\}\_\{t\}R^\{2\}\+\\bar\{\\alpha\}\_\{t\}R^\{2\}=\-\(u\-c\)^\{2\}\+10\\bar\{\\alpha\}\_\{t\}R^\{2\}\.\(45\)To make the Gaussian integral tractable, we apply the bound\(u−c\)2≥12u2−c2\(u\-c\)^\{2\}\\geq\\tfrac\{1\}\{2\}u^\{2\}\-c^\{2\}, which follows from Young’s inequality withϵ=1/2\\epsilon=1/2\. This gives

−\(u−c\)2\+10α¯tR2≤−12‖xt‖2\+9α¯tR2\+10α¯tR2=−12‖xt‖2\+19α¯tR2\.\-\(u\-c\)^\{2\}\+10\\bar\{\\alpha\}\_\{t\}R^\{2\}\\leq\-\\tfrac\{1\}\{2\}\\\|x\_\{t\}\\\|^\{2\}\+9\\bar\{\\alpha\}\_\{t\}R^\{2\}\+10\\bar\{\\alpha\}\_\{t\}R^\{2\}=\-\\tfrac\{1\}\{2\}\\\|x\_\{t\}\\\|^\{2\}\+19\\bar\{\\alpha\}\_\{t\}R^\{2\}\.\(46\)Substituting into \([44](https://arxiv.org/html/2605.28900#A1.E44)\),

pt\(xt∣x0\)2pt\(xt\)≤1\(2π\(1−α¯t\)\)d/2exp⁡\(19α¯tR22\(1−α¯t\)\)exp⁡\(−‖xt‖24\(1−α¯t\)\)\.\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)^\{2\}\}\{p\_\{t\}\(x\_\{t\}\)\}\\leq\\frac\{1\}\{\\bigl\(2\\pi\(1\-\\bar\{\\alpha\}\_\{t\}\)\\bigr\)^\{d/2\}\}\\,\\exp\\biggl\(\\frac\{19\\bar\{\\alpha\}\_\{t\}R^\{2\}\}\{2\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\\biggr\)\\,\\exp\\biggl\(\-\\frac\{\\\|x\_\{t\}\\\|^\{2\}\}\{4\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\\biggr\)\.\(47\)

#### Concluding the bound\.

Integrating \([47](https://arxiv.org/html/2605.28900#A1.E47)\) overxt∈ℝdx\_\{t\}\\in\\mathbb\{R\}^\{d\}yields the Gaussian integral

∫ℝdpt\(xt∣x0\)2pt\(xt\)dxt≤\(4π\(1−α¯t\)\)d/2\(2π\(1−α¯t\)\)d/2exp\(19α¯tR22\(1−α¯t\)\)=2d/2exp\(19α¯tR22\(1−α¯t\)\)=:C\(t,R\)\.\\int\_\{\\mathbb\{R\}^\{d\}\}\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)^\{2\}\}\{p\_\{t\}\(x\_\{t\}\)\}\\,dx\_\{t\}\\leq\\frac\{\\bigl\(4\\pi\(1\-\\bar\{\\alpha\}\_\{t\}\)\\bigr\)^\{d/2\}\}\{\\bigl\(2\\pi\(1\-\\bar\{\\alpha\}\_\{t\}\)\\bigr\)^\{d/2\}\}\\,\\exp\\biggl\(\\frac\{19\\bar\{\\alpha\}\_\{t\}R^\{2\}\}\{2\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\\biggr\)=2^\{d/2\}\\,\\exp\\biggl\(\\frac\{19\\bar\{\\alpha\}\_\{t\}R^\{2\}\}\{2\(1\-\\bar\{\\alpha\}\_\{t\}\)\}\\biggr\)=:C\(t,R\)\.\(48\)The constantC\(t,R\)C\(t,R\)is finite for everyt\>0t\>0and depends only ontt,RR, and the dimensiondd\. Substituting back into \([39](https://arxiv.org/html/2605.28900#A1.E39)\),

‖Tt‖HS2≤C\(t,R\)∫𝒳p0\(x0\)𝑑x0=C\(t,R\)<∞\.\\\|T\_\{t\}\\\|\_\{\\mathrm\{HS\}\}^\{2\}\\leq C\(t,R\)\\int\_\{\\mathcal\{X\}\}p\_\{0\}\(x\_\{0\}\)\\,dx\_\{0\}=C\(t,R\)<\\infty\.\(49\)SinceTtT\_\{t\}has finite Hilbert–Schmidt norm, it is Hilbert–Schmidt and therefore compact\. ∎

By proving compactness forTtT\_\{t\}, we immediately have that the adjointTt∗T\_\{t\}^\{\\ast\}is compact\. Consequently, the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}, being the composition of compact operators, is itself compact\.

### A\.3Singular Value Decomposition

SinceTtT\_\{t\}is compact \(Theorem[A\.4](https://arxiv.org/html/2605.28900#A1.Thmtheorem4)\), it admits a singular value decomposition \(SVD\)\. For anyf∈L2\(μ0\)f\\in L^\{2\}\(\\mu\_\{0\}\),

\(Ttf\)\(xt\)=∑k=1∞σt,kϕt,k\(xt\)⟨ψt,k,f⟩μ0,\(T\_\{t\}f\)\(x\_\{t\}\)=\\sum\_\{k=1\}^\{\\infty\}\\sigma\_\{t,k\}\\,\\phi\_\{t,k\}\(x\_\{t\}\)\\,\\langle\\psi\_\{t,k\},\\,f\\rangle\_\{\\mu\_\{0\}\},\(50\)where the convergence is inL2\(μt\)L^\{2\}\(\\mu\_\{t\}\),\{ψt,k\}k≥1⊂L2\(μ0\)\\\{\\psi\_\{t,k\}\\\}\_\{k\\geq 1\}\\subset L^\{2\}\(\\mu\_\{0\}\)and\{ϕt,k\}k≥1⊂L2\(μt\)\\\{\\phi\_\{t,k\}\\\}\_\{k\\geq 1\}\\subset L^\{2\}\(\\mu\_\{t\}\)are orthonormal systems, andσt,1≥σt,2≥⋯≥0\\sigma\_\{t,1\}\\geq\\sigma\_\{t,2\}\\geq\\cdots\\geq 0are the singular values\. In particular,

Tt∗Ttψt,k=σt,k2ψt,k,Tt∗ϕt,k=σt,kψt,k\.T\_\{t\}^\{\\ast\}T\_\{t\}\\,\\psi\_\{t,k\}=\\sigma\_\{t,k\}^\{2\}\\,\\psi\_\{t,k\},\\qquad T\_\{t\}^\{\\ast\}\\,\\phi\_\{t,k\}=\\sigma\_\{t,k\}\\,\\psi\_\{t,k\}\.\(51\)
###### Proposition A\.5\(Leading singular functions\)\.

Letψt,1≡1\\psi\_\{t,1\}\\equiv 1andϕt,1≡1\\phi\_\{t,1\}\\equiv 1denote the constant functions equal to one\. Then\(ψt,1,ϕt,1\)\(\\psi\_\{t,1\},\\,\\phi\_\{t,1\}\)is a singular pair ofTtT\_\{t\}with singular valueσt,1=1\\sigma\_\{t,1\}=1\. That is,

Ttψt,1=ϕt,1,Tt∗ϕt,1=ψt,1\.T\_\{t\}\\,\\psi\_\{t,1\}=\\phi\_\{t,1\},\\qquad T\_\{t\}^\{\\ast\}\\,\\phi\_\{t,1\}=\\psi\_\{t,1\}\.\(52\)

###### Proof\.

We verify three facts: thatTtT\_\{t\}andTt∗T\_\{t\}^\{\\ast\}each map𝟏\\mathbf\{1\}to𝟏\\mathbf\{1\}, and that the singular functions have unit norm\.

For everyxtx\_\{t\},

\(Tt1\)\(xt\)=𝔼\[𝟏\(X0\)∣Xt=xt\]=∫𝒳pt\(x0∣xt\)𝑑x0=1,\(T\_\{t\}\\,\\mathbf\{1\}\)\(x\_\{t\}\)=\\mathbb\{E\}\[\\mathbf\{1\}\(X\_\{0\}\)\\mid X\_\{t\}=x\_\{t\}\]=\\int\_\{\\mathcal\{X\}\}p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)\\,dx\_\{0\}=1,\(53\)sincep\(⋅∣xt\)p\(\\cdot\\mid x\_\{t\}\)is a probability density overx0x\_\{0\}\. HenceTt1=𝟏T\_\{t\}\\,\\mathbf\{1\}=\\mathbf\{1\}\.

Similarly, for everyx0x\_\{0\},

\(Tt∗1\)\(x0\)=𝔼\[𝟏\(Xt\)∣X0=x0\]=∫ℝdpt\(xt∣x0\)𝑑xt=1,\(T\_\{t\}^\{\\ast\}\\,\\mathbf\{1\}\)\(x\_\{0\}\)=\\mathbb\{E\}\[\\mathbf\{1\}\(X\_\{t\}\)\\mid X\_\{0\}=x\_\{0\}\]=\\int\_\{\\mathbb\{R\}^\{d\}\}p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\\,dx\_\{t\}=1,\(54\)sincep\(⋅∣x0\)p\(\\cdot\\mid x\_\{0\}\)is a probability density overxtx\_\{t\}\. HenceTt∗1=𝟏T\_\{t\}^\{\\ast\}\\,\\mathbf\{1\}=\\mathbf\{1\}\.

The singular functions have unit norm in their respective spaces:

‖ψt,1‖μ02=𝔼X0∼p0\[𝟏2\]=1,‖ϕt,1‖μt2=𝔼Xt∼pt\[𝟏2\]=1\.\\\|\\psi\_\{t,1\}\\\|\_\{\\mu\_\{0\}\}^\{2\}=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\[\\mathbf\{1\}^\{2\}\]=1,\\qquad\\\|\\phi\_\{t,1\}\\\|\_\{\\mu\_\{t\}\}^\{2\}=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\[\\mathbf\{1\}^\{2\}\]=1\.\(55\)
Combining the above,Ttψt,1=ϕt,1T\_\{t\}\\,\\psi\_\{t,1\}=\\phi\_\{t,1\}andTt∗ϕt,1=ψt,1T\_\{t\}^\{\\ast\}\\,\\phi\_\{t,1\}=\\psi\_\{t,1\}, so\(ψt,1,ϕt,1\)\(\\psi\_\{t,1\},\\,\\phi\_\{t,1\}\)is a singular pair with singular valueσt,1=1\\sigma\_\{t,1\}=1\. ∎

###### Proposition A\.6\(TtT\_\{t\}is a contraction\)\.

For everyf∈L2\(μ0\)f\\in L^\{2\}\(\\mu\_\{0\}\),‖Ttf‖μt≤‖f‖μ0\\\|T\_\{t\}f\\\|\_\{\\mu\_\{t\}\}\\leq\\\|f\\\|\_\{\\mu\_\{0\}\}\. In particular,σt,k≤1\\sigma\_\{t,k\}\\leq 1for allk≥1k\\geq 1\.

###### Proof\.

Fixf∈L2\(μ0\)f\\in L^\{2\}\(\\mu\_\{0\}\)\. Applying Jensen’s inequality to the conditional expectation and then the tower property gives

‖Ttf‖μt2\\displaystyle\\\|T\_\{t\}f\\\|\_\{\\mu\_\{t\}\}^\{2\}=𝔼Xt∼pt\[𝔼\[f\(X0\)∣Xt\]2\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\\!\\bigl\[\\mathbb\{E\}\[f\(X\_\{0\}\)\\mid X\_\{t\}\]^\{2\}\\bigr\]≤𝔼Xt∼pt\[𝔼\[f\(X0\)2∣Xt\]\]\\displaystyle\\leq\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\\!\\bigl\[\\mathbb\{E\}\[f\(X\_\{0\}\)^\{2\}\\mid X\_\{t\}\]\\bigr\]\(Jensen\)=𝔼X0∼p0\[f\(X0\)2\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\[f\(X\_\{0\}\)^\{2\}\]\(tower property\)=‖f‖μ02\.\\displaystyle=\\\|f\\\|\_\{\\mu\_\{0\}\}^\{2\}\.Hence‖Tt‖op≤1\\\|T\_\{t\}\\\|\_\{\\mathrm\{op\}\}\\leq 1, and since‖Tt‖op=σt,1≥σt,2≥⋯\\\|T\_\{t\}\\\|\_\{\\mathrm\{op\}\}=\\sigma\_\{t,1\}\\geq\\sigma\_\{t,2\}\\geq\\cdots, it follows thatσt,k≤1\\sigma\_\{t,k\}\\leq 1for everyk≥1k\\geq 1\. ∎

###### Proposition A\.7\(Non\-trivial singular values vanish asα¯t→0\\bar\{\\alpha\}\_\{t\}\\to 0\)\.

Letp0p\_\{0\}have compact support𝒳\\mathcal\{X\}and letXt=α¯tX0\+1−α¯tϵX\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,X\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon, withϵ∼𝒩\(0,I\)\\epsilon\\sim\\mathcal\{N\}\(0,I\)\. Then, for everyk≥2k\\geq 2,

σt,k→0asα¯t→0\.\\sigma\_\{t,k\}\\to 0\\quad\\text\{as\}\\quad\\bar\{\\alpha\}\_\{t\}\\to 0\.\(56\)

###### Proof\.

Sinceσt,1=1\\sigma\_\{t,1\}=1with singular functionsψt,1≡1\\psi\_\{t,1\}\\equiv 1,ϕt,1≡1\\phi\_\{t,1\}\\equiv 1\(Proposition[A\.5](https://arxiv.org/html/2605.28900#A1.Thmtheorem5)\), it suffices to show that the operator norm ofTtT\_\{t\}restricted to𝟏⟂:=\{f∈L2\(μ0\):𝔼p0\[f\]=0\}\\mathbf\{1\}^\{\\perp\}:=\\\{f\\in L^\{2\}\(\\mu\_\{0\}\):\\mathbb\{E\}\_\{p\_\{0\}\}\[f\]=0\\\}vanishes asα¯t→0\\bar\{\\alpha\}\_\{t\}\\to 0\.

#### Rewriting the operator on𝟏⟂\\mathbf\{1\}^\{\\perp\}\.

Fixf∈𝟏⟂f\\in\\mathbf\{1\}^\{\\perp\}, so that∫f\(x0\)p0\(x0\)𝑑x0=0\\int f\(x\_\{0\}\)\\,p\_\{0\}\(x\_\{0\}\)\\,dx\_\{0\}=0\. Applying Bayes’ rule,p\(x0∣xt\)=p\(xt∣x0\)p0\(x0\)/pt\(xt\)p\(x\_\{0\}\\mid x\_\{t\}\)=p\(x\_\{t\}\\mid x\_\{0\}\)\\,p\_\{0\}\(x\_\{0\}\)/p\_\{t\}\(x\_\{t\}\), and using the zero\-mean condition to subtract the identity,

\(Ttf\)\(xt\)\\displaystyle\(T\_\{t\}f\)\(x\_\{t\}\)=∫𝒳f\(x0\)pt\(x0∣xt\)𝑑x0\\displaystyle=\\int\_\{\\mathcal\{X\}\}f\(x\_\{0\}\)\\,p\_\{t\}\(x\_\{0\}\\mid x\_\{t\}\)\\,dx\_\{0\}=∫𝒳f\(x0\)pt\(xt∣x0\)pt\(xt\)p0\(x0\)𝑑x0\\displaystyle=\\int\_\{\\mathcal\{X\}\}f\(x\_\{0\}\)\\,\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\}\{p\_\{t\}\(x\_\{t\}\)\}\\,p\_\{0\}\(x\_\{0\}\)\\,dx\_\{0\}=∫𝒳f\(x0\)\(pt\(xt∣x0\)pt\(xt\)−1\)p0\(x0\)𝑑x0,\\displaystyle=\\int\_\{\\mathcal\{X\}\}f\(x\_\{0\}\)\\,\\biggl\(\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\}\{p\_\{t\}\(x\_\{t\}\)\}\-1\\biggr\)\\,p\_\{0\}\(x\_\{0\}\)\\,dx\_\{0\},\(57\)where the last step uses∫f𝑑μ0=0\\int f\\,d\\mu\_\{0\}=0\.

#### Pointwise and integrated bound\.

Applying Cauchy–Schwarz inL2\(μ0\)L^\{2\}\(\\mu\_\{0\}\)to \([57](https://arxiv.org/html/2605.28900#A1.E57)\) gives, for eachxtx\_\{t\},

\|\(Ttf\)\(xt\)\|2≤‖f‖μ02∫𝒳\(pt\(xt∣x0\)pt\(xt\)−1\)2p0\(x0\)𝑑x0\.\\bigl\|\(T\_\{t\}f\)\(x\_\{t\}\)\\bigr\|^\{2\}\\leq\\\|f\\\|\_\{\\mu\_\{0\}\}^\{2\}\\int\_\{\\mathcal\{X\}\}\\biggl\(\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\}\{p\_\{t\}\(x\_\{t\}\)\}\-1\\biggr\)^\{\\\!2\}\\,p\_\{0\}\(x\_\{0\}\)\\,dx\_\{0\}\.\(58\)Integrating both sides againstμt\\mu\_\{t\}i\.e\., multiplying bypt\(xt\)p\_\{t\}\(x\_\{t\}\)and integrating overxtx\_\{t\},

∥Ttf∥μt2≤∥f∥μ02∫𝒳p0\(x0\)∫ℝd\(pt\(xt∣x0\)−pt\(xt\)\)2pt\(xt\)dxtdx0=∥f∥μ02𝔼X0∼p0\[χ2\(pt\(⋅∣X0\)∥pt\)\],\\\|T\_\{t\}f\\\|\_\{\\mu\_\{t\}\}^\{2\}\\leq\\\|f\\\|\_\{\\mu\_\{0\}\}^\{2\}\\int\_\{\\mathcal\{X\}\}p\_\{0\}\(x\_\{0\}\)\\int\_\{\\mathbb\{R\}^\{d\}\}\\frac\{\\bigl\(p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\-p\_\{t\}\(x\_\{t\}\)\\bigr\)^\{2\}\}\{p\_\{t\}\(x\_\{t\}\)\}\\,dx\_\{t\}\\,dx\_\{0\}=\\\|f\\\|\_\{\\mu\_\{0\}\}^\{2\}\\;\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\\\!\\bigl\[\\chi^\{2\}\\\!\\bigl\(p\_\{t\}\(\\cdot\\mid X\_\{0\}\)\\,\\big\\\|\\,p\_\{t\}\\bigr\)\\bigr\],\(59\)whereχ2\(q∥r\):=∫\(q−r\)2/r\\chi^\{2\}\(q\\\|r\):=\\int\(q\-r\)^\{2\}/rdenotes the chi\-squared divergence\.

#### Vanishing of the chi\-squared divergence\.

Asα¯t→0\\bar\{\\alpha\}\_\{t\}\\to 0, the forward kernelpt\(xt∣x0\)=𝒩\(xt;α¯tx0,\(1−α¯t\)I\)p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)=\\mathcal\{N\}\\\!\\bigl\(x\_\{t\};\\,\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\},\\,\(1\-\\bar\{\\alpha\}\_\{t\}\)\\,I\\bigr\)converges to𝒩\(xt;0,I\)\\mathcal\{N\}\(x\_\{t\};\\,0,\\,I\)\. Since𝒳\\mathcal\{X\}is compact, this convergence is uniform overx0∈𝒳x\_\{0\}\\in\\mathcal\{X\}: for everyxtx\_\{t\},

supx0∈𝒳\|pt\(xt∣x0\)pt\(xt\)−1\|⟶0asα¯t→0,\\sup\_\{x\_\{0\}\\in\\mathcal\{X\}\}\\biggl\|\\frac\{p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\}\{p\_\{t\}\(x\_\{t\}\)\}\-1\\biggr\|\\;\\longrightarrow\\;0\\quad\\text\{as\}\\quad\\bar\{\\alpha\}\_\{t\}\\to 0,\(60\)because both the numerator and the denominatorpt\(xt\)=∫𝒳pt\(xt∣x0′\)p0\(x0′\)𝑑x0′p\_\{t\}\(x\_\{t\}\)=\\int\_\{\\mathcal\{X\}\}p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}^\{\\prime\}\)\\,p\_\{0\}\(x\_\{0\}^\{\\prime\}\)\\,dx\_\{0\}^\{\\prime\}converge to the same standard Gaussian density\. Therefore,

𝔼X0∼p0\[χ2\(pt\(⋅∣X0\)∥pt\)\]⟶0asα¯t→0\.\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\\\!\\bigl\[\\chi^\{2\}\\\!\\bigl\(p\_\{t\}\(\\cdot\\mid X\_\{0\}\)\\,\\big\\\|\\,p\_\{t\}\\bigr\)\\bigr\]\\;\\longrightarrow\\;0\\quad\\text\{as\}\\quad\\bar\{\\alpha\}\_\{t\}\\to 0\.\(61\)

#### Conclusion\.

Taking the supremum of \([59](https://arxiv.org/html/2605.28900#A1.E59)\) overf∈𝟏⟂f\\in\\mathbf\{1\}^\{\\perp\}with‖f‖μ0=1\\\|f\\\|\_\{\\mu\_\{0\}\}=1gives

σt,22=∥Tt\|𝟏⟂∥op2≤𝔼X0∼p0\[χ2\(pt\(⋅∣X0\)∥pt\)\]⟶0\.\\sigma\_\{t,2\}^\{2\}=\\bigl\\\|T\_\{t\}\\big\|\_\{\\mathbf\{1\}^\{\\perp\}\}\\bigr\\\|\_\{\\mathrm\{op\}\}^\{2\}\\leq\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\\\!\\bigl\[\\chi^\{2\}\\\!\\bigl\(p\_\{t\}\(\\cdot\\mid X\_\{0\}\)\\,\\big\\\|\\,p\_\{t\}\\bigr\)\\bigr\]\\;\\longrightarrow\\;0\.\(62\)Sinceσt,2≥σt,3≥⋯\\sigma\_\{t,2\}\\geq\\sigma\_\{t,3\}\\geq\\cdots, it follows thatσt,k→0\\sigma\_\{t,k\}\\to 0for allk≥2k\\geq 2\. ∎

We conclude with the variational characterization of the SVD ofTtT\_\{t\}, which forms the basis of our learning algorithm\. We first present the constrained formulation, based on the Courant\-Fischer theorem and then the unconstrained approach\.

###### Lemma A\.8\(Constrained variational characterization of the leading eigenspace\)\.

Letζ\\zetadenote the joint distribution of two independent noisy views of the same clean sample,

ζ\(xt,x~t\):=∫𝒳pt\(xt∣x0\)pt\(x~t∣x0\)𝑑μ0\(x0\)\.\\zeta\(x\_\{t\},\\tilde\{x\}\_\{t\}\):=\\int\_\{\\mathcal\{X\}\}p\_\{t\}\(x\_\{t\}\\mid x\_\{0\}\)\\,p\_\{t\}\(\\tilde\{x\}\_\{t\}\\mid x\_\{0\}\)\\,d\\mu\_\{0\}\(x\_\{0\}\)\.\(63\)Then the top\-KKleft singular functions\{ϕt,k\}k=1K\\\{\\phi\_\{t,k\}\\\}\_\{k=1\}^\{K\}ofTtT\_\{t\}solve

maxf1,…,fK∑k=1K𝔼\(Xt,X~t\)∼ζ\[fk\(Xt\)fk\(X~t\)\]s\.t\.⟨fi,fj⟩μt=δij\.\\max\_\{f\_\{1\},\\ldots,f\_\{K\}\}\\;\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\_\{\(X\_\{t\},\\tilde\{X\}\_\{t\}\)\\sim\\zeta\}\\\!\\bigl\[f\_\{k\}\(X\_\{t\}\)\\,f\_\{k\}\(\\tilde\{X\}\_\{t\}\)\\bigr\]\\qquad\\text\{s\.t\.\}\\quad\\langle f\_\{i\},\\,f\_\{j\}\\rangle\_\{\\mu\_\{t\}\}=\\delta\_\{ij\}\.\(64\)The maximum value is∑k=1Kλt,k\\sum\_\{k=1\}^\{K\}\\lambda\_\{t,k\}, whereλt,k=σt,k2\\lambda\_\{t,k\}=\\sigma\_\{t,k\}^\{2\}\. Given any maximizer\{ϕt,k\}k=1K\\\{\\phi\_\{t,k\}\\\}\_\{k=1\}^\{K\}, the corresponding right singular functions are recovered as

ψt,k\(x0\)=1σt,k\(Tt∗ϕt,k\)\(x0\)=1σt,k𝔼\[ϕt,k\(Xt\)∣X0=x0\]\.\\psi\_\{t,k\}\(x\_\{0\}\)=\\frac\{1\}\{\\sigma\_\{t,k\}\}\\,\(T\_\{t\}^\{\\ast\}\\,\\phi\_\{t,k\}\)\(x\_\{0\}\)=\\frac\{1\}\{\\sigma\_\{t,k\}\}\\,\\mathbb\{E\}\\bigl\[\\phi\_\{t,k\}\(X\_\{t\}\)\\mid X\_\{0\}=x\_\{0\}\\bigr\]\.\(65\)

###### Proof\.

SinceTtTt∗:ℋt→ℋtT\_\{t\}T\_\{t\}^\{\\ast\}:\\mathcal\{H\}\_\{t\}\\to\\mathcal\{H\}\_\{t\}is compact \(Theorem[A\.4](https://arxiv.org/html/2605.28900#A1.Thmtheorem4)\) and self\-adjoint, the Courant–Fischer theorem gives

maxf1,…,fK∈ℋt⟨fi,fj⟩μt=δij∑k=1K⟨fk,TtTt∗fk⟩μt=∑k=1Kλt,k,\\max\_\{\\begin\{subarray\}\{c\}f\_\{1\},\\ldots,f\_\{K\}\\in\\mathcal\{H\}\_\{t\}\\\\ \\langle f\_\{i\},f\_\{j\}\\rangle\_\{\\mu\_\{t\}\}=\\delta\_\{ij\}\\end\{subarray\}\}\\sum\_\{k=1\}^\{K\}\\langle f\_\{k\},\\,T\_\{t\}T\_\{t\}^\{\\ast\}\\,f\_\{k\}\\rangle\_\{\\mu\_\{t\}\}=\\sum\_\{k=1\}^\{K\}\\lambda\_\{t,k\},\(66\)attained whenspan\{f1,…,fK\}=span\{ϕt,1,…,ϕt,K\}\\mathrm\{span\}\\\{f\_\{1\},\\ldots,f\_\{K\}\\\}=\\mathrm\{span\}\\\{\\phi\_\{t,1\},\\ldots,\\phi\_\{t,K\}\\\}\. It remains to show that the Rayleigh quotient equals the cross\-view correlation\.

#### Expanding the Rayleigh quotient\.

Forf∈ℋtf\\in\\mathcal\{H\}\_\{t\}, we unwind the operator compositionTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}by first expanding the outer operatorTtT\_\{t\}\(expectation overX0∣XtX\_\{0\}\\mid X\_\{t\}\), then the inner operatorTt∗T\_\{t\}^\{\\ast\}\(expectation overX~t∣X0\\tilde\{X\}\_\{t\}\\mid X\_\{0\}\):

⟨f,TtTt∗f⟩μt\\displaystyle\\langle f,\\,T\_\{t\}T\_\{t\}^\{\\ast\}f\\rangle\_\{\\mu\_\{t\}\}=𝔼Xt∼pt\[f\(Xt\)\(TtTt∗f\)\(Xt\)\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\\!\\bigl\[f\(X\_\{t\}\)\\,\(T\_\{t\}T\_\{t\}^\{\\ast\}f\)\(X\_\{t\}\)\\bigr\]=𝔼Xt∼pt\[f\(Xt\)𝔼\[\(Tt∗f\)\(X0\)∣Xt\]\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\\!\\Bigl\[f\(X\_\{t\}\)\\,\\mathbb\{E\}\\bigl\[\(T\_\{t\}^\{\\ast\}f\)\(X\_\{0\}\)\\mid X\_\{t\}\\bigr\]\\Bigr\]\(expandingTtT\_\{t\}\)=𝔼Xt∼pt\[f\(Xt\)𝔼\[𝔼\[f\(X~t\)∣X0\]\|Xt\]\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\\!\\Bigl\[f\(X\_\{t\}\)\\,\\mathbb\{E\}\\bigl\[\\mathbb\{E\}\[f\(\\tilde\{X\}\_\{t\}\)\\mid X\_\{0\}\]\\;\\big\|\\;X\_\{t\}\\bigr\]\\Bigr\]\(expandingTt∗T\_\{t\}^\{\\ast\}\)=𝔼Xt,X0,X~t\[f\(Xt\)f\(X~t\)\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{t\},\\,X\_\{0\},\\,\\tilde\{X\}\_\{t\}\}\\\!\\bigl\[f\(X\_\{t\}\)\\,f\(\\tilde\{X\}\_\{t\}\)\\bigr\]\(tower property\)=𝔼\(Xt,X~t\)∼ζ\[f\(Xt\)f\(X~t\)\],\\displaystyle=\\mathbb\{E\}\_\{\(X\_\{t\},\\tilde\{X\}\_\{t\}\)\\sim\\zeta\}\\\!\\bigl\[f\(X\_\{t\}\)\\,f\(\\tilde\{X\}\_\{t\}\)\\bigr\],\(67\)where the last equality follows because the marginal over\(Xt,X~t\)\(X\_\{t\},\\tilde\{X\}\_\{t\}\), obtained by integrating outX0X\_\{0\}, is preciselyζ\\zeta\. Substituting \([67](https://arxiv.org/html/2605.28900#A1.E67)\) into \([66](https://arxiv.org/html/2605.28900#A1.E66)\) yields the variational problem \([64](https://arxiv.org/html/2605.28900#A1.E64)\)\.

#### Recovery of the right singular functions\.

The SVD relationTt∗ϕt,k=σt,kψt,kT\_\{t\}^\{\\ast\}\\,\\phi\_\{t,k\}=\\sigma\_\{t,k\}\\,\\psi\_\{t,k\}\(cf\. Eq\. \([51](https://arxiv.org/html/2605.28900#A1.E51)\)\) givesψt,k=σt,k−1Tt∗ϕt,k\\psi\_\{t,k\}=\\sigma\_\{t,k\}^\{\-1\}\\,T\_\{t\}^\{\\ast\}\\,\\phi\_\{t,k\}, and expanding the definition ofTt∗T\_\{t\}^\{\\ast\}yields \([65](https://arxiv.org/html/2605.28900#A1.E65)\)\. ∎

###### Theorem A\.9\(Variational characterization\)\.

For anyf=\(f1,…,fK\)⊤f=\(f\_\{1\},\\dots,f\_\{K\}\)^\{\\top\}with𝔼pt\[fk\]=0\\mathbb\{E\}\_\{p\_\{t\}\}\[f\_\{k\}\]=0, define theK×KK\\times Kcovariance matrix

𝚺t\(f\):=𝔼pt\[f\(Xt\)f\(Xt\)⊤\]≻0\\boldsymbol\{\\Sigma\}\_\{t\}\(f\):=\\mathbb\{E\}\_\{p\_\{t\}\}\\left\[f\(X\_\{t\}\)\\,f\(X\_\{t\}\)^\{\\top\}\\right\]\\succ 0\(68\)and the cross\-covariance

𝐂t\(f\):=𝔼ζ\[f\(Xt\)f\(X~t\)⊤\]\.\\mathbf\{C\}\_\{t\}\(f\):=\\mathbb\{E\}\_\{\\zeta\}\\left\[f\(X\_\{t\}\)\\,f\(\\tilde\{X\}\_\{t\}\)^\{\\top\}\\right\]\.\(69\)Then

maxf⁡Tr\(𝐂t\(f\)𝚺t\(f\)−1\)=∑k=2K\+1σt,k2\\max\_\{f\}\\;\\mathrm\{Tr\}\\\!\\left\(\\mathbf\{C\}\_\{t\}\(f\)\\,\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1\}\\right\)\\;=\\;\\sum\_\{k=2\}^\{K\+1\}\\sigma\_\{t,k\}^\{2\}\(70\)and any maximizerf⋆f^\{\\star\}satisfiesspan\{fk⋆\}k=1K=span\{ϕt,k\}k=2K\+1\\mathrm\{span\}\\\{f^\{\\star\}\_\{k\}\\\}\_\{k=1\}^\{K\}=\\mathrm\{span\}\\\{\\phi\_\{t,k\}\\\}\_\{k=2\}^\{K\+1\}\.

###### Proof\.

For any validff, letg:=𝚺t\(f\)−1/2fg:=\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1/2\}f\. Then𝔼pt\[gg⊤\]=𝐈\\mathbb\{E\}\_\{p\_\{t\}\}\[gg^\{\\top\}\]=\\mathbf\{I\}, soggsatisfies theL2\(μt\)L^\{2\}\(\\mu\_\{t\}\)orthonormality constraint of Lemma[A\.8](https://arxiv.org/html/2605.28900#A1.Thmtheorem8)\. Using cyclicity of the trace, a direct calculation gives

Tr⁡\(𝐂t\(f\)𝚺t\(f\)−1\)=Tr⁡\(𝚺t\(f\)−1/2𝐂t\(f\)𝚺t\(f\)−1/2\)=Tr⁡\(𝐂t\(g\)\)=∑k=1K𝔼ζ\[gk\(Xt\)gk\(X~t\)\]\.\\operatorname\{Tr\}\\\!\\left\(\\mathbf\{C\}\_\{t\}\(f\)\\,\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1\}\\right\)=\\operatorname\{Tr\}\\\!\\left\(\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1/2\}\\mathbf\{C\}\_\{t\}\(f\)\\,\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1/2\}\\right\)=\\operatorname\{Tr\}\(\\mathbf\{C\}\_\{t\}\(g\)\)=\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\_\{\\zeta\}\\\!\\left\[g\_\{k\}\(X\_\{t\}\)\\,g\_\{k\}\(\\tilde\{X\}\_\{t\}\)\\right\]\.\(71\)Conversely, anyggsatisfying the orthonormality constraint is itself admissible in the unconstrained problem \(with𝚺t\(g\)=𝐈\\boldsymbol\{\\Sigma\}\_\{t\}\(g\)=\\mathbf\{I\}\), and both objectives take the same value on suchgg\. The mapf↦gf\\mapsto gis therefore a bijection between admissible points of the two problems that preserves the objective, so they share the same supremum\. By Lemma[A\.8](https://arxiv.org/html/2605.28900#A1.Thmtheorem8), this common supremum equals∑k=2K\+1σt,k2\\sum\_\{k=2\}^\{K\+1\}\\sigma\_\{t,k\}^\{2\}and is attained when\{gk\}k=1K\\\{g\_\{k\}\\\}\_\{k=1\}^\{K\}is any orthonormal basis ofspan\{ϕt,k\}k=2K\+1\\mathrm\{span\}\\\{\\phi\_\{t,k\}\\\}\_\{k=2\}^\{K\+1\}\. Sincef=𝚺t\(f\)1/2gf=\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{1/2\}gis an invertible linear transformation,span\{fk⋆\}k=1K=span\{gk\}k=1K=span\{ϕt,k\}k=2K\+1\\mathrm\{span\}\\\{f\_\{k\}^\{\\star\}\\\}\_\{k=1\}^\{K\}=\\mathrm\{span\}\\\{g\_\{k\}\\\}\_\{k=1\}^\{K\}=\\mathrm\{span\}\\\{\\phi\_\{t,k\}\\\}\_\{k=2\}^\{K\+1\}, as claimed\. ∎

### A\.4Spectral Guidance

###### Proposition A\.10\(Noisy posterior via the SVD\)\.

Leth∈L2\(μ0\)h\\in L^\{2\}\(\\mu\_\{0\}\)\. Then

𝔼p\(⋅∣xt\)\[h\(X0\)\]=∑k=1∞ct,kϕt,k\(xt\),ct,k:=𝔼\(X0,Xt\)∼p\(x0,xt\)\[h\(X0\)ϕt,k\(Xt\)\]\.\\mathbb\{E\}\_\{p\(\\cdot\\mid x\_\{t\}\)\}\[h\(X\_\{0\}\)\]=\\sum\_\{k=1\}^\{\\infty\}c\_\{t,k\}\\,\\phi\_\{t,k\}\(x\_\{t\}\),\\qquad c\_\{t,k\}:=\\mathbb\{E\}\_\{\(X\_\{0\},X\_\{t\}\)\\sim p\(x\_\{0\},x\_\{t\}\)\}\\\!\\bigl\[h\(X\_\{0\}\)\\,\\phi\_\{t,k\}\(X\_\{t\}\)\\bigr\]\.\(72\)

###### Proof\.

Since𝔼\[h\(X0\)∣Xt=xt\]=\(Tth\)\(xt\)\\mathbb\{E\}\[h\(X\_\{0\}\)\\mid X\_\{t\}=x\_\{t\}\]=\(T\_\{t\}h\)\(x\_\{t\}\), the SVD expansion gives

𝔼\[h\(X0\)∣Xt=xt\]=∑k=1∞σt,kϕt,k\(xt\)⟨h,ψt,k⟩μ0\.\\mathbb\{E\}\[h\(X\_\{0\}\)\\mid X\_\{t\}=x\_\{t\}\]=\\sum\_\{k=1\}^\{\\infty\}\\sigma\_\{t,k\}\\,\\phi\_\{t,k\}\(x\_\{t\}\)\\,\\langle h,\\,\\psi\_\{t,k\}\\rangle\_\{\\mu\_\{0\}\}\.\(73\)We rewrite the inner product using the SVD relationψt,k=1σt,kTt∗ϕt,k\\psi\_\{t,k\}=\\frac\{1\}\{\\sigma\_\{t,k\}\}\\,T\_\{t\}^\{\\ast\}\\phi\_\{t,k\}:

σt,k⟨h,ψt,k⟩μ0\\displaystyle\\sigma\_\{t,k\}\\,\\langle h,\\,\\psi\_\{t,k\}\\rangle\_\{\\mu\_\{0\}\}=σt,k⋅1σt,k⟨h,Tt∗ϕt,k⟩μ0\\displaystyle=\\sigma\_\{t,k\}\\cdot\\frac\{1\}\{\\sigma\_\{t,k\}\}\\,\\langle h,\\,T\_\{t\}^\{\\ast\}\\phi\_\{t,k\}\\rangle\_\{\\mu\_\{0\}\}=𝔼X0∼p0\[h\(X0\)\(Tt∗ϕt,k\)\(X0\)\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\\\!\\bigl\[h\(X\_\{0\}\)\\,\(T\_\{t\}^\{\\ast\}\\phi\_\{t,k\}\)\(X\_\{0\}\)\\bigr\]=𝔼X0∼p0\[h\(X0\)𝔼\[ϕt,k\(Xt\)∣X0\]\]\\displaystyle=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\\\!\\Bigl\[h\(X\_\{0\}\)\\,\\mathbb\{E\}\\bigl\[\\phi\_\{t,k\}\(X\_\{t\}\)\\mid X\_\{0\}\\bigr\]\\Bigr\]=𝔼\(X0,Xt\)∼p\(x0,xt\)\[h\(X0\)ϕt,k\(Xt\)\]=:ct,k,\\displaystyle=\\mathbb\{E\}\_\{\(X\_\{0\},X\_\{t\}\)\\sim p\(x\_\{0\},x\_\{t\}\)\}\\\!\\bigl\[h\(X\_\{0\}\)\\,\\phi\_\{t,k\}\(X\_\{t\}\)\\bigr\]=:c\_\{t,k\},\(74\)where the third line expands the definition ofTt∗T\_\{t\}^\{\\ast\}and the fourth applies the tower property\. Substituting into \([73](https://arxiv.org/html/2605.28900#A1.E73)\) yields the result\. ∎

We next quantify the error incurred by truncating the SVD expansion at rankKK\. Define the rank\-KKapproximation

\(Tt,Kh\)\(xt\):=∑k=1Kσt,kϕt,k\(xt\)⟨h,ψt,k⟩μ0,\(T\_\{t,K\}h\)\(x\_\{t\}\):=\\sum\_\{k=1\}^\{K\}\\sigma\_\{t,k\}\\,\\phi\_\{t,k\}\(x\_\{t\}\)\\,\\langle h,\\,\\psi\_\{t,k\}\\rangle\_\{\\mu\_\{0\}\},\(75\)and writeλt,k:=σt,k2\\lambda\_\{t,k\}:=\\sigma\_\{t,k\}^\{2\}for the eigenvalues ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}\.

###### Proposition A\.11\(SVD truncation error\)\.

LetTt,KT\_\{t,K\}be the rank\-KKapproximation ofTtT\_\{t\}\. For anyh∈L2\(μ0\)h\\in L^\{2\}\(\\mu\_\{0\}\),

‖Tth−Tt,Kh‖μt2≤λt,K\+1‖h‖μ02\.\\\|T\_\{t\}h\-T\_\{t,K\}h\\\|\_\{\\mu\_\{t\}\}^\{2\}\\;\\leq\\;\\lambda\_\{t,K\+1\}\\,\\\|h\\\|\_\{\\mu\_\{0\}\}^\{2\}\.\(76\)

###### Proof\.

By orthonormality of\{ϕt,k\}k≥1\\\{\\phi\_\{t,k\}\\\}\_\{k\\geq 1\}inL2\(μt\)L^\{2\}\(\\mu\_\{t\}\), the squared error is the tail of the expansion

‖Tth−Tt,Kh‖μt2=∑k=K\+1∞σt,k2⟨h,ψt,k⟩μ02\.\\\|T\_\{t\}h\-T\_\{t,K\}h\\\|\_\{\\mu\_\{t\}\}^\{2\}=\\sum\_\{k=K\+1\}^\{\\infty\}\\sigma\_\{t,k\}^\{2\}\\,\\langle h,\\,\\psi\_\{t,k\}\\rangle\_\{\\mu\_\{0\}\}^\{2\}\.\(77\)Since the singular values are ordered,σt,k2≤σt,K\+12=λt,K\+1\\sigma\_\{t,k\}^\{2\}\\leq\\sigma\_\{t,K\+1\}^\{2\}=\\lambda\_\{t,K\+1\}for allk≥K\+1k\\geq K\+1\. Factoring out the leading term and applying Bessel’s inequality,

∑k=K\+1∞σt,k2⟨h,ψt,k⟩μ02≤λt,K\+1∑k=K\+1∞⟨h,ψt,k⟩μ02≤λt,K\+1‖h‖μ02\.∎\\sum\_\{k=K\+1\}^\{\\infty\}\\sigma\_\{t,k\}^\{2\}\\,\\langle h,\\,\\psi\_\{t,k\}\\rangle\_\{\\mu\_\{0\}\}^\{2\}\\;\\leq\\;\\lambda\_\{t,K\+1\}\\sum\_\{k=K\+1\}^\{\\infty\}\\langle h,\\,\\psi\_\{t,k\}\\rangle\_\{\\mu\_\{0\}\}^\{2\}\\;\\leq\\;\\lambda\_\{t,K\+1\}\\,\\\|h\\\|\_\{\\mu\_\{0\}\}^\{2\}\.\\qed\(78\)

### A\.5Finite Sample Error

###### Theorem A\.12\(Finite\-sample error of the variational objective\)\.

Letf:𝒳→ℝKf:\\mathcal\{X\}\\to\\mathbb\{R\}^\{K\}with𝔼pt\[f\(Xt\)\]=0\\mathbb\{E\}\_\{p\_\{t\}\}\[f\(X\_\{t\}\)\]=0and‖f\(Xt\)‖2≤M\\\|f\(X\_\{t\}\)\\\|\_\{2\}\\leq Malmost surely\. Assume the population covariance𝚺t\(f\):=𝔼pt\[f\(Xt\)f\(Xt\)⊤\]\\boldsymbol\{\\Sigma\}\_\{t\}\(f\):=\\mathbb\{E\}\_\{p\_\{t\}\}\\\!\\left\[f\(X\_\{t\}\)\\,f\(X\_\{t\}\)^\{\\top\}\\right\]satisfies𝚺t\(f\)⪰λ𝐈\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)\\succeq\\lambda\\,\\mathbf\{I\}for someλ\>0\\lambda\>0\. GivenBBi\.i\.d\. pairs\{\(Xt\(i\),X~t\(i\)\)\}i=1B∼ζ\\bigl\\\{\(X\_\{t\}^\{\(i\)\},\\tilde\{X\}\_\{t\}^\{\(i\)\}\)\\bigr\\\}\_\{i=1\}^\{B\}\\sim\\zeta, define the symmetrized empirical estimators

𝐂^t\(f\):=1B∑i=1B12\(f\(Xt\(i\)\)f\(X~t\(i\)\)⊤\+f\(X~t\(i\)\)f\(Xt\(i\)\)⊤\),𝚺^t\(f\):=1B∑i=1Bf\(Xt\(i\)\)f\(Xt\(i\)\)⊤,\\hat\{\\mathbf\{C\}\}\_\{t\}\(f\):=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\tfrac\{1\}\{2\}\\\!\\left\(f\(X\_\{t\}^\{\(i\)\}\)f\(\\tilde\{X\}\_\{t\}^\{\(i\)\}\)^\{\\top\}\+f\(\\tilde\{X\}\_\{t\}^\{\(i\)\}\)f\(X\_\{t\}^\{\(i\)\}\)^\{\\top\}\\right\),\\qquad\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{t\}\(f\):=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}f\(X\_\{t\}^\{\(i\)\}\)f\(X\_\{t\}^\{\(i\)\}\)^\{\\top\},\(79\)and the population and empirical objectives

Jt\(f\):=Tr⁡\(𝐂t\(f\)𝚺t\(f\)−1\),J^t\(f\):=Tr⁡\(𝐂^t\(f\)𝚺^t\(f\)−1\)\.J\_\{t\}\(f\):=\\operatorname\{Tr\}\\\!\\bigl\(\\mathbf\{C\}\_\{t\}\(f\)\\,\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)^\{\-1\}\\bigr\),\\qquad\\hat\{J\}\_\{t\}\(f\):=\\operatorname\{Tr\}\\\!\\bigl\(\\hat\{\\mathbf\{C\}\}\_\{t\}\(f\)\\,\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{t\}\(f\)^\{\-1\}\\bigr\)\.\(80\)Then for anyδ∈\(0,1\)\\delta\\in\(0,1\)and anyBBsatisfyingB≥32M4log⁡\(4K/δ\)/λ2B\\geq 32\\,M^\{4\}\\log\(4K/\\delta\)/\\lambda^\{2\}, with probability at least1−δ1\-\\delta,

\|J^t\(f\)−Jt\(f\)\|≤4KM4λ2log⁡\(4K/δ\)B\+16KM4log⁡\(4K/δ\)3λ2B\.\\bigl\|\\hat\{J\}\_\{t\}\(f\)\-J\_\{t\}\(f\)\\bigr\|\\;\\leq\\;\\frac\{4\\,K\\,M^\{4\}\}\{\\lambda^\{2\}\}\\sqrt\{\\frac\{\\log\(4K/\\delta\)\}\{B\}\}\\;\+\\;\\frac\{16\\,K\\,M^\{4\}\\log\(4K/\\delta\)\}\{3\\,\\lambda^\{2\}\\,B\}\.\(81\)

###### Proof\.

Let𝐇i:=12\(f\(Xt\(i\)\)f\(X~t\(i\)\)⊤\+f\(X~t\(i\)\)f\(Xt\(i\)\)⊤\)\\mathbf\{H\}\_\{i\}:=\\tfrac\{1\}\{2\}\\bigl\(f\(X\_\{t\}^\{\(i\)\}\)f\(\\tilde\{X\}\_\{t\}^\{\(i\)\}\)^\{\\top\}\+f\(\\tilde\{X\}\_\{t\}^\{\(i\)\}\)f\(X\_\{t\}^\{\(i\)\}\)^\{\\top\}\\bigr\)and𝐒i:=f\(Xt\(i\)\)f\(Xt\(i\)\)⊤\\mathbf\{S\}\_\{i\}:=f\(X\_\{t\}^\{\(i\)\}\)f\(X\_\{t\}^\{\(i\)\}\)^\{\\top\}, both self\-adjointK×KK\\times Kmatrices\. By exchangeability of\(Xt\(i\),X~t\(i\)\)\(X\_\{t\}^\{\(i\)\},\\tilde\{X\}\_\{t\}^\{\(i\)\}\)underζ\\zeta,𝔼\[𝐇i\]=𝐂t\(f\)\\mathbb\{E\}\[\\mathbf\{H\}\_\{i\}\]=\\mathbf\{C\}\_\{t\}\(f\); trivially𝔼\[𝐒i\]=𝚺t\(f\)\\mathbb\{E\}\[\\mathbf\{S\}\_\{i\}\]=\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)\. Define centered summands𝐘i:=𝐇i−𝐂t\(f\)\\mathbf\{Y\}\_\{i\}:=\\mathbf\{H\}\_\{i\}\-\\mathbf\{C\}\_\{t\}\(f\)and𝐓i:=𝐒i−𝚺t\(f\)\\mathbf\{T\}\_\{i\}:=\\mathbf\{S\}\_\{i\}\-\\boldsymbol\{\\Sigma\}\_\{t\}\(f\)\.

Since‖f‖2≤M\\\|f\\\|\_\{2\}\\leq Ma\.s\.,‖𝐇i‖,‖𝐒i‖≤M2\\\|\\mathbf\{H\}\_\{i\}\\\|,\\\|\\mathbf\{S\}\_\{i\}\\\|\\leq M^\{2\}, hence‖𝐘i‖,‖𝐓i‖≤2M2\\\|\\mathbf\{Y\}\_\{i\}\\\|,\\\|\\mathbf\{T\}\_\{i\}\\\|\\leq 2M^\{2\}\. For the variance proxies,

𝔼\[𝐘i2\]=𝔼\[𝐇i2\]−𝐂t2⪯𝔼\[𝐇i2\],\\mathbb\{E\}\[\\mathbf\{Y\}\_\{i\}^\{2\}\]=\\mathbb\{E\}\[\\mathbf\{H\}\_\{i\}^\{2\}\]\-\\mathbf\{C\}\_\{t\}^\{2\}\\preceq\\mathbb\{E\}\[\\mathbf\{H\}\_\{i\}^\{2\}\],\(82\)since𝐂t2⪰0\\mathbf\{C\}\_\{t\}^\{2\}\\succeq 0\. By Jensen,‖𝔼\[𝐇i2\]‖≤𝔼\[‖𝐇i‖2\]≤M4\\\|\\mathbb\{E\}\[\\mathbf\{H\}\_\{i\}^\{2\}\]\\\|\\leq\\mathbb\{E\}\\bigl\[\\\|\\mathbf\{H\}\_\{i\}\\\|^\{2\}\\bigr\]\\leq M^\{4\}, so‖𝔼\[𝐘i2\]‖≤M4\\\|\\mathbb\{E\}\[\\mathbf\{Y\}\_\{i\}^\{2\}\]\\\|\\leq M^\{4\}; the same bound applies to𝔼\[𝐓i2\]\\mathbb\{E\}\[\\mathbf\{T\}\_\{i\}^\{2\}\]\.

#### Concentration of𝐂^t\\hat\{\\mathbf\{C\}\}\_\{t\}and𝚺^t\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{t\}\.

Applying matrix Bernstein\(Tropp,[2015](https://arxiv.org/html/2605.28900#bib.bib61)\)to∑i𝐘i\\sum\_\{i\}\\mathbf\{Y\}\_\{i\}in its two\-sided form, then to∑i𝐓i\\sum\_\{i\}\\mathbf\{T\}\_\{i\}, and union\-bounding over both estimators \(withδ/2\\delta/2each\), gives that with probability at least1−δ1\-\\delta,

max⁡\(‖𝐂^t−𝐂t‖,‖𝚺^t−𝚺t‖\)≤εB:=M2\(2log⁡\(4K/δ\)B\+4log⁡\(4K/δ\)3B\)\.\\max\\\!\\bigl\(\\\|\\hat\{\\mathbf\{C\}\}\_\{t\}\-\\mathbf\{C\}\_\{t\}\\\|,\\;\\\|\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{t\}\-\\boldsymbol\{\\Sigma\}\_\{t\}\\\|\\bigr\)\\;\\leq\\;\\varepsilon\_\{B\}\\;:=\\;M^\{2\}\\\!\\left\(\\sqrt\{\\frac\{2\\log\(4K/\\delta\)\}\{B\}\}\+\\frac\{4\\log\(4K/\\delta\)\}\{3B\}\\right\)\.\(83\)

#### Perturbation of the trace\.

WriteΔC:=𝐂^t−𝐂t\\Delta\_\{C\}:=\\hat\{\\mathbf\{C\}\}\_\{t\}\-\\mathbf\{C\}\_\{t\}andΔΣ:=𝚺^t−𝚺t\\Delta\_\{\\Sigma\}:=\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{t\}\-\\boldsymbol\{\\Sigma\}\_\{t\}\. The hypothesisB≥32M4log⁡\(4K/δ\)/λ2B\\geq 32\\,M^\{4\}\\log\(4K/\\delta\)/\\lambda^\{2\}ensuresεB≤λ/2\\varepsilon\_\{B\}\\leq\\lambda/2, so‖𝚺t−1ΔΣ‖≤1/2\\\|\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}\\Delta\_\{\\Sigma\}\\\|\\leq 1/2\. The Neumann expansion of\(𝚺t\+ΔΣ\)−1\(\\boldsymbol\{\\Sigma\}\_\{t\}\+\\Delta\_\{\\Sigma\}\)^\{\-1\}yields

𝚺^t−1=𝚺t−1−𝚺t−1ΔΣ𝚺t−1\+𝐑,‖𝐑‖≤2εB2λ3\.\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{t\}^\{\-1\}=\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}\-\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}\\Delta\_\{\\Sigma\}\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}\+\\mathbf\{R\},\\qquad\\\|\\mathbf\{R\}\\\|\\leq\\frac\{2\\,\\varepsilon\_\{B\}^\{2\}\}\{\\lambda^\{3\}\}\.\(84\)Substituting and isolating leading terms,

𝐂^t𝚺^t−1−𝐂t𝚺t−1=ΔC𝚺t−1−𝐂t𝚺t−1ΔΣ𝚺t−1\+𝐄,\\hat\{\\mathbf\{C\}\}\_\{t\}\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{t\}^\{\-1\}\-\\mathbf\{C\}\_\{t\}\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}=\\Delta\_\{C\}\\,\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}\\;\-\\;\\mathbf\{C\}\_\{t\}\\,\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}\\,\\Delta\_\{\\Sigma\}\\,\\boldsymbol\{\\Sigma\}\_\{t\}^\{\-1\}\\;\+\\;\\mathbf\{E\},\(85\)where the residual𝐄\\mathbf\{E\}collects the second\-order cross terms together with\(𝐂t\+ΔC\)𝐑\(\\mathbf\{C\}\_\{t\}\+\\Delta\_\{C\}\)\\,\\mathbf\{R\}\. Using‖ΔC‖,‖ΔΣ‖≤εB\\\|\\Delta\_\{C\}\\\|,\\\|\\Delta\_\{\\Sigma\}\\\|\\leq\\varepsilon\_\{B\}and‖𝐂t‖≤M2\\\|\\mathbf\{C\}\_\{t\}\\\|\\leq M^\{2\}gives‖𝐄‖≤4M2εB2/λ3\\\|\\mathbf\{E\}\\\|\\leq 4\\,M^\{2\}\\varepsilon\_\{B\}^\{2\}/\\lambda^\{3\}\.

Taking trace and applying\|Tr⁡\(A\)\|≤K‖A‖op\|\\operatorname\{Tr\}\(A\)\|\\leq K\\,\\\|A\\\|\_\{\\mathrm\{op\}\}together with submultiplicativity of the operator norm,

\|J^t\(f\)−Jt\(f\)\|≤K\(‖ΔC‖λ\+‖𝐂t‖‖ΔΣ‖λ2\+4M2εB2λ3\)\.\\bigl\|\\hat\{J\}\_\{t\}\(f\)\-J\_\{t\}\(f\)\\bigr\|\\;\\leq\\;K\\\!\\left\(\\frac\{\\\|\\Delta\_\{C\}\\\|\}\{\\lambda\}\+\\frac\{\\\|\\mathbf\{C\}\_\{t\}\\\|\\,\\\|\\Delta\_\{\\Sigma\}\\\|\}\{\\lambda^\{2\}\}\+\\frac\{4\\,M^\{2\}\\varepsilon\_\{B\}^\{2\}\}\{\\lambda^\{3\}\}\\right\)\.\(86\)
Since𝐂t=𝔼\[𝐇i\]\\mathbf\{C\}\_\{t\}=\\mathbb\{E\}\[\\mathbf\{H\}\_\{i\}\]with‖𝐇i‖≤M2\\\|\\mathbf\{H\}\_\{i\}\\\|\\leq M^\{2\}, we have‖𝐂t‖≤M2\\\|\\mathbf\{C\}\_\{t\}\\\|\\leq M^\{2\}\. Substituting \([83](https://arxiv.org/html/2605.28900#A1.E83)\) into \([86](https://arxiv.org/html/2605.28900#A1.E86)\),

\|J^t\(f\)−Jt\(f\)\|≤KεBλ\+KM2εBλ2\+4KM2εB2λ3≤2KM2εBλ2\(1\+2εBλ\),\\bigl\|\\hat\{J\}\_\{t\}\(f\)\-J\_\{t\}\(f\)\\bigr\|\\;\\leq\\;\\frac\{K\\,\\varepsilon\_\{B\}\}\{\\lambda\}\+\\frac\{K\\,M^\{2\}\\,\\varepsilon\_\{B\}\}\{\\lambda^\{2\}\}\+\\frac\{4\\,K\\,M^\{2\}\\,\\varepsilon\_\{B\}^\{2\}\}\{\\lambda^\{3\}\}\\;\\leq\\;\\frac\{2\\,K\\,M^\{2\}\\,\\varepsilon\_\{B\}\}\{\\lambda^\{2\}\}\\\!\\left\(1\+\\frac\{2\\,\\varepsilon\_\{B\}\}\{\\lambda\}\\right\),\(87\)where the second inequality usesλ≤M2\\lambda\\leq M^\{2\}\(which follows from𝚺t⪯M2𝐈\\boldsymbol\{\\Sigma\}\_\{t\}\\preceq M^\{2\}\\,\\mathbf\{I\}\)\. ExpandingεB\\varepsilon\_\{B\}via \([83](https://arxiv.org/html/2605.28900#A1.E83)\) and absorbing the higher\-order contributions into the second term yields the bound \([81](https://arxiv.org/html/2605.28900#A1.E81)\)\. ∎

### A\.6Eigenfunctions for Tractable Priors

###### Proposition A\.13\(Eigenfunctions ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}for Gaussian priors\)\.

LetX0∼𝒩\(𝟎,𝚺\)X\_\{0\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\boldsymbol\{\\Sigma\}\)with eigendecomposition𝚺=∑k=1dρk𝐮k𝐮k⊤\\boldsymbol\{\\Sigma\}=\\sum\_\{k=1\}^\{d\}\\rho\_\{k\}\\,\\mathbf\{u\}\_\{k\}\\mathbf\{u\}\_\{k\}^\{\\top\},ρ1≥⋯≥ρd\>0\\rho\_\{1\}\\geq\\cdots\\geq\\rho\_\{d\}\>0, and let the forward process beXt=atX0\+btϵX\_\{t\}=a\_\{t\}\\,X\_\{0\}\+b\_\{t\}\\,\\epsilonwithϵ∼𝒩\(𝟎,𝐈\)\\epsilon\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\(in the DDPM parameterization,at=α¯ta\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}andbt=1−α¯tb\_\{t\}=\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\)\. Then the linear functionals

ϕt,k\(𝐱t\):=𝐮k⊤𝐱t,k=1,…,d,\\phi\_\{t,k\}\(\\mathbf\{x\}\_\{t\}\):=\\mathbf\{u\}\_\{k\}^\{\\top\}\\mathbf\{x\}\_\{t\},\\qquad k=1,\\ldots,d,\(88\)are eigenfunctions of the covariance operatorTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}, with eigenvalues

λt,k=at2ρkat2ρk\+bt2\.\\lambda\_\{t,k\}=\\frac\{a\_\{t\}^\{2\}\\,\\rho\_\{k\}\}\{a\_\{t\}^\{2\}\\,\\rho\_\{k\}\+b\_\{t\}^\{2\}\}\.\(89\)

###### Proof\.

Sinceϕt,k\\phi\_\{t,k\}is linear, both conditional expectations in the compositionTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}reduce to matrix operations\.

#### Inner step \(Tt∗T\_\{t\}^\{\\ast\}\)\.

ApplyingTt∗T\_\{t\}^\{\\ast\}toϕt,k\\phi\_\{t,k\}and using linearity of conditional expectation,

\(Tt∗ϕt,k\)\(𝐱0\)=𝔼\[𝐮k⊤Xt∣X0=𝐱0\]=𝐮k⊤𝔼\[Xt∣X0=𝐱0\]=at𝐮k⊤𝐱0\.\(T\_\{t\}^\{\\ast\}\\phi\_\{t,k\}\)\(\\mathbf\{x\}\_\{0\}\)=\\mathbb\{E\}\\bigl\[\\mathbf\{u\}\_\{k\}^\{\\top\}X\_\{t\}\\mid X\_\{0\}=\\mathbf\{x\}\_\{0\}\\bigr\]=\\mathbf\{u\}\_\{k\}^\{\\top\}\\mathbb\{E\}\[X\_\{t\}\\mid X\_\{0\}=\\mathbf\{x\}\_\{0\}\]=a\_\{t\}\\,\\mathbf\{u\}\_\{k\}^\{\\top\}\\mathbf\{x\}\_\{0\}\.\(90\)

#### Outer step \(TtT\_\{t\}\)\.

ApplyingTtT\_\{t\}to the result and using the standard Gaussian conditioning identity𝔼\[X0∣Xt=𝐱t\]=at𝚺\(at2𝚺\+bt2𝐈\)−1𝐱t\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{t\}=\\mathbf\{x\}\_\{t\}\]=a\_\{t\}\\,\\mathbf\{\\Sigma\}\\,\(a\_\{t\}^\{2\}\\,\\boldsymbol\{\\Sigma\}\+b\_\{t\}^\{2\}\\,\\mathbf\{I\}\)^\{\-1\}\\mathbf\{x\}\_\{t\},

\(TtTt∗ϕt,k\)\(𝐱t\)\\displaystyle\(T\_\{t\}T\_\{t\}^\{\\ast\}\\phi\_\{t,k\}\)\(\\mathbf\{x\}\_\{t\}\)=𝔼\[at𝐮k⊤X0∣Xt=𝐱t\]\\displaystyle=\\mathbb\{E\}\\bigl\[a\_\{t\}\\,\\mathbf\{u\}\_\{k\}^\{\\top\}X\_\{0\}\\mid X\_\{t\}=\\mathbf\{x\}\_\{t\}\\bigr\]=at2𝐮k⊤𝚺\(at2𝚺\+bt2𝐈\)−1𝐱t\.\\displaystyle=a\_\{t\}^\{2\}\\,\\mathbf\{u\}\_\{k\}^\{\\top\}\\boldsymbol\{\\Sigma\}\\,\(a\_\{t\}^\{2\}\\,\\boldsymbol\{\\Sigma\}\+b\_\{t\}^\{2\}\\,\\mathbf\{I\}\)^\{\-1\}\\,\\mathbf\{x\}\_\{t\}\.\(91\)Since𝐮k\\mathbf\{u\}\_\{k\}is an eigenvector of both𝚺\\boldsymbol\{\\Sigma\}\(with eigenvalueρk\\rho\_\{k\}\) and\(at2𝚺\+bt2𝐈\)−1\(a\_\{t\}^\{2\}\\,\\boldsymbol\{\\Sigma\}\+b\_\{t\}^\{2\}\\,\\mathbf\{I\}\)^\{\-1\}\(with eigenvalue\(at2ρk\+bt2\)−1\(a\_\{t\}^\{2\}\\,\\rho\_\{k\}\+b\_\{t\}^\{2\}\)^\{\-1\}\), this simplifies to

\(TtTt∗ϕt,k\)\(𝐱t\)=at2ρkat2ρk\+bt2𝐮k⊤𝐱t=λt,kϕt,k\(𝐱t\)∎\.\(T\_\{t\}T\_\{t\}^\{\\ast\}\\phi\_\{t,k\}\)\(\\mathbf\{x\}\_\{t\}\)=\\frac\{a\_\{t\}^\{2\}\\,\\rho\_\{k\}\}\{a\_\{t\}^\{2\}\\,\\rho\_\{k\}\+b\_\{t\}^\{2\}\}\\;\\mathbf\{u\}\_\{k\}^\{\\top\}\\mathbf\{x\}\_\{t\}=\\lambda\_\{t,k\}\\,\\phi\_\{t,k\}\(\\mathbf\{x\}\_\{t\}\)\\qed\.\(92\)
###### Proposition A\.14\(Eigenfunctions for the uniform prior onS1S^\{1\}\)\.

Letp0p\_\{0\}be the uniform distribution on the unit circleS1⊂ℝ2S^\{1\}\\subset\\mathbb\{R\}^\{2\}, parameterized as𝐱0\(θ\)=\(cos⁡θ,sin⁡θ\)\\mathbf\{x\}\_\{0\}\(\\theta\)=\(\\cos\\theta,\\,\\sin\\theta\)forθ∈\[0,2π\)\\theta\\in\[0,2\\pi\), and letp\(𝐱t∣𝐱0\)=𝒩\(𝐱t;α¯t𝐱0,\(1−α¯t\)𝐈2\)p\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\)=\\mathcal\{N\}\\\!\\bigl\(\\mathbf\{x\}\_\{t\};\\,\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,\\mathbf\{x\}\_\{0\},\\,\(1\-\\bar\{\\alpha\}\_\{t\}\)\\,\\mathbf\{I\}\_\{2\}\\bigr\)\. Then the eigenfunctions ofTt∗TtT\_\{t\}^\{\\ast\}T\_\{t\}\(equivalently, the right singular functions ofTtT\_\{t\}\), expressed in angle coordinates, are the Fourier modes

ψt,1\(θ\)=1,ψt,2n\(θ\)=2cos⁡\(nθ\),ψt,2n\+1\(θ\)=2sin⁡\(nθ\),n=1,2,…\\psi\_\{t,1\}\(\\theta\)=1,\\qquad\\psi\_\{t,2n\}\(\\theta\)=\\sqrt\{2\}\\,\\cos\(n\\theta\),\\qquad\\psi\_\{t,2n\+1\}\(\\theta\)=\\sqrt\{2\}\\,\\sin\(n\\theta\),\\qquad n=1,2,\\ldots\(93\)

###### Proof\.

We show that the kernel ofTt∗TtT\_\{t\}^\{\\ast\}T\_\{t\}depends only on the angular difference, which identifies it as a convolution operator onS1S^\{1\}and fixes the eigenfunctions as Fourier modes\.

#### Kernel ofTt∗TtT\_\{t\}^\{\\ast\}T\_\{t\}\.

In angle coordinates, the operatorTt∗Tt:L2\(μ0\)→L2\(μ0\)T\_\{t\}^\{\\ast\}T\_\{t\}:L^\{2\}\(\\mu\_\{0\}\)\\to L^\{2\}\(\\mu\_\{0\}\)is an integral operator with kernel

k\(θ,θ′\)=∫ℝ2p\(𝐱t∣𝐱0\(θ\)\)p\(𝐱t∣𝐱0\(θ′\)\)pt\(𝐱t\)𝑑𝐱t\.k\(\\theta,\\theta^\{\\prime\}\)=\\int\_\{\\mathbb\{R\}^\{2\}\}\\frac\{p\\bigl\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\(\\theta\)\\bigr\)\\,p\\bigl\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\(\\theta^\{\\prime\}\)\\bigr\)\}\{p\_\{t\}\(\\mathbf\{x\}\_\{t\}\)\}\\,d\\mathbf\{x\}\_\{t\}\.\(94\)

#### Rotational invariance\.

We claim thatk\(θ\+α,θ′\+α\)=k\(θ,θ′\)k\(\\theta\+\\alpha,\\,\\theta^\{\\prime\}\+\\alpha\)=k\(\\theta,\\theta^\{\\prime\}\)for everyα\\alpha, so thatk\(θ,θ′\)=κt\(θ−θ′\)k\(\\theta,\\theta^\{\\prime\}\)=\\kappa\_\{t\}\(\\theta\-\\theta^\{\\prime\}\)\. Let𝐑α∈SO⁡\(2\)\\mathbf\{R\}\_\{\\alpha\}\\in\\operatorname\{SO\}\(2\)denote rotation by angleα\\alphainℝ2\\mathbb\{R\}^\{2\}\. Since𝐱0\(θ\+α\)=𝐑α𝐱0\(θ\)\\mathbf\{x\}\_\{0\}\(\\theta\+\\alpha\)=\\mathbf\{R\}\_\{\\alpha\}\\,\\mathbf\{x\}\_\{0\}\(\\theta\)by definition of the angle parameterization, the change of variables𝐱t=𝐑α𝐮\\mathbf\{x\}\_\{t\}=\\mathbf\{R\}\_\{\\alpha\}\\,\\mathbf\{u\}\(withd𝐱t=d𝐮d\\mathbf\{x\}\_\{t\}=d\\mathbf\{u\}, as rotations preserve Lebesgue measure\) gives

k\(θ\+α,θ′\+α\)=∫ℝ2p\(𝐑α𝐮∣𝐑α𝐱0\(θ\)\)p\(𝐑α𝐮∣𝐑α𝐱0\(θ′\)\)pt\(𝐑α𝐮\)𝑑𝐮\.k\(\\theta\+\\alpha,\\,\\theta^\{\\prime\}\+\\alpha\)=\\int\_\{\\mathbb\{R\}^\{2\}\}\\frac\{p\\bigl\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\\mid\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{x\}\_\{0\}\(\\theta\)\\bigr\)\\;p\\bigl\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\\mid\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{x\}\_\{0\}\(\\theta^\{\\prime\}\)\\bigr\)\}\{p\_\{t\}\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\)\}\\,d\\mathbf\{u\}\.\(95\)We verify each factor separately\.

*Forward kernel\.*Since𝐑α\\mathbf\{R\}\_\{\\alpha\}is an isometry,‖𝐑α𝐮−α¯t𝐑α𝐯‖=‖𝐮−α¯t𝐯‖\\\|\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{v\}\\\|=\\\|\\mathbf\{u\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,\\mathbf\{v\}\\\|, and therefore

p\(𝐑α𝐮∣𝐑α𝐱0\(θ\)\)=𝒩\(𝐑α𝐮;α¯t𝐑α𝐱0\(θ\),\(1−α¯t\)𝐈2\)=p\(𝐮∣𝐱0\(θ\)\)\.p\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\\mid\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{x\}\_\{0\}\(\\theta\)\)=\\mathcal\{N\}\\\!\\bigl\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\};\\,\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{x\}\_\{0\}\(\\theta\),\\,\(1\-\\bar\{\\alpha\}\_\{t\}\)\\,\\mathbf\{I\}\_\{2\}\\bigr\)=p\\bigl\(\\mathbf\{u\}\\mid\\mathbf\{x\}\_\{0\}\(\\theta\)\\bigr\)\.\(96\)
*Marginal\.*Using the isometry property and the uniformity ofp0=1/\(2π\)p\_\{0\}=1/\(2\\pi\),

pt\(𝐑α𝐮\)\\displaystyle p\_\{t\}\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\)=12π∫02πp\(𝐑α𝐮∣𝐱0\(θ′′\)\)𝑑θ′′=12π∫02πp\(𝐑α𝐮∣𝐑α𝐱0\(θ′′−α\)\)𝑑θ′′\\displaystyle=\\frac\{1\}\{2\\pi\}\\int\_\{0\}^\{2\\pi\}p\\bigl\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\\mid\\mathbf\{x\}\_\{0\}\(\\theta^\{\\prime\\prime\}\)\\bigr\)\\,d\\theta^\{\\prime\\prime\}=\\frac\{1\}\{2\\pi\}\\int\_\{0\}^\{2\\pi\}p\\bigl\(\\mathbf\{R\}\_\{\\alpha\}\\mathbf\{u\}\\mid\\mathbf\{R\}\_\{\\alpha\}\\,\\mathbf\{x\}\_\{0\}\(\\theta^\{\\prime\\prime\}\-\\alpha\)\\bigr\)\\,d\\theta^\{\\prime\\prime\}=12π∫02πp\(𝐮∣𝐱0\(θ′′−α\)\)𝑑θ′′=pt\(𝐮\),\\displaystyle=\\frac\{1\}\{2\\pi\}\\int\_\{0\}^\{2\\pi\}p\\bigl\(\\mathbf\{u\}\\mid\\mathbf\{x\}\_\{0\}\(\\theta^\{\\prime\\prime\}\-\\alpha\)\\bigr\)\\,d\\theta^\{\\prime\\prime\}=p\_\{t\}\(\\mathbf\{u\}\),\(97\)where the last step uses2π2\\pi\-periodicity of the integrand\.

Substituting into \([95](https://arxiv.org/html/2605.28900#A1.E95)\),

k\(θ\+α,θ′\+α\)=∫ℝ2p\(𝐮∣𝐱0\(θ\)\)p\(u∣𝐱0\(θ′\)\)pt\(𝐮\)𝑑𝐮=k\(θ,θ′\)\.k\(\\theta\+\\alpha,\\,\\theta^\{\\prime\}\+\\alpha\)=\\int\_\{\\mathbb\{R\}^\{2\}\}\\frac\{p\\bigl\(\\mathbf\{u\}\\mid\\mathbf\{x\}\_\{0\}\(\\theta\)\\bigr\)\\,p\\bigl\(u\\mid\\mathbf\{x\}\_\{0\}\(\\theta^\{\\prime\}\)\\bigr\)\}\{p\_\{t\}\(\\mathbf\{u\}\)\}\\,d\\mathbf\{u\}=k\(\\theta,\\theta^\{\\prime\}\)\.\(98\)Hencek\(θ,θ′\)=κt\(θ−θ′\)k\(\\theta,\\theta^\{\\prime\}\)=\\kappa\_\{t\}\(\\theta\-\\theta^\{\\prime\}\), withκt\\kappa\_\{t\}defined by

κt\(Δθ\):=∫ℝ2p\(𝐱t∣𝐱0\(0\)\)p\(𝐱t∣𝐱0\(Δθ\)\)pt\(𝐱t\)𝑑𝐱t\.\\kappa\_\{t\}\(\\Delta\\theta\):=\\int\_\{\\mathbb\{R\}^\{2\}\}\\frac\{p\\bigl\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\(0\)\\bigr\)\\,p\\bigl\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\(\\Delta\\theta\)\\bigr\)\}\{p\_\{t\}\(\\mathbf\{x\}\_\{t\}\)\}\\,d\\mathbf\{x\}\_\{t\}\.\(99\)

#### Convolution structure and eigenfunctions\.

Sincek\(θ,θ′\)=κt\(θ−θ′\)k\(\\theta,\\theta^\{\\prime\}\)=\\kappa\_\{t\}\(\\theta\-\\theta^\{\\prime\}\), the operatorTt∗TtT\_\{t\}^\{\\ast\}T\_\{t\}acts onL2\(S1\)L^\{2\}\(S^\{1\}\)as circular convolution:

\(Tt∗Ttf\)\(θ\)=∫02πκt\(θ−θ′\)f\(θ′\)dθ′2π\.\(T\_\{t\}^\{\\ast\}T\_\{t\}\\,f\)\(\\theta\)=\\int\_\{0\}^\{2\\pi\}\\kappa\_\{t\}\(\\theta\-\\theta^\{\\prime\}\)\\,f\(\\theta^\{\\prime\}\)\\,\\frac\{d\\theta^\{\\prime\}\}\{2\\pi\}\.\(100\)By the spectral theorem for convolution operators onS1S^\{1\}, the eigenfunctions are the Fourier modes\{1,cos⁡\(nθ\),sin⁡\(nθ\)\}n≥1\\\{1,\\,\\cos\(n\\theta\),\\,\\sin\(n\\theta\)\\\}\_\{n\\geq 1\}\. Eachcos⁡\(nθ\)\\cos\(n\\theta\)andsin⁡\(nθ\)\\sin\(n\\theta\)share the same eigenvalueλt,n\\lambda\_\{t,n\}becauseκt\\kappa\_\{t\}is real and even \(the latter following fromk\(θ,θ′\)=k\(θ′,θ\)k\(\\theta,\\theta^\{\\prime\}\)=k\(\\theta^\{\\prime\},\\theta\)by symmetry of \([94](https://arxiv.org/html/2605.28900#A1.E94)\)\)\. The2\\sqrt\{2\}normalization in \([93](https://arxiv.org/html/2605.28900#A1.E93)\) ensures‖ψt,k‖μ0=1\\\|\\psi\_\{t,k\}\\\|\_\{\\mu\_\{0\}\}=1\. ∎

## Appendix BIllustrative Examples

### B\.1Gaussian Prior

We demonstrate our spectral learning framework on a centered Gaussian prior, where Proposition[A\.13](https://arxiv.org/html/2605.28900#A1.Thmtheorem13)gives closed\-form expressions for both the eigenvalues and eigenfunctions ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}, allowing direct comparison with our estimates\.

#### Setup\.

The priorp0p\_\{0\}is a centered Gaussian onℝ20\\mathbb\{R\}^\{20\}whose covariance spectrum has a geometric decay\{40⋅0\.7k−1\}k=120\\\{40\\cdot 0\.7^\{k\-1\}\\\}\_\{k=1\}^\{20\}\. The forward process follows a linear DDPM schedule withT=1000T=1000timesteps\. We train the eigenfunction networkfϕf\_\{\\phi\}\(architecture and loss as described in §[4\.3](https://arxiv.org/html/2605.28900#S4.SS3), with ridgeξ=0\.001\\xi=0\.001\) to recover the leadingK=3K=3non\-trivial eigenfunctions ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}, excluding the constant mode \(the eigenfunctionϕt,1≡1\\phi\_\{t,1\}\\equiv 1with eigenvalue11\)\.

#### Results\.

Fig\.[11\(a\)](https://arxiv.org/html/2605.28900#A2.F11.sf1)reports four diagnostics over diffusion time\. Fig\.LABEL:app:fig:gauss\_gtshows the leading three ground\-truth eigenvaluesλt,2,λt,3,λt,4\\lambda\_\{t,2\},\\lambda\_\{t,3\},\\lambda\_\{t,4\}from Eq\. \([89](https://arxiv.org/html/2605.28900#A1.E89)\), and Fig\.LABEL:app:fig:gauss\_estthe corresponding Monte\-Carlo estimates produced byfϕf\_\{\\phi\}; the two curves match closely along the entire trajectory\. Fig\.LABEL:app:fig:gauss\_resplots the absolute residuals, which remain below0\.060\.06and decay as the eigenvalues approach zero at largett\. Fig\.LABEL:app:fig:gauss\_cosshows the mean cosine of the principal angles between the true leading eigenspace and our estimate: it rises from0\.970\.97at smallttto essentially11at largett\. This is consistent with the structure of the operator: at smalltt,TtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}is close to identity and the eigengap is small, so the leading eigenspace is weakly identified, while over the second half of the trajectory the cosine stays above0\.9950\.995, indicating near\-perfect subspace recovery\.

\(a\)Spectral recovery on a centered Gaussian prior inℝ20\\mathbb\{R\}^\{20\}\. \(a\) Closed\-form ground\-truth eigenvalues ofTtTt∗T\_\{t\}T\_\{t\}^\{\\ast\}from Proposition[A\.13](https://arxiv.org/html/2605.28900#A1.Thmtheorem13)\. \(b\) Monte\-Carlo estimates produced byfϕf\_\{\\phi\}\. \(c\) Absolute residuals between \(a\) and \(b\)\. \(d\) Mean cosine of the principal angles between the true and estimated leadingK=3K=3eigenspaces\.### B\.2Eigenfunction Visualization

To build intuition for the singular functions learned by the spectral framework, we visualize the right singular functions ofTtT\_\{t\}on four simple priors: the unit circle𝒮1⊂ℝ2\\mathcal\{S\}^\{1\}\\subset\\mathbb\{R\}^\{2\}, the uniform square\[−0\.5,0\.5\]2\[\-0\.5,0\.5\]^\{2\}, the unit disk\{x∈ℝ2:‖x‖≤1\}\\\{x\\in\\mathbb\{R\}^\{2\}:\\\|x\\\|\\leq 1\\\}, and an annulus\{x∈ℝ2:rin≤‖x‖≤rout\}\\\{x\\in\\mathbb\{R\}^\{2\}:r\_\{\\mathrm\{in\}\}\\leq\\\|x\\\|\\leq r\_\{\\mathrm\{out\}\}\\\}\. In all cases, we train the eigenfunction network as described in §[4\.3](https://arxiv.org/html/2605.28900#S4.SS3)and visualize the learned right singular functionsψt,k\(x0\)\\psi\_\{t,k\}\(x\_\{0\}\)by coloring each samplex0x\_\{0\}from the prior according to the function valueψt,k\(x0\)\\psi\_\{t,k\}\(x\_\{0\}\)\.

For thecircle prior, the support is a one\-dimensional manifold, and the right singular functions ofTtT\_\{t\}are the Fourier basis \(Proposition[A\.14](https://arxiv.org/html/2605.28900#A1.Thmtheorem14)\)\. The learned modes, depicted in FigureLABEL:app:fig:eigenfunctions\_circle, recover this structure:ψt,2\\psi\_\{t,2\}exhibits two nodal points, and subsequent pairs display progressively higher\-frequency oscillations along the circle \(the constant eigenfunction is not shown\)\. Modes appear in degenerate pairs, consistent with the rotational symmetry of the prior\.

For thesquare prior, the learned modes, shown in FigureLABEL:app:fig:eigenfunctions\_square, recover axis\-aligned oscillatory patterns of increasing spatial frequency, akin to the Fourier basis\.

For thedisk prior, the support is a two\-dimensional manifold with a circular boundary\. The learned modes, shown in FigureLABEL:app:fig:eigenfunctions\_disk, separate into radial and angular components: angular oscillations alongθ\\thetaare modulated by a radial envelope\. The rotational symmetry of the prior again yields degenerate pairs of angular modes\.

For theannulus prior, the learned modes, shown in FigureLABEL:app:fig:eigenfunctions\_annulus, again factor into radial and angular components, but the inner boundary modifies the radial envelope relative to the disk and the angular modes wrap continuously around the hole\. Rotational symmetry continues to enforce degenerate pairs\.

In all four cases, low\-frequency \(spatially smooth\) modes appear first, corresponding to large singular values, while high\-frequency modes appear later with diminishing singular values\.

\(b\)Learned singular functions on simple manifolds \(t=100t=100\)\.Each row shows the leading88right singular functionsψt,k\\psi\_\{t,k\}, with samples from the prior colored by function value\. On the unit circle, the modes recover the Fourier basis on𝒮1\\mathcal\{S\}^\{1\}\(Proposition[A\.14](https://arxiv.org/html/2605.28900#A1.Thmtheorem14)\)\. On the uniform square, they recover axis\-aligned oscillations of increasing spatial frequency\. On the disk, modes factor into a radial envelope and angular harmonics\. On the annulus, the additional inner boundary modifies the radial envelope and the angular modes wrap around the hole\. In all cases, spatially smooth modes appear first, and rotationally symmetric priors yield degenerate angular pairs\.## Appendix CExperiments

### C\.1Training Setup

We employed a consistent backbone architecture across all datasets\. The networksfϕf\_\{\\phi\}are parameterized as ResNets, incorporating time modulation via FiLM blocks to condition on the diffusion timestep\.

#### Datasets and base models\.

- •CIFAR\-10\.We trainedfϕf\_\{\\phi\}with an output dimension ofK=512K=512on the standard train split, comprising 50,00032×3232\\times 32images across 10 classes\. For the diffusion backbone, we used the pre\-trained unconditional DDPM pipelinegoogle/ddpm\-cifar10\-32, available on HuggingFace\.
- •CelebA\-HQ\.We trainedfϕf\_\{\\phi\}with an output dimension ofK=512K=512on the full dataset of 30,000 images, resized to256×256256\\times 256\. We used the pre\-trained unconditional DDPM pipelinegoogle/ddpm\-ema\-celebahq\-256, available on HuggingFace\.
- •ImageNet\.We trainedfϕf\_\{\\phi\}with an output dimension ofK=2000K=2000on the training split comprising 1,281,167 images, resized to64×6464\\times 64\. We used a64×6464\\times 64improved\-diffusion unconditional DDPM pipeline available on GitHub111[https://github\.com/openai/improved\-diffusion](https://github.com/openai/improved-diffusion)\.

#### Optimization\.

Training was performed using the Adam optimizer, with the loss described in Algorithm[1](https://arxiv.org/html/2605.28900#alg1)\. We used an initial learning rate of10−410^\{\-4\}, which was decayed exponentially by a factor of0\.9950\.995every epoch\. The ridge parameter to compute the batch whitening matrices𝐖t\\mathbf\{W\}\_\{t\}was set toξ=0\.001\\xi=0\.001in all experiments\. Training on CIFAR\-10 and CelebA\-HQ was conducted on a single NVIDIA GPU with 48GB of memory \(batch size of 2048\) while ImageNet training was performed on 4 NVIDIA GPUs with 48GB memory \(batch size per GPU of 4096\)\.

### C\.2Label Guidance

#### CIFAR\-10\.

We generated 2,650 samples per class with a DDIM sampler \(100 timesteps,η=1\.0\\eta=1\.0\)\. We used a guidance strength ofκ=10\\kappa=10on Spectral Guidance and the implementations and hyperparameters of all baselines fromYeet al\.\([2024](https://arxiv.org/html/2605.28900#bib.bib34)\)\. Guidance was evaluated across all 10 classes, using an external ConvNext classifier222[https://huggingface\.co/ahsanjavid/convnext\-tiny\-finetuned\-cifar10](https://huggingface.co/ahsanjavid/convnext-tiny-finetuned-cifar10)\. We report the average top\-1 accuracy\.

#### CelebA\-HQ\.

#### ImageNet\.

We evaluated guidance of 4 labels: kuvasz, hamster, bicycle\-built\-for\-two and nematode, followingYeet al\.\([2024](https://arxiv.org/html/2605.28900#bib.bib34)\)\. We generated 256 images per label with a DDIM sampler \(400 timesteps,η=1\.0\\eta=1\.0\)\. We used a guidance strength ofκ=30\\kappa=30on Spectral Guidance\. Guidance validity was evaluated with a pre\-trained DeiT from HuggingFace777[https://huggingface\.co/facebook/deit\-small\-patch16\-224](https://huggingface.co/facebook/deit-small-patch16-224)\. We report the average top\-1 accuracy\.

\(c\)CIFAR\-10 label\-conditioned samples generated via Spectral Guidance \(K=512K=512,ξ=0\.001\\xi=0\.001,κ=10\.0\\kappa=10\.0\)\.\(d\)CelebA\-HQ attribute conditioned samples generated with Spectral Guidance \(K=512K=512,ξ=0\.001\\xi=0\.001,κ=10\.0\\kappa=10\.0\)\.### C\.3CLIP Guidance

We benchmarked open\-vocabulary guidance on CelebA\-HQ using 15 natural language prompts, detailed in Table[4](https://arxiv.org/html/2605.28900#A3.T4)\. These prompts encompass attributes from the original dataset as well as new features such as pose and background\. We used the TFG\(Yeet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib34)\)implementation of all baselines with the following hyperparameters, found via grid search:

- •DPS\.Guidance strength=10\.0=10\.0
- •LGD\.Guidance strength=400\.0=400\.0; Number of samples to estimate posterior:1010
- •FreeDoM\.Guidance strength=20\.0=20\.0;Nrecur=2N\_\{\\text\{recur\}\}=2
- •MPGD\.Guidance strength=20\.0=20\.0;
- •UGD\.Guidance strength=20\.0=20\.0;Niter=5N\_\{\\text\{iter\}\}=5;Nrecur=1N\_\{\\text\{recur\}\}=1
- •TFG\.ρ=30\.0\\rho=30\.0;μ=1\.0\\mu=1\.0,Nrecur=2N\_\{\\text\{recur\}\}=2,Niter=5N\_\{\\text\{iter\}\}=5,γ¯=0\.001\\bar\{\\gamma\}=0\.001, Number of samples to estimate posterior: 1

We used a guidance strength ofκ=20\.0\\kappa=20\.0for Spectral Guidance\. For all methods, guidance was computed using CLIP ViT\-B/32, via the cosine similarity between the predicted clean image and the target text prompt\. We generated 100 images per prompt with a DDIM sampler \(100 timesteps,η=1\.0\\eta=1\.0\) and report the average VQAScore usingllava\-v1\.5\-7b\.

### C\.4Mask Guidance

In the mask guidance experiment, we used 1,000 binary hair masks \(256×256256\\times 256\) from CelebAMask\-HQ to guide the generation\. Spectral Guidance uses the same modelfϕf\_\{\\phi\}and precomputed singular functions\{𝚽𝒕\}t∈𝒯\\\{\\boldsymbol\{\\Phi\_\{t\}\}\\\}\_\{t\\in\\mathcal\{T\}\}from the CLIP and label guidance experiments on CelebA\-HQ\. In contrast, all baselines rely on the BiSeNet888[https://github\.com/zllrunning/face\-parsing\.pytorch](https://github.com/zllrunning/face-parsing.pytorch)segmentation model for guidance\. We used the TFG\(Yeet al\.,[2024](https://arxiv.org/html/2605.28900#bib.bib34)\)implementation of all baselines with the following hyperparameters:

- •DPS\.Guidance strength=3\.0=3\.0
- •LGD\.Guidance strength=400\.0=400\.0; Number of samples to estimate posterior:55
- •FreeDoM\.Guidance strength=30\.0=30\.0;Nrecur=1N\_\{\\text\{recur\}\}=1
- •MPGD\.Guidance strength=30\.0=30\.0;
- •UGD\.Guidance strength=50\.0=50\.0;Niter=5N\_\{\\text\{iter\}\}=5;Nrecur=1N\_\{\\text\{recur\}\}=1
- •TFG\.ρ=30\.0\\rho=30\.0;μ=1\.0\\mu=1\.0,Nrecur=1N\_\{\\text\{recur\}\}=1,Niter=5N\_\{\\text\{iter\}\}=5,γ¯=0\.001\\bar\{\\gamma\}=0\.001, Number of samples to estimate posterior: 1

We generated one image per target mask with a DDIM sampler \(100 timesteps,η=1\.0\\eta=1\.0\)\. Guidance validity was evaluated by feeding the generated image to an independent face parsing model from HuggingFace999[https://huggingface\.co/jonathandinu/face\-parsing](https://huggingface.co/jonathandinu/face-parsing)and computing the IoU between the detected hair mask and the target one\. Fidelity was assessed via thelog\\logKID between the generated images and 1,000 random images from the dataset\.

Table 4:Prompts used for CLIP Guidance\.
Spectral Guidance for Flexible and Efficient Control of Diffusion Models

Similar Articles

Learning to Discretize: Diffusion-Based Adaptive Mesh with Spectral Guidance

Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Class-frequency Guided Noise Schedule for Diffusion Models

Temporal Difference Learning for Diffusion Models

Submit Feedback

Similar Articles

Learning to Discretize: Diffusion-Based Adaptive Mesh with Spectral Guidance
Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
Class-frequency Guided Noise Schedule for Diffusion Models
Temporal Difference Learning for Diffusion Models