Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees

arXiv cs.LG Papers

Summary

This paper analyzes zero-shot conditional sampling with pretrained diffusion models for linear inverse problems, providing information-theoretic guarantees and proposing a projected-Langevin initialization method.

arXiv:2605.05387v1 Announce Type: new Abstract: We study zero-shot conditional sampling with pretrained diffusion models for linear inverse problems, including inpainting and super-resolution. In these problems, the observation determines only part of the unknown signal. The remaining degrees of freedom must be sampled according to the correct conditional data distribution. Existing projection-based samplers enforce measurement consistency by correcting the observed component during reverse diffusion. However, measurement consistency alone does not determine how probability mass should be distributed along the feasible set, and this can lead to biased conditional samples. We analyze this issue through a normal--tangent decomposition of the score function. For Gaussian noising, the observed-direction score is exactly determined by the measurement; only the tangent conditional score is unknown. We prove that the error from replacing this score by the unconditional tangent score is upper bounded by a dimension-free conditional mutual information between observed and unobserved components. This gives an information-theoretic decomposition into initialization and pathwise score-mismatch errors. Motivated by the theory, we propose a projected-Langevin initialization followed by guided reverse denoising, which outperforms a strong projection-based baseline in inpainting and super-resolution experiments.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:13 AM

# Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
Source: [https://arxiv.org/html/2605.05387](https://arxiv.org/html/2605.05387)
\\nameAhmad Aghapour\\emailaghapour@umich\.edu \\addrDepartment of Mathematics University of Michigan Ann Arbor, MI 48109, USA\\nameErhan Bayraktar\\emailerhan@umich\.edu \\addrDepartment of Mathematics University of Michigan Ann Arbor, MI 48109, USA\\nameAsaf Cohen\\emailasafc@umich\.edu \\addrDepartment of Mathematics University of Michigan Ann Arbor, MI 48109, USA

###### Abstract

We study zero\-shot conditional sampling with pretrained diffusion models for linear inverse problems, including inpainting and super\-resolution\. In these problems, the observation determines only part of the unknown signal\. The remaining degrees of freedom must be sampled according to the correct conditional data distribution\. Existing projection\-based samplers enforce measurement consistency by correcting the observed component during reverse diffusion\. However, measurement consistency alone does not determine how probability mass should be distributed along the feasible set, and this can lead to biased conditional samples\.

We analyze this issue through a normal–tangent decomposition of the score function\. For Gaussian noising, the observed\-direction score is exactly determined by the measurement; only the tangent conditional score is unknown\. We prove that the error from replacing this score by the unconditional tangent score is upper bounded by a dimension\-free conditional mutual information between observed and unobserved components\. This gives an information\-theoretic decomposition into initialization and pathwise score\-mismatch errors\. Motivated by the theory, we propose a projected\-Langevin initialization followed by guided reverse denoising, which outperforms a strong projection\-based baseline in inpainting and super\-resolution experiments\.

Keywords:diffusion models, inverse problems, Langevin dynamics, information\-theoretic bounds, conditional sampling

## 1Introduction

Diffusion models have become a standard tool for high\-dimensional generative modeling\. Given samples from a data distribution, a diffusion model learns the score of progressively noised versions of the data and then generates new samples by simulating a reverse\-time denoising process\(Song and Ermon,[2019](https://arxiv.org/html/2605.05387#bib.bib23); Ho et al\.,[2020](https://arxiv.org/html/2605.05387#bib.bib13); Song et al\.,[2020b](https://arxiv.org/html/2605.05387#bib.bib24),[a](https://arxiv.org/html/2605.05387#bib.bib22)\)\. In many applications, however, generation is not unconditional\. In image restoration, for example, one observes a corrupted image and wants to sample clean images that are both realistic under a pretrained image prior and consistent with the observation\.

This paper studies such conditional sampling problems for noiseless linear observations\. LetZ∈ℝdZ\\in\\mathbb\{R\}^\{d\}denote the clean signal and suppose that

y=A​Z,A∈ℝm×d\.y=AZ,\\qquad A\\in\\mathbb\{R\}^\{m\\times d\}\.The goal is to sample from the conditional law

Law⁡\(Z∣A​Z=y\)\.\\operatorname\{Law\}\(Z\\mid AZ=y\)\.WhenAAhas full row rank, the constraintA​Z=yAZ=ydefines an affine set\. Writing

P⟂:=A⊤​\(A​A⊤\)−1​A,P∥:=I−P⟂,P\_\{\\perp\}:=A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}A,\\qquad P\_\{\\parallel\}:=I\-P\_\{\\perp\},the projectionP⟂P\_\{\\perp\}extracts the component of the signal determined by the measurements, whileP∥P\_\{\\parallel\}extracts the component in the null space ofAA\. We refer to these as the normal and tangent components, respectively\. Thus the observation fixes the normal component, whereas the tangent component contains the remaining degrees of freedom\. In imaging problems, this formulation covers inpainting, super\-resolution, deblurring, and other linear inverse problems\.

A central difficulty is that measurement consistency and conditional sampling are not the same task\. Measurement consistency only requires producing a samplez^\\hat\{z\}satisfyingA​z^=yA\\hat\{z\}=y\. Conditional sampling requires more: among all feasible signals satisfying the measurement, samples should be distributed according to the true conditional law of the data\. In the geometric language above, the normal component enforces feasibility, while the tangent component determines how probability mass is distributed along the feasible affine set\.

Many zero\-shot inverse\-problem samplers based on pretrained diffusion models enforce the measurement by repeatedly correcting or projecting the sample in the observed directions\. We call such methods projection\-based because they use the known linear operatorAAto replace, project, or analytically correct the normal component during reverse diffusion, while leaving the unobserved directions largely governed by the pretrained unconditional model\. Methods such as denoising diffusion restoration models \(DDRM\) and the denoising diffusion null\-space model \(DDNM\) are representative examples for linear inverse problems\(Kawar et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib16); Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27)\)\. These methods can be very effective at maintaining measurement consistency\. However, correcting the normal component does not by itself determine the correct distribution in the tangent directions\. As a result, a sample may satisfyA​z^=yA\\hat\{z\}=ywhile still being biased along the feasible manifold\.

The goal of this paper is to understand and reduce this tangent\-space bias\. We work in the zero\-shot setting: the diffusion model is pretrained unconditionally, is not fine\-tuned for the observation, and conditioning is imposed only at inference time\. This setting is practically important because it allows a single generative prior to be reused across many inverse problems\. It is also theoretically revealing, because the only available learned object is the unconditional score\. The question is therefore: when can an unconditional score be used to approximate the conditional dynamics, and where does the error enter?

Our starting point is a normal–tangent decomposition of the conditional score\. Under Gaussian noising, the normal component of the conditional score is available in closed form from the observation\. In the variance\-exploding normalization, ifB=P⟂​Z=bB=P\_\{\\perp\}Z=b, then

P⟂​st∗,b​\(x\)=1t​P⟂​\(b−x\)\.P\_\{\\perp\}s\_\{t\}^\{\*,b\}\(x\)=\\frac\{1\}\{t\}P\_\{\\perp\}\(b\-x\)\.Thus the normal score is not the obstacle\. The only unknown part is the tangent conditional scoreP∥​st∗,b​\(x\)P\_\{\\parallel\}s\_\{t\}^\{\*,b\}\(x\)\. Projection\-based zero\-shot samplers can therefore be viewed as replacing this unknown tangent conditional score by the pretrained unconditional tangent scoreP∥​st​\(x\)P\_\{\\parallel\}s\_\{t\}\(x\)\. This view isolates the precise source of bias: the approximation is made along the feasible directions, not in the measured directions\.

Motivated by this decomposition, we propose a two\-stage conditional sampler\. Rather than starting reverse diffusion from the highest\-noise distribution, we start from an intermediate noise level\. At this level, the noisy normal component can be sampled exactly under the constraint\. We then run projected underdamped Langevin dynamics on the corresponding affine slice, using the projected unconditional score to mix only in the tangent directions\. This produces an initialization that is already consistent with the noisy constraint and better adapted to the feasible slice\. From this initialization, we perform guided reverse denoising using the exact normal correction and the pretrained unconditional score in the tangent directions\.

The theoretical analysis follows the same decomposition\. We separate the total sampling error into two terms\. The first is an initialization error at the intermediate noise level, caused by approximating the true conditional marginal on the affine slice\. The second is a pathwise error accumulated during reverse denoising, caused by replacing the true conditional tangent score with the unconditional tangent score\. Our main pathwise result shows that this second error is controlled by a conditional mutual information between tangent and normal components\. Informally, zero\-shot tangent guidance is accurate when, at the chosen noise level, the remaining statistical dependence between the unobserved tangent component and the observed normal component is small\.

We further combine the pathwise bound with an initialization analysis\. Under a latent Gaussian\-mixture model, we obtain a terminal Kullback–Leibler \(KL\) bound consisting of an initialization term and a mutual\-information pathwise term\. Under an additional separation condition on the latent normal codebook, both terms become exponentially small in the separation\-to\-noise ratio\. These results identify regimes in which inference\-time conditioning with a fixed unconditional score can be accurate, and they also explain why tangent\-space ambiguity is the central obstruction\.

We evaluate the resulting sampler on standard linear imaging inverse problems using pretrained diffusion backbones and matched compute budgets\. On inpainting and8×8\\timessuper\-resolution across CelebA\-HQ, LSUN Church, and ImageNet, the proposed method improves Learned Perceptual Image Patch Similarity \(LPIPS\) and Fréchet Inception Distance \(FID\) over a strong projection\-based zero\-shot baseline\. The gains are largest in settings with greater unresolved tangent ambiguity, such as ImageNet and high\-factor super\-resolution, consistent with the role of tangent mixing in the analysis\.

### 1\.1Related Work

Our work builds on score\-based generative modeling and diffusion models\(Song and Ermon,[2019](https://arxiv.org/html/2605.05387#bib.bib23); Ho et al\.,[2020](https://arxiv.org/html/2605.05387#bib.bib13); Song et al\.,[2020b](https://arxiv.org/html/2605.05387#bib.bib24),[a](https://arxiv.org/html/2605.05387#bib.bib22)\)\. Conditional generation can be obtained by training conditional models, but many inverse problems require reusing a fixed unconditional model\. Classifier guidance and classifier\-free guidance modify the reverse process using additional conditional information\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2605.05387#bib.bib5); Ho and Salimans,[2022](https://arxiv.org/html/2605.05387#bib.bib12)\)\. Image editing and restoration methods such as SDEdit, RePaint, and ILVR impose conditioning through noising, denoising, and resampling procedures\(Meng et al\.,[2021](https://arxiv.org/html/2605.05387#bib.bib21); Lugmayr et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib20); Choi et al\.,[2021](https://arxiv.org/html/2605.05387#bib.bib2)\)\.

Diffusion priors have also been widely used for inverse problems\. Predictor– corrector samplers and likelihood\-gradient corrections incorporate observations during sampling\(Song et al\.,[2020b](https://arxiv.org/html/2605.05387#bib.bib24),[2021](https://arxiv.org/html/2605.05387#bib.bib25)\)\. For linear inverse problems, DDRM and DDNM exploit the measurement operator to impose analytic updates or null\-space corrections during reverse diffusion\(Kawar et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib16); Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27)\)\. Diffusion posterior sampling \(DPS\) extends posterior sampling ideas to more general noisy and nonlinear settings through likelihood\-gradient guidance\(Chung et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib3)\)\. These methods demonstrate the strength of pretrained diffusion priors for restoration\. Our focus is different: we analyze the specific tangent\-score approximation that remains after the normal measurement component has been corrected\.

Other approaches construct conditional samplers by changing the model or the underlying path measure\. Reward\-based fine\-tuning and reinforcement\-learning methods adapt a pretrained generator using task\-specific feedback\(Fan et al\.,[2023](https://arxiv.org/html/2605.05387#bib.bib9); Zhao et al\.,[2025](https://arxiv.org/html/2605.05387#bib.bib30); Uehara et al\.,[2024](https://arxiv.org/html/2605.05387#bib.bib26)\)\. Doob’shh\-transform and diffusion\-bridge methods provide principled path\-space formulations of conditioning\(Didi et al\.,[2023](https://arxiv.org/html/2605.05387#bib.bib6); Guo et al\.,[2026](https://arxiv.org/html/2605.05387#bib.bib11); Zhou et al\.,[2024](https://arxiv.org/html/2605.05387#bib.bib31)\)\. These methods can be exact or asymptotically exact under suitable assumptions, but they typically require learning an additional object, solving a control problem, or fine\-tuning the model\. By contrast, we keep the unconditional score fixed and study what can be achieved by inference\-time conditioning alone\.

The Langevin initialization used here is related to constrained sampling\. Projected Langevin methods sample on constrained domains or manifolds\(Lamperski,[2021](https://arxiv.org/html/2605.05387#bib.bib17)\)\. Underdamped Langevin dynamics can improve mixing relative to overdamped dynamics in some settings\(Cheng et al\.,[2018](https://arxiv.org/html/2605.05387#bib.bib1)\), and BAOAB discretizations are known for stable and low\-bias behavior in the position marginal\(Leimkuhler and Matthews,[2013](https://arxiv.org/html/2605.05387#bib.bib18)\)\. In affine inverse problems, these methods are natural because, once the normal component is fixed, the remaining sampling problem lives in the tangent space\.

Recent theory has begun to analyze conditional and zero\-shot diffusion samplers, including asymptotically exact conditional samplers\(Wu et al\.,[2023](https://arxiv.org/html/2605.05387#bib.bib28)\), filtering\-based posterior samplers for linear inverse problems\(Dou and Song,[2024](https://arxiv.org/html/2605.05387#bib.bib7)\), and score\-mismatch analyses for zero\-shot guidance\(Liang et al\.,[2025](https://arxiv.org/html/2605.05387#bib.bib19)\)\. Our contribution is complementary: we isolate the normal–tangent structure of affine conditioning and bound the error caused by using the unconditional tangent score in place of the conditional tangent score\.

### 1\.2Contributions

The main contributions of this paper are as follows\.

First, we derive a normal–tangent decomposition of affine conditional diffusion\. For Gaussian noising, the normal component of the conditional score is available exactly from the measurement and the noising process, while the tangent component is the only part not supplied by a pretrained unconditional score model\. This decomposition motivates the surrogate dynamics in Section[2](https://arxiv.org/html/2605.05387#S2)\.

Second, we propose a zero\-shot conditional sampler that combines exact normal correction, projected underdamped Langevin mixing on an affine slice, and guided reverse denoising\. The Langevin phase is designed to initialize the sampler at an intermediate noise level with improved mixing in the unobserved tangent directions before the final denoising stage\.

Third, we prove a pathwise error bound for the guided reverse dynamics\. In Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4), the KL divergence between the ideal conditional path measure and the surrogate path measure is controlled by a conditional mutual information between the tangent and normal components\. This gives an information\-theoretic criterion for when replacing the conditional tangent score by the unconditional tangent score is accurate\.

Fourth, we combine the pathwise estimate with an initialization analysis to obtain terminal KL guarantees\. In Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9), the terminal error separates into an initialization term and the mutual\-information pathwise term\. The resulting bound has no explicit dependence on the ambient dimension; its size is governed by the sensitivity of tangent conditionals and by the residual statistical dependence between observed and unobserved components\. Under an additional separation condition on the latent normal codebook, Theorem[11](https://arxiv.org/html/2605.05387#Thmtheorem11)further gives an exponential small\-error regime, where both the initialization and pathwise contributions become exponentially small when component of gaussian mixture model is separated\.

Finally, we evaluate the proposed sampler on linear imaging inverse problems\. The experiments show that the algorithm outperforms previous zero\-shot diffusion methods on inpainting and8×8\\timessuper\-resolution under matched network\-evaluation budgets\.

### 1\.3Organization

Section[2](https://arxiv.org/html/2605.05387#S2)formulates affine conditional diffusion and derives the normal–tangent decomposition of the conditional reverse dynamics\. Section[3](https://arxiv.org/html/2605.05387#S3)presents the Langevin–diffusion sampler\. Section[4](https://arxiv.org/html/2605.05387#S4)reports experiments on inpainting and super\-resolution\. Section[5](https://arxiv.org/html/2605.05387#S5)gives the KL bounds and total error decomposition\. Proofs and additional derivations are deferred to the appendix\.

## 2Methodology

We adopt the variance\-exploding \(VE\) diffusion framework ofSong et al\. \([2020b](https://arxiv.org/html/2605.05387#bib.bib24)\)\. Let the clean data beZ∈ℝdZ\\in\\mathbb\{R\}^\{d\}with prior lawZ∼p0Z\\sim p\_\{0\}\. For diffusion timet∈\[0,T\]t\\in\[0,T\], the forward process is

Xt:=Z\+Wt,X\_\{t\}:=Z\+W\_\{t\},\(2\.1\)where\{Wt\}t≥0\\\{W\_\{t\}\\\}\_\{t\\geq 0\}is standard Brownian motion inℝd\\mathbb\{R\}^\{d\}independent ofZZ\. Hence

Xt∣Z∼𝒩​\(Z,t​Id\),X\_\{t\}\\mid Z\\sim\\mathcal\{N\}\(Z,\\,tI\_\{d\}\),\(2\.2\)and we writeptp\_\{t\}for the marginal density ofXtX\_\{t\}, with scorest​\(x\):=∇xlog⁡pt​\(x\)s\_\{t\}\(x\):=\\nabla\_\{x\}\\log p\_\{t\}\(x\)\.

The time\-reversal of \([2\.1](https://arxiv.org/html/2605.05387#S2.E1)\) yields the reverse\-time stochastic differential equation \(SDE\) that generates*unconditional*samples fromp0p\_\{0\}\. Using the reverse\-time parameterτ:=T−t∈\[0,T\]\\tau:=T\-t\\in\[0,T\], this SDE can be written as

d​Yτ=sT−τ​\(Yτ\)​d​τ\+d​W¯τ,Y0∼pT,\\mathrm\{d\}Y\_\{\\tau\}=s\_\{T\-\\tau\}\(Y\_\{\\tau\}\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},\\qquad Y\_\{0\}\\sim p\_\{T\},\(2\.3\)whereW¯τ\\bar\{W\}\_\{\\tau\}is a Brownian motion in reverse time\. In practice,st​\(x\)s\_\{t\}\(x\)is approximated by a neural network trained via score matching\.

In this work we do not seek unconditional samples\. Instead, we aim to sample from a conditional distribution under a linear constraint\. LetA∈ℝm×dA\\in\\mathbb\{R\}^\{m\\times d\}have full row rank and consider the affine constraintA​Z=yAZ=y\. It is convenient to express the constraint through orthogonal projection onto the row space ofAA\. Define

P⟂:=A⊤​\(A​A⊤\)−1​A,P∥:=I−P⟂,P\_\{\\perp\}:=A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}A,\\qquad P\_\{\\parallel\}:=I\-P\_\{\\perp\},so thatP⟂P\_\{\\perp\}projects ontorange​\(A⊤\)\\mathrm\{range\}\(A^\{\\top\}\)\(normal space\) andP∥P\_\{\\parallel\}ontoker⁡\(A\)\\ker\(A\)\(tangent space\)\. We encode the observation via the*level*

B:=P⟂​Z,b:=A⊤​\(A​A⊤\)−1​y,B:=P\_\{\\perp\}Z,\\qquad b:=A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}y,so thatA​Z=yAZ=yis equivalent toB=bB=b, i\.e\.,ZZlies on the affine set

ℳ​\(b\):=\{x∈ℝd:P⟂​x=b\}\.\\mathcal\{M\}\(b\):=\\\{x\\in\\mathbb\{R\}^\{d\}:\\;P\_\{\\perp\}x=b\\\}\.\(WhenZZis supported on a countable codebook𝒞⊂ℝd\\mathcal\{C\}\\subset\\mathbb\{R\}^\{d\}, the levelBBis supported onP⟂​𝒞P\_\{\\perp\}\\mathcal\{C\}; the development below does not otherwise rely on discreteness\.\)

Fixbband letpt∗,bp\_\{t\}^\{\*,b\}denote the conditional density ofXtX\_\{t\}underLaw\(⋅∣P⟂Z=b\)\\operatorname\{Law\}\(\\,\\cdot\\,\\mid P\_\{\\perp\}Z=b\), with conditional scorest∗,b​\(x\):=∇xlog⁡pt∗,b​\(x\)s\_\{t\}^\{\*,b\}\(x\):=\\nabla\_\{x\}\\log p\_\{t\}^\{\*,b\}\(x\)\. Ifst∗,bs\_\{t\}^\{\*,b\}were available, then the correct reverse\-time dynamics that sample fromLaw⁡\(Z∣P⟂​Z=b\)\\operatorname\{Law\}\(Z\\mid P\_\{\\perp\}Z=b\)would be

d​Yτ∗,b=sT−τ∗,b​\(Yτ∗,b\)​d​τ\+d​W¯τ,Y0∗,b∼Law⁡\(XT∣P⟂​Z=b\)\.\\mathrm\{d\}Y\_\{\\tau\}^\{\*,b\}=s\_\{T\-\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},\\qquad Y\_\{0\}^\{\*,b\}\\sim\\operatorname\{Law\}\(X\_\{T\}\\mid P\_\{\\perp\}Z=b\)\.\(2\.4\)The main obstacle is thatst∗,bs\_\{t\}^\{\*,b\}is not directly learned by standard unconditional score training\.

For Gaussian perturbations, Tweedie’s formula expresses the conditional expectation ofZZgivenXt=xX\_\{t\}=xas

𝔼​\[Z∣Xt=x\]=x\+t​st​\(x\)\.\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]=x\+t\\,s\_\{t\}\(x\)\.\(2\.5\)A key observation is that, under the affine conditioningB=bB=b, Tweedie’s identity immediately yields a closed\-form expression for the*normal*component of the conditional score: applyingP⟂P\_\{\\perp\}to \([2\.5](https://arxiv.org/html/2605.05387#S2.E5)\) underLaw\(⋅∣P⟂Z=b\)\\operatorname\{Law\}\(\\,\\cdot\\,\\mid P\_\{\\perp\}Z=b\)gives

P⟂​𝔼​\[Z∣Xt=x,B=b\]=P⟂​\(x\+t​st∗,b​\(x\)\)\.P\_\{\\perp\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,\\,B=b\]\\;=\\;P\_\{\\perp\}\\big\(x\+t\\,s\_\{t\}^\{\*,b\}\(x\)\\big\)\.SinceP⟂​Z=bP\_\{\\perp\}Z=bholds almost surely underB=bB=b, the left\-hand side equalsbb, and therefore

P⟂​st∗,b​\(x\)=1t​P⟂​\(b−x\)\.P\_\{\\perp\}s\_\{t\}^\{\*,b\}\(x\)=\\frac\{1\}\{t\}\\,P\_\{\\perp\}\(b\-x\)\.\(2\.6\)Thus only the tangent componentP∥​st∗,bP\_\{\\parallel\}s\_\{t\}^\{\*,b\}remains unknown\. Usingst∗,b=P∥​st∗,b\+P⟂​st∗,bs\_\{t\}^\{\*,b\}=P\_\{\\parallel\}s\_\{t\}^\{\*,b\}\+P\_\{\\perp\}s\_\{t\}^\{\*,b\}and substituting \([2\.6](https://arxiv.org/html/2605.05387#S2.E6)\) into \([2\.4](https://arxiv.org/html/2605.05387#S2.E4)\) yields the equivalent decomposition

d​Yτ∗,b=\(P∥​sT−τ∗,b​\(Yτ∗,b\)\+1T−τ​P⟂​\(b−Yτ∗,b\)\)​d​τ\+d​W¯τ\.\\mathrm\{d\}Y\_\{\\tau\}^\{\*,b\}=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\+\\frac\{1\}\{T\-\\tau\}\\,P\_\{\\perp\}\\big\(b\-Y\_\{\\tau\}^\{\*,b\}\\big\)\\Big\)\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\}\.\(2\.7\)This form makes the conditioning mechanism explicit: the process is driven toward the affine setℳ​\(b\)\\mathcal\{M\}\(b\)by thenormal drift, while the remainingtangent driftdepends on the intractable conditional score\.

To obtain a practical sampler using only an unconditional score model, we keep the*exact*normal drift and approximate the unknown tangent term by the unconditional tangent scoreP∥​stP\_\{\\parallel\}s\_\{t\}\. This yields thesurrogate constrained reverse SDE

d​Y^τb=\(P∥​sT−τ​\(Y^τb\)\+1T−τ​P⟂​\(b−Y^τb\)\)​d​τ\+d​W¯τ,τ∈\[0,T−t0\)\.\\mathrm\{d\}\\hat\{Y\}\_\{\\tau\}^\{\\,b\}=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}\(\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\)\+\\frac\{1\}\{T\-\\tau\}\\,P\_\{\\perp\}\\big\(b\-\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\\big\)\\Big\)\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},\\qquad\\tau\\in\[0,T\-t\_\{0\}\)\.\(2\.8\)
It is constrained in the sense that its normal drift explicitly forcesP⟂​Y^τbP\_\{\\perp\}\\hat\{Y\}\_\{\\tau\}^\{\\,b\}toward the prescribed levelbb, thereby steering the trajectory toward the affine manifoldℳ​\(b\)=\{x:P⟂​x=b\}\\mathcal\{M\}\(b\)=\\\{x:\\,P\_\{\\perp\}x=b\\\}, while only the tangent component evolves according to the learned \(unconditional\) score\. Equations \([2\.7](https://arxiv.org/html/2605.05387#S2.E7)\) and \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\) share the same \(exact\) normal component and differ only in the tangent score:P∥​s∗,bP\_\{\\parallel\}s^\{\*,b\}versusP∥​sP\_\{\\parallel\}s\. In implementations, the factor1/\(T−τ\)=1/t1/\(T\-\\tau\)=1/tis handled by stopping the integration at a smallt0\>0t\_\{0\}\>0\(equivalentlyτmax=T−t0\\tau\_\{\\max\}=T\-t\_\{0\}\) and applying a final denoising step\.

We do not integrate the surrogate reverse SDE \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\) over the full reverse\-time horizonτ∈\[0,T\]\\tau\\in\[0,T\]\. The surrogate replaces the true conditional tangent scoreP∥​sT−τ∗,bP\_\{\\parallel\}s^\{\*,b\}\_\{T\-\\tau\}by the unconditional termP∥​sT−τP\_\{\\parallel\}s\_\{T\-\\tau\}\. If we start atτ=0\\tau=0\(i\.e\., from the highest\-noise marginal\), this mismatch acts over a long interval during which the normal correction is weak because its strength scales as1/\(T−τ\)=1/t1/\(T\-\\tau\)=1/t\. As a result, the trajectory can drift in tangent directions in a way that is inconsistent with the target conditional law, producing a bias that accumulates before the constraint becomes dominant at smaller noise\.

To limit this accumulation, we start the surrogate reverse SDE only at an intermediate noise levelt∗∈\(0,T−t0\)t^\{\*\}\\in\(0,T\-t\_\{0\}\), equivalently at reverse timeτ∗:=T−t∗\\tau^\{\*\}:=T\-t^\{\*\}\. Intuitively,t∗t^\{\*\}is chosen so that, for the remaining reverse intervalτ∈\[τ∗,T−t0\]\\tau\\in\[\\tau^\{\*\},T\-t\_\{0\}\]\(i\.e\., forward timest∈\(0,t∗\]t\\in\(0,t^\{\*\}\]\), usingP∥​stP\_\{\\parallel\}s\_\{t\}as a proxy forP∥​st∗,bP\_\{\\parallel\}s\_\{t\}^\{\*,b\}is acceptable, while the normal drift is already strong enough to enforce the constraint\. What remains is that we cannot initialize the reverse SDE atτ∗\\tau^\{\*\}from an arbitrary point: we need an initial state that is \(approximately\) distributed as the correct conditional marginalLaw⁡\(Xt∗∣P⟂​Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)\.

We construct such an initialization by combining an exact draw for the normal component with a tangent\-space sampling step\. Under the conditioningB=bB=bwe haveP⟂​Z=bP\_\{\\perp\}Z=balmost surely, hence

P⟂​Xt∗=P⟂​\(Z\+Wt∗\)=b\+P⟂​Wt∗,P\_\{\\perp\}X\_\{t^\{\*\}\}=P\_\{\\perp\}\(Z\+W\_\{t^\{\*\}\}\)=b\+P\_\{\\perp\}W\_\{t^\{\*\}\},so the normal component at timet∗t^\{\*\}can be sampled explicitly by

xt∗⟂:=b\+t∗​P⟂​ξ,ξ∼𝒩​\(0,Id\),x^\{\\perp\}\_\{t^\{\*\}\}:=b\+\\sqrt\{t^\{\*\}\}\\,P\_\{\\perp\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),which matches the exact law ofP⟂​Xt∗∣P⟂​Z=bP\_\{\\perp\}X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\. Conditional on this sampled normal componentx⟂x^\{\\perp\}, we sample a compatible tangent component by running Langevin dynamics restricted to the affine setℳ​\(x⟂\):=\{x:P⟂​x=x⟂\}\\mathcal\{M\}\(x^\{\\perp\}\):=\\\{x:\\;P\_\{\\perp\}x=x^\{\\perp\}\\\}, using only the projected \(tangent\) score at timet∗t^\{\*\}\.

The following lemma shows that restricting a density toℳ​\(x⟂\)\\mathcal\{M\}\(x^\{\\perp\}\)simply projects its ambient score onto the tangent space\.

###### Lemma 2

LetA∈ℝm×dA\\in\\mathbb\{R\}^\{m\\times d\}have full row rank and letC∈ℝd×\(d−m\)C\\in\\mathbb\{R\}^\{d\\times\(d\-m\)\}have orthonormal columns spanningker⁡\(A\)\\ker\(A\)\(C⊤​C=Id−mC^\{\\top\}C=I\_\{d\-m\},C​C⊤=P∥CC^\{\\top\}=P\_\{\\parallel\}\)\. Fix anyu0∈ℳ​\(x⟂\)u\_\{0\}\\in\\mathcal\{M\}\(x^\{\\perp\}\)and parametrize the affine set byx=u0\+C​z∥x=u\_\{0\}\+Cz^\{\\parallel\}withz∥∈ℝd−mz^\{\\parallel\}\\in\\mathbb\{R\}^\{d\-m\}\. For any differentiable densityp:ℝd→\(0,∞\)p:\\mathbb\{R\}^\{d\}\\to\(0,\\infty\), define its restriction toℳ​\(x⟂\)\\mathcal\{M\}\(x^\{\\perp\}\)byπ​\(z∥\)∝p​\(u0\+C​z∥\)\\pi\(z^\{\\parallel\}\)\\propto p\(u\_\{0\}\+Cz^\{\\parallel\}\)\. Then, for allz∥z^\{\\parallel\},

C​∇z∥log⁡π​\(z∥\)=P∥​∇xlog⁡p​\(x\),x=u0\+C​z∥\.C\\,\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=P\_\{\\parallel\}\\,\\nabla\_\{x\}\\log p\(x\),\\qquad x=u\_\{0\}\+Cz^\{\\parallel\}\.

ProofSince the proportionality constant does not depend onz∥z^\{\\parallel\}, it disappears after taking logarithms and gradients\. Thus

log⁡π​\(z∥\)=log⁡p​\(u0\+C​z∥\)\+const\.\\log\\pi\(z^\{\\parallel\}\)=\\log p\(u\_\{0\}\+Cz^\{\\parallel\}\)\+\\text\{const\}\.Differentiating with respect toz∥z^\{\\parallel\}and using the chain rule gives

∇z∥log⁡π​\(z∥\)=C⊤​∇xlog⁡p​\(x\),x=u0\+C​z∥\.\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=C^\{\\top\}\\nabla\_\{x\}\\log p\(x\),\\qquad x=u\_\{0\}\+Cz^\{\\parallel\}\.Multiplying both sides byCC, we obtain

C​∇z∥log⁡π​\(z∥\)=C​C⊤​∇xlog⁡p​\(x\)\.C\\,\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=CC^\{\\top\}\\nabla\_\{x\}\\log p\(x\)\.Because the columns ofCCform an orthonormal basis ofker⁡\(A\)\\ker\(A\), we have

C​C⊤=P∥\.CC^\{\\top\}=P\_\{\\parallel\}\.Therefore

C​∇z∥log⁡π​\(z∥\)=P∥​∇xlog⁡p​\(x\),x=u0\+C​z∥,C\\,\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=P\_\{\\parallel\}\\nabla\_\{x\}\\log p\(x\),\\qquad x=u\_\{0\}\+Cz^\{\\parallel\},which is exactly the claimed identity\. We use Lemma[2](https://arxiv.org/html/2605.05387#Thmtheorem2)with the time\-t∗t^\{\*\}marginalpt∗p\_\{t^\{\*\}\}\(and its learned scorest∗=∇log⁡pt∗s\_\{t^\{\*\}\}=\\nabla\\log p\_\{t^\{\*\}\}\)\. Starting from any point onℳ​\(xt∗⟂\)\\mathcal\{M\}\(x^\{\\perp\}\_\{t^\{\*\}\}\), e\.g\.

y0:=xt∗⟂\+t∗​P∥​ξ,ξ∼𝒩​\(0,Id\),y\_\{0\}:=x^\{\\perp\}\_\{t^\{\*\}\}\+\\sqrt\{t^\{\*\}\}\\,P\_\{\\parallel\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),we run underdamped Langevin dynamics evolving only in tangent directions:

d​ys\\displaystyle\\mathrm\{d\}y\_\{s\}=vs​d​s,\\displaystyle=v\_\{s\}\\,\\mathrm\{d\}s,\(2\.9\)d​vs\\displaystyle\\mathrm\{d\}v\_\{s\}=P∥​st∗​\(ys\)​d​s−γ​P∥​vs​d​s\+2​γ​P∥​d​Ws,\\displaystyle=P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\_\{s\}\)\\,\\mathrm\{d\}s\-\\gamma P\_\{\\parallel\}v\_\{s\}\\,\\mathrm\{d\}s\+\\sqrt\{2\\gamma\}\\,P\_\{\\parallel\}\\,\\mathrm\{d\}W\_\{s\},while enforcing the constraintP⟂​ys≡xt∗⟂P\_\{\\perp\}y\_\{s\}\\equiv x^\{\\perp\}\_\{t^\{\*\}\}for allss\(equivalently, we project updates onto the tangent space\)\. LetY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{\\,b\}denote the resulting positionysy\_\{s\}after a prescribed Langevin time\.

This two\-stage procedure induces an initialization law at timet∗t^\{\*\}that matches the conditional normal marginal exactly and uses a tractable surrogate for the tangent conditional, namely

p^t∗b​\(x⟂,x∥\)=pt∗​\(x∥∣x⟂\)​pt∗​\(x⟂∣P⟂​Z=b\),x⟂=P⟂​x,x∥=P∥​x\.\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\(x^\{\\perp\},x^\{\\parallel\}\)=p\_\{t^\{\*\}\}\(x^\{\\parallel\}\\mid x^\{\\perp\}\)\\,p\_\{t^\{\*\}\}\(x^\{\\perp\}\\mid P\_\{\\perp\}Z=b\),\\qquad x^\{\\perp\}=P\_\{\\perp\}x,\\ \\ x^\{\\parallel\}=P\_\{\\parallel\}x\.Herept∗​\(x⟂∣P⟂​Z=b\)p\_\{t^\{\*\}\}\(x^\{\\perp\}\\mid P\_\{\\perp\}Z=b\)is available in closed form because, underB=bB=b, the forward process satisfiesXt∗⟂=b\+Wt∗⟂X\_\{t^\{\*\}\}^\{\\perp\}=b\+W\_\{t^\{\*\}\}^\{\\perp\}, henceXt∗⟂∼𝒩​\(b,t∗​P⟂\)X\_\{t^\{\*\}\}^\{\\perp\}\\sim\\mathcal\{N\}\(b,t^\{\*\}P\_\{\\perp\}\)\. The remaining factorpt∗​\(x∥∣x⟂\)p\_\{t^\{\*\}\}\(x^\{\\parallel\}\\mid x^\{\\perp\}\)is*not*conditioned onB=bB=b; it is the*unconditional*tangent conditional induced by the pretrained model at noise levelt∗t^\{\*\}\. Equivalently,p^t∗b\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}is the distribution obtained by \(i\) drawing the correct noisy normal component under the constraint, and then \(ii\) drawing a tangent component that is compatible with that normal slice according to the unconditional time\-t∗t^\{\*\}marginal\. This is exactly what the projected Langevin phase targets: it mixes along the affine setℳ​\(xt∗⟂\)\\mathcal\{M\}\(x^\{\\perp\}\_\{t^\{\*\}\}\)using the projected scoreP∥​st∗P\_\{\\parallel\}s\_\{t^\{\*\}\}, which is the score ofpt∗\(⋅∣x⟂\)p\_\{t^\{\*\}\}\(\\cdot\\mid x^\{\\perp\}\)restricted to the manifold \(Lemma[2](https://arxiv.org/html/2605.05387#Thmtheorem2)\)\. Finally, we usep^t∗b\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}as the*initial distribution*for the surrogate reverse dynamics \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\) at reverse timeτ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}, i\.e\.,Y^τ∗b∼p^t∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{\\,b\}\\sim\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\.

*Toy mixture illustration\.*We illustrate the tangent\-bias mechanism on a simple prior inℝ2\\mathbb\{R\}^\{2\}: a three\-point mixture with atoms at\(1,1\)\(1,1\),\(−1,−1\)\(\-1,\-1\), and\(0,5\)\(0,5\)with weights0\.125:0\.125:0\.750\.125:0\.125:0\.75\. We consider the linear constraint

x−y=0,equivalentlyA​Z=0withA=\[1−1\],x\-y=0,\\qquad\\text\{equivalently\}\\qquad AZ=0\\ \\ \\text\{with\}\\ \\ A=\\begin\{bmatrix\}1&\-1\\end\{bmatrix\},so the conditional target isLaw⁡\(Z∣x−y=0\)\\operatorname\{Law\}\(Z\\mid x\-y=0\), i\.e\., sampling on the diagonal affine setℳ​\(0\)=\{\(x,y\)∈ℝ2:x=y\}\\mathcal\{M\}\(0\)=\\\{\(x,y\)\\in\\mathbb\{R\}^\{2\}:\\ x=y\\\}\.

We run the probability\-flow ordinary differential equation \(PF\-ODE\), the deterministic counterpart of the reverse\-time sampler, fromσmax=20\\sigma\_\{\\max\}=20toσmin=0\.01\\sigma\_\{\\min\}=0\.01, with the identificationt=σ2t=\\sigma^\{2\}\. As shown in Figure[1](https://arxiv.org/html/2605.05387#S2.F1)\(a\), the unconstrained PF\-ODE recovers the correct mixture\.

Under naive projection\-based guidance initialized atσmax\\sigma\_\{\\max\}, the constraintx−y=0x\-y=0is enforced only through the analytic normal drift, while the tangent drift remains that of the unconditional score\. At high noise, the unconditional score is dominated by the heavy\(0,5\)\(0,5\)component, and this dominant\-mode tangent direction accumulates along the manifoldℳ​\(0\)\\mathcal\{M\}\(0\), distorting the conditional weights and smearing the low\-mass modes toward the dominant cluster \(Figure[1](https://arxiv.org/html/2605.05387#S2.F1)\(b\)\)\. In contrast, our two\-stage procedure runs a brief projected underdamped Langevin phase att∗=0\.25t^\{\*\}=0\.25restricted toker⁡\(A\)\\ker\(A\)\(i\.e\., motion tangent tox−y=0x\-y=0, cf\. Lemma[2](https://arxiv.org/html/2605.05387#Thmtheorem2)\), producing an initialization close toLaw⁡\(Xt∗∣x−y=0\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid x\-y=0\)\. Starting the surrogate reverse dynamics fromτ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}then yields samples that remain consistent with the constraint and recover the intended mode structure \(Figure[1](https://arxiv.org/html/2605.05387#S2.F1)\(c\)\)\.

![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/reverse_ve_diffusion_actual.png)

\(a\) Unconstrained PF\-ODE

![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/reverse_ve_diffusion_naive_inverse.png)

\(b\) Naive projection guidance

![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/reverse_ve_diffusion_langevin_our_method.png)

\(c\) Two\-stage \(ours\)

Figure 1:Three\-point mixture inℝ2\\mathbb\{R\}^\{2\}\. \(a\) Unconstrained PF\-ODE reproduces the prior mixture\. \(b\) Naive projection\-based guidance accumulates tangent drift dominated by the high\-mass\(0,5\)\(0,5\)mode, biasing the conditional outcome\. \(c\) Our projected Langevin initialization att∗t^\{\*\}followed by the surrogate reverse dynamics recovers constraint\-consistent samples with the correct mode structure\.
## 3Algorithm and Implementation

Our conditional sampler is organized into three conceptually distinct steps\. The design goal is to \(i\) initialize at an intermediate “safe” noise levelt∗t^\{\*\}, \(ii\) mix efficiently*along*the affine constraint manifold, and \(iii\) complete denoising while enforcing the constraint through an exact normal drift\. Figure[2](https://arxiv.org/html/2605.05387#S3.F2)provides a visual summary of the full pipeline\. In Step 1 we move to the intermediate timet∗t^\{\*\}and fix the*noisy*normal level so thatP⟂​Xt∗P\_\{\\perp\}X\_\{t^\{\*\}\}has the exact conditional law underB=bB=b\. In Step 2 we run a short phase of projected underdamped Langevin dynamics \(BAOAB\) on the corresponding affine setℳ​\(x⟂\)\\mathcal\{M\}\(x^\{\\perp\}\)to mix*in the tangent directions*while keeping the normal component fixed\. This produces an initializationY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{\\,b\}that is approximately distributed asLaw⁡\(Xt∗∣P⟂​Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)and is already well\-mixed alongker⁡\(A\)\\ker\(A\), which reduces the accumulation of tangent\-score mismatch at high noise\. In Step 3 we integrate the surrogate guided reverse dynamics fromτ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}toTT, using the analytic normal drift to enforce the constraint and the pretrained score for the tangent drift during denoising\.

*Step 1: Initialization for Langevin\.*As in Section[2](https://arxiv.org/html/2605.05387#S2), we start the reverse\-time procedure at the intermediate “safe” noise levelt∗=T−τ∗t^\{\*\}=T\-\\tau^\{\*\}\. Step 1 produces the*initial state for the projected Langevin phase*\(Step 2\)\. To do so, we select any clean feasible pointx0∈ℳ​\(b\)x\_\{0\}\\in\\mathcal\{M\}\(b\)satisfyingP⟂​x0=bP\_\{\\perp\}x\_\{0\}=b\(equivalentlyA​x0=yAx\_\{0\}=y\)\. The choice ofx0x\_\{0\}is not unique and does not affect feasibility; in practice one may take, for example,x0=b\+P∥​ζx\_\{0\}=b\+P\_\{\\parallel\}\\zetawithζ∼𝒩​\(0,Id\)\\zeta\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\(Gaussian initialization inker⁡\(A\)\\ker\(A\)\), or use a plug\-in estimate \(e\.g\., a pseudoinverse or any other fast reconstruction\) and project it ontoℳ​\(b\)\\mathcal\{M\}\(b\)\.

We then movex0x\_\{0\}to timet∗t^\{\*\}by adding Gaussian perturbation,

yτ∗=x0\+T−τ∗​ξ=x0\+t∗​ξ,ξ∼𝒩​\(0,Id\)\.y^\{\\tau^\{\*\}\}\\;=\\;x\_\{0\}\+\\sqrt\{T\-\\tau^\{\*\}\}\\,\\xi\\;=\\;x\_\{0\}\+\\sqrt\{t^\{\*\}\}\\,\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\.Rather than enforcing the clean constraint levelbbat this stage, we freeze the*noisy*normal component

bnoisy:=P⟂​yτ∗,b\_\{\\mathrm\{noisy\}\}\\;:=\\;P\_\{\\perp\}y^\{\\tau^\{\*\}\},which is the correct forward\-time stochastic normal level at timet∗t^\{\*\}under the conditioningB=bB=b\.yτ∗y^\{\\tau^\{\*\}\}is then used to initialize the constrained BAOAB/underdamped Langevin dynamics on the affine setℳ​\(bnoisy\)\\mathcal\{M\}\(b\_\{\\mathrm\{noisy\}\}\)in Step 2\. *Step 2: Tangent BAOAB Langevin*Next, we approximate the conditional marginalLaw⁡\(Xt∗∣P⟂​Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)by running*underdamped Langevin dynamics*restricted to the affine setℳ​\(bnoisy\)\\mathcal\{M\}\(b\_\{\\text\{noisy\}\}\)\. We evolve a position–velocity pair\(ys,vs\)\(y\_\{s\},v\_\{s\}\)using the projected dynamics in Equation \([2\.9](https://arxiv.org/html/2605.05387#S2.E9)\), so that both the deterministic “force” and the stochastic excitation act*only in tangent directions*ker⁡\(A\)\\ker\(A\)\. We discretize this SDE with theBAOAB splitting integrator, which decomposes the dynamics into three sub\-operators that can be integrated in closed form\.

The BAOAB split\.Write the SDE as the sum of:

- •*B \(kick\):*deterministic velocity update due to the forcev˙=P∥​st∗​\(y\)\\;\\dot\{v\}=P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)\.
- •*A \(drift\):*deterministic position updatey˙=v\\;\\dot\{y\}=v\.
- •*O \(Ornstein–Uhlenbeck\):*stochastic friction/noise on velocityd​v=−γ​P∥​v​d​s\+2​γ​P∥​d​Ws\\;\\mathrm\{d\}v=\-\\gamma P\_\{\\parallel\}v\\,\\mathrm\{d\}s\+\\sqrt\{2\\gamma\}\\,P\_\{\\parallel\}\\mathrm\{d\}W\_\{s\}\.

BAOAB applies these pieces in the symmetric order

B/2→A/2→O→A/2→B/2,\\text\{B/2\}\\;\\rightarrow\\;\\text\{A/2\}\\;\\rightarrow\\;\\text\{O\}\\;\\rightarrow\\;\\text\{A/2\}\\;\\rightarrow\\;\\text\{B/2\},which is time\-reversible \(in the deterministic limit\) and is known to have excellent stability and low bias in the*configurational*\(position\) marginal\.

With step sizeΔ​s\\Delta s, one iteration from\(y,v\)\(y,v\)proceeds as:

1. 1\.*B/2 \(half kick\):*update the velocity using the score force at the current position: v←v\+Δ​s2​P∥​st∗​\(y\)\.v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}\\,P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)\.
2. 2\.*A/2 \(half drift\):*move the position forward using the current velocity: y←y\+Δ​s2​v\.y\\leftarrow y\+\\frac\{\\Delta s\}\{2\}\\,v\.
3. 3\.*O \(OU refresh\):*apply friction and inject Gaussian noise directly in velocity\. This step is exact because it is an Ornstein–Uhlenbeck process\. Writing c1:=e−γ​Δ​s,c2:=\(1−e−2​γ​Δ​s\),c\_\{1\}:=e^\{\-\\gamma\\Delta s\},\\qquad c\_\{2\}:=\\sqrt\{\\big\(1\-e^\{\-2\\gamma\\Delta s\}\\big\)\},we perform v←c1​v\+c2​P∥​ξ,ξ∼𝒩​\(0,Id\)\.v\\leftarrow c\_\{1\}v\+c\_\{2\}\\,P\_\{\\parallel\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\.Herec1c\_\{1\}contracts velocity \(friction\) andc2c\_\{2\}sets the noise amplitude; the projectionP∥P\_\{\\parallel\}ensures that the OU excitation does not change the normal component\.
4. 4\.*A/2 \(half drift\):*advance the position again: y←y\+Δ​s2​v\.y\\leftarrow y\+\\frac\{\\Delta s\}\{2\}\\,v\.
5. 5\.*B/2 \(half kick\):*apply the remaining half force update: v←v\+Δ​s2​P∥​st∗​\(y\)\.v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}\\,P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)\.

AfterKKBAOAB iterations, we denote the resulting position byY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}\. This state is well\-mixed alongker⁡\(A\)\\ker\(A\)while remaining consistent with the forward\-time noisy level set, making it a reliable initialization for the guided reverse denoising stage\. *Step 3: Guided Reverse Denoising*Finally, starting fromY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}we integrate the guided reverse SDE fromτ=τ∗\\tau=\\tau^\{\*\}up toT−t0T\-t\_\{0\}in Equation \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\)

Algorithm[1](https://arxiv.org/html/2605.05387#alg1)summarizes these three steps in pseudocode\.

Algorithm 1Conditional Sampling via Affine BAOAB Initialization1:Input:Clean measurement

bb, starting point

x0∈ℳ​\(b\)x\_\{0\}\\in\\mathcal\{M\}\(b\), intermediate noise

t∗=T−τ∗t^\{\*\}=T\-\\tau^\{\*\}, Langevin steps

KK, step size

Δ​s\\Delta s, friction

γ\\gamma, score network

st∗s\_\{t^\{\*\}\}\.

2:Step 1: Initialization for Langevin

3:

yτ∗←x0\+T−τ∗​ξ,ξ∼𝒩​\(0,Id\)y^\{\\tau^\{\*\}\}\\leftarrow x\_\{0\}\+\\sqrt\{T\-\\tau^\{\*\}\}\\xi,\\quad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)
4:

bnoisy←P⟂​yτ∗b\_\{\\text\{noisy\}\}\\leftarrow P\_\{\\perp\}y^\{\\tau^\{\*\}\}⊳\\trianglerightTarget level set for the Langevin phase

5:

y←yτ∗,v←0y\\leftarrow y^\{\\tau^\{\*\}\},\\quad v\\leftarrow 0
6:

c1←e−γ​Δ​s,c2←1−e−2​γ​Δ​sc\_\{1\}\\leftarrow e^\{\-\\gamma\\Delta s\},\\quad c\_\{2\}\\leftarrow\\sqrt\{1\-e^\{\-2\\gamma\\Delta s\}\}
7:Step 2: Tangent BAOAB Langevin

8:for

k=1k=1to

KKdo

9:

v←v\+Δ​s2​P∥​st∗​\(y\)v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)⊳\\trianglerightB: Half\-step drift

10:

y←y\+Δ​s2​vy\\leftarrow y\+\\frac\{\\Delta s\}\{2\}v⊳\\trianglerightA: Half\-step position

11:

v←c1​v\+c2​P∥​ξ,ξ∼𝒩​\(0,Id\)v\\leftarrow c\_\{1\}v\+c\_\{2\}P\_\{\\parallel\}\\xi,\\quad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)⊳\\trianglerightO: Projected noise injection

12:

y←y\+Δ​s2​vy\\leftarrow y\+\\frac\{\\Delta s\}\{2\}v⊳\\trianglerightA: Half\-step position

13:

v←v\+Δ​s2​P∥​st∗​\(y\)v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)⊳\\trianglerightB: Half\-step drift

14:

y←P∥​y\+bnoisyy\\leftarrow P\_\{\\parallel\}y\+b\_\{\\text\{noisy\}\}⊳\\trianglerightConstraint: MaintainP⟂​y=bnoisyP\_\{\\perp\}y=b\_\{\\text\{noisy\}\}

15:endfor

16:

Y^τ∗b←y\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}\\leftarrow y
17:Step 3: Guided Reverse Denoising

18:Evolve

Y^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}from

τ=τ∗\\tau=\\tau^\{\*\}to

T−t0T\-t\_\{0\}using the guided reverse SDE in Equation \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\)

19:Return:Final conditional sample

z^\\hat\{z\}

![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/crop_sample.png)Input: Measurementb\\bm\{b\}\(Linear Constraint\)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/noisy_sample.png)Step 1: Noisy Inityτ∗y^\{\\tau^\{\*\}\}\(Forward Diffusion\)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/langavin_sample.png)Step 2: Langevin StateY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}\(BAOABTangent Mixing\)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/generated_sample.png)Step 3: Final Samplex0x\_\{0\}\(Generated Data\)Step 1Initialization for LangevinStep 2Tangent BAOABLangevinStep 3Guided ReverseDenoisingFigure 2:Visual overview of the proposed sampling process\. Step 1: diffuse the constrained input to the intermediate noise levelt∗t^\{\*\}\. Step 2: run projected BAOAB underdamped Langevin dynamics to mix along the affine constraint set while preserving the noisy normal level\. Step 3: perform guided reverse denoising with exact normal correction to obtain the final sample\.
## 4Experiments

We evaluate the proposed Langevin\-Conditioned Diffusion Model with BAOAB \(LCDM\-BAOAB\) sampler on standard256×256256\\times 256image inverse problems\. LCDM\-BAOAB uses the affine normal–tangent decomposition developed in the previous sections: it first performs projected BAOAB Langevin mixing in the tangent directions at an intermediate noise level, and then completes sampling by guided DDIM denoising with exact normal correction\.

We test on three benchmarks: CelebA\-HQ\(Karras et al\.,[2018](https://arxiv.org/html/2605.05387#bib.bib15)\), LSUN Church\(Yu et al\.,[2015](https://arxiv.org/html/2605.05387#bib.bib29)\), and ImageNet\(Deng et al\.,[2009](https://arxiv.org/html/2605.05387#bib.bib4)\)\. As the primary baseline, we use the DDNM\(Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27)\), a strong zero\-shot diffusion method for linear image inverse problems\.

We compare with DDNM because it has been reported to meaningfully outperform earlier zero\-shot conditional sampling and restoration methods for linear inverse problems\. In particular, DDNM was introduced as a unified zero\-shot framework for linear image restoration tasks such as super\-resolution, inpainting, colorization, compressed sensing, and deblurring, and was shown to improve over prior zero\-shot approaches including ILVR, RePaint, DDRM, and DPS\(Choi et al\.,[2021](https://arxiv.org/html/2605.05387#bib.bib2); Lugmayr et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib20); Kawar et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib16); Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27); Chung et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib3)\)\. Thus, DDNM provides a strong projection\-based reference point for testing whether the additional tangent\-space BAOAB Langevin initialization in LCDM\-BAOAB yields measurable improvements under matched compute\.

In Appendix[D](https://arxiv.org/html/2605.05387#A4), we show that, under the VP–DDPMε\\varepsilon\-parameterization, the DDNM update is equivalent to using the effective score

s^tDDNM​\(xt;y\)=P∥​st​\(xt\)\+αt​b−P⟂​xtσt2\.\\hat\{s\}\_\{t\}^\{\\rm DDNM\}\(x\_\{t\};y\)=P\_\{\\parallel\}s\_\{t\}\(x\_\{t\}\)\+\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.Thus DDNM applies the analytic correction in the normal directions while retaining the pretrained unconditional score in the tangent directions\. Our experiments therefore test whether explicitly mixing in the tangent directions through projected BAOAB Langevin dynamics improves over a strong zero\-shot projection\-based sampler\.

All experiments are performed in the zero\-shot setting using pretrained256×256256\\times 256diffusion backbones, with no task\-specific fine\-tuning\. We report LPIPS and FID; lower values are better for both metrics\.

All experiments are run under a matched budget of100100effective network function evaluations \(NFEs\)\. For DDNM, this corresponds to100100DDIM steps withη=0\.85\\eta=0\.85\. For LCDM\-BAOAB, the budget is split into5050projected BAOAB Langevin updates and5050guided DDIM denoising steps\. In the8×8\\timessuper\-resolution experiments, we use a200200\-point DDIM time grid and start LCDM\-BAOAB from the25%25\\%point of this grid, so the guided reverse stage uses the final5050DDIM network evaluations\. Thus all reported comparisons use the same total budget of100100NFEs\. In the BAOAB phase, we cache and reuse the previous UNet output whenever possible, so that the effective number of score\-network evaluations remains matched to DDNM\.

The Langevin phase is introduced at a task\-dependent discrete DDPM timestep, denotedkmixk\_\{\\mathrm\{mix\}\}\. This index refers to the implementation timestep of the pretrained DDPM sampler and should not be confused with the continuous safe\-time parametert∗t^\{\*\}used in the theoretical analysis\. For inpainting, we usekmix=500k\_\{\\mathrm\{mix\}\}=500\. For super\-resolution, we usekmix=250k\_\{\\mathrm\{mix\}\}=250, which corresponds to a later and higher\-SNR point in the reverse trajectory\. This choice reflects the different nature of the two inverse problems\. In super\-resolution, the main difficulty is recovering high\-frequency detail from a heavily downsampled image, and tangent\-space refinement is more stable once the iterate is closer to the data manifold\. Therefore, for super\-resolution, we perform BAOAB mixing later in the denoising trajectory than we do for masking tasks\.

For super\-resolution, we consider8×8\\timesmean downsampling, mapping32×3232\\times 32observations to256×256256\\times 256images\. Quantitative results are reported on10001000images per data set\.

### 4\.1Inpainting Results

We first evaluate fixed\-mask inpainting\. The fixed mask is chosen differently across data sets\. On CelebA\-HQ, we mask a facial region, since reconstructing a semantically important part of a human face is substantially more challenging than filling an arbitrary patch\. On LSUN Church and ImageNet, we use the corresponding fixed square masks for those data sets\. This distinction is important when interpreting the CelebA\-HQ results, since the CelebA\-HQ mask targets a harder semantic completion problem\.

Table[1](https://arxiv.org/html/2605.05387#S4.T1)shows that LCDM\-BAOAB consistently improves over DDNM on all three data sets\. The improvement is modest on CelebA\-HQ, but becomes larger on LSUN Church and especially on ImageNet\. This trend is consistent with our hypothesis: as the data distribution becomes more diverse and the tangent space becomes more semantically ambiguous, projection\-only guidance is more susceptible to tangent\-space bias, and explicit tangent mixing becomes more beneficial\.

Table 1:Fixed\-mask inpainting results on256×256256\\times 256benchmarks\. Metrics are LPIPS↓\\downarrowand FID↓\\downarrow\.To test whether the inpainting improvement persists beyond a single fixed corruption pattern, we also evaluate random\-mask inpainting on10001000images per data set\. For each image, we remove a100×100100\\times 100square patch sampled at a random location\. The random mask is tied deterministically to the image identity, so DDNM and LCDM\-BAOAB are evaluated on exactly the same corrupted input for each image\. This gives a paired comparison and removes any ambiguity about whether differences are caused by the sampler or by different mask locations\.

Table[2](https://arxiv.org/html/2605.05387#S4.T2)shows that LCDM\-BAOAB again improves over DDNM across all three data sets\. The gains are smallest on CelebA\-HQ, larger on LSUN Church, and largest on ImageNet\. On ImageNet, LCDM\-BAOAB improves FID from29\.0029\.00to20\.9120\.91and LPIPS from0\.11820\.1182to0\.09330\.0933\. These results show that tangent\-space BAOAB mixing remains beneficial even when the missing region varies across images\.

Table 2:Random\-mask inpainting results on10001000images per data set\. For each image, a100×100100\\times 100square mask is sampled at a random location and shared across methods\. Metrics are LPIPS↓\\downarrowand FID↓\\downarrow\.![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/Real_image.png)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/biased_DDNM.png)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/lcdm_unbiased.png)\(a\) Real image\(b\) DDNM\(c\) LCDM\-BAOAB![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/real_crop_imagnet.jpg)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/ddnm_crop_imagenet.jpg)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/lcdm_imagenet.jpg)\(d\) Real image\(e\) DDNM\(f\) LCDM\-BAOABFigure 3:Visual comparison for inpainting\. DDNM can produce texture artifacts or semantically inconsistent completions, especially on ImageNet\. LCDM\-BAOAB produces cleaner and more coherent reconstructions while preserving measurement consistency\.
### 4\.2Super\-Resolution Results

We next evaluate8×8\\timessuper\-resolution, where the observation is obtained by mean downsampling a256×256256\\times 256image to32×3232\\times 32\. This inverse problem is substantially more ill\-posed than inpainting because most high\-frequency information is removed by the forward operator\. Consequently, the conditional distribution contains a large tangent\-space ambiguity: many high\-resolution images are consistent with the same low\-resolution observation\.

For this task, we introduce the BAOAB Langevin phase at the discrete DDPM timestepkmix=250k\_\{\\mathrm\{mix\}\}=250, later in the reverse trajectory than in the inpainting experiments\. This higher\-SNR starting point makes tangent refinement more stable and allows the sampler to recover fine\-scale structure after the coarse image content has already been established\.

Table[3](https://arxiv.org/html/2605.05387#S4.T3)shows that LCDM\-BAOAB consistently improves over DDNM on all three data sets\. The gains are largest on ImageNet, where the conditional ambiguity is strongest, and remain substantial on LSUN Church\. These results support the central claim of the paper: enforcing the measurement in the normal directions is not sufficient for highly ill\-posed linear inverse problems; additional tangent\-space mixing can substantially improve perceptual and distributional quality\.

Table 3:8×8\\timessuper\-resolution results on256×256256\\times 256benchmarks\. Metrics are LPIPS↓\\downarrowand FID↓\\downarrow\.![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/real_super_res.jpg)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/ddnm_super_res.jpg)![Refer to caption](https://arxiv.org/html/2605.05387v1/figs/lcdm_super_res.jpg)\(a\) Real image\(b\) DDNM,8×8\\times\(c\) LCDM\-BAOAB,8×8\\timesFigure 4:8×8\\timessuper\-resolution on ImageNet\. DDNM tends to produce blurred or structurally inconsistent outputs, while LCDM\-BAOAB recovers sharper edges and more realistic high\-frequency detail\.

## 5Error Decomposition and Average KL Bounds

We now state quantitative guarantees for the discrepancy between the ideal conditional sampler and our practical procedure\. The key idea is to align the analysis with the algorithmic structure: starting from a safe noise levelt∗t^\{\*\}, our method \(i\) approximately initializes the reverse process at timet∗t^\{\*\}using the two\-stage normal/tangent construction, and \(ii\) then evolves via a guided reverse SDE whose normal drift is exact but whose tangent drift uses the unconditional score\. Accordingly, the results below separate the total error into an*initialization*term at timet∗t^\{\*\}and a*pathwise*term accumulated during the reverse evolution\. The first theorem bounds the pathwise KL divergence between the true conditional and surrogate reverse\-time*path measures*in terms of conditional mutual information between tangent and normal components\. We then introduce assumptions that control how the tangent conditional marginal varies with the levelbb, and use them to bound the average initialization error\. Combining these ingredients yields an average terminal KL bound for the tangent marginal of the generated sample, together with sharper consequences under additional separation conditions on the admissible levels\. We now quantify the error of our conditional sampling procedure\. There are two conceptual contributions:

- \(i\)a*pathwise*error from using the unconditional tangent score in the guided reverse SDE instead of the true conditional tangent score;
- \(ii\)an*initialization*error at timet∗t^\{\*\}, because we only approximately sample from the true conditional marginalLaw⁡\(Xt∗∣P⟂​Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)using the two\-stage procedure described above\.

The next theorem controls the pathwise error in terms of conditional mutual information between tangent and normal components\. To prove it, we only need a second moment bound onZZ\.

###### Assumption 5\.1

ForZ∼p0Z\\sim p\_\{0\}, we have𝔼​‖Z‖2<∞\\mathbb\{E\}\\\|Z\\\|^\{2\}<\\infty\.

###### Theorem 4

Let Assumption[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1)be in force\. Fix0≤τ∗<T−t00\\leq\\tau^\{\*\}<T\-t\_\{0\}\. For each levelbb, consider on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]the ideal conditional reverse SDE \([2\.7](https://arxiv.org/html/2605.05387#S2.E7)\) and the surrogate constrained reverse SDE \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\), started from the same initial law at timeτ∗\\tau^\{\*\}:

d​Yτ∗,b\\displaystyle\\mathrm\{d\}Y\_\{\\tau\}^\{\*,b\}=\(P∥​sT−τ∗,b​\(Yτ∗,b\)\+1T−τ​P⟂​\(b−Yτ∗,b\)\)​d​τ\+d​W¯τ,\\displaystyle=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\+\\frac\{1\}\{T\-\\tau\}P\_\{\\perp\}\(b\-Y\_\{\\tau\}^\{\*,b\}\)\\Big\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},d​Y^τb\\displaystyle\\mathrm\{d\}\\hat\{Y\}\_\{\\tau\}^\{\\,b\}=\(P∥​sT−τ​\(Y^τb\)\+1T−τ​P⟂​\(b−Y^τb\)\)​d​τ\+d​W¯τ\.\\displaystyle=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}\(\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\)\+\\frac\{1\}\{T\-\\tau\}P\_\{\\perp\}\(b\-\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\)\\Big\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\}\.LetℙY∗,b\\mathbb\{P\}^\{Y^\{\*,b\}\}andℙY^b\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,b\}\}denote the corresponding path measures on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]\. Decompose the clean signal as

Z∥:=P∥​Z,Z⟂:=P⟂​Z\.Z^\{\\parallel\}:=P\_\{\\parallel\}Z,\\qquad Z^\{\\perp\}:=P\_\{\\perp\}Z\.Then

𝔼B​\[KL​\(ℙY∗,B∥ℙY^B\)\]≤I​\(Z∥;Z⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\,\\\|\\,\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.Moreover,

𝔼B​\[KL​\(ℙY∗,B∥ℙY^B\)\]≥I​\(Z∥;Z⟂∣Xt∗\)−I​\(Z∥;Z⟂∣Xt∗⟂\)−I​\(Z∥;Z⟂∣Xt0\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\,\\\|\\,\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\geq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\-I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\\big\)\-I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t\_\{0\}\}\\big\)\.

Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)is stated for a general clean signalZZ, and shows that the pathwise error is controlled by the conditional mutual information

I​\(Z∥;Z⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.To make this quantity more concrete, we now specialize to a latent Gaussian\-mixture model\. This specialization is motivated by modern discrete\-latent generative models for images, in which the observed image can be viewed as a structured latent code decoded into pixel space up to a small reconstruction error\. Under this model,ZZis a Gaussian perturbation of a discrete latent variableSS, and the dependence term in Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)can be compared to the corresponding latent dependence

I​\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.The next assumption and proposition formalize this reduction\.

###### Assumption 5\.2

The clean signalZ∈ℝdZ\\in\\mathbb\{R\}^\{d\}admits the representation

Z=S\+ε​N,Z=S\+\\varepsilon N,whereSSis a discrete random vector taking values in a countable set𝒞⊂ℝd\\mathcal\{C\}\\subset\\mathbb\{R\}^\{d\},N∼𝒩​\(0,Id\)N\\sim\\mathcal\{N\}\(0,I\_\{d\}\),NNis independent ofSS, andε\>0\\varepsilon\>0\. Equivalently,ZZfollows a countable Gaussian mixture distribution whose components have means in𝒞\\mathcal\{C\}and common covarianceε2​Id\\varepsilon^\{2\}I\_\{d\}\.

We write

S∥:=P∥​S,S⟂:=P⟂​S,S^\{\\parallel\}:=P\_\{\\parallel\}S,\\qquad S^\{\\perp\}:=P\_\{\\perp\}S,and

Z∥:=P∥​Z,Z⟂:=P⟂​Z\.Z^\{\\parallel\}:=P\_\{\\parallel\}Z,\\qquad Z^\{\\perp\}:=P\_\{\\perp\}Z\.In addition, we assume that the projected latent normal code has finite entropy,

H​\(S⟂\)<∞\.H\(S^\{\\perp\}\)<\\infty\.Whenever Rényi\-entropy bounds are invoked, we further assume that the order\-1/21/2Rényi entropy is finite:

H1/2​\(S⟂\)<∞\.H\_\{1/2\}\(S^\{\\perp\}\)<\\infty\.

###### Proposition 6

Let Assumption[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2)be in force\. Then, for everyt≥0t\\geq 0,

I​\(Z∥;Z⟂∣Xt\)≤I​\(S∥;S⟂∣Xt\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t\}\\big\)\.In particular, at the safe timet∗t^\{\*\},

I​\(Z∥;Z⟂∣Xt∗\)≤I​\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.

ProofFixt≥0t\\geq 0\. Under Assumption[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2),

Z=S\+ε​N,Xt=Z\+Wt=S\+ε​N\+Wt,Z=S\+\\varepsilon N,\\qquad X\_\{t\}=Z\+W\_\{t\}=S\+\\varepsilon N\+W\_\{t\},whereNNandWtW\_\{t\}are independent standard Gaussian noises\. SinceP∥P\_\{\\parallel\}andP⟂P\_\{\\perp\}are orthogonal projections, the tangent and normal noise components are independent\. Hence, for every regular conditional law givenXt=xX\_\{t\}=x,

Law⁡\(Z∥,Z⟂∣S∥,S⟂,Xt=x\)=Law⁡\(Z∥∣S∥,Xt∥=x∥\)⊗Law⁡\(Z⟂∣S⟂,Xt⟂=x⟂\)\.\\operatorname\{Law\}\\\!\\big\(Z^\{\\parallel\},Z^\{\\perp\}\\mid S^\{\\parallel\},S^\{\\perp\},X\_\{t\}=x\\big\)=\\operatorname\{Law\}\\\!\\big\(Z^\{\\parallel\}\\mid S^\{\\parallel\},X\_\{t\}^\{\\parallel\}=x^\{\\parallel\}\\big\)\\otimes\\operatorname\{Law\}\\\!\\big\(Z^\{\\perp\}\\mid S^\{\\perp\},X\_\{t\}^\{\\perp\}=x^\{\\perp\}\\big\)\.Thus, conditionally onXt=xX\_\{t\}=x, the pair\(Z∥,Z⟂\)\(Z^\{\\parallel\},Z^\{\\perp\}\)is obtained from\(S∥,S⟂\)\(S^\{\\parallel\},S^\{\\perp\}\)by applying two separate conditionally independent channels: one fromS∥S^\{\\parallel\}toZ∥Z^\{\\parallel\}, and one fromS⟂S^\{\\perp\}toZ⟂Z^\{\\perp\}\. Therefore, by the data\-processing inequality for mutual information under product channels,

ILaw\(⋅∣Xt=x\)​\(Z∥;Z⟂\)≤ILaw\(⋅∣Xt=x\)​\(S∥;S⟂\)I\_\{\\operatorname\{Law\}\(\\cdot\\mid X\_\{t\}=x\)\}\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\big\)\\leq I\_\{\\operatorname\{Law\}\(\\cdot\\mid X\_\{t\}=x\)\}\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\big\)forXtX\_\{t\}\-almost everyxx\. Integrating this inequality with respect to the law ofXtX\_\{t\}gives

I​\(Z∥;Z⟂∣Xt\)≤I​\(S∥;S⟂∣Xt\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t\}\\big\)\.The statement att=t∗t=t^\{\*\}is the same inequality evaluated at the safe time\.

We now formalize the assumptions needed to control the initialization error at timet∗t^\{\*\}and to express the pathwise term in latent information\-theoretic form\.

###### Assumption 5\.3

Define

𝒞⟂:=P⟂​𝒞\.\\mathcal\{C\}^\{\\perp\}:=P\_\{\\perp\}\\mathcal\{C\}\.For eacht≥t0t\\geq t\_\{0\}andc∈𝒞⟂c\\in\\mathcal\{C\}^\{\\perp\}, let

rtc:=Law⁡\(Xt∥∣S⟂=c\)\.r\_\{t\}^\{c\}:=\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid S^\{\\perp\}=c\)\.We assume that for everyt≥t0t\\geq t\_\{0\}there exists a finite constantLt<∞L\_\{t\}<\\inftysuch that for allc1,c2∈𝒞⟂c\_\{1\},c\_\{2\}\\in\\mathcal\{C\}^\{\\perp\},

KL\(rtc1∥rtc2\)≤Lt∥c1−c2∥22\.\\mathrm\{KL\}\\\!\\left\(r\_\{t\}^\{c\_\{1\}\}\\,\\middle\\\|\\,r\_\{t\}^\{c\_\{2\}\}\\right\)\\leq L\_\{t\}\\,\\\|c\_\{1\}\-c\_\{2\}\\\|\_\{2\}^\{2\}\.\(5\.1\)

We now state our main quantitative guarantee for the*terminal tangent marginal*\. Recall that the ideal conditional reverse\-time dynamics\{Yτ∗,b\}τ∈\[τ∗,T−t0\]\\\{Y\_\{\\tau\}^\{\*,b\}\\\}\_\{\\tau\\in\[\\tau^\{\*\},\\,T\-t\_\{0\}\]\}, initialized from the true conditional marginal at timeτ∗\\tau^\{\*\}, and the practical surrogate procedure\{Y^τb\}τ∈\[τ∗,T−t0\]\\\{\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\\\}\_\{\\tau\\in\[\\tau^\{\*\},\\,T\-t\_\{0\}\]\}, obtained by the two\-stage initialization at timet∗t^\{\*\}followed by the surrogate guided reverse dynamics, induce terminal tangent laws

μT−t0∗,b:=Law\(P∥YT−t0∗,b\),μ^T−t0b:=Law\(P∥Y^T−t0b\)\.\\mu\_\{T\-t\_\{0\}\}^\{\*,b\}:=\\operatorname\{Law\}\\\!\\big\(P\_\{\\parallel\}Y\_\{T\-t\_\{0\}\}^\{\*,b\}\\big\),\\qquad\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,b\}:=\\operatorname\{Law\}\\\!\\big\(P\_\{\\parallel\}\\hat\{Y\}\_\{T\-t\_\{0\}\}^\{\\,b\}\\big\)\.Our goal is to bound the averaged terminal discrepancy

𝔼B​\[KL​\(μT−t0∗,B∥μ^T−t0B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\big\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\right\]\.
###### Theorem 9

Let Assumptions[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1),[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2), and[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3)be in force\. Let

H:=H​\(S⟂\),H:=H\(S^\{\\perp\}\),and assumeH<∞H<\\infty\. Fix a safe noise levelt∗∈\(t0,T\)t^\{\*\}\\in\(t\_\{0\},T\)\(equivalently,τ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}\)\. Then

𝔼B\[KL\(μT−t0∗,B∥μ^T−t0B\)\]≤4Lt∗\(t∗\+ε2\)H\+I\(S∥;S⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\middle\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\right\]\\leq 4L\_\{t^\{\*\}\}\(t^\{\*\}\+\\varepsilon^\{2\}\)\\,H\+I\\\!\\big\(S^\{\\parallel\};\\,S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.\(5\.2\)

Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9)applies without any geometric separation assumption on the set of admissible latent normal codes𝒞⟂\\mathcal\{C\}^\{\\perp\}\. In that general case, the noisy normal observation may remain ambiguous among several nearby latent codes\. We now show that, if the admissible codes are uniformly separated, then such confusions become rare and both the initialization and pathwise contributions become exponentially small\.

###### Assumption 5\.4

There existsδ\>0\\delta\>0such that for all distinctc,c~∈𝒞⟂c,\\tilde\{c\}\\in\\mathcal\{C\}^\{\\perp\},

‖c−c~‖2≥δ\.\\\|c\-\\tilde\{c\}\\\|\_\{2\}\\geq\\delta\.

Assumption[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4)enforces a minimum spacing between admissible latent normal codes\. Since

Xt∗⟂=S⟂\+t∗\+ε2​G,G∼𝒩​\(0,Id\),X\_\{t^\{\*\}\}^\{\\perp\}=S^\{\\perp\}\+\\sqrt\{t^\{\*\}\+\\varepsilon^\{2\}\}\\,G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{d\}\),confusing the true latent codeccwith a different codec~\\tilde\{c\}requires a Gaussian fluctuation of order at leastδ\\delta, which occurs with probabilityexp⁡\(−Ω​\(δ2/\(t∗\+ε2\)\)\)\\exp\(\-\\Omega\(\\delta^\{2\}/\(t^\{\*\}\+\\varepsilon^\{2\}\)\)\)\. This separation upgrades the Shannon\-scale control above to exponentially small error bounds\.

###### Theorem 11

Let Assumptions[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1),[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2),[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3), and[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4)be in force\. Let

H1/2:=H1/2​\(S⟂\)=2​log​∑c∈𝒞⟂pS⟂​\(c\),σ∗2:=t∗\+ε2,H\_\{1/2\}:=H\_\{1/2\}\(S^\{\\perp\}\)=2\\log\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{p\_\{S^\{\\perp\}\}\(c\)\},\\qquad\\sigma\_\{\*\}^\{2\}:=t^\{\*\}\+\\varepsilon^\{2\},and fix a safe noise levelt∗∈\(t0,T\)t^\{\*\}\\in\(t\_\{0\},T\)\. Then

𝔼B​\[KL​\(pt∗∗,B∥p^t∗B\)\+KL​\(ℙY∗,B∥ℙY^B\)\]≤\\displaystyle\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\+\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\;Lt∗​\(δ22\+4​σ∗2\)​exp⁡\(H1/2−δ28​σ∗2\)\\displaystyle L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\+2​exp⁡\(H1/2−δ28​σ∗2\)\.\\displaystyle\\quad\+2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.\(5\.3\)Consequently, by data processing,

𝔼B​\[KL​\(μT−t0∗,B∥μ^T−t0B\)\]≤\[Lt∗​\(δ22\+4​σ∗2\)\+2\]​exp⁡\(H1/2−δ28​σ∗2\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\Big\]\\leq\\Big\[L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\+2\\Big\]\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.\(5\.4\)

Acknowledgments and Disclosure of Funding

Funding in direct support of this work: none\. Competing interests and additional revenues related to this work: the authors declare no competing interests\.

## Appendix AProof of Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)

We compare the true conditional reverse\-time dynamics and the surrogate guided dynamics at the level of path measures\. Since the two SDEs have the same diffusion coefficient and differ only in the tangent drift, the first task is to justify a Girsanov formula for their relative entropy\. For this, in Lemma[13](https://arxiv.org/html/2605.05387#Thmtheorem13)we first prove that the posterior meanx↦𝔼​\[Z∣Xt=x\]x\\mapsto\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]has at most linear growth; via Tweedie’s formula, this implies the required linear\-growth control on the two drifts\. Then Lemma[14](https://arxiv.org/html/2605.05387#Thmtheorem14)converts the pathwise KL divergence into an integral of the squared drift gap along the true conditional path\.

The next step is to rewrite this drift gap in statistical terms\. Using Tweedie’s identity and projecting onto the tangent space, the drift difference becomes the difference between two posterior means of the tangent componentU=P∥​ZU=P\_\{\\parallel\}Z: one conditioned on the noisy observation alone, and one conditioned on the noisy observation together with the normal componentB=P⟂​ZB=P\_\{\\perp\}Z\. Averaging over the random levelBBand applying the MMSE\-gap identity turns the pathwise KL bound into an integral of conditional MMSE differences\. The conditional I–MMSE lemma is then used to identify this integral with a difference of conditional mutual informations, yielding the upper bound in terms ofI​\(U;B∣Xt∗\)I\(U;B\\mid X\_\{t^\{\*\}\}\)\.

For the lower bound, the same MMSE representation is kept on the finite interval corresponding tot∈\[t0,t∗\]t\\in\[t\_\{0\},t^\{\*\}\]\. One then isolates the error term involving the MMSE of the normal componentBB, and controls it by projecting the observation onto the normal subspace\. The key observation is that, conditional onUU, the parallel observation carries no information aboutBB, so the relevant MMSE gap can be reduced to a Gaussian channel only in the normal directions\. Applying the conditional I–MMSE identity once more to this reduced channel yields the correction term involvingI​\(U;B∣Xt∗⟂\)I\(U;B\\mid X\_\{t^\{\*\}\}^\{\\perp\}\), and this gives the stated lower bound\.

###### Lemma 13

Let assumption[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1)be in force\. Fixt≥0t\\geq 0and set

mt​\(x\):=𝔼​\[Z∣Xt=x\],x∈ℝd\.m\_\{t\}\(x\):=\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\],\\qquad x\\in\\mathbb\{R\}^\{d\}\.Then there exists a constantCt<∞C\_\{t\}<\\inftysuch that

\|mt​\(x\)\|≤Ct​\(1\+\|x\|\),x∈ℝd\.\|m\_\{t\}\(x\)\|\\leq C\_\{t\}\(1\+\|x\|\),\\qquad x\\in\\mathbb\{R\}^\{d\}\.In particular, the posterior meanx↦𝔼​\[Y∣Xt=x\]x\\mapsto\\mathbb\{E\}\[Y\\mid X\_\{t\}=x\]has at most linear growth\.

ProofLetμ:=Law​\(Z\)\\mu:=\\mathrm\{Law\}\(Z\), and let

ϕt​\(u\):=\(2​π​t\)−d/2​exp⁡\(−\|u\|22​t\),u∈ℝd,\\phi\_\{t\}\(u\):=\(2\\pi t\)^\{\-d/2\}\\exp\\\!\\left\(\-\\frac\{\|u\|^\{2\}\}\{2t\}\\right\),\\qquad u\\in\\mathbb\{R\}^\{d\},be the Gaussian kernel with covariance matrixt​IdtI\_\{d\}\. The law ofXtX\_\{t\}admits density

pt​\(x\)=∫ℝdϕt​\(x−z\)​μ​\(d​z\),p\_\{t\}\(x\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\),and the conditional mean is given by

mt​\(x\)=∫ℝdz​ϕt​\(x−z\)​μ​\(d​z\)∫ℝdϕt​\(x−z\)​μ​\(d​z\)\.m\_\{t\}\(x\)=\\frac\{\\int\_\{\\mathbb\{R\}^\{d\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\}\{\\int\_\{\\mathbb\{R\}^\{d\}\}\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\}\.Set

N​\(x\):=∫ℝdz​ϕt​\(x−z\)​μ​\(d​z\),D​\(x\):=∫ℝdϕt​\(x−z\)​μ​\(d​z\)=pt​\(x\)\.N\(x\):=\\int\_\{\\mathbb\{R\}^\{d\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\),\\qquad D\(x\):=\\int\_\{\\mathbb\{R\}^\{d\}\}\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)=p\_\{t\}\(x\)\.Then

mt​\(x\)=N​\(x\)D​\(x\)\.m\_\{t\}\(x\)=\\frac\{N\(x\)\}\{D\(x\)\}\.
We shall prove that

\|N​\(x\)\|≤Ct​\(1\+\|x\|\)​D​\(x\),x∈ℝd\.\|N\(x\)\|\\leq C\_\{t\}\(1\+\|x\|\)D\(x\),\\qquad x\\in\\mathbb\{R\}^\{d\}\.
ChooseR\>0R\>0such that

a:=μ​\(B​\(0,R\)\)\>0\.a:=\\mu\(B\(0,R\)\)\>0\.This is possible sinceμ\\muis a probability measure\. Fixx∈ℝdx\\in\\mathbb\{R\}^\{d\}\. We split the numerator into a near part and a far part:

N​\(x\)=N1​\(x\)\+N2​\(x\),N\(x\)=N\_\{1\}\(x\)\+N\_\{2\}\(x\),where

N1​\(x\):=∫\{\|z\|≤4​\(\|x\|\+R\)\}z​ϕt​\(x−z\)​μ​\(d​z\),N\_\{1\}\(x\):=\\int\_\{\\\{\|z\|\\leq 4\(\|x\|\+R\)\\\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\),and

N2​\(x\):=∫\{\|z\|\>4​\(\|x\|\+R\)\}z​ϕt​\(x−z\)​μ​\(d​z\)\.N\_\{2\}\(x\):=\\int\_\{\\\{\|z\|\>4\(\|x\|\+R\)\\\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\.
On the set\{\|z\|≤4​\(\|x\|\+R\)\}\\\{\|z\|\\leq 4\(\|x\|\+R\)\\\}one has\|z\|≤4​\(\|x\|\+R\)\|z\|\\leq 4\(\|x\|\+R\), and therefore

\|N1​\(x\)\|≤∫\{\|z\|≤4​\(\|x\|\+R\)\}\|z\|​ϕt​\(x−z\)​μ​\(d​z\)≤4​\(\|x\|\+R\)​D​\(x\)\.\|N\_\{1\}\(x\)\|\\leq\\int\_\{\\\{\|z\|\\leq 4\(\|x\|\+R\)\\\}\}\|z\|\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\\leq 4\(\|x\|\+R\)D\(x\)\.
Letz∈ℝdz\\in\\mathbb\{R\}^\{d\}satisfy\|z\|\>4​\(\|x\|\+R\)\|z\|\>4\(\|x\|\+R\), and letu∈B​\(0,R\)u\\in B\(0,R\)\. Then

\|x−u\|≤\|x\|\+\|u\|≤\|x\|\+R<\|z\|4,\|x\-u\|\\leq\|x\|\+\|u\|\\leq\|x\|\+R<\\frac\{\|z\|\}\{4\},while

\|x−z\|≥\|z\|−\|x\|\>\|z\|−\|z\|4=34​\|z\|\.\|x\-z\|\\geq\|z\|\-\|x\|\>\|z\|\-\\frac\{\|z\|\}\{4\}=\\frac\{3\}\{4\}\|z\|\.Hence

\|x−z\|2−\|x−u\|2≥916​\|z\|2−116​\|z\|2=12​\|z\|2\.\|x\-z\|^\{2\}\-\|x\-u\|^\{2\}\\geq\\frac\{9\}\{16\}\|z\|^\{2\}\-\\frac\{1\}\{16\}\|z\|^\{2\}=\\frac\{1\}\{2\}\|z\|^\{2\}\.Consequently,

ϕt​\(x−z\)ϕt​\(x−u\)=exp⁡\(−\|x−z\|2−\|x−u\|22​t\)≤exp⁡\(−\|z\|24​t\)\.\\frac\{\\phi\_\{t\}\(x\-z\)\}\{\\phi\_\{t\}\(x\-u\)\}=\\exp\\\!\\left\(\-\\frac\{\|x\-z\|^\{2\}\-\|x\-u\|^\{2\}\}\{2t\}\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{\|z\|^\{2\}\}\{4t\}\\right\)\.Thus,

ϕt​\(x−z\)≤e−\|z\|2/\(4​t\)​ϕt​\(x−u\),u∈B​\(0,R\)\.\\phi\_\{t\}\(x\-z\)\\leq e^\{\-\|z\|^\{2\}/\(4t\)\}\\,\\phi\_\{t\}\(x\-u\),\\qquad u\\in B\(0,R\)\.Integrating this inequality with respect toμ​\(d​u\)\\mu\(du\)overB​\(0,R\)B\(0,R\)gives

a​ϕt​\(x−z\)≤e−\|z\|2/\(4​t\)​∫B​\(0,R\)ϕt​\(x−u\)​μ​\(d​u\)≤e−\|z\|2/\(4​t\)​D​\(x\),a\\,\\phi\_\{t\}\(x\-z\)\\leq e^\{\-\|z\|^\{2\}/\(4t\)\}\\int\_\{B\(0,R\)\}\\phi\_\{t\}\(x\-u\)\\,\\mu\(du\)\\leq e^\{\-\|z\|^\{2\}/\(4t\)\}D\(x\),and therefore

ϕt​\(x−z\)≤a−1​e−\|z\|2/\(4​t\)​D​\(x\)\.\\phi\_\{t\}\(x\-z\)\\leq a^\{\-1\}e^\{\-\|z\|^\{2\}/\(4t\)\}D\(x\)\.Using this bound, we obtain

\|N2​\(x\)\|\\displaystyle\|N\_\{2\}\(x\)\|≤∫\{\|z\|\>4​\(\|x\|\+R\)\}\|z\|​ϕt​\(x−z\)​μ​\(d​z\)\\displaystyle\\leq\\int\_\{\\\{\|z\|\>4\(\|x\|\+R\)\\\}\}\|z\|\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)≤a−1​D​\(x\)​∫ℝd\|z\|​e−\|z\|2/\(4​t\)​μ​\(d​z\)\.\\displaystyle\\leq a^\{\-1\}D\(x\)\\int\_\{\\mathbb\{R\}^\{d\}\}\|z\|e^\{\-\|z\|^\{2\}/\(4t\)\}\\,\\mu\(dz\)\.Since the functionr↦r​e−r2/\(4​t\)r\\mapsto re^\{\-r^\{2\}/\(4t\)\}is bounded on\[0,∞\)\[0,\\infty\), the quantity

Ct,1:=a−1​∫ℝd\|z\|​e−\|z\|2/\(4​t\)​μ​\(d​z\)C\_\{t,1\}:=a^\{\-1\}\\int\_\{\\mathbb\{R\}^\{d\}\}\|z\|e^\{\-\|z\|^\{2\}/\(4t\)\}\\,\\mu\(dz\)is finite\. Hence

\|N2​\(x\)\|≤Ct,1​D​\(x\)\.\|N\_\{2\}\(x\)\|\\leq C\_\{t,1\}D\(x\)\.
Combining the bounds forN1​\(x\)N\_\{1\}\(x\)andN2​\(x\)N\_\{2\}\(x\), we get

\|N​\(x\)\|≤\(4​\(\|x\|\+R\)\+Ct,1\)​D​\(x\)\.\|N\(x\)\|\\leq\\bigl\(4\(\|x\|\+R\)\+C\_\{t,1\}\\bigr\)D\(x\)\.Dividing byD​\(x\)\>0D\(x\)\>0, we conclude that

\|mt​\(x\)\|=\|N​\(x\)D​\(x\)\|≤4​\|x\|\+4​R\+Ct,1\.\|m\_\{t\}\(x\)\|=\\left\|\\frac\{N\(x\)\}\{D\(x\)\}\\right\|\\leq 4\|x\|\+4R\+C\_\{t,1\}\.Therefore there exists a finite constantCtC\_\{t\}such that

\|mt​\(x\)\|≤Ct​\(1\+\|x\|\),x∈ℝd\.\|m\_\{t\}\(x\)\|\\leq C\_\{t\}\(1\+\|x\|\),\\qquad x\\in\\mathbb\{R\}^\{d\}\.This completes the proof\.

###### Lemma 14

LetT\>0T\>0, letΩ=C​\(\[0,T\];ℝd\)\\Omega=C\(\[0,T\];\\mathbb\{R\}^\{d\}\)be endowed with the canonical filtration\(ℱt\)0≤t≤T\(\\mathcal\{F\}\_\{t\}\)\_\{0\\leq t\\leq T\}, and letXt​\(ω\)=ω​\(t\)X\_\{t\}\(\\omega\)=\\omega\(t\)be the coordinate process\.

Assume thatb,β:\[0,T\]×ℝd→ℝdb,\\beta:\[0,T\]\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}are Borel measurable and satisfy

\|b​\(t,x\)\|\+\|β​\(t,x\)\|≤L​\(1\+\|x\|\),\(t,x\)∈\[0,T\]×ℝd,\|b\(t,x\)\|\+\|\\beta\(t,x\)\|\\leq L\(1\+\|x\|\),\\qquad\(t,x\)\\in\[0,T\]\\times\\mathbb\{R\}^\{d\},for some constantL\>0L\>0\. Letσ∈ℝd×d\\sigma\\in\\mathbb\{R\}^\{d\\times d\}be a constant invertible matrix, and letν\\nube a probability measure onℝd\\mathbb\{R\}^\{d\}such that

∫ℝd\|x\|2​ν​\(d​x\)<∞\.\\int\_\{\\mathbb\{R\}^\{d\}\}\|x\|^\{2\}\\,\\nu\(dx\)<\\infty\.
Suppose thatℙβ\\mathbb\{P\}^\{\\beta\}is a weak solution law of

d​Xt=β​\(t,Xt\)​d​t\+σ​d​Wt,X0∼ν\.dX\_\{t\}=\\beta\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\},\\qquad X\_\{0\}\\sim\\nu\.Equivalently, underℙβ\\mathbb\{P\}^\{\\beta\},

Wtβ:=σ−1​\(Xt−X0−∫0tβ​\(s,Xs\)​𝑑s\)W\_\{t\}^\{\\beta\}:=\\sigma^\{\-1\}\\Bigl\(X\_\{t\}\-X\_\{0\}\-\\int\_\{0\}^\{t\}\\beta\(s,X\_\{s\}\)\\,ds\\Bigr\)is add\-dimensional Brownian motion\.

Define

θ​\(t,x\):=σ−1​\(b−β\)​\(t,x\),\\theta\(t,x\):=\\sigma^\{\-1\}\(b\-\\beta\)\(t,x\),and

Zt:=exp⁡\(∫0tθ​\(s,Xs\)⋅𝑑Wsβ−12​∫0t\|θ​\(s,Xs\)\|2​𝑑s\),0≤t≤T\.Z\_\{t\}:=\\exp\\\!\\left\(\\int\_\{0\}^\{t\}\\theta\(s,X\_\{s\}\)\\cdot dW\_\{s\}^\{\\beta\}\-\\frac\{1\}\{2\}\\int\_\{0\}^\{t\}\|\\theta\(s,X\_\{s\}\)\|^\{2\}\\,ds\\right\),\\qquad 0\\leq t\\leq T\.
Then the following hold:

1. 1\.Z=\(Zt\)0≤t≤TZ=\(Z\_\{t\}\)\_\{0\\leq t\\leq T\}is a trueℙβ\\mathbb\{P\}^\{\\beta\}\-martingale;
2. 2\.the probability measureℙb\\mathbb\{P\}^\{b\}on\(Ω,ℱT\)\(\\Omega,\\mathcal\{F\}\_\{T\}\)defined by d​ℙbd​ℙβ=ZT\\frac\{d\\mathbb\{P\}^\{b\}\}\{d\\mathbb\{P\}^\{\\beta\}\}=Z\_\{T\}is a weak solution law of d​Xt=b​\(t,Xt\)​d​t\+σ​d​Wt,X0∼ν\.dX\_\{t\}=b\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\},\\qquad X\_\{0\}\\sim\\nu\.

In particular,

ℙb≪ℙβon​ℱT,\\mathbb\{P\}^\{b\}\\ll\\mathbb\{P\}^\{\\beta\}\\quad\\text\{on \}\\mathcal\{F\}\_\{T\},with Radon–Nikodym derivativeZTZ\_\{T\}\.

If, in addition, the martingale problem for\(b,σ,ν\)\(b,\\sigma,\\nu\)is well posed, thenℙb\\mathbb\{P\}^\{b\}is the unique weak solution law of thebb\-equation\. If both martingale problems\(β,σ,ν\)\(\\beta,\\sigma,\\nu\)and\(b,σ,ν\)\(b,\\sigma,\\nu\)are well posed, thenℙβ\\mathbb\{P\}^\{\\beta\}andℙb\\mathbb\{P\}^\{b\}are equivalent onℱT\\mathcal\{F\}\_\{T\}\.

ProofWe divide the argument into several steps\.

For eachn∈ℕn\\in\\mathbb\{N\}, define the stopping time

τn:=inf\{t∈\[0,T\]:\|Xt\|≥n\}∧T,\\tau\_\{n\}:=\\inf\\\{t\\in\[0,T\]:\|X\_\{t\}\|\\geq n\\\}\\wedge T,and set

Zt\(n\):=Zt∧τn\.Z\_\{t\}^\{\(n\)\}:=Z\_\{t\\wedge\\tau\_\{n\}\}\.Since the process

θn​\(t\):=θ​\(t,Xt\)​𝟏\{t≤τn\}\\theta\_\{n\}\(t\):=\\theta\(t,X\_\{t\}\)\\mathbf\{1\}\_\{\\\{t\\leq\\tau\_\{n\}\\\}\}is bounded, the classical bounded\-integrand version of Girsanov’s theorem implies thatZ\(n\)Z^\{\(n\)\}is a trueℙβ\\mathbb\{P\}^\{\\beta\}\-martingale\. Define a probability measureℚn\\mathbb\{Q\}\_\{n\}onℱT\\mathcal\{F\}\_\{T\}by

d​ℚnd​ℙβ=ZT\(n\)\.\\frac\{d\\mathbb\{Q\}\_\{n\}\}\{d\\mathbb\{P\}^\{\\beta\}\}=Z\_\{T\}^\{\(n\)\}\.
Underℚn\\mathbb\{Q\}\_\{n\}, the process

Wt\(n\):=Wtβ−∫0t∧τnθ​\(s,Xs\)​𝑑sW\_\{t\}^\{\(n\)\}:=W\_\{t\}^\{\\beta\}\-\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\theta\(s,X\_\{s\}\)\\,dsis add\-dimensional Brownian motion\. Consequently,

Xt∧τn\\displaystyle X\_\{t\\wedge\\tau\_\{n\}\}=X0\+∫0t∧τnβ​\(s,Xs\)​𝑑s\+σ​Wt∧τnβ\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\\beta\}=X0\+∫0t∧τnβ​\(s,Xs\)​𝑑s\+σ​∫0t∧τnθ​\(s,Xs\)​𝑑s\+σ​Wt∧τn\(n\)\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\theta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\(n\)\}=X0\+∫0t∧τnb​\(s,Xs\)​𝑑s\+σ​Wt∧τn\(n\)\.\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}b\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\(n\)\}\.Thus, underℚn\\mathbb\{Q\}\_\{n\}, the stopped coordinate process solves thebb\-equation up toτn\\tau\_\{n\}\. Define

fn​\(t\):=EQn​\[sup0≤u≤t\|Xu∧τn\|2\],0≤t≤T\.f\_\{n\}\(t\):=E^\{Q\_\{n\}\}\\Big\[\\sup\_\{0\\leq u\\leq t\}\|X\_\{u\\wedge\\tau\_\{n\}\}\|^\{2\}\\Big\],\\qquad 0\\leq t\\leq T\.Write

Xt∧τn=X0\+At\(n\)\+Mt\(n\),X\_\{t\\wedge\\tau\_\{n\}\}=X\_\{0\}\+A\_\{t\}^\{\(n\)\}\+M\_\{t\}^\{\(n\)\},where

At\(n\):=∫0t∧τnb​\(s,Xs\)​𝑑s,Mt\(n\):=σ​Wt∧τn\(n\)\.A\_\{t\}^\{\(n\)\}:=\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}b\(s,X\_\{s\}\)\\,ds,\\qquad M\_\{t\}^\{\(n\)\}:=\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\(n\)\}\.Using\(a\+b\+c\)2≤3​\(a2\+b2\+c2\)\(a\+b\+c\)^\{2\}\\leq 3\(a^\{2\}\+b^\{2\}\+c^\{2\}\), we obtain

fn​\(t\)≤3​Eℚn​\|X0\|2\+3​Eℚn​\[supu≤t\|Au\(n\)\|2\]\+3​Eℚn​\[supu≤t\|Mu\(n\)\|2\]\.f\_\{n\}\(t\)\\leq 3E^\{\\mathbb\{Q\}\_\{n\}\}\|X\_\{0\}\|^\{2\}\+3E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|A\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\+3E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|M\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\.
SinceZ0\(n\)=1Z\_\{0\}^\{\(n\)\}=1, the law ofX0X\_\{0\}underℚn\\mathbb\{Q\}\_\{n\}is the same as underℙβ\\mathbb\{P\}^\{\\beta\}, namelyν\\nu\. Hence

Eℚn​\|X0\|2=∫ℝd\|x\|2​ν​\(d​x\)<∞\.E^\{\\mathbb\{Q\}\_\{n\}\}\|X\_\{0\}\|^\{2\}=\\int\_\{\\mathbb\{R\}^\{d\}\}\|x\|^\{2\}\\,\\nu\(dx\)<\\infty\.
For the drift term, the linear\-growth assumption yields

\|b​\(t,x\)\|≤L​\(1\+\|x\|\),\|b\(t,x\)\|\\leq L\(1\+\|x\|\),so

supu≤t\|Au\(n\)\|≤∫0t𝟏\{s≤τn\}​\|b​\(s,Xs\)\|​𝑑s≤L​∫0t\(1\+\|Xs∧τn\|\)​𝑑s\.\\sup\_\{u\\leq t\}\|A\_\{u\}^\{\(n\)\}\|\\leq\\int\_\{0\}^\{t\}\\mathbf\{1\}\_\{\\\{s\\leq\\tau\_\{n\}\\\}\}\|b\(s,X\_\{s\}\)\|\\,ds\\leq L\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|\\bigr\)\\,ds\.By Cauchy–Schwarz,

\(∫0t\(1\+\|Xs∧τn\|\)​𝑑s\)2≤t​∫0t\(1\+\|Xs∧τn\|\)2​𝑑s≤2​t​∫0t\(1\+\|Xs∧τn\|2\)​𝑑s\.\\Big\(\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|\\bigr\)\\,ds\\Big\)^\{2\}\\leq t\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|\\bigr\)^\{2\}\\,ds\\leq 2t\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\bigr\)\\,ds\.Therefore,

Eℚn​\[supu≤t\|Au\(n\)\|2\]≤2​L2​t​∫0t\(1\+Eℚn​\|Xs∧τn\|2\)​𝑑s≤2​L2​t​∫0t\(1\+fn​\(s\)\)​𝑑s\.E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|A\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\\leq 2L^\{2\}t\\int\_\{0\}^\{t\}\\Bigl\(1\+E^\{\\mathbb\{Q\}\_\{n\}\}\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\Bigr\)\\,ds\\leq 2L^\{2\}t\\int\_\{0\}^\{t\}\\bigl\(1\+f\_\{n\}\(s\)\\bigr\)\\,ds\.
For the martingale term,M\(n\)M^\{\(n\)\}is a continuousℚn\\mathbb\{Q\}\_\{n\}\-martingale with quadratic variation

⟨M\(n\)⟩t=∫0t∧τnσ​σ⊤​𝑑s\.\\langle M^\{\(n\)\}\\rangle\_\{t\}=\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\sigma\\sigma^\{\\top\}\\,ds\.Hence, by the Burkholder–Davis–Gundy inequality\(Karatzas and Shreve,[2014](https://arxiv.org/html/2605.05387#bib.bib14)\),

Eℚn​\[supu≤t\|Mu\(n\)\|2\]≤CBDG​Eℚn​\[tr​⟨M\(n\)⟩t\]≤CBDG​‖σ‖HS2​t\.E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|M\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\\leq C\_\{\\mathrm\{BDG\}\}\\,E^\{\\mathbb\{Q\}\_\{n\}\}\\big\[\\mathrm\{tr\}\\langle M^\{\(n\)\}\\rangle\_\{t\}\\big\]\\leq C\_\{\\mathrm\{BDG\}\}\\\|\\sigma\\\|\_\{\\mathrm\{HS\}\}^\{2\}\\,t\.Combining the above bounds, we find constantsC0,C1\>0C\_\{0\},C\_\{1\}\>0, independent ofnn, such that

fn​\(t\)≤C0\+C1​∫0t\(1\+fn​\(s\)\)​𝑑s,0≤t≤T\.f\_\{n\}\(t\)\\leq C\_\{0\}\+C\_\{1\}\\int\_\{0\}^\{t\}\\bigl\(1\+f\_\{n\}\(s\)\\bigr\)\\,ds,\\qquad 0\\leq t\\leq T\.By Gronwall’s lemma,

supn≥1sup0≤t≤Tfn​\(t\)≤CT\\sup\_\{n\\geq 1\}\\sup\_\{0\\leq t\\leq T\}f\_\{n\}\(t\)\\leq C\_\{T\}for some constantCT<∞C\_\{T\}<\\inftyindependent ofnn\. In particular,

supn≥1Eℚn​∫0T\|Xs∧τn\|2​𝑑s≤T​CT\.\\sup\_\{n\\geq 1\}E^\{\\mathbb\{Q\}\_\{n\}\}\\int\_\{0\}^\{T\}\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\,ds\\leq TC\_\{T\}\.Underℚn\\mathbb\{Q\}\_\{n\},

d​Wtβ=d​Wt\(n\)\+𝟏\{t≤τn\}​θ​\(t,Xt\)​d​t\.dW\_\{t\}^\{\\beta\}=dW\_\{t\}^\{\(n\)\}\+\\mathbf\{1\}\_\{\\\{t\\leq\\tau\_\{n\}\\\}\}\\theta\(t,X\_\{t\}\)\\,dt\.Substituting into the definition ofZT\(n\)Z\_\{T\}^\{\(n\)\}yields

log⁡ZT\(n\)=∫0T∧τnθ​\(s,Xs\)⋅𝑑Ws\(n\)\+12​∫0T∧τn\|θ​\(s,Xs\)\|2​𝑑s\.\\log Z\_\{T\}^\{\(n\)\}=\\int\_\{0\}^\{T\\wedge\\tau\_\{n\}\}\\theta\(s,X\_\{s\}\)\\cdot dW\_\{s\}^\{\(n\)\}\+\\frac\{1\}\{2\}\\int\_\{0\}^\{T\\wedge\\tau\_\{n\}\}\|\\theta\(s,X\_\{s\}\)\|^\{2\}\\,ds\.Taking expectations underℚn\\mathbb\{Q\}\_\{n\}, the stochastic integral has mean zero, and therefore

Eℚn​\[log⁡ZT\(n\)\]=12​Eℚn​∫0T∧τn\|θ​\(s,Xs\)\|2​𝑑s\.E^\{\\mathbb\{Q\}\_\{n\}\}\[\\log Z\_\{T\}^\{\(n\)\}\]=\\frac\{1\}\{2\}E^\{\\mathbb\{Q\}\_\{n\}\}\\int\_\{0\}^\{T\\wedge\\tau\_\{n\}\}\|\\theta\(s,X\_\{s\}\)\|^\{2\}\\,ds\.Sinceθ​\(t,x\)=σ−1​\(b−β\)​\(t,x\)\\theta\(t,x\)=\\sigma^\{\-1\}\(b\-\\beta\)\(t,x\)and bothbbandβ\\betahave linear growth, there exists a constantC\>0C\>0such that

\|θ​\(t,x\)\|2≤C​\(1\+\|x\|2\)\.\|\\theta\(t,x\)\|^\{2\}\\leq C\(1\+\|x\|^\{2\}\)\.Hence

Eℚn​\[log⁡ZT\(n\)\]≤C​\(T\+Eℚn​∫0T\|Xs∧τn\|2​𝑑s\)≤CT′\.E^\{\\mathbb\{Q\}\_\{n\}\}\[\\log Z\_\{T\}^\{\(n\)\}\]\\leq C\\left\(T\+E^\{\\mathbb\{Q\}\_\{n\}\}\\int\_\{0\}^\{T\}\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\,ds\\right\)\\leq C\_\{T\}^\{\\prime\}\.Moreover, by definition ofℚn\\mathbb\{Q\}\_\{n\},

Eℙβ​\[ZT\(n\)​log⁡ZT\(n\)\]=Eℚn​\[log⁡ZT\(n\)\]\.E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[Z\_\{T\}^\{\(n\)\}\\log Z\_\{T\}^\{\(n\)\}\\big\]=E^\{\\mathbb\{Q\}\_\{n\}\}\[\\log Z\_\{T\}^\{\(n\)\}\]\.Thus,

supn≥1Eℙβ​\[ZT\(n\)​log⁡ZT\(n\)\]<∞\.\\sup\_\{n\\geq 1\}E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[Z\_\{T\}^\{\(n\)\}\\log Z\_\{T\}^\{\(n\)\}\\big\]<\\infty\.Since the functionx↦x​log⁡xx\\mapsto x\\log xis increasing convex function , de la Vallée\-Poussin’s criterion implies that the family\{ZT\(n\)\}n≥1\\\{Z\_\{T\}^\{\(n\)\}\\\}\_\{n\\geq 1\}is uniformly integrable\(Durrett,[2019](https://arxiv.org/html/2605.05387#bib.bib8)\)\.

Nowτn↑T\\tau\_\{n\}\\uparrow Tℙβ\\mathbb\{P\}^\{\\beta\}\-almost surely, andZZis continuous, hence

ZT\(n\)→ZTℙβ​\-a\.s\.Z\_\{T\}^\{\(n\)\}\\to Z\_\{T\}\\qquad\\mathbb\{P\}^\{\\beta\}\\text\{\-a\.s\.\}Uniform integrability therefore implies

Eℙβ​\[ZT\]=limn→∞Eℙβ​\[ZT\(n\)\]=1\.E^\{\\mathbb\{P\}^\{\\beta\}\}\[Z\_\{T\}\]=\\lim\_\{n\\to\\infty\}E^\{\\mathbb\{P\}^\{\\beta\}\}\[Z\_\{T\}^\{\(n\)\}\]=1\.It follows thatZZis a trueℙβ\\mathbb\{P\}^\{\\beta\}\-martingale\.Now define a probability measureℙb\\mathbb\{P\}^\{b\}onℱT\\mathcal\{F\}\_\{T\}by

d​ℙbd​ℙβ=ZT\.\\frac\{d\\mathbb\{P\}^\{b\}\}\{d\\mathbb\{P\}^\{\\beta\}\}=Z\_\{T\}\.SinceZZis a true martingale, the classical Girsanov theorem applies and yields that

Wtb:=Wtβ−∫0tθ​\(s,Xs\)​𝑑sW\_\{t\}^\{b\}:=W\_\{t\}^\{\\beta\}\-\\int\_\{0\}^\{t\}\\theta\(s,X\_\{s\}\)\\,dsis a Brownian motion underℙb\\mathbb\{P\}^\{b\}\. Therefore,

Xt\\displaystyle X\_\{t\}=X0\+∫0tβ​\(s,Xs\)​𝑑s\+σ​Wtβ\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\}^\{\\beta\}=X0\+∫0tβ​\(s,Xs\)​𝑑s\+σ​∫0tθ​\(s,Xs\)​𝑑s\+σ​Wtb\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma\\int\_\{0\}^\{t\}\\theta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\}^\{b\}=X0\+∫0tb​\(s,Xs\)​𝑑s\+σ​Wtb\.\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\}b\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\}^\{b\}\.ThusXXsolves

d​Xt=b​\(t,Xt\)​d​t\+σ​d​WtbdX\_\{t\}=b\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\}^\{b\}underℙb\\mathbb\{P\}^\{b\}\.

It remains to identify the initial law\. For every Borel setA⊂ℝdA\\subset\\mathbb\{R\}^\{d\},

ℙb​\(X0∈A\)=Eℙβ​\[𝟏\{X0∈A\}​ZT\]\.\\mathbb\{P\}^\{b\}\(X\_\{0\}\\in A\)=E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}Z\_\{T\}\\big\]\.Since𝟏\{X0∈A\}∈ℱ0\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}\\in\\mathcal\{F\}\_\{0\}andZZis a martingale withZ0=1Z\_\{0\}=1,

Eℙβ​\[𝟏\{X0∈A\}​ZT\]=Eℙβ​\[𝟏\{X0∈A\}​Eℙβ​\[ZT∣ℱ0\]\]=Eℙβ​\[𝟏\{X0∈A\}\]=ν​\(A\)\.E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}Z\_\{T\}\\big\]=E^\{\\mathbb\{P\}^\{\\beta\}\}\\Big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}E^\{\\mathbb\{P\}^\{\\beta\}\}\[Z\_\{T\}\\mid\\mathcal\{F\}\_\{0\}\]\\Big\]=E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}\\big\]=\\nu\(A\)\.HenceX0∼νX\_\{0\}\\sim\\nuunderℙb\\mathbb\{P\}^\{b\}as well\. This proves thatℙb\\mathbb\{P\}^\{b\}is a weak solution law of

d​Xt=b​\(t,Xt\)​d​t\+σ​d​Wt,X0∼ν\.dX\_\{t\}=b\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\},\\qquad X\_\{0\}\\sim\\nu\.
If the martingale problem for\(b,σ,ν\)\(b,\\sigma,\\nu\)is well posed, then the probability measureℙb\\mathbb\{P\}^\{b\}constructed above coincides with the unique weak solution law of thebb\-equation\. If both martingale problems\(β,σ,ν\)\(\\beta,\\sigma,\\nu\)and\(b,σ,ν\)\(b,\\sigma,\\nu\)are well posed, then the same argument withbbandβ\\betainterchanged yields

ℙβ≪ℙb,\\mathbb\{P\}^\{\\beta\}\\ll\\mathbb\{P\}^\{b\},and therefore

ℙβ∼ℙb\.\\mathbb\{P\}^\{\\beta\}\\sim\\mathbb\{P\}^\{b\}\.
This completes the proof\.

###### Lemma 15

LetUUbe square\-integrable and let𝒢⊆ℋ\\mathcal\{G\}\\subseteq\\mathcal\{H\}beσ\\sigma\-fields\. Then

𝔼\[∥𝔼\[U∣ℋ\]−𝔼\[U∣𝒢\]∥22\]=mmse\(U∣𝒢\)−mmse\(U∣ℋ\),\\mathbb\{E\}\\bigl\[\\\|\\mathbb\{E\}\[U\\mid\\mathcal\{H\}\]\-\\mathbb\{E\}\[U\\mid\\mathcal\{G\}\]\\\|\_\{2\}^\{2\}\\bigr\]=\\operatorname\{mmse\}\(U\\mid\\mathcal\{G\}\)\-\\operatorname\{mmse\}\(U\\mid\\mathcal\{H\}\),wheremmse\(U∣𝒢\):=𝔼∥U−𝔼\[U∣𝒢\]∥22\\operatorname\{mmse\}\(U\\mid\\mathcal\{G\}\):=\\mathbb\{E\}\\\|U\-\\mathbb\{E\}\[U\\mid\\mathcal\{G\}\]\\\|\_\{2\}^\{2\}\.

ProofThe identity is the Pythagorean theorem for orthogonal projections inL2L^\{2\}\(conditional expectation is the orthogonal projection onto the subspace of𝒢\\mathcal\{G\}\-measurable functions\)\.

###### Lemma 16

LetX∈ℝdX\\in\\mathbb\{R\}^\{d\}be a random vector with𝔼​‖X‖22<∞\\mathbb\{E\}\\\|X\\\|\_\{2\}^\{2\}<\\infty, letSSbe an arbitrary random element, and letN∼𝒩​\(0,Id\)N\\sim\\mathcal\{N\}\(0,I\_\{d\}\)be independent of\(X,S\)\(X,S\)\. Forγ\>0\\gamma\>0define the Gaussian observation channel

Yγ:=γ​X\+N\.Y\_\{\\gamma\}\\;:=\\;\\sqrt\{\\gamma\}\\,X\+N\.ThenI​\(X;Yγ∣S\)I\(X;Y\_\{\\gamma\}\\mid S\)is differentiable inγ\\gammaand

dd​γ​I​\(X;Yγ∣S\)=12​mmse⁡\(X∣Yγ,S\),\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}\\,I\(X;Y\_\{\\gamma\}\\mid S\)\\;=\\;\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S\),where

mmse\(X∣Yγ,S\):=𝔼\[∥X−𝔼\[X∣Yγ,S\]∥22\]\.\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S\)\\;:=\\;\\mathbb\{E\}\\\!\\left\[\\big\\\|X\-\\mathbb\{E\}\[X\\mid Y\_\{\\gamma\},S\]\\big\\\|\_\{2\}^\{2\}\\right\]\.

ProofFixγ\>0\\gamma\>0\. By disintegration,

I​\(X;Yγ∣S\)=∫I​\(X;Yγ∣S=s\)​PS​\(d​s\)\.I\(X;Y\_\{\\gamma\}\\mid S\)=\\int I\(X;Y\_\{\\gamma\}\\mid S=s\)\\,P\_\{S\}\(\\mathrm\{d\}s\)\.\(A\.1\)For eachss, conditional onS=sS=sthe channel remains AWGN:Yγ=γ​X\+NY\_\{\\gamma\}=\\sqrt\{\\gamma\}\\,X\+NwithN⟂⟂XN\\perp\\\!\\\!\\\!\\perp XunderLaw\(⋅∣S=s\)\\operatorname\{Law\}\(\\cdot\\mid S=s\)\. Since𝔼​\[‖X‖22∣S=s\]<∞\\mathbb\{E\}\[\\\|X\\\|\_\{2\}^\{2\}\\mid S=s\]<\\inftyforPSP\_\{S\}\-a\.e\.ss, the \(vector\) I–MMSE identity of\(Guo et al\.,[2005](https://arxiv.org/html/2605.05387#bib.bib10)\)applied to the conditional input lawX∣S=sX\\mid S=syields

dd​γ​I​\(X;Yγ∣S=s\)=12​mmse⁡\(X∣Yγ,S=s\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(X;Y\_\{\\gamma\}\\mid S=s\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\.\(A\.2\)Moreover, for everyss,

0≤mmse⁡\(X∣Yγ,S=s\)≤𝔼​\[‖X‖22∣S=s\],0\\leq\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\\leq\\mathbb\{E\}\[\\\|X\\\|\_\{2\}^\{2\}\\mid S=s\],because the MMSE is the minimum mean\-squared error and is upper bounded by the MSE of the zero estimator\. Since𝔼​‖X‖22=∫𝔼​\[‖X‖22∣S=s\]​PS​\(d​s\)<∞\\mathbb\{E\}\\\|X\\\|\_\{2\}^\{2\}=\\int\\mathbb\{E\}\[\\\|X\\\|\_\{2\}^\{2\}\\mid S=s\]\\,P\_\{S\}\(\\mathrm\{d\}s\)<\\infty, dominated convergence \(Leibniz rule\) allows differentiating under the integral in \([A\.1](https://arxiv.org/html/2605.05387#A1.E1)\), giving

dd​γ​I​\(X;Yγ∣S\)=12​∫mmse⁡\(X∣Yγ,S=s\)​PS​\(d​s\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(X;Y\_\{\\gamma\}\\mid S\)=\\frac\{1\}\{2\}\\int\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\\,P\_\{S\}\(\\mathrm\{d\}s\)\.Finally, by the law of total expectation and the definition of conditional MMSE,

∫mmse\(X∣Yγ,S=s\)PS\(ds\)=𝔼\[∥X−𝔼\[X∣Yγ,S\]∥22\]=mmse\(X∣Yγ,S\),\\int\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\\,P\_\{S\}\(\\mathrm\{d\}s\)=\\mathbb\{E\}\\\!\\left\[\\big\\\|X\-\\mathbb\{E\}\[X\\mid Y\_\{\\gamma\},S\]\\big\\\|\_\{2\}^\{2\}\\right\]=\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S\),which proves the claim\.

Proof\[Proof of Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)\] Fixbb\. LetY∗,bY^\{\*,b\}andY^b\\hat\{Y\}^\{\\,b\}solve the two reverse\-time SDEs on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]in the theorem, started from the same law at timeτ∗\\tau^\{\*\}and with the same unit diffusion coefficient\. Denote their drifts byfτ∗,bf\_\{\\tau\}^\{\*,b\}andf^τb\\hat\{f\}\_\{\\tau\}^\{\\,b\}, respectively\.

We first explain why these drifts have at most linear growth and why Lemma[14](https://arxiv.org/html/2605.05387#Thmtheorem14)applies\. By Tweedie’s formula, fort=T−τt=T\-\\tau,

𝔼​\[Z∣Xt=x\]=x\+t​st​\(x\),𝔼​\[Z∣Xt=x,B=b\]=x\+t​st∗,b​\(x\)\.\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]=x\+t\\,s\_\{t\}\(x\),\\qquad\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]=x\+t\\,s\_\{t\}^\{\*,b\}\(x\)\.Hence

fτ∗,b​\(x\)=P∥​st∗,b​\(x\)\+1t​P⟂​\(b−x\)=1t​\(𝔼​\[Z∣Xt=x,B=b\]−x\),f\_\{\\tau\}^\{\*,b\}\(x\)=P\_\{\\parallel\}s\_\{t\}^\{\*,b\}\(x\)\+\\frac\{1\}\{t\}P\_\{\\perp\}\(b\-x\)=\\frac\{1\}\{t\}\\bigl\(\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]\-x\\bigr\),because under the conditioningB=bB=bwe haveP⟂​𝔼​\[Z∣Xt=x,B=b\]=bP\_\{\\perp\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]=b\. Likewise,

f^τb​\(x\)=P∥​st​\(x\)\+1t​P⟂​\(b−x\)=1t​\(P∥​𝔼​\[Z∣Xt=x\]−P∥​x\+P⟂​\(b−x\)\)\.\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)=P\_\{\\parallel\}s\_\{t\}\(x\)\+\\frac\{1\}\{t\}P\_\{\\perp\}\(b\-x\)=\\frac\{1\}\{t\}\\bigl\(P\_\{\\parallel\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]\-P\_\{\\parallel\}x\+P\_\{\\perp\}\(b\-x\)\\bigr\)\.By Lemma[13](https://arxiv.org/html/2605.05387#Thmtheorem13), the mapsx↦𝔼​\[Z∣Xt=x\]x\\mapsto\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]andx↦𝔼​\[Z∣Xt=x,B=b\]x\\mapsto\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]have at most linear growth\. Sincet=T−τ∈\[t0,t∗\]t=T\-\\tau\\in\[t\_\{0\},t^\{\*\}\]on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\], the factor1/t1/tis uniformly bounded by1/t01/t\_\{0\}\. Therefore bothfτ∗,bf\_\{\\tau\}^\{\*,b\}andf^τb\\hat\{f\}\_\{\\tau\}^\{\\,b\}satisfy a linear\-growth bound of the form

\|fτ∗,b​\(x\)\|\+\|f^τb​\(x\)\|≤Cb,t0​\(1\+\|x\|\),τ∈\[τ∗,T−t0\]\.\|f\_\{\\tau\}^\{\*,b\}\(x\)\|\+\|\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)\|\\leq C\_\{b,t\_\{0\}\}\(1\+\|x\|\),\\qquad\\tau\\in\[\\tau^\{\*\},T\-t\_\{0\}\]\.
NowY∗,bY^\{\*,b\}is already given as a weak solution law for the driftf∗,bf^\{\*,b\}, namely the ideal conditional reverse process\. We therefore apply Lemma[14](https://arxiv.org/html/2605.05387#Thmtheorem14)with

β​\(τ,x\)=fτ∗,b​\(x\),b​\(τ,x\)=f^τb​\(x\),σ=Id,\\beta\(\\tau,x\)=f\_\{\\tau\}^\{\*,b\}\(x\),\\qquad b\(\\tau,x\)=\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\),\\qquad\\sigma=I\_\{d\},and with initial law equal to the common law ofYτ∗∗,bY^\{\*,b\}\_\{\\tau^\{\*\}\}andY^τ∗b\\hat\{Y\}^\{\\,b\}\_\{\\tau^\{\*\}\}\. The lemma yields existence of the surrogate weak solution law and the relative\-entropy identity

KL​\(ℙY∗,b∥ℙY^b\)\\displaystyle\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,b\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,b\}\}\\big\)=12​𝔼Y∗,b​\[∫τ∗T−t0‖fτ∗,b​\(Yτ∗,b\)−f^τb​\(Yτ∗,b\)‖22​dτ\]\\displaystyle=\\frac\{1\}\{2\}\\,\\mathbb\{E\}^\{Y^\{\*,b\}\}\\\!\\left\[\\int\_\{\\tau^\{\*\}\}^\{T\-t\_\{0\}\}\\big\\\|f\_\{\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\\big\\\|\_\{2\}^\{2\}\\,\\mathrm\{d\}\\tau\\right\]≤12​𝔼Y∗,b​\[∫τ∗T‖fτ∗,b​\(Yτ∗,b\)−f^τb​\(Yτ∗,b\)‖22​dτ\]\.\\displaystyle\\leq\\frac\{1\}\{2\}\\,\\mathbb\{E\}^\{Y^\{\*,b\}\}\\\!\\left\[\\int\_\{\\tau^\{\*\}\}^\{T\}\\big\\\|f\_\{\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\\big\\\|\_\{2\}^\{2\}\\,\\mathrm\{d\}\\tau\\right\]\.\(A\.3\)SinceY∗,bY^\{\*,b\}is the true reverse\-time process of the conditional forward diffusion\{Xt\}t∈\[0,T\]\\\{X\_\{t\}\\\}\_\{t\\in\[0,T\]\}underB=bB=b, it is defined up to terminal reverse timeTT\. Hence we may extend the integral from\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]to\[τ∗,T\)\[\\tau^\{\*\},T\)\. By inspection, the two SDEs have the same normal drift, hence for allx∈ℝdx\\in\\mathbb\{R\}^\{d\},

fτ∗,b​\(x\)−f^τb​\(x\)=P∥​\(st∗,b​\(x\)−st​\(x\)\)\.f\_\{\\tau\}^\{\*,b\}\(x\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)=P\_\{\\parallel\}\\Big\(s\_\{t\}^\{\*,b\}\(x\)\-s\_\{t\}\(x\)\\Big\)\.\(A\.4\)
Now letZ∼p0Z\\sim p\_\{0\}denote the clean signal, and decompose it as

U:=P∥​Z,B:=P⟂​Z\.U:=P\_\{\\parallel\}Z,\\qquad B:=P\_\{\\perp\}Z\.
Tweedie’s formula yields, for everyx∈ℝdx\\in\\mathbb\{R\}^\{d\},

𝔼​\[Z∣Xt=x\]=x\+t​st​\(x\),𝔼​\[Z∣Xt=x,B=b\]=x\+t​st∗,b​\(x\)\.\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]=x\+t\\,s\_\{t\}\(x\),\\qquad\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,\\,B=b\]=x\+t\\,s\_\{t\}^\{\*,b\}\(x\)\.Projecting ontoker⁡\(A\)\\ker\(A\)and subtracting, we obtain

P∥​\(st∗,b−st\)​\(x\)=1t​\(𝔼​\[U∣Xt=x,B=b\]−𝔼​\[U∣Xt=x\]\)\.P\_\{\\parallel\}\\\!\\big\(s\_\{t\}^\{\*,b\}\-s\_\{t\}\\big\)\(x\)=\\frac\{1\}\{t\}\\Big\(\\mathbb\{E\}\[U\\mid X\_\{t\}=x,\\,B=b\]\-\\mathbb\{E\}\[U\\mid X\_\{t\}=x\]\\Big\)\.\(A\.5\)Settingt=T−τt=T\-\\tauand combining \([A\.4](https://arxiv.org/html/2605.05387#A1.E4)\) with \([A\.5](https://arxiv.org/html/2605.05387#A1.E5)\), we get

fτ∗,b​\(x\)−f^τb​\(x\)=1T−τ​\(𝔼​\[U∣XT−τ=x,B=b\]−𝔼​\[U∣XT−τ=x\]\)\.f\_\{\\tau\}^\{\*,b\}\(x\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)=\\frac\{1\}\{T\-\\tau\}\\Big\(\\mathbb\{E\}\[U\\mid X\_\{T\-\\tau\}=x,\\,B=b\]\-\\mathbb\{E\}\[U\\mid X\_\{T\-\\tau\}=x\]\\Big\)\.
Plugging this identity into \([A](https://arxiv.org/html/2605.05387#A1.Ex91)\), averaging over the random levelBB, and applying Lemma[15](https://arxiv.org/html/2605.05387#Thmtheorem15), yields

𝔼B​\[KL​\(ℙY∗,B∥ℙY^B\)\]≤12​∫τ∗T1\(T−τ\)2​\(mmse⁡\(U∣XT−τ\)−mmse⁡\(U∣XT−τ,B\)\)​dτ\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\frac\{1\}\{2\}\\int\_\{\\tau^\{\*\}\}^\{T\}\\frac\{1\}\{\(T\-\\tau\)^\{2\}\}\\Big\(\\operatorname\{mmse\}\(U\\mid X\_\{T\-\\tau\}\)\-\\operatorname\{mmse\}\(U\\mid X\_\{T\-\\tau\},B\)\\Big\)\\,\\mathrm\{d\}\\tau\.\(A\.6\)
Now define

γ:=1t=1T−τ,X~γ:=γ​Xt=γ​Z\+Ξ,Ξ∼𝒩​\(0,Id\)​independent of​Z\.\\gamma:=\\frac\{1\}\{t\}=\\frac\{1\}\{T\-\\tau\},\\qquad\\tilde\{X\}\_\{\\gamma\}:=\\sqrt\{\\gamma\}\\,X\_\{t\}=\\sqrt\{\\gamma\}\\,Z\+\\Xi,\\qquad\\Xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\\ \\text\{independent of \}Z\.SinceXt↦X~γX\_\{t\}\\mapsto\\tilde\{X\}\_\{\\gamma\}is an invertible scaling, conditioning onXtX\_\{t\}is equivalent to conditioning onX~γ\\tilde\{X\}\_\{\\gamma\}\. Moreover,

d​γ=d​τ\(T−τ\)2\.\\mathrm\{d\}\\gamma=\\frac\{\\mathrm\{d\}\\tau\}\{\(T\-\\tau\)^\{2\}\}\.Therefore \([A\.6](https://arxiv.org/html/2605.05387#A1.E6)\) becomes

𝔼B​\[KL​\(ℙY∗,B∥ℙY^B\)\]≤12​∫γ∗∞\(mmse⁡\(U∣X~γ\)−mmse⁡\(U∣X~γ,B\)\)​dγ,\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\infty\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\\Big\)\\,\\mathrm\{d\}\\gamma,\(A\.7\)whereγ∗:=1/\(T−τ∗\)\\gamma^\{\*\}:=1/\(T\-\\tau^\{\*\}\)\.

Define

Φ​\(γ\):=I​\(U;X~γ\)−I​\(U;X~γ∣B\)\.\\Phi\(\\gamma\):=I\(U;\\tilde\{X\}\_\{\\gamma\}\)\-I\(U;\\tilde\{X\}\_\{\\gamma\}\\mid B\)\.
Conditioning onBBturnsX~γ=γ​\(U\+B\)\+Ξ\\tilde\{X\}\_\{\\gamma\}=\\sqrt\{\\gamma\}\(U\+B\)\+\\Xiinto an AWGN channel inUUwith a known \(measurable\) shift, so Lemma[16](https://arxiv.org/html/2605.05387#Thmtheorem16)\(withX=UX=U,S=BS=BandYγ=X~γY\_\{\\gamma\}=\\tilde\{X\}\_\{\\gamma\}\) yields

dd​γ​I​\(U;X~γ∣B\)=12​mmse⁡\(U∣X~γ,B\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(U;\\tilde\{X\}\_\{\\gamma\}\\mid B\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\.\(A\.8\)
Next, use the chain rule

I​\(Z;X~γ\)=I​\(U;X~γ\)\+I​\(B;X~γ∣U\)\.I\(Z;\\tilde\{X\}\_\{\\gamma\}\)=I\(U;\\tilde\{X\}\_\{\\gamma\}\)\+I\(B;\\tilde\{X\}\_\{\\gamma\}\\mid U\)\.\(A\.9\)SinceX~γ=γ​Z\+Ξ\\tilde\{X\}\_\{\\gamma\}=\\sqrt\{\\gamma\}Z\+\\Xiis AWGN inZZ, so by I\-MMSE we have

dd​γ​I​\(Z;X~γ\)=12​mmse⁡\(Z∣X~γ\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(Z;\\tilde\{X\}\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(Z\\mid\\tilde\{X\}\_\{\\gamma\}\)\.Also, givenUU, the observationX~γ\\tilde\{X\}\_\{\\gamma\}is an AWGN channel inBBwith a known shift, so Lemma[16](https://arxiv.org/html/2605.05387#Thmtheorem16)\(withX=BX=B,S=US=U\) yields

dd​γ​I​\(B;X~γ∣U\)=12​mmse⁡\(B∣X~γ,U\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(B;\\tilde\{X\}\_\{\\gamma\}\\mid U\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\.Differentiating \([A\.9](https://arxiv.org/html/2605.05387#A1.E9)\) and subtracting the last display from the derivative ofI​\(Z;X~γ\)I\(Z;\\tilde\{X\}\_\{\\gamma\}\)gives

dd​γ​I​\(U;X~γ\)=12​\(mmse⁡\(Z∣X~γ\)−mmse⁡\(B∣X~γ,U\)\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(U;\\tilde\{X\}\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(Z\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\.\(A\.10\)BecauseUUandBBlive in orthogonal subspaces andZ=U\+BZ=U\+B,

mmse⁡\(Z∣X~γ\)=mmse⁡\(U∣X~γ\)\+mmse⁡\(B∣X~γ\),\\operatorname\{mmse\}\(Z\\mid\\tilde\{X\}\_\{\\gamma\}\)=\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\+\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\),hence \([A\.10](https://arxiv.org/html/2605.05387#A1.E10)\) becomes exactly

dd​γ​I​\(U;X~γ\)=12​\(mmse⁡\(U∣X~γ\)\+mmse⁡\(B∣X~γ\)−mmse⁡\(B∣X~γ,U\)\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(U;\\tilde\{X\}\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\+\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\.\(A\.11\)
Subtracting \([A\.8](https://arxiv.org/html/2605.05387#A1.E8)\) from \([A\.11](https://arxiv.org/html/2605.05387#A1.E11)\) yields

dd​γ​Φ​\(γ\)=12​\(mmse⁡\(U∣X~γ\)−mmse⁡\(U∣X~γ,B\)\)\+12​\(mmse⁡\(B∣X~γ\)−mmse⁡\(B∣X~γ,U\)\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}\\Phi\(\\gamma\)=\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\\Big\)\+\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\.\(A\.12\)Insert \([A\.12](https://arxiv.org/html/2605.05387#A1.E12)\) into \([A\.7](https://arxiv.org/html/2605.05387#A1.E7)\) to obtain the exact decomposition

𝔼B​\[KL​\(ℙY∗,B∥ℙY^B\)\]≤\[Φ​\(γ\)\]γ=γ∗∞−A,\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\Big\[\\Phi\(\\gamma\)\\Big\]\_\{\\gamma=\\gamma^\{\*\}\}^\{\\infty\}\-A,\(A\.13\)where

A:=12​∫γ∗∞\(mmse⁡\(B∣X~γ\)−mmse⁡\(B∣X~γ,U\)\)​dγ\.A:=\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\infty\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\\,\\mathrm\{d\}\\gamma\.By the orthogonality principle / law of total variance,

mmse\(B∣X~γ\)−mmse\(B∣X~γ,U\)=𝔼\[∥𝔼\[B∣X~γ,U\]−𝔼\[B∣X~γ\]∥22\]≥0,\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)=\\mathbb\{E\}\\\!\\left\[\\big\\\|\\mathbb\{E\}\[B\\mid\\tilde\{X\}\_\{\\gamma\},U\]\-\\mathbb\{E\}\[B\\mid\\tilde\{X\}\_\{\\gamma\}\]\\big\\\|\_\{2\}^\{2\}\\right\]\\geq 0,soA≥0A\\geq 0\.

Using the identityI​\(U;X\)−I​\(U;X∣B\)=I​\(U;B\)−I​\(U;B∣X\)I\(U;X\)\-I\(U;X\\mid B\)=I\(U;B\)\-I\(U;B\\mid X\)\(a direct consequence of the chain rule\), we have

Φ​\(γ\)=I​\(U;B\)−I​\(U;B∣X~γ\)\.\\Phi\(\\gamma\)=I\(U;B\)\-I\(U;B\\mid\\tilde\{X\}\_\{\\gamma\}\)\.Then,

\[Φ​\(γ\)\]γ=γ∗∞≤I​\(U;B∣X~γ∗\)\.\\Big\[\\Phi\(\\gamma\)\\Big\]\_\{\\gamma=\\gamma^\{\*\}\}^\{\\infty\}\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\.\(A\.14\)Combining \([A\.13](https://arxiv.org/html/2605.05387#A1.E13)\) and \([A\.14](https://arxiv.org/html/2605.05387#A1.E14)\) gives

𝔼B​\[KL​\(ℙY∗,B∥ℙY^B\)\]≤I​\(U;B∣X~γ∗\)−A≤I​\(U;B∣X~γ∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\-A\\ \\leq\\ I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\.\(A\.15\)
Now we want to give a lower bound for \([A\.7](https://arxiv.org/html/2605.05387#A1.E7)\)\. This is equal to the lower bound

12​∫γ∗γm​a​x\(mmse⁡\(U∣X~γ\)−mmse⁡\(U∣X~γ,B\)\)​dγ,\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\gamma\_\{max\}\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\\Big\)\\,\\mathrm\{d\}\\gamma,whereγm​a​x=1t0\\gamma\_\{max\}=\\frac\{1\}\{t\_\{0\}\}\. Based on what we hade then we only need to lower bound

I​\(U;B∣X~γ∗\)−I​\(U;B∣X~γm​a​x\)−Aγm​a​xI\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\-I\(U;B\\mid\\tilde\{X\}\_\{\\gamma\_\{max\}\}\)\-A\_\{\\gamma\_\{max\}\}Where

Aγm​a​x:=12​∫γ∗γm​a​x\(mmse⁡\(B∣X~γ\)−mmse⁡\(B∣X~γ,U\)\)​dγ\.A\_\{\\gamma\_\{max\}\}:=\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\gamma\_\{max\}\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\\,\\mathrm\{d\}\\gamma\.\(A\.16\)Decompose the observation into orthogonal components

X~γ⟂:=P⟂​X~γ,X~γ∥:=P∥​X~γ,X~γ=X~γ⟂\+X~γ∥\.\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}:=P\_\{\\perp\}\\tilde\{X\}\_\{\\gamma\},\\qquad\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}:=P\_\{\\parallel\}\\tilde\{X\}\_\{\\gamma\},\\qquad\\tilde\{X\}\_\{\\gamma\}=\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\+\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}\.SinceX~γ=γ​\(U\+B\)\+Ξ\\tilde\{X\}\_\{\\gamma\}=\\sqrt\{\\gamma\}\(U\+B\)\+\\XiwithΞ∼𝒩​\(0,Id\)\\Xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)independent of\(U,B\)\(U,B\), we have

X~γ⟂=γ​B\+Ξ⟂,X~γ∥=γ​U\+Ξ∥,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}=\\sqrt\{\\gamma\}\\,B\+\\Xi^\{\\perp\},\\qquad\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}=\\sqrt\{\\gamma\}\\,U\+\\Xi^\{\\parallel\},whereΞ⟂:=P⟂​Ξ\\Xi^\{\\perp\}:=P\_\{\\perp\}\\XiandΞ∥:=P∥​Ξ\\Xi^\{\\parallel\}:=P\_\{\\parallel\}\\Xiare independent and independent of\(U,B\)\(U,B\)\(because they are orthogonal projections of a standard Gaussian\)\.

*Key claim:*conditioning onUU, the parallel observation carries no information aboutBB, hence

mmse⁡\(B∣X~γ,U\)=mmse⁡\(B∣X~γ⟂,U\)\.\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)=\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},U\)\.\(A\.17\)Indeed, givenUU, we can writeX~γ∥=γ​U\+Ξ∥\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}=\\sqrt\{\\gamma\}\\,U\+\\Xi^\{\\parallel\}as a function ofUUplus independent noiseΞ∥\\Xi^\{\\parallel\}, soX~γ∥⟂⟂\(B,X~γ⟂\)∣U\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}\\perp\\\!\\\!\\\!\\perp\(B,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\\mid U\. Therefore

Law⁡\(B∣U,X~γ⟂,X~γ∥\)=Law⁡\(B∣U,X~γ⟂\),\\operatorname\{Law\}\(B\\mid U,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}\)=\\operatorname\{Law\}\(B\\mid U,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\),which implies𝔼​\[B∣U,X~γ\]=𝔼​\[B∣U,X~γ⟂\]\\mathbb\{E\}\[B\\mid U,\\tilde\{X\}\_\{\\gamma\}\]=\\mathbb\{E\}\[B\\mid U,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\]and thus \([A\.17](https://arxiv.org/html/2605.05387#A1.E17)\)\.

Next, by monotonicity of MMSE with respect to side information \(conditioning on more cannot increase MMSE\),

mmse⁡\(B∣X~γ\)≤mmse⁡\(B∣X~γ⟂\),\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\\leq\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\),\(A\.18\)sinceσ​\(X~γ⟂\)⊆σ​\(X~γ\)\\sigma\(\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\\subseteq\\sigma\(\\tilde\{X\}\_\{\\gamma\}\)\.

Combining \([A\.17](https://arxiv.org/html/2605.05387#A1.E17)\) and \([A\.18](https://arxiv.org/html/2605.05387#A1.E18)\) yields the*correct*pointwise bound

mmse⁡\(B∣X~γ\)−mmse⁡\(B∣X~γ,U\)≤mmse⁡\(B∣X~γ⟂\)−mmse⁡\(B∣X~γ⟂,U\)\.\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\ \\leq\\ \\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},U\)\.\(A\.19\)Plugging \([A\.19](https://arxiv.org/html/2605.05387#A1.E19)\) into \([A\.16](https://arxiv.org/html/2605.05387#A1.E16)\) givesAγm​a​x≤A⟂A\_\{\\gamma\_\{max\}\}\\leq A^\{\\perp\}, where

A⟂:=12​∫γ∗∞\(mmse⁡\(B∣X~γ⟂\)−mmse⁡\(B∣X~γ⟂,U\)\)​𝑑γ\.A^\{\\perp\}:=\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\infty\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},U\)\\Big\)\\,d\\gamma\.
Finally,X~γ⟂=γ​B\+Ξ⟂\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}=\\sqrt\{\\gamma\}\\,B\+\\Xi^\{\\perp\}is a Gaussian channel forBB\(in the normal subspace\); applying Lemma[16](https://arxiv.org/html/2605.05387#Thmtheorem16)\(after identifying an orthonormal basis ofrange​\(P⟂\)\\mathrm\{range\}\(P\_\{\\perp\}\), if desired\) gives

A⟂=\[I​\(B;X~γ⟂\)−I​\(B;X~γ⟂∣U\)\]γ=γ∗∞≤I​\(U;B∣X~γ∗⟂\),A^\{\\perp\}=\\Big\[I\(B;\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\-I\(B;\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\\mid U\)\\Big\]\_\{\\gamma=\\gamma^\{\*\}\}^\{\\infty\}\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}^\{\\perp\}\),ThusAγm​a​x≤I​\(U;B∣X~γ∗⟂\)A\_\{\\gamma\_\{max\}\}\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}^\{\\perp\}\), completing the lower bound\.

## Appendix BProof of Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9)

The proof follows the same decomposition as the sampler\. At the safe timet∗t^\{\*\}, our initialization is not the exact conditional lawLaw⁡\(Xt∗∣B=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid B=b\), but a surrogate law obtained by sampling the correct noisy normal component and then sampling the tangent component from the unconditional slice given that normal observation\. Thus the error splits into an initialization term at timet∗t^\{\*\}and a pathwise term accumulated during the reverse dynamics\. The pathwise part is already controlled by Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4), so it remains to control the initialization discrepancy\. For this, we write both the true and surrogate tangent laws as mixtures over the discrete latent normal codeS⟂=P⟂​SS^\{\\perp\}=P\_\{\\perp\}S, reduce the resulting KL divergence to a posterior resampling error through a coupling inequality for mixture KL, and then bound that posterior resampling error by Shannon entropy using the Gaussian I–MMSE identity\.

###### Lemma 17

Let\{rc\}c∈𝒞\\\{r\_\{c\}\\\}\_\{c\\in\\mathcal\{C\}\}be a family of probability measures on a measurable space\(E,ℰ\)\(E,\\mathcal\{E\}\), where𝒞\\mathcal\{C\}is countable\. Letα,β\\alpha,\\betabe probability mass functions on𝒞\\mathcal\{C\}, and define

μ:=∑c∈𝒞α​\(c\)​rc,ν:=∑c∈𝒞β​\(c\)​rc\.\\mu:=\\sum\_\{c\\in\\mathcal\{C\}\}\\alpha\(c\)\\,r\_\{c\},\\qquad\\nu:=\\sum\_\{c\\in\\mathcal\{C\}\}\\beta\(c\)\\,r\_\{c\}\.Then

KL​\(μ∥ν\)≤infλ∈Γ​\(α,β\)∑c,c~∈𝒞λ​\(c,c~\)​KL​\(rc∥rc~\),\\mathrm\{KL\}\(\\mu\\\|\\nu\)\\leq\\inf\_\{\\lambda\\in\\Gamma\(\\alpha,\\beta\)\}\\sum\_\{c,\\tilde\{c\}\\in\\mathcal\{C\}\}\\lambda\(c,\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{c\}\\\|r\_\{\\tilde\{c\}\}\),whereΓ​\(α,β\)\\Gamma\(\\alpha,\\beta\)denotes the set of couplings ofα\\alphaandβ\\beta\.

ProofFix any couplingλ∈Γ​\(α,β\)\\lambda\\in\\Gamma\(\\alpha,\\beta\)\. Define two probability measures on𝒞×𝒞×E\\mathcal\{C\}\\times\\mathcal\{C\}\\times Eby

𝒫λ​\(c,c~,d​x\):=λ​\(c,c~\)​rc​\(d​x\),Qλ​\(c,c~,d​x\):=λ​\(c,c~\)​rc~​\(d​x\)\.\\mathcal\{P\}\_\{\\lambda\}\(c,\\tilde\{c\},dx\):=\\lambda\(c,\\tilde\{c\}\)\\,r\_\{c\}\(dx\),\\qquad Q\_\{\\lambda\}\(c,\\tilde\{c\},dx\):=\\lambda\(c,\\tilde\{c\}\)\\,r\_\{\\tilde\{c\}\}\(dx\)\.TheirEE\-marginals are exactlyμ\\muandν\\nu\. Therefore, by data processing under the projection\(c,c~,x\)↦x\(c,\\tilde\{c\},x\)\\mapsto x,

KL​\(μ∥ν\)≤KL​\(𝒫λ∥Qλ\)\.\\mathrm\{KL\}\(\\mu\\\|\\nu\)\\leq\\mathrm\{KL\}\(\\mathcal\{P\}\_\{\\lambda\}\\\|Q\_\{\\lambda\}\)\.Since𝒫λ\\mathcal\{P\}\_\{\\lambda\}andQλQ\_\{\\lambda\}have the same\(c,c~\)\(c,\\tilde\{c\}\)\-marginalλ\\lambda, the chain rule for relative entropy gives

KL​\(𝒫λ∥Qλ\)=∑c,c~λ​\(c,c~\)​KL​\(rc∥rc~\)\.\\mathrm\{KL\}\(\\mathcal\{P\}\_\{\\lambda\}\\\|Q\_\{\\lambda\}\)=\\sum\_\{c,\\tilde\{c\}\}\\lambda\(c,\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{c\}\\\|r\_\{\\tilde\{c\}\}\)\.Taking the infimum overλ∈Γ​\(α,β\)\\lambda\\in\\Gamma\(\\alpha,\\beta\)yields the claim\.

###### Lemma 18

LetSSbe a discrete random variable inℝm\\mathbb\{R\}^\{m\}withH​\(C\)<∞H\(C\)<\\infty, and let

X=S\+σ​G,G∼𝒩​\(0,Im\),X=S\+\\sigma G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{m\}\),withGGindependent ofSS\. LetS~\\tilde\{S\}be an independent posterior draw, i\.e\.

S~∣X∼Law⁡\(S∣X\),S~⟂S∣X\.\\tilde\{S\}\\mid X\\sim\\operatorname\{Law\}\(S\\mid X\),\\qquad\\tilde\{S\}\\perp S\\mid X\.Then

𝔼​‖S−S~‖22≤4​σ2​H​\(C\)\.\\mathbb\{E\}\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}\\leq 4\\sigma^\{2\}H\(C\)\.

ProofConditional onXX, the random variablesCCandC~\\tilde\{C\}are i\.i\.d\. with common lawLaw⁡\(C∣X\)\\operatorname\{Law\}\(C\\mid X\)\. Hence

𝔼\[∥S−S~∥22\|X\]=2tr\(Cov\(S∣X\)\),\\mathbb\{E\}\\\!\\left\[\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}\\,\\middle\|\\,X\\right\]=2\\,\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid X\)\\big\),so

𝔼​‖S−S~‖22=2​𝔼​tr⁡\(Cov⁡\(S∣X\)\)\.\\mathbb\{E\}\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}=2\\,\\mathbb\{E\}\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid X\)\\big\)\.\(B\.1\)Setγ:=1/σ2\\gamma:=1/\\sigma^\{2\}andYγ:=γ​S\+GY\_\{\\gamma\}:=\\sqrt\{\\gamma\}\\,S\+G\. SinceX=σ​YγX=\\sigma Y\_\{\\gamma\}, the observationsXXandYγY\_\{\\gamma\}are equivalent, and

𝔼tr\(Cov\(S∣X\)\)=𝔼tr\(Cov\(S∣Yγ\)\)=:mmse\(γ\)\.\\mathbb\{E\}\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid X\)\\big\)=\\mathbb\{E\}\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid Y\_\{\\gamma\}\)\\big\)=:\\operatorname\{mmse\}\(\\gamma\)\.For the Gaussian channelYγ=γ​S\+GY\_\{\\gamma\}=\\sqrt\{\\gamma\}\\,S\+G, the vector I–MMSE identity gives

I​\(S;Yγ\)=12​∫0γmmse⁡\(s\)​𝑑s\.I\(S;Y\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\int\_\{0\}^\{\\gamma\}\\operatorname\{mmse\}\(s\)\\,ds\.Sincemmse⁡\(s\)\\operatorname\{mmse\}\(s\)is nonincreasing inss,

I​\(S;Yγ\)≥γ2​mmse⁡\(γ\)\.I\(S;Y\_\{\\gamma\}\)\\geq\\frac\{\\gamma\}\{2\}\\,\\operatorname\{mmse\}\(\\gamma\)\.BecauseCCis discrete,

I​\(S;Yγ\)≤H​\(S\)\.I\(S;Y\_\{\\gamma\}\)\\leq H\(S\)\.Thus

mmse⁡\(γ\)≤2​H​\(S\)γ=2​σ2​H​\(S\)\.\\operatorname\{mmse\}\(\\gamma\)\\leq\\frac\{2H\(S\)\}\{\\gamma\}=2\\sigma^\{2\}H\(S\)\.Substituting into \([B\.1](https://arxiv.org/html/2605.05387#A2.E1)\) gives

𝔼​‖S−S~‖22≤4​σ2​H​\(S\)\.\\mathbb\{E\}\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}\\leq 4\\sigma^\{2\}H\(S\)\.
ProofLet

rtc:=Law⁡\(Xt∥∣S⟂=c\),c∈𝒞⟂\.r\_\{t\}^\{c\}:=\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid S^\{\\perp\}=c\),\\qquad c\\in\\mathcal\{C\}^\{\\perp\}\.Under Assumption[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2),

Z=S\+ε​N,B=P⟂​Z=S⟂\+ε​N⟂,Xt⟂=B\+Wt⟂\.Z=S\+\\varepsilon N,\\qquad B=P\_\{\\perp\}Z=S^\{\\perp\}\+\\varepsilon N^\{\\perp\},\\qquad X\_\{t\}^\{\\perp\}=B\+W\_\{t\}^\{\\perp\}\.Since the tangent and normal noises are independent, conditional onS⟂S^\{\\perp\}the variableXt∥X\_\{t\}^\{\\parallel\}is independent of bothBBandXt⟂X\_\{t\}^\{\\perp\}\. Therefore, if

πb​\(c\):=ℙ​\(S⟂=c∣B=b\),πx​\(c\):=ℙ​\(S⟂=c∣Xt⟂=x\),\\pi\_\{b\}\(c\):=\\mathbb\{P\}\(S^\{\\perp\}=c\\mid B=b\),\\qquad\\pi\_\{x\}\(c\):=\\mathbb\{P\}\(S^\{\\perp\}=c\\mid X\_\{t\}^\{\\perp\}=x\),then

Law⁡\(Xt∥∣B=b\)=∑c∈𝒞⟂πb​\(c\)​rtc,Law⁡\(Xt∥∣Xt⟂=x\)=∑c∈𝒞⟂πx​\(c\)​rtc\.\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid B=b\)=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{b\}\(c\)\\,r\_\{t\}^\{c\},\\qquad\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid X\_\{t\}^\{\\perp\}=x\)=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{x\}\(c\)\\,r\_\{t\}^\{c\}\.
Moreover, conditional onB=bB=b, the normal component isXt⟂=b\+Wt⟂X\_\{t\}^\{\\perp\}=b\+W\_\{t\}^\{\\perp\}, and it is independent ofXt∥X\_\{t\}^\{\\parallel\}\. Hence the true conditional law and the surrogate initialization law factorize as

pt∗,b​\(x⟂,x∥\)=pt​\(x⟂∣B=b\)​μtb​\(x∥\),μtb:=∑c∈𝒞⟂πb​\(c\)​rtc,p\_\{t\}^\{\*,b\}\(x^\{\\perp\},x^\{\\parallel\}\)=p\_\{t\}\(x^\{\\perp\}\\mid B=b\)\\,\\mu\_\{t\}^\{b\}\(x^\{\\parallel\}\),\\qquad\\mu\_\{t\}^\{b\}:=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{b\}\(c\)\\,r\_\{t\}^\{c\},and

p^tb​\(x⟂,x∥\)=pt​\(x⟂∣B=b\)​νtx⟂​\(x∥\),νtx:=∑c∈𝒞⟂πx​\(c\)​rtc\.\\hat\{p\}\_\{t\}^\{\\,b\}\(x^\{\\perp\},x^\{\\parallel\}\)=p\_\{t\}\(x^\{\\perp\}\\mid B=b\)\\,\\nu\_\{t\}^\{x^\{\\perp\}\}\(x^\{\\parallel\}\),\\qquad\\nu\_\{t\}^\{x\}:=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{x\}\(c\)\\,r\_\{t\}^\{c\}\.Therefore

KL\(pt∗,b∥p^tb\)=𝔼\[KL\(μtb∥νtXt⟂\)\|B=b\]\.\\mathrm\{KL\}\(p\_\{t\}^\{\*,b\}\\\|\\hat\{p\}\_\{t\}^\{\\,b\}\)=\\mathbb\{E\}\\\!\\left\[\\mathrm\{KL\}\(\\mu\_\{t\}^\{b\}\\\|\\nu\_\{t\}^\{X\_\{t\}^\{\\perp\}\}\)\\,\\middle\|\\,B=b\\right\]\.\(B\.2\)
Applying Lemma[17](https://arxiv.org/html/2605.05387#Thmtheorem17)and choosing the product couplingλ=πb⊗πx\\lambda=\\pi\_\{b\}\\otimes\\pi\_\{x\}, we obtain

KL​\(μtb∥νtx\)≤∑c,c~πb​\(c\)​πx​\(c~\)​KL​\(rtc∥rtc~\)\.\\mathrm\{KL\}\(\\mu\_\{t\}^\{b\}\\\|\\nu\_\{t\}^\{x\}\)\\leq\\sum\_\{c,\\tilde\{c\}\}\\pi\_\{b\}\(c\)\\pi\_\{x\}\(\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{t\}^\{c\}\\\|r\_\{t\}^\{\\tilde\{c\}\}\)\.By Assumption[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3),

KL​\(rtc∥rtc~\)≤Lt​‖c−c~‖22,t≥t0,\\mathrm\{KL\}\(r\_\{t\}^\{c\}\\\|r\_\{t\}^\{\\tilde\{c\}\}\)\\leq L\_\{t\}\\\|c\-\\tilde\{c\}\\\|\_\{2\}^\{2\},\\qquad t\\geq t\_\{0\},and hence

KL​\(μtb∥νtx\)≤Lt​∑c,c~πb​\(c\)​πx​\(c~\)​‖c−c~‖22\.\\mathrm\{KL\}\(\\mu\_\{t\}^\{b\}\\\|\\nu\_\{t\}^\{x\}\)\\leq L\_\{t\}\\sum\_\{c,\\tilde\{c\}\}\\pi\_\{b\}\(c\)\\pi\_\{x\}\(\\tilde\{c\}\)\\\|c\-\\tilde\{c\}\\\|\_\{2\}^\{2\}\.Substituting into \([B\.2](https://arxiv.org/html/2605.05387#A2.E2)\) and averaging overBBgives

𝔼B​\[KL​\(pt∗,B∥p^tB\)\]≤Lt​𝔼​‖S⟂−S~⟂‖22,\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\(p\_\{t\}^\{\*,B\}\\\|\\hat\{p\}\_\{t\}^\{\\,B\}\)\\right\]\\leq L\_\{t\}\\,\\mathbb\{E\}\\\|S^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}^\{2\},\(B\.3\)where, conditional onXt⟂X\_\{t\}^\{\\perp\}, the random variableS~⟂\\tilde\{S\}^\{\\perp\}is an independent draw fromLaw⁡\(S⟂∣Xt⟂\)\\operatorname\{Law\}\(S^\{\\perp\}\\mid X\_\{t\}^\{\\perp\}\)\.

Since

Xt⟂=S⟂\+ε​N⟂\+Wt⟂=S⟂\+t\+ε2​G,G∼𝒩​\(0,Im\),X\_\{t\}^\{\\perp\}=S^\{\\perp\}\+\\varepsilon N^\{\\perp\}\+W\_\{t\}^\{\\perp\}=S^\{\\perp\}\+\\sqrt\{t\+\\varepsilon^\{2\}\}\\,G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{m\}\),Lemma[18](https://arxiv.org/html/2605.05387#Thmtheorem18)yields

𝔼​‖S⟂−S~⟂‖22≤4​\(t\+ε2\)​H​\(S⟂\)\.\\mathbb\{E\}\\\|S^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}^\{2\}\\leq 4\(t\+\\varepsilon^\{2\}\)H\(S^\{\\perp\}\)\.Combining this with \([B\.3](https://arxiv.org/html/2605.05387#A2.E3)\) and settingt=t∗t=t^\{\*\}, we obtain

𝔼B​\[KL​\(pt∗∗,B∥p^t∗B\)\]≤4​Lt∗​\(t∗\+ε2\)​H​\(S⟂\)\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\(p\_\{t^\{\*\}\}^\{\*,B\}\\\|\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\)\\right\]\\leq 4L\_\{t^\{\*\}\}\(t^\{\*\}\+\\varepsilon^\{2\}\)H\(S^\{\\perp\}\)\.\(B\.4\)
For eachbb, letℙ∗,b\\mathbb\{P\}^\{\*,b\}be the path measure of the true conditional reverse SDE on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\], started frompt∗∗,bp\_\{t^\{\*\}\}^\{\*,b\}, and letℙ^b\\hat\{\\mathbb\{P\}\}^\{\\,b\}be the path measure of the surrogate reverse SDE on the same interval, started fromp^t∗b\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\. Letℙ~b\\tilde\{\\mathbb\{P\}\}^\{\\,b\}denote the path measure obtained by running the surrogate reverse SDE from the true initial lawpt∗∗,bp\_\{t^\{\*\}\}^\{\*,b\}\. Then

KL​\(ℙ∗,b∥ℙ^b\)=KL​\(ℙ∗,b∥ℙ~b\)\+KL​\(pt∗∗,b∥p^t∗b\)\.\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,b\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,b\}\)=\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,b\}\\\|\\tilde\{\\mathbb\{P\}\}^\{\\,b\}\)\+\\mathrm\{KL\}\(p\_\{t^\{\*\}\}^\{\*,b\}\\\|\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\)\.Averaging overBByields

𝔼B​\[KL​\(ℙ∗,B∥ℙ^B\)\]=𝔼B​\[KL​\(ℙ∗,B∥ℙ~B\)\]\+𝔼B​\[KL​\(pt∗∗,B∥p^t∗B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]=\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\tilde\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\+\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(p\_\{t^\{\*\}\}^\{\*,B\}\\\|\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\)\\big\]\.By Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4),

𝔼B​\[KL​\(ℙ∗,B∥ℙ~B\)\]≤I​\(Z∥;Z⟂∣Xt∗\),\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\tilde\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\\leq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\),and by Proposition[6](https://arxiv.org/html/2605.05387#Thmtheorem6),

I​\(Z∥;Z⟂∣Xt∗\)≤I​\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.Together with \([B\.4](https://arxiv.org/html/2605.05387#A2.E4)\), this gives

𝔼B​\[KL​\(ℙ∗,B∥ℙ^B\)\]≤4​Lt∗​\(t∗\+ε2\)​H​\(S⟂\)\+I​\(S∥;S⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\\leq 4L\_\{t^\{\*\}\}\(t^\{\*\}\+\\varepsilon^\{2\}\)H\(S^\{\\perp\}\)\+I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.
Finally, the terminal tangent marginal is a measurable image of path space, so by data processing,

𝔼B​\[KL​\(μT−t0∗,B∥μ^T−t0B\)\]≤𝔼B​\[KL​\(ℙ∗,B∥ℙ^B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\right\]\\leq\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\.This proves \([5\.2](https://arxiv.org/html/2605.05387#S5.E2)\)\.

## Appendix CProof of Theorem[11](https://arxiv.org/html/2605.05387#Thmtheorem11)

The proof again separates initialization and pathwise contributions, but now theδ\\delta\-separation assumption upgrades both bounds from Shannon\-scale control to exponential control\. The initialization term is handled through the same mixture representation as above, followed by a posterior\-resampling tail estimate for the effective normal Gaussian channel\. The pathwise term is bounded through Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4), the latent comparison proposition, and an exponential bound on the residual conditional entropyH​\(S⟂∣Xt∗⟂\)H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\.

###### Lemma 19

LetCCbe a discrete random variable supported on a countable set𝒞⟂⊂ℝm\\mathcal\{C\}^\{\\perp\}\\subset\\mathbb\{R\}^\{m\}with pmfpSp\_\{S\}, and define

H1/2​\(S\):=2​log​∑c∈𝒞⟂pS​\(c\)\.H\_\{1/2\}\(S\):=2\\log\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{p\_\{S\}\(c\)\}\.Fixt\>0t\>0, and consider the Gaussian channel

X∣\(S=c\)∼𝒩​\(c,\(t\+ε2\)​Im\)\.X\\mid\(S=c\)\\sim\\mathcal\{N\}\(c,\(t\+\\varepsilon^\{2\}\)I\_\{m\}\)\.Letpt​\(x∣c\)p\_\{t\}\(x\\mid c\)denote this Gaussian density and letpt​\(c∣x\)p\_\{t\}\(c\\mid x\)be the posterior

pt​\(c∣x\)=pS​\(c\)​pt​\(x∣c\)∑u∈𝒞⟂pS​\(u\)​pt​\(x∣u\)\.p\_\{t\}\(c\\mid x\)=\\frac\{p\_\{S\}\(c\)\\,p\_\{t\}\(x\\mid c\)\}\{\\sum\_\{u\\in\\mathcal\{C\}^\{\\perp\}\}p\_\{S\}\(u\)\\,p\_\{t\}\(x\\mid u\)\}\.For eachc∗∈𝒞⟂c^\{\*\}\\in\\mathcal\{C\}^\{\\perp\}, define the posterior\-resampling kernel

Kt​\(c,c∗\):=∫pt​\(c∣x\)​pt​\(x∣c∗\)​𝑑x\.K\_\{t\}\(c,c^\{\*\}\):=\\int p\_\{t\}\(c\\mid x\)\\,p\_\{t\}\(x\\mid c^\{\*\}\)\\,dx\.LetS∗∼pSS^\{\*\}\\sim p\_\{S\}, and conditional onS∗=c∗S^\{\*\}=c^\{\*\}, letS~\\tilde\{S\}have pmfKt​\(⋅,c∗\)K\_\{t\}\(\\cdot,c^\{\*\}\)\. Set

R:=‖S~−S∗‖2\.R:=\\\|\\tilde\{S\}\-S^\{\*\}\\\|\_\{2\}\.Then for everyr≥0r\\geq 0,

ℙ​\(R≥r\)≤12​exp⁡\(H1/2​\(S\)−r28​\(t\+ε2\)\)\.\\mathbb\{P\}\(R\\geq r\)\\leq\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\(S\)\-\\frac\{r^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.

ProofFixc,c∗∈𝒞⟂c,c^\{\*\}\\in\\mathcal\{C\}^\{\\perp\}andx∈ℝmx\\in\\mathbb\{R\}^\{m\}\. By Bayes’ rule,

pt​\(c∣x\)≤pS​\(c\)​pt​\(x∣c\)pS​\(c\)​pt​\(x∣c\)\+pS​\(c∗\)​pt​\(x∣c∗\)\.p\_\{t\}\(c\\mid x\)\\leq\\frac\{p\_\{S\}\(c\)p\_\{t\}\(x\\mid c\)\}\{p\_\{S\}\(c\)p\_\{t\}\(x\\mid c\)\+p\_\{S\}\(c^\{\*\}\)p\_\{t\}\(x\\mid c^\{\*\}\)\}\.WritingA:=pS​\(c\)​pt​\(x∣c\)A:=p\_\{S\}\(c\)p\_\{t\}\(x\\mid c\)andB:=pS​\(c∗\)​pt​\(x∣c∗\)B:=p\_\{S\}\(c^\{\*\}\)p\_\{t\}\(x\\mid c^\{\*\}\), the inequalityA\+B≥2​A​BA\+B\\geq 2\\sqrt\{AB\}gives

AA\+B≤12​AB=12​pS​\(c\)pS​\(c∗\)​pt​\(x∣c\)pt​\(x∣c∗\)\.\\frac\{A\}\{A\+B\}\\leq\\frac\{1\}\{2\}\\sqrt\{\\frac\{A\}\{B\}\}=\\frac\{1\}\{2\}\\sqrt\{\\frac\{p\_\{S\}\(c\)\}\{p\_\{S\}\(c^\{\*\}\)\}\}\\sqrt\{\\frac\{p\_\{t\}\(x\\mid c\)\}\{p\_\{t\}\(x\\mid c^\{\*\}\)\}\}\.Multiplying bypt​\(x∣c∗\)p\_\{t\}\(x\\mid c^\{\*\}\)and integrating overxx, we obtain

Kt​\(c,c∗\)≤12​pS​\(c\)pS​\(c∗\)​∫pt​\(x∣c\)​pt​\(x∣c∗\)​𝑑x\.K\_\{t\}\(c,c^\{\*\}\)\\leq\\frac\{1\}\{2\}\\sqrt\{\\frac\{p\_\{S\}\(c\)\}\{p\_\{S\}\(c^\{\*\}\)\}\}\\int\\sqrt\{p\_\{t\}\(x\\mid c\)p\_\{t\}\(x\\mid c^\{\*\}\)\}\\,dx\.For isotropic Gaussians with covariance\(t\+ε2\)​Im\(t\+\\varepsilon^\{2\}\)I\_\{m\}, the Hellinger affinity is

∫pt​\(x∣c\)​pt​\(x∣c∗\)​𝑑x=exp⁡\(−‖c−c∗‖228​\(t\+ε2\)\)\.\\int\\sqrt\{p\_\{t\}\(x\\mid c\)p\_\{t\}\(x\\mid c^\{\*\}\)\}\\,dx=\\exp\\\!\\Big\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.Hence

Kt​\(c,c∗\)≤12​pS​\(c\)pS​\(c∗\)​exp⁡\(−‖c−c∗‖228​\(t\+ε2\)\)\.K\_\{t\}\(c,c^\{\*\}\)\\leq\\frac\{1\}\{2\}\\sqrt\{\\frac\{p\_\{S\}\(c\)\}\{p\_\{S\}\(c^\{\*\}\)\}\}\\exp\\\!\\Big\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.Let

M:=∑c∈𝒞⟂pS​\(c\),M2=eH1/2​\(C\)\.M:=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{p\_\{S\}\(c\)\},\\qquad M^\{2\}=e^\{H\_\{1/2\}\(C\)\}\.Then

ℙ​\(R≥r∣S∗=c∗\)=∑‖c−c∗‖≥rKt​\(c,c∗\)≤M2​pS​\(c∗\)​e−r2/\(8​\(t\+ε2\)\)\.\\mathbb\{P\}\(R\\geq r\\mid S^\{\*\}=c^\{\*\}\)=\\sum\_\{\\\|c\-c^\{\*\}\\\|\\geq r\}K\_\{t\}\(c,c^\{\*\}\)\\leq\\frac\{M\}\{2\\sqrt\{p\_\{S\}\(c^\{\*\}\)\}\}e^\{\-r^\{2\}/\(8\(t\+\\varepsilon^\{2\}\)\)\}\.Averaging overS∗∼pSS^\{\*\}\\sim p\_\{S\}yields

ℙ​\(R≥r\)≤12​exp⁡\(H1/2​\(S\)−r28​\(t\+ε2\)\)\.\\mathbb\{P\}\(R\\geq r\)\\leq\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\(S\)\-\\frac\{r^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.
###### Lemma 20

Assume Assumption[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4)andH1/2​\(S⟂\)<∞H\_\{1/2\}\(S^\{\\perp\}\)<\\infty\. Then

H​\(S⟂∣Xt∗⟂\)≤2​exp⁡\(H1/2​\(S⟂\)−δ28​\(t∗\+ε2\)\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\leq 2\\exp\\\!\\Big\(H\_\{1/2\}\(S^\{\\perp\}\)\-\\frac\{\\delta^\{2\}\}\{8\(t^\{\*\}\+\\varepsilon^\{2\}\)\}\\Big\)\.

ProofWriteπc:=ℙ​\(S⟂=c\)\\pi\_\{c\}:=\\mathbb\{P\}\(S^\{\\perp\}=c\)forc∈𝒞⟂c\\in\\mathcal\{C\}^\{\\perp\}, and fixc∗∈𝒞⟂c^\{\*\}\\in\\mathcal\{C\}^\{\\perp\}\. Conditional onS⟂=c∗S^\{\\perp\}=c^\{\*\},

Xt∗⟂=c∗\+t∗\+ε2​G,G∼𝒩​\(0,Im\)\.X\_\{t^\{\*\}\}^\{\\perp\}=c^\{\*\}\+\\sqrt\{t^\{\*\}\+\\varepsilon^\{2\}\}\\,G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{m\}\)\.Forx∈ℝmx\\in\\mathbb\{R\}^\{m\}, let

px​\(c\):=ℙ​\(S⟂=c∣Xt∗⟂=x\)\.p\_\{x\}\(c\):=\\mathbb\{P\}\(S^\{\\perp\}=c\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)\.Then

H​\(S⟂∣Xt∗⟂=x\)=∑c∈𝒞⟂px​\(c\)​log⁡1px​\(c\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}p\_\{x\}\(c\)\\log\\frac\{1\}\{p\_\{x\}\(c\)\}\.Forc≠c∗c\\neq c^\{\*\}, define

lc​\(x\):=px​\(c\)px​\(c∗\),R​\(x\):=∑c≠c∗lc​\(x\)\.l\_\{c\}\(x\):=\\frac\{p\_\{x\}\(c\)\}\{p\_\{x\}\(c^\{\*\}\)\},\\qquad R\(x\):=\\sum\_\{c\\neq c^\{\*\}\}l\_\{c\}\(x\)\.Then

px​\(c∗\)=11\+R​\(x\),px​\(c\)=lc​\(x\)1\+R​\(x\)\(c≠c∗\),p\_\{x\}\(c^\{\*\}\)=\\frac\{1\}\{1\+R\(x\)\},\\qquad p\_\{x\}\(c\)=\\frac\{l\_\{c\}\(x\)\}\{1\+R\(x\)\}\\quad\(c\\neq c^\{\*\}\),and therefore

H​\(S⟂∣Xt∗⟂=x\)=∑c≠c∗lc​\(x\)1\+R​\(x\)​log⁡1lc​\(x\)\+log⁡\(1\+R​\(x\)\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)=\\sum\_\{c\\neq c^\{\*\}\}\\frac\{l\_\{c\}\(x\)\}\{1\+R\(x\)\}\\log\\frac\{1\}\{l\_\{c\}\(x\)\}\+\\log\(1\+R\(x\)\)\.Using

u​log⁡1u≤u\(0<u≤1\),log⁡\(1\+v\)≤v\(v≥0\),u\\log\\frac\{1\}\{u\}\\leq\\sqrt\{u\}\\quad\(0<u\\leq 1\),\\qquad\\log\(1\+v\)\\leq\\sqrt\{v\}\\quad\(v\\geq 0\),and discarding the nonpositive terms withrc​\(x\)\>1r\_\{c\}\(x\)\>1, we obtain

H​\(S⟂∣Xt∗⟂=x\)≤2​∑c≠c∗lc​\(x\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)\\leq 2\\sum\_\{c\\neq c^\{\*\}\}\\sqrt\{l\_\{c\}\(x\)\}\.
By Bayes’ rule,

lc​\(x\)=πcπc∗​φσ∗2​\(x−c\)φσ∗2​\(x−c∗\),σ∗2:=t∗\+ε2,l\_\{c\}\(x\)=\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\\frac\{\\varphi\_\{\\sigma\_\{\*\}^\{2\}\}\(x\-c\)\}\{\\varphi\_\{\\sigma\_\{\*\}^\{2\}\}\(x\-c^\{\*\}\)\},\\qquad\\sigma\_\{\*\}^\{2\}:=t^\{\*\}\+\\varepsilon^\{2\},whereφσ∗2\\varphi\_\{\\sigma\_\{\*\}^\{2\}\}is the Gaussian density with covarianceσ∗2​Im\\sigma\_\{\*\}^\{2\}I\_\{m\}\. Writingx=c∗\+wx=c^\{\*\}\+w, we obtain

lc​\(x\)=πcπc∗​exp⁡\(−‖c−c∗‖224​σ∗2\+⟨w,c−c∗⟩2​σ∗2\)\.\\sqrt\{l\_\{c\}\(x\)\}=\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}\\exp\\\!\\left\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{4\\sigma\_\{\*\}^\{2\}\}\+\\frac\{\\langle w,c\-c^\{\*\}\\rangle\}\{2\\sigma\_\{\*\}^\{2\}\}\\right\)\.Taking expectation overw∼𝒩​\(0,σ∗2​Im\)w\\sim\\mathcal\{N\}\(0,\\sigma\_\{\*\}^\{2\}I\_\{m\}\),

𝔼\[lc​\(Xt∗⟂\)\|S⟂=c∗\]=πcπc∗exp\(−‖c−c∗‖228​σ∗2\)\.\\mathbb\{E\}\\\!\\left\[\\sqrt\{l\_\{c\}\(X\_\{t^\{\*\}\}^\{\\perp\}\)\}\\,\\middle\|\\,S^\{\\perp\}=c^\{\*\}\\right\]=\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}\\exp\\\!\\left\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\right\)\.Byδ\\delta\-separation,

‖c−c∗‖2≥δ\(c≠c∗\),\\\|c\-c^\{\*\}\\\|\_\{2\}\\geq\\delta\\qquad\(c\\neq c^\{\*\}\),and hence

𝔼\[lc​\(Xt∗⟂\)\|S⟂=c∗\]≤πcπc∗e−δ2/\(8​σ∗2\)\.\\mathbb\{E\}\\\!\\left\[\\sqrt\{l\_\{c\}\(X\_\{t^\{\*\}\}^\{\\perp\}\)\}\\,\\middle\|\\,S^\{\\perp\}=c^\{\*\}\\right\]\\leq\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\.Therefore

𝔼\[H\(S⟂∣Xt∗⟂\)\|S⟂=c∗\]≤2e−δ2/\(8​σ∗2\)∑c≠c∗πcπc∗\.\\mathbb\{E\}\\\!\\left\[H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\,\\middle\|\\,S^\{\\perp\}=c^\{\*\}\\right\]\\leq 2e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\\sum\_\{c\\neq c^\{\*\}\}\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}\.Averaging overS⟂S^\{\\perp\}gives

H​\(S⟂∣Xt∗⟂\)≤2​e−δ2/\(8​σ∗2\)​∑c∗πc∗​∑c≠c∗πc≤2​e−δ2/\(8​σ∗2\)​\(∑c∈𝒞⟂πc\)2\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\leq 2e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\\sum\_\{c^\{\*\}\}\\sqrt\{\\pi\_\{c^\{\*\}\}\}\\sum\_\{c\\neq c^\{\*\}\}\\sqrt\{\\pi\_\{c\}\}\\leq 2e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\\Big\(\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{\\pi\_\{c\}\}\\Big\)^\{2\}\.Since

\(∑c∈𝒞⟂πc\)2=eH1/2​\(S⟂\),\\Big\(\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{\\pi\_\{c\}\}\\Big\)^\{2\}=e^\{H\_\{1/2\}\(S^\{\\perp\}\)\},the claim follows\.

ProofLet

σ∗2:=t∗\+ε2,H1/2:=H1/2​\(S⟂\)\.\\sigma\_\{\*\}^\{2\}:=t^\{\*\}\+\\varepsilon^\{2\},\\qquad H\_\{1/2\}:=H\_\{1/2\}\(S^\{\\perp\}\)\.As in the proof of Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9), the true and surrogate tangent laws at timettcan be written as mixtures overS⟂S^\{\\perp\}, and Lemma[17](https://arxiv.org/html/2605.05387#Thmtheorem17)therefore yields

KL​\(pt∗,b∥p^tb\)≤∑c,c~πb​\(c\)​πx​\(c~\)​KL​\(rtc∥rtc~\)\.\\mathrm\{KL\}\(p\_\{t\}^\{\*,b\}\\\|\\hat\{p\}\_\{t\}^\{\\,b\}\)\\leq\\sum\_\{c,\\tilde\{c\}\}\\pi\_\{b\}\(c\)\\pi\_\{x\}\(\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{t\}^\{c\}\\\|r\_\{t\}^\{\\tilde\{c\}\}\)\.Using Assumption[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3)at timet=t∗t=t^\{\*\}, we get

KL​\(rt∗c∥rt∗c~\)≤Lt∗​‖c−c~‖22\.\\mathrm\{KL\}\(r\_\{t^\{\*\}\}^\{c\}\\\|r\_\{t^\{\*\}\}^\{\\tilde\{c\}\}\)\\leq L\_\{t^\{\*\}\}\\\|c\-\\tilde\{c\}\\\|\_\{2\}^\{2\}\.Hence

𝔼B​\[KL​\(pt∗∗,B∥p^t∗B\)\]≤Lt∗​𝔼​‖S∗⟂−S~⟂‖22,\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\\Big\]\\leq L\_\{t^\{\*\}\}\\,\\mathbb\{E\}\\\|\{S^\{\*\}\}^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}^\{2\},whereS∗⟂∼pS⟂\{S^\{\*\}\}^\{\\perp\}\\sim p\_\{S^\{\\perp\}\}, and conditional onS∗⟂=c∗\{S^\{\*\}\}^\{\\perp\}=c^\{\*\}, the variableS~⟂\\tilde\{S\}^\{\\perp\}is drawn from the posterior\-resampling kernel of the effective channel

Xt∗⟂∣\(S⟂=c\)∼𝒩​\(c,σ∗2​Im\)\.X\_\{t^\{\*\}\}^\{\\perp\}\\mid\(S^\{\\perp\}=c\)\\sim\\mathcal\{N\}\(c,\\sigma\_\{\*\}^\{2\}I\_\{m\}\)\.
Let

R:=‖S∗⟂−S~⟂‖2\.R:=\\\|\{S^\{\*\}\}^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}\.By Assumption[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4), eitherR=0R=0orR≥δR\\geq\\delta\. Therefore

𝔼​\[R2\]=∫0∞ℙ​\(R2≥s\)​𝑑s=∫0δ2ℙ​\(R≥δ\)​𝑑s\+∫δ2∞ℙ​\(R≥s\)​𝑑s\.\\mathbb\{E\}\[R^\{2\}\]=\\int\_\{0\}^\{\\infty\}\\mathbb\{P\}\(R^\{2\}\\geq s\)\\,ds=\\int\_\{0\}^\{\\delta^\{2\}\}\\mathbb\{P\}\(R\\geq\\delta\)\\,ds\+\\int\_\{\\delta^\{2\}\}^\{\\infty\}\\mathbb\{P\}\(R\\geq\\sqrt\{s\}\)\\,ds\.Applying Lemma[19](https://arxiv.org/html/2605.05387#Thmtheorem19),

ℙ​\(R≥r\)≤12​exp⁡\(H1/2−r28​σ∗2\),\\mathbb\{P\}\(R\\geq r\)\\leq\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{r^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\),we obtain

𝔼​\[R2\]≤δ22​exp⁡\(H1/2−δ28​σ∗2\)\+∫δ2∞12​exp⁡\(H1/2−s8​σ∗2\)​𝑑s\.\\mathbb\{E\}\[R^\{2\}\]\\leq\\frac\{\\delta^\{2\}\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\+\\int\_\{\\delta^\{2\}\}^\{\\infty\}\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{s\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\\,ds\.Evaluating the integral yields

𝔼​\[R2\]≤\(δ22\+4​σ∗2\)​exp⁡\(H1/2−δ28​σ∗2\),\\mathbb\{E\}\[R^\{2\}\]\\leq\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\),and therefore

𝔼B​\[KL​\(pt∗∗,B∥p^t∗B\)\]≤Lt∗​\(δ22\+4​σ∗2\)​exp⁡\(H1/2−δ28​σ∗2\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\\Big\]\\leq L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.
For the pathwise term, Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)gives

𝔼B​\[KL​\(ℙY∗,B∥ℙY~B\)\]≤I​\(Z∥;Z⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\tilde\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.Using Proposition[6](https://arxiv.org/html/2605.05387#Thmtheorem6),

I​\(Z∥;Z⟂∣Xt∗\)≤I​\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.SinceS⟂=S⟂S^\{\\perp\}=S^\{\\perp\},

I​\(S∥;S⟂∣Xt∗\)=I​\(S∥;S⟂∣Xt∗\)≤H​\(S⟂∣Xt∗\)≤H​\(S⟂∣Xt∗⟂\)\.I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)=I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\)\\leq H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\.Lemma[20](https://arxiv.org/html/2605.05387#Thmtheorem20)now implies

H​\(S⟂∣Xt∗⟂\)≤2​exp⁡\(H1/2−δ28​σ∗2\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\leq 2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.Hence

𝔼B​\[KL​\(ℙY∗,B∥ℙY~B\)\]≤2​exp⁡\(H1/2−δ28​σ∗2\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\tilde\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq 2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.
Combining the initialization and pathwise bounds yields

𝔼B​\[KL​\(pt∗∗,B∥p^t∗B\)\+KL​\(ℙY∗,B∥ℙY^B\)\]≤Lt∗​\(δ22\+4​σ∗2\)​exp⁡\(H1/2−δ28​σ∗2\)\+2​exp⁡\(H1/2−δ28​σ∗2\),\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\+\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\+2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\),which is \([11](https://arxiv.org/html/2605.05387#S5.Ex44)\)\.

Finally, the terminal tangent marginal is a measurable image of path space, so by data processing,

𝔼B​\[KL​\(μT−t0∗,B∥μ^T−t0B\)\]≤𝔼B​\[KL​\(pt∗∗,B∥p^t∗B\)\+KL​\(ℙY∗,B∥ℙY^B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\Big\]\\leq\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\+\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\.This proves \([5\.4](https://arxiv.org/html/2605.05387#S5.E4)\)\.

## Appendix DDDPM implementation of the VP normal correction

This appendix records the VP/DDPM form of the normal correction used in the experiments\. Consider the forward marginal

Xt=αt​Z\+σt​ξ,ξ∼𝒩​\(0,Id\),X\_\{t\}=\\alpha\_\{t\}Z\+\\sigma\_\{t\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),and condition onB=P⟂​Z=bB=P\_\{\\perp\}Z=b\. Letpt∗,bp\_\{t\}^\{\*,b\}be the density ofXt∣B=bX\_\{t\}\\mid B=b, with scorest∗,b=∇log⁡pt∗,bs\_\{t\}^\{\*,b\}=\\nabla\\log p\_\{t\}^\{\*,b\}\. The VP Tweedie identity gives

st∗,b​\(x\)=αt​𝔼​\[Z∣Xt=x,B=b\]−xσt2\.s\_\{t\}^\{\*,b\}\(x\)=\\frac\{\\alpha\_\{t\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]\-x\}\{\\sigma\_\{t\}^\{2\}\}\.Projecting onto the normal space and usingP⟂​Z=bP\_\{\\perp\}Z=bunder the conditioning,

P⟂​𝔼​\[Z∣Xt=x,B=b\]=b,P\_\{\\perp\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]=b,we obtain

P⟂​st∗,b​\(x\)=αt​b−P⟂​xσt2\.P\_\{\\perp\}s\_\{t\}^\{\*,b\}\(x\)=\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\}\{\\sigma\_\{t\}^\{2\}\}\.Thus the normal correction used in the DDPM implementation is

αt​b−P⟂​xtσt2\.\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.
We next relate this expression to the DDNM\-style projected denoising update\. For a pretrained VP/DDPM model, the usual Tweedie denoiser is

z^0​\(xt\)=xt\+σt2​st​\(xt\)αt\.\\hat\{z\}\_\{0\}\(x\_\{t\}\)=\\frac\{x\_\{t\}\+\\sigma\_\{t\}^\{2\}s\_\{t\}\(x\_\{t\}\)\}\{\\alpha\_\{t\}\}\.DDNM replaces the normal component of this denoised estimate by the observed levelbb:

z~0​\(xt;y\)=P∥​z^0​\(xt\)\+b\.\\tilde\{z\}\_\{0\}\(x\_\{t\};y\)=P\_\{\\parallel\}\\hat\{z\}\_\{0\}\(x\_\{t\}\)\+b\.The score associated with this projected denoiser is obtained by inverting Tweedie’s formula:

s^tDDNM​\(xt;y\)=αt​z~0​\(xt;y\)−xtσt2\.\\hat\{s\}\_\{t\}^\{\\rm DDNM\}\(x\_\{t\};y\)=\\frac\{\\alpha\_\{t\}\\tilde\{z\}\_\{0\}\(x\_\{t\};y\)\-x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.Substituting the expression forz~0\\tilde\{z\}\_\{0\},

s^tDDNM​\(xt;y\)\\displaystyle\\hat\{s\}\_\{t\}^\{\\rm DDNM\}\(x\_\{t\};y\)=αt​P∥​z^0​\(xt\)\+αt​b−xtσt2\\displaystyle=\\frac\{\\alpha\_\{t\}P\_\{\\parallel\}\\hat\{z\}\_\{0\}\(x\_\{t\}\)\+\\alpha\_\{t\}b\-x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}=P∥​xt\+σt2​P∥​st​\(xt\)\+αt​b−P∥​xt−P⟂​xtσt2\\displaystyle=\\frac\{P\_\{\\parallel\}x\_\{t\}\+\\sigma\_\{t\}^\{2\}P\_\{\\parallel\}s\_\{t\}\(x\_\{t\}\)\+\\alpha\_\{t\}b\-P\_\{\\parallel\}x\_\{t\}\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}=P∥​st​\(xt\)\+αt​b−P⟂​xtσt2\.\\displaystyle=P\_\{\\parallel\}s\_\{t\}\(x\_\{t\}\)\+\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.Therefore the DDPM/DDIM implementation used in the experiments applies the closed\-form VP normal correction together with the pretrained tangent score\.

## References

- Cheng et al\. \(2018\)Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan\.Underdamped langevin mcmc: A non\-asymptotic analysis\.In*Conference on learning theory*, pages 300–323\. PMLR, 2018\.
- Choi et al\. \(2021\)Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon\.Ilvr: Conditioning method for denoising diffusion probabilistic models\.*arXiv preprint arXiv:2108\.02938*, 2021\.
- Chung et al\. \(2022\)Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye\.Diffusion posterior sampling for general noisy inverse problems\.*arXiv preprint arXiv:2209\.14687*, 2022\.
- Deng et al\. \(2009\)Jia Deng, Wei Dong, Richard Socher, Li\-Jia Li, Kai Li, and Li Fei\-Fei\.Imagenet: A large\-scale hierarchical image database\.In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 248–255, 2009\.doi:10\.1109/CVPR\.2009\.5206848\.
- Dhariwal and Nichol \(2021\)Prafulla Dhariwal and Alexander Nichol\.Diffusion models beat gans on image synthesis\.*Advances in neural information processing systems*, 34:8780–8794, 2021\.
- Didi et al\. \(2023\)Kieran Didi, Francisco Vargas, Simon V Mathis, Vincent Dutordoir, Emile Mathieu, Urszula J Komorowska, and Pietro Lio\.A framework for conditional diffusion modelling with applications in motif scaffolding for protein design\.*arXiv preprint arXiv:2312\.09236*, 2023\.
- Dou and Song \(2024\)Zehao Dou and Yang Song\.Diffusion posterior sampling for linear inverse problem solving: A filtering perspective\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Durrett \(2019\)Rick Durrett\.*Probability: theory and examples*, volume 49\.Cambridge university press, 2019\.
- Fan et al\. \(2023\)Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee\.Dpok: Reinforcement learning for fine\-tuning text\-to\-image diffusion models\.*Advances in Neural Information Processing Systems*, 36:79858–79885, 2023\.
- Guo et al\. \(2005\)Dongning Guo, Shlomo Shamai, and Sergio Verdú\.Mutual information and minimum mean\-square error in gaussian channels\.*IEEE Transactions on Information Theory*, 51\(4\):1261–1282, 2005\.doi:10\.1109/TIT\.2005\.844072\.URL[https://arxiv\.org/abs/cs/0412108](https://arxiv.org/abs/cs/0412108)\.
- Guo et al\. \(2026\)Zhengyi Guo, Wenpin Tang, and Renyuan Xu\.Conditional diffusion guidance under hard constraint: A stochastic analysis approach\.*arXiv preprint arXiv:2602\.05533*, 2026\.
- Ho and Salimans \(2022\)Jonathan Ho and Tim Salimans\.Classifier\-free diffusion guidance\.*arXiv preprint arXiv:2207\.12598*, 2022\.
- Ho et al\. \(2020\)Jonathan Ho, Ajay Jain, and Pieter Abbeel\.Denoising diffusion probabilistic models\.*Advances in neural information processing systems*, 33:6840–6851, 2020\.
- Karatzas and Shreve \(2014\)Ioannis Karatzas and Steven Shreve\.*Brownian motion and stochastic calculus*\.springer, 2014\.
- Karras et al\. \(2018\)Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen\.Progressive growing of gans for improved quality, stability, and variation\.In*International Conference on Learning Representations \(ICLR\)*, 2018\.URL[https://arxiv\.org/abs/1710\.10196](https://arxiv.org/abs/1710.10196)\.
- Kawar et al\. \(2022\)Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song\.Denoising diffusion restoration models\.*Advances in neural information processing systems*, 35:23593–23606, 2022\.
- Lamperski \(2021\)Andrew Lamperski\.Projected stochastic gradient langevin algorithms for constrained sampling and non\-convex learning\.In*Conference on Learning Theory*, pages 2891–2937\. PMLR, 2021\.
- Leimkuhler and Matthews \(2013\)Benedict Leimkuhler and Charles Matthews\.Robust and efficient configurational molecular sampling via langevin dynamics\.*The Journal of chemical physics*, 138\(17\), 2013\.
- Liang et al\. \(2025\)Yuchen Liang, Peizhong Ju, Yingbin Liang, and Ness Shroff\.Theory on score\-mismatched diffusion models and zero\-shot conditional samplers\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Lugmayr et al\. \(2022\)Andreas Lugmayr, Martin Danelljan, Antonio Romero, Fisher Yu, Radu Timofte, and Luc Van Gool\.Repaint: Inpainting using denoising diffusion probabilistic models\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 11451–11461, 2022\.doi:10\.1109/CVPR52688\.2022\.01117\.URL[https://arxiv\.org/abs/2201\.09865](https://arxiv.org/abs/2201.09865)\.
- Meng et al\. \(2021\)Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun\-Yan Zhu, and Stefano Ermon\.Sdedit: Guided image synthesis and editing with stochastic differential equations\.*arXiv preprint arXiv:2108\.01073*, 2021\.
- Song et al\. \(2020a\)Jiaming Song, Chenlin Meng, and Stefano Ermon\.Denoising diffusion implicit models\.*arXiv preprint arXiv:2010\.02502*, 2020a\.
- Song and Ermon \(2019\)Yang Song and Stefano Ermon\.Generative modeling by estimating gradients of the data distribution\.*Advances in neural information processing systems*, 32, 2019\.
- Song et al\. \(2020b\)Yang Song, Jascha Sohl\-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole\.Score\-based generative modeling through stochastic differential equations\.*arXiv preprint arXiv:2011\.13456*, 2020b\.
- Song et al\. \(2021\)Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon\.Solving inverse problems in medical imaging with score\-based generative models\.*arXiv preprint arXiv:2111\.08005*, 2021\.
- Uehara et al\. \(2024\)Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, and Sergey Levine\.Understanding reinforcement learning\-based fine\-tuning of diffusion models: A tutorial and review\.*arXiv preprint arXiv:2407\.13734*, 2024\.
- Wang et al\. \(2022\)Yinhuai Wang, Jiwen Yu, and Jian Zhang\.Zero\-shot image restoration using denoising diffusion null\-space model\.*arXiv preprint arXiv:2212\.00490*, 2022\.
- Wu et al\. \(2023\)Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham\.Practical and asymptotically exact conditional sampling in diffusion models\.*Advances in Neural Information Processing Systems*, 36:31372–31403, 2023\.
- Yu et al\. \(2015\)Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao\.Lsun: Construction of a large\-scale image dataset using deep learning with humans in the loop\.*arXiv preprint arXiv:1506\.03365*, 2015\.URL[https://arxiv\.org/abs/1506\.03365](https://arxiv.org/abs/1506.03365)\.
- Zhao et al\. \(2025\)Yulai Zhao, Masatoshi Uehara, Gabriele Scalia, Sunyuan Kung, Tommaso Biancalani, Sergey Levine, and Ehsan Hajiramezanali\.Adding conditional control to diffusion models with reinforcement learning\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Zhou et al\. \(2024\)Linqi Zhou, Aaron Lou, Samar Khanna, and Stefano Ermon\.Denoising diffusion bridge models\.In*The Twelfth International Conference on Learning Representations*, 2024\.

Similar Articles

Constrained Diffusion Models with Primal-Dual Inference

arXiv cs.LG

This paper proposes primal-dual inference for constrained diffusion models, jointly inferring the optimal distribution and its dual variable via a dual-conditioned score network, with convergence guarantees and applications in wireless resource allocation and portfolio management.

Temporal Difference Learning for Diffusion Models

arXiv cs.LG

This paper introduces a temporal difference (TD) learning objective for diffusion models that enforces cross-time consistency along the denoising trajectory. It reformulates denoising as a reinforcement learning policy evaluation problem, showing significant improvements in sample quality (FID), especially for few-step samplers.

Active Learning for Conditional Generative Compressed Sensing

arXiv cs.LG

This paper proposes a framework for conditional generative compressed sensing, proving stable recovery bounds for prompt-conditioned models and demonstrating how prompt matching influences sampling distributions in experiments with Stable Diffusion.