Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
Summary
This paper analyzes zero-shot conditional sampling with pretrained diffusion models for linear inverse problems, providing information-theoretic guarantees and proposing a projected-Langevin initialization method.
View Cached Full Text
Cached at: 05/08/26, 07:13 AM
# Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
Source: [https://arxiv.org/html/2605.05387](https://arxiv.org/html/2605.05387)
\\nameAhmad Aghapour\\emailaghapour@umich\.edu \\addrDepartment of Mathematics University of Michigan Ann Arbor, MI 48109, USA\\nameErhan Bayraktar\\emailerhan@umich\.edu \\addrDepartment of Mathematics University of Michigan Ann Arbor, MI 48109, USA\\nameAsaf Cohen\\emailasafc@umich\.edu \\addrDepartment of Mathematics University of Michigan Ann Arbor, MI 48109, USA
###### Abstract
We study zero\-shot conditional sampling with pretrained diffusion models for linear inverse problems, including inpainting and super\-resolution\. In these problems, the observation determines only part of the unknown signal\. The remaining degrees of freedom must be sampled according to the correct conditional data distribution\. Existing projection\-based samplers enforce measurement consistency by correcting the observed component during reverse diffusion\. However, measurement consistency alone does not determine how probability mass should be distributed along the feasible set, and this can lead to biased conditional samples\.
We analyze this issue through a normal–tangent decomposition of the score function\. For Gaussian noising, the observed\-direction score is exactly determined by the measurement; only the tangent conditional score is unknown\. We prove that the error from replacing this score by the unconditional tangent score is upper bounded by a dimension\-free conditional mutual information between observed and unobserved components\. This gives an information\-theoretic decomposition into initialization and pathwise score\-mismatch errors\. Motivated by the theory, we propose a projected\-Langevin initialization followed by guided reverse denoising, which outperforms a strong projection\-based baseline in inpainting and super\-resolution experiments\.
Keywords:diffusion models, inverse problems, Langevin dynamics, information\-theoretic bounds, conditional sampling
## 1Introduction
Diffusion models have become a standard tool for high\-dimensional generative modeling\. Given samples from a data distribution, a diffusion model learns the score of progressively noised versions of the data and then generates new samples by simulating a reverse\-time denoising process\(Song and Ermon,[2019](https://arxiv.org/html/2605.05387#bib.bib23); Ho et al\.,[2020](https://arxiv.org/html/2605.05387#bib.bib13); Song et al\.,[2020b](https://arxiv.org/html/2605.05387#bib.bib24),[a](https://arxiv.org/html/2605.05387#bib.bib22)\)\. In many applications, however, generation is not unconditional\. In image restoration, for example, one observes a corrupted image and wants to sample clean images that are both realistic under a pretrained image prior and consistent with the observation\.
This paper studies such conditional sampling problems for noiseless linear observations\. LetZ∈ℝdZ\\in\\mathbb\{R\}^\{d\}denote the clean signal and suppose that
y=AZ,A∈ℝm×d\.y=AZ,\\qquad A\\in\\mathbb\{R\}^\{m\\times d\}\.The goal is to sample from the conditional law
Law\(Z∣AZ=y\)\.\\operatorname\{Law\}\(Z\\mid AZ=y\)\.WhenAAhas full row rank, the constraintAZ=yAZ=ydefines an affine set\. Writing
P⟂:=A⊤\(AA⊤\)−1A,P∥:=I−P⟂,P\_\{\\perp\}:=A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}A,\\qquad P\_\{\\parallel\}:=I\-P\_\{\\perp\},the projectionP⟂P\_\{\\perp\}extracts the component of the signal determined by the measurements, whileP∥P\_\{\\parallel\}extracts the component in the null space ofAA\. We refer to these as the normal and tangent components, respectively\. Thus the observation fixes the normal component, whereas the tangent component contains the remaining degrees of freedom\. In imaging problems, this formulation covers inpainting, super\-resolution, deblurring, and other linear inverse problems\.
A central difficulty is that measurement consistency and conditional sampling are not the same task\. Measurement consistency only requires producing a samplez^\\hat\{z\}satisfyingAz^=yA\\hat\{z\}=y\. Conditional sampling requires more: among all feasible signals satisfying the measurement, samples should be distributed according to the true conditional law of the data\. In the geometric language above, the normal component enforces feasibility, while the tangent component determines how probability mass is distributed along the feasible affine set\.
Many zero\-shot inverse\-problem samplers based on pretrained diffusion models enforce the measurement by repeatedly correcting or projecting the sample in the observed directions\. We call such methods projection\-based because they use the known linear operatorAAto replace, project, or analytically correct the normal component during reverse diffusion, while leaving the unobserved directions largely governed by the pretrained unconditional model\. Methods such as denoising diffusion restoration models \(DDRM\) and the denoising diffusion null\-space model \(DDNM\) are representative examples for linear inverse problems\(Kawar et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib16); Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27)\)\. These methods can be very effective at maintaining measurement consistency\. However, correcting the normal component does not by itself determine the correct distribution in the tangent directions\. As a result, a sample may satisfyAz^=yA\\hat\{z\}=ywhile still being biased along the feasible manifold\.
The goal of this paper is to understand and reduce this tangent\-space bias\. We work in the zero\-shot setting: the diffusion model is pretrained unconditionally, is not fine\-tuned for the observation, and conditioning is imposed only at inference time\. This setting is practically important because it allows a single generative prior to be reused across many inverse problems\. It is also theoretically revealing, because the only available learned object is the unconditional score\. The question is therefore: when can an unconditional score be used to approximate the conditional dynamics, and where does the error enter?
Our starting point is a normal–tangent decomposition of the conditional score\. Under Gaussian noising, the normal component of the conditional score is available in closed form from the observation\. In the variance\-exploding normalization, ifB=P⟂Z=bB=P\_\{\\perp\}Z=b, then
P⟂st∗,b\(x\)=1tP⟂\(b−x\)\.P\_\{\\perp\}s\_\{t\}^\{\*,b\}\(x\)=\\frac\{1\}\{t\}P\_\{\\perp\}\(b\-x\)\.Thus the normal score is not the obstacle\. The only unknown part is the tangent conditional scoreP∥st∗,b\(x\)P\_\{\\parallel\}s\_\{t\}^\{\*,b\}\(x\)\. Projection\-based zero\-shot samplers can therefore be viewed as replacing this unknown tangent conditional score by the pretrained unconditional tangent scoreP∥st\(x\)P\_\{\\parallel\}s\_\{t\}\(x\)\. This view isolates the precise source of bias: the approximation is made along the feasible directions, not in the measured directions\.
Motivated by this decomposition, we propose a two\-stage conditional sampler\. Rather than starting reverse diffusion from the highest\-noise distribution, we start from an intermediate noise level\. At this level, the noisy normal component can be sampled exactly under the constraint\. We then run projected underdamped Langevin dynamics on the corresponding affine slice, using the projected unconditional score to mix only in the tangent directions\. This produces an initialization that is already consistent with the noisy constraint and better adapted to the feasible slice\. From this initialization, we perform guided reverse denoising using the exact normal correction and the pretrained unconditional score in the tangent directions\.
The theoretical analysis follows the same decomposition\. We separate the total sampling error into two terms\. The first is an initialization error at the intermediate noise level, caused by approximating the true conditional marginal on the affine slice\. The second is a pathwise error accumulated during reverse denoising, caused by replacing the true conditional tangent score with the unconditional tangent score\. Our main pathwise result shows that this second error is controlled by a conditional mutual information between tangent and normal components\. Informally, zero\-shot tangent guidance is accurate when, at the chosen noise level, the remaining statistical dependence between the unobserved tangent component and the observed normal component is small\.
We further combine the pathwise bound with an initialization analysis\. Under a latent Gaussian\-mixture model, we obtain a terminal Kullback–Leibler \(KL\) bound consisting of an initialization term and a mutual\-information pathwise term\. Under an additional separation condition on the latent normal codebook, both terms become exponentially small in the separation\-to\-noise ratio\. These results identify regimes in which inference\-time conditioning with a fixed unconditional score can be accurate, and they also explain why tangent\-space ambiguity is the central obstruction\.
We evaluate the resulting sampler on standard linear imaging inverse problems using pretrained diffusion backbones and matched compute budgets\. On inpainting and8×8\\timessuper\-resolution across CelebA\-HQ, LSUN Church, and ImageNet, the proposed method improves Learned Perceptual Image Patch Similarity \(LPIPS\) and Fréchet Inception Distance \(FID\) over a strong projection\-based zero\-shot baseline\. The gains are largest in settings with greater unresolved tangent ambiguity, such as ImageNet and high\-factor super\-resolution, consistent with the role of tangent mixing in the analysis\.
### 1\.1Related Work
Our work builds on score\-based generative modeling and diffusion models\(Song and Ermon,[2019](https://arxiv.org/html/2605.05387#bib.bib23); Ho et al\.,[2020](https://arxiv.org/html/2605.05387#bib.bib13); Song et al\.,[2020b](https://arxiv.org/html/2605.05387#bib.bib24),[a](https://arxiv.org/html/2605.05387#bib.bib22)\)\. Conditional generation can be obtained by training conditional models, but many inverse problems require reusing a fixed unconditional model\. Classifier guidance and classifier\-free guidance modify the reverse process using additional conditional information\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2605.05387#bib.bib5); Ho and Salimans,[2022](https://arxiv.org/html/2605.05387#bib.bib12)\)\. Image editing and restoration methods such as SDEdit, RePaint, and ILVR impose conditioning through noising, denoising, and resampling procedures\(Meng et al\.,[2021](https://arxiv.org/html/2605.05387#bib.bib21); Lugmayr et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib20); Choi et al\.,[2021](https://arxiv.org/html/2605.05387#bib.bib2)\)\.
Diffusion priors have also been widely used for inverse problems\. Predictor– corrector samplers and likelihood\-gradient corrections incorporate observations during sampling\(Song et al\.,[2020b](https://arxiv.org/html/2605.05387#bib.bib24),[2021](https://arxiv.org/html/2605.05387#bib.bib25)\)\. For linear inverse problems, DDRM and DDNM exploit the measurement operator to impose analytic updates or null\-space corrections during reverse diffusion\(Kawar et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib16); Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27)\)\. Diffusion posterior sampling \(DPS\) extends posterior sampling ideas to more general noisy and nonlinear settings through likelihood\-gradient guidance\(Chung et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib3)\)\. These methods demonstrate the strength of pretrained diffusion priors for restoration\. Our focus is different: we analyze the specific tangent\-score approximation that remains after the normal measurement component has been corrected\.
Other approaches construct conditional samplers by changing the model or the underlying path measure\. Reward\-based fine\-tuning and reinforcement\-learning methods adapt a pretrained generator using task\-specific feedback\(Fan et al\.,[2023](https://arxiv.org/html/2605.05387#bib.bib9); Zhao et al\.,[2025](https://arxiv.org/html/2605.05387#bib.bib30); Uehara et al\.,[2024](https://arxiv.org/html/2605.05387#bib.bib26)\)\. Doob’shh\-transform and diffusion\-bridge methods provide principled path\-space formulations of conditioning\(Didi et al\.,[2023](https://arxiv.org/html/2605.05387#bib.bib6); Guo et al\.,[2026](https://arxiv.org/html/2605.05387#bib.bib11); Zhou et al\.,[2024](https://arxiv.org/html/2605.05387#bib.bib31)\)\. These methods can be exact or asymptotically exact under suitable assumptions, but they typically require learning an additional object, solving a control problem, or fine\-tuning the model\. By contrast, we keep the unconditional score fixed and study what can be achieved by inference\-time conditioning alone\.
The Langevin initialization used here is related to constrained sampling\. Projected Langevin methods sample on constrained domains or manifolds\(Lamperski,[2021](https://arxiv.org/html/2605.05387#bib.bib17)\)\. Underdamped Langevin dynamics can improve mixing relative to overdamped dynamics in some settings\(Cheng et al\.,[2018](https://arxiv.org/html/2605.05387#bib.bib1)\), and BAOAB discretizations are known for stable and low\-bias behavior in the position marginal\(Leimkuhler and Matthews,[2013](https://arxiv.org/html/2605.05387#bib.bib18)\)\. In affine inverse problems, these methods are natural because, once the normal component is fixed, the remaining sampling problem lives in the tangent space\.
Recent theory has begun to analyze conditional and zero\-shot diffusion samplers, including asymptotically exact conditional samplers\(Wu et al\.,[2023](https://arxiv.org/html/2605.05387#bib.bib28)\), filtering\-based posterior samplers for linear inverse problems\(Dou and Song,[2024](https://arxiv.org/html/2605.05387#bib.bib7)\), and score\-mismatch analyses for zero\-shot guidance\(Liang et al\.,[2025](https://arxiv.org/html/2605.05387#bib.bib19)\)\. Our contribution is complementary: we isolate the normal–tangent structure of affine conditioning and bound the error caused by using the unconditional tangent score in place of the conditional tangent score\.
### 1\.2Contributions
The main contributions of this paper are as follows\.
First, we derive a normal–tangent decomposition of affine conditional diffusion\. For Gaussian noising, the normal component of the conditional score is available exactly from the measurement and the noising process, while the tangent component is the only part not supplied by a pretrained unconditional score model\. This decomposition motivates the surrogate dynamics in Section[2](https://arxiv.org/html/2605.05387#S2)\.
Second, we propose a zero\-shot conditional sampler that combines exact normal correction, projected underdamped Langevin mixing on an affine slice, and guided reverse denoising\. The Langevin phase is designed to initialize the sampler at an intermediate noise level with improved mixing in the unobserved tangent directions before the final denoising stage\.
Third, we prove a pathwise error bound for the guided reverse dynamics\. In Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4), the KL divergence between the ideal conditional path measure and the surrogate path measure is controlled by a conditional mutual information between the tangent and normal components\. This gives an information\-theoretic criterion for when replacing the conditional tangent score by the unconditional tangent score is accurate\.
Fourth, we combine the pathwise estimate with an initialization analysis to obtain terminal KL guarantees\. In Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9), the terminal error separates into an initialization term and the mutual\-information pathwise term\. The resulting bound has no explicit dependence on the ambient dimension; its size is governed by the sensitivity of tangent conditionals and by the residual statistical dependence between observed and unobserved components\. Under an additional separation condition on the latent normal codebook, Theorem[11](https://arxiv.org/html/2605.05387#Thmtheorem11)further gives an exponential small\-error regime, where both the initialization and pathwise contributions become exponentially small when component of gaussian mixture model is separated\.
Finally, we evaluate the proposed sampler on linear imaging inverse problems\. The experiments show that the algorithm outperforms previous zero\-shot diffusion methods on inpainting and8×8\\timessuper\-resolution under matched network\-evaluation budgets\.
### 1\.3Organization
Section[2](https://arxiv.org/html/2605.05387#S2)formulates affine conditional diffusion and derives the normal–tangent decomposition of the conditional reverse dynamics\. Section[3](https://arxiv.org/html/2605.05387#S3)presents the Langevin–diffusion sampler\. Section[4](https://arxiv.org/html/2605.05387#S4)reports experiments on inpainting and super\-resolution\. Section[5](https://arxiv.org/html/2605.05387#S5)gives the KL bounds and total error decomposition\. Proofs and additional derivations are deferred to the appendix\.
## 2Methodology
We adopt the variance\-exploding \(VE\) diffusion framework ofSong et al\. \([2020b](https://arxiv.org/html/2605.05387#bib.bib24)\)\. Let the clean data beZ∈ℝdZ\\in\\mathbb\{R\}^\{d\}with prior lawZ∼p0Z\\sim p\_\{0\}\. For diffusion timet∈\[0,T\]t\\in\[0,T\], the forward process is
Xt:=Z\+Wt,X\_\{t\}:=Z\+W\_\{t\},\(2\.1\)where\{Wt\}t≥0\\\{W\_\{t\}\\\}\_\{t\\geq 0\}is standard Brownian motion inℝd\\mathbb\{R\}^\{d\}independent ofZZ\. Hence
Xt∣Z∼𝒩\(Z,tId\),X\_\{t\}\\mid Z\\sim\\mathcal\{N\}\(Z,\\,tI\_\{d\}\),\(2\.2\)and we writeptp\_\{t\}for the marginal density ofXtX\_\{t\}, with scorest\(x\):=∇xlogpt\(x\)s\_\{t\}\(x\):=\\nabla\_\{x\}\\log p\_\{t\}\(x\)\.
The time\-reversal of \([2\.1](https://arxiv.org/html/2605.05387#S2.E1)\) yields the reverse\-time stochastic differential equation \(SDE\) that generates*unconditional*samples fromp0p\_\{0\}\. Using the reverse\-time parameterτ:=T−t∈\[0,T\]\\tau:=T\-t\\in\[0,T\], this SDE can be written as
dYτ=sT−τ\(Yτ\)dτ\+dW¯τ,Y0∼pT,\\mathrm\{d\}Y\_\{\\tau\}=s\_\{T\-\\tau\}\(Y\_\{\\tau\}\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},\\qquad Y\_\{0\}\\sim p\_\{T\},\(2\.3\)whereW¯τ\\bar\{W\}\_\{\\tau\}is a Brownian motion in reverse time\. In practice,st\(x\)s\_\{t\}\(x\)is approximated by a neural network trained via score matching\.
In this work we do not seek unconditional samples\. Instead, we aim to sample from a conditional distribution under a linear constraint\. LetA∈ℝm×dA\\in\\mathbb\{R\}^\{m\\times d\}have full row rank and consider the affine constraintAZ=yAZ=y\. It is convenient to express the constraint through orthogonal projection onto the row space ofAA\. Define
P⟂:=A⊤\(AA⊤\)−1A,P∥:=I−P⟂,P\_\{\\perp\}:=A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}A,\\qquad P\_\{\\parallel\}:=I\-P\_\{\\perp\},so thatP⟂P\_\{\\perp\}projects ontorange\(A⊤\)\\mathrm\{range\}\(A^\{\\top\}\)\(normal space\) andP∥P\_\{\\parallel\}ontoker\(A\)\\ker\(A\)\(tangent space\)\. We encode the observation via the*level*
B:=P⟂Z,b:=A⊤\(AA⊤\)−1y,B:=P\_\{\\perp\}Z,\\qquad b:=A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}y,so thatAZ=yAZ=yis equivalent toB=bB=b, i\.e\.,ZZlies on the affine set
ℳ\(b\):=\{x∈ℝd:P⟂x=b\}\.\\mathcal\{M\}\(b\):=\\\{x\\in\\mathbb\{R\}^\{d\}:\\;P\_\{\\perp\}x=b\\\}\.\(WhenZZis supported on a countable codebook𝒞⊂ℝd\\mathcal\{C\}\\subset\\mathbb\{R\}^\{d\}, the levelBBis supported onP⟂𝒞P\_\{\\perp\}\\mathcal\{C\}; the development below does not otherwise rely on discreteness\.\)
Fixbband letpt∗,bp\_\{t\}^\{\*,b\}denote the conditional density ofXtX\_\{t\}underLaw\(⋅∣P⟂Z=b\)\\operatorname\{Law\}\(\\,\\cdot\\,\\mid P\_\{\\perp\}Z=b\), with conditional scorest∗,b\(x\):=∇xlogpt∗,b\(x\)s\_\{t\}^\{\*,b\}\(x\):=\\nabla\_\{x\}\\log p\_\{t\}^\{\*,b\}\(x\)\. Ifst∗,bs\_\{t\}^\{\*,b\}were available, then the correct reverse\-time dynamics that sample fromLaw\(Z∣P⟂Z=b\)\\operatorname\{Law\}\(Z\\mid P\_\{\\perp\}Z=b\)would be
dYτ∗,b=sT−τ∗,b\(Yτ∗,b\)dτ\+dW¯τ,Y0∗,b∼Law\(XT∣P⟂Z=b\)\.\\mathrm\{d\}Y\_\{\\tau\}^\{\*,b\}=s\_\{T\-\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},\\qquad Y\_\{0\}^\{\*,b\}\\sim\\operatorname\{Law\}\(X\_\{T\}\\mid P\_\{\\perp\}Z=b\)\.\(2\.4\)The main obstacle is thatst∗,bs\_\{t\}^\{\*,b\}is not directly learned by standard unconditional score training\.
For Gaussian perturbations, Tweedie’s formula expresses the conditional expectation ofZZgivenXt=xX\_\{t\}=xas
𝔼\[Z∣Xt=x\]=x\+tst\(x\)\.\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]=x\+t\\,s\_\{t\}\(x\)\.\(2\.5\)A key observation is that, under the affine conditioningB=bB=b, Tweedie’s identity immediately yields a closed\-form expression for the*normal*component of the conditional score: applyingP⟂P\_\{\\perp\}to \([2\.5](https://arxiv.org/html/2605.05387#S2.E5)\) underLaw\(⋅∣P⟂Z=b\)\\operatorname\{Law\}\(\\,\\cdot\\,\\mid P\_\{\\perp\}Z=b\)gives
P⟂𝔼\[Z∣Xt=x,B=b\]=P⟂\(x\+tst∗,b\(x\)\)\.P\_\{\\perp\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,\\,B=b\]\\;=\\;P\_\{\\perp\}\\big\(x\+t\\,s\_\{t\}^\{\*,b\}\(x\)\\big\)\.SinceP⟂Z=bP\_\{\\perp\}Z=bholds almost surely underB=bB=b, the left\-hand side equalsbb, and therefore
P⟂st∗,b\(x\)=1tP⟂\(b−x\)\.P\_\{\\perp\}s\_\{t\}^\{\*,b\}\(x\)=\\frac\{1\}\{t\}\\,P\_\{\\perp\}\(b\-x\)\.\(2\.6\)Thus only the tangent componentP∥st∗,bP\_\{\\parallel\}s\_\{t\}^\{\*,b\}remains unknown\. Usingst∗,b=P∥st∗,b\+P⟂st∗,bs\_\{t\}^\{\*,b\}=P\_\{\\parallel\}s\_\{t\}^\{\*,b\}\+P\_\{\\perp\}s\_\{t\}^\{\*,b\}and substituting \([2\.6](https://arxiv.org/html/2605.05387#S2.E6)\) into \([2\.4](https://arxiv.org/html/2605.05387#S2.E4)\) yields the equivalent decomposition
dYτ∗,b=\(P∥sT−τ∗,b\(Yτ∗,b\)\+1T−τP⟂\(b−Yτ∗,b\)\)dτ\+dW¯τ\.\\mathrm\{d\}Y\_\{\\tau\}^\{\*,b\}=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\+\\frac\{1\}\{T\-\\tau\}\\,P\_\{\\perp\}\\big\(b\-Y\_\{\\tau\}^\{\*,b\}\\big\)\\Big\)\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\}\.\(2\.7\)This form makes the conditioning mechanism explicit: the process is driven toward the affine setℳ\(b\)\\mathcal\{M\}\(b\)by thenormal drift, while the remainingtangent driftdepends on the intractable conditional score\.
To obtain a practical sampler using only an unconditional score model, we keep the*exact*normal drift and approximate the unknown tangent term by the unconditional tangent scoreP∥stP\_\{\\parallel\}s\_\{t\}\. This yields thesurrogate constrained reverse SDE
dY^τb=\(P∥sT−τ\(Y^τb\)\+1T−τP⟂\(b−Y^τb\)\)dτ\+dW¯τ,τ∈\[0,T−t0\)\.\\mathrm\{d\}\\hat\{Y\}\_\{\\tau\}^\{\\,b\}=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}\(\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\)\+\\frac\{1\}\{T\-\\tau\}\\,P\_\{\\perp\}\\big\(b\-\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\\big\)\\Big\)\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},\\qquad\\tau\\in\[0,T\-t\_\{0\}\)\.\(2\.8\)
It is constrained in the sense that its normal drift explicitly forcesP⟂Y^τbP\_\{\\perp\}\\hat\{Y\}\_\{\\tau\}^\{\\,b\}toward the prescribed levelbb, thereby steering the trajectory toward the affine manifoldℳ\(b\)=\{x:P⟂x=b\}\\mathcal\{M\}\(b\)=\\\{x:\\,P\_\{\\perp\}x=b\\\}, while only the tangent component evolves according to the learned \(unconditional\) score\. Equations \([2\.7](https://arxiv.org/html/2605.05387#S2.E7)\) and \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\) share the same \(exact\) normal component and differ only in the tangent score:P∥s∗,bP\_\{\\parallel\}s^\{\*,b\}versusP∥sP\_\{\\parallel\}s\. In implementations, the factor1/\(T−τ\)=1/t1/\(T\-\\tau\)=1/tis handled by stopping the integration at a smallt0\>0t\_\{0\}\>0\(equivalentlyτmax=T−t0\\tau\_\{\\max\}=T\-t\_\{0\}\) and applying a final denoising step\.
We do not integrate the surrogate reverse SDE \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\) over the full reverse\-time horizonτ∈\[0,T\]\\tau\\in\[0,T\]\. The surrogate replaces the true conditional tangent scoreP∥sT−τ∗,bP\_\{\\parallel\}s^\{\*,b\}\_\{T\-\\tau\}by the unconditional termP∥sT−τP\_\{\\parallel\}s\_\{T\-\\tau\}\. If we start atτ=0\\tau=0\(i\.e\., from the highest\-noise marginal\), this mismatch acts over a long interval during which the normal correction is weak because its strength scales as1/\(T−τ\)=1/t1/\(T\-\\tau\)=1/t\. As a result, the trajectory can drift in tangent directions in a way that is inconsistent with the target conditional law, producing a bias that accumulates before the constraint becomes dominant at smaller noise\.
To limit this accumulation, we start the surrogate reverse SDE only at an intermediate noise levelt∗∈\(0,T−t0\)t^\{\*\}\\in\(0,T\-t\_\{0\}\), equivalently at reverse timeτ∗:=T−t∗\\tau^\{\*\}:=T\-t^\{\*\}\. Intuitively,t∗t^\{\*\}is chosen so that, for the remaining reverse intervalτ∈\[τ∗,T−t0\]\\tau\\in\[\\tau^\{\*\},T\-t\_\{0\}\]\(i\.e\., forward timest∈\(0,t∗\]t\\in\(0,t^\{\*\}\]\), usingP∥stP\_\{\\parallel\}s\_\{t\}as a proxy forP∥st∗,bP\_\{\\parallel\}s\_\{t\}^\{\*,b\}is acceptable, while the normal drift is already strong enough to enforce the constraint\. What remains is that we cannot initialize the reverse SDE atτ∗\\tau^\{\*\}from an arbitrary point: we need an initial state that is \(approximately\) distributed as the correct conditional marginalLaw\(Xt∗∣P⟂Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)\.
We construct such an initialization by combining an exact draw for the normal component with a tangent\-space sampling step\. Under the conditioningB=bB=bwe haveP⟂Z=bP\_\{\\perp\}Z=balmost surely, hence
P⟂Xt∗=P⟂\(Z\+Wt∗\)=b\+P⟂Wt∗,P\_\{\\perp\}X\_\{t^\{\*\}\}=P\_\{\\perp\}\(Z\+W\_\{t^\{\*\}\}\)=b\+P\_\{\\perp\}W\_\{t^\{\*\}\},so the normal component at timet∗t^\{\*\}can be sampled explicitly by
xt∗⟂:=b\+t∗P⟂ξ,ξ∼𝒩\(0,Id\),x^\{\\perp\}\_\{t^\{\*\}\}:=b\+\\sqrt\{t^\{\*\}\}\\,P\_\{\\perp\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),which matches the exact law ofP⟂Xt∗∣P⟂Z=bP\_\{\\perp\}X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\. Conditional on this sampled normal componentx⟂x^\{\\perp\}, we sample a compatible tangent component by running Langevin dynamics restricted to the affine setℳ\(x⟂\):=\{x:P⟂x=x⟂\}\\mathcal\{M\}\(x^\{\\perp\}\):=\\\{x:\\;P\_\{\\perp\}x=x^\{\\perp\}\\\}, using only the projected \(tangent\) score at timet∗t^\{\*\}\.
The following lemma shows that restricting a density toℳ\(x⟂\)\\mathcal\{M\}\(x^\{\\perp\}\)simply projects its ambient score onto the tangent space\.
###### Lemma 2
LetA∈ℝm×dA\\in\\mathbb\{R\}^\{m\\times d\}have full row rank and letC∈ℝd×\(d−m\)C\\in\\mathbb\{R\}^\{d\\times\(d\-m\)\}have orthonormal columns spanningker\(A\)\\ker\(A\)\(C⊤C=Id−mC^\{\\top\}C=I\_\{d\-m\},CC⊤=P∥CC^\{\\top\}=P\_\{\\parallel\}\)\. Fix anyu0∈ℳ\(x⟂\)u\_\{0\}\\in\\mathcal\{M\}\(x^\{\\perp\}\)and parametrize the affine set byx=u0\+Cz∥x=u\_\{0\}\+Cz^\{\\parallel\}withz∥∈ℝd−mz^\{\\parallel\}\\in\\mathbb\{R\}^\{d\-m\}\. For any differentiable densityp:ℝd→\(0,∞\)p:\\mathbb\{R\}^\{d\}\\to\(0,\\infty\), define its restriction toℳ\(x⟂\)\\mathcal\{M\}\(x^\{\\perp\}\)byπ\(z∥\)∝p\(u0\+Cz∥\)\\pi\(z^\{\\parallel\}\)\\propto p\(u\_\{0\}\+Cz^\{\\parallel\}\)\. Then, for allz∥z^\{\\parallel\},
C∇z∥logπ\(z∥\)=P∥∇xlogp\(x\),x=u0\+Cz∥\.C\\,\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=P\_\{\\parallel\}\\,\\nabla\_\{x\}\\log p\(x\),\\qquad x=u\_\{0\}\+Cz^\{\\parallel\}\.
ProofSince the proportionality constant does not depend onz∥z^\{\\parallel\}, it disappears after taking logarithms and gradients\. Thus
logπ\(z∥\)=logp\(u0\+Cz∥\)\+const\.\\log\\pi\(z^\{\\parallel\}\)=\\log p\(u\_\{0\}\+Cz^\{\\parallel\}\)\+\\text\{const\}\.Differentiating with respect toz∥z^\{\\parallel\}and using the chain rule gives
∇z∥logπ\(z∥\)=C⊤∇xlogp\(x\),x=u0\+Cz∥\.\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=C^\{\\top\}\\nabla\_\{x\}\\log p\(x\),\\qquad x=u\_\{0\}\+Cz^\{\\parallel\}\.Multiplying both sides byCC, we obtain
C∇z∥logπ\(z∥\)=CC⊤∇xlogp\(x\)\.C\\,\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=CC^\{\\top\}\\nabla\_\{x\}\\log p\(x\)\.Because the columns ofCCform an orthonormal basis ofker\(A\)\\ker\(A\), we have
CC⊤=P∥\.CC^\{\\top\}=P\_\{\\parallel\}\.Therefore
C∇z∥logπ\(z∥\)=P∥∇xlogp\(x\),x=u0\+Cz∥,C\\,\\nabla\_\{z^\{\\parallel\}\}\\log\\pi\(z^\{\\parallel\}\)=P\_\{\\parallel\}\\nabla\_\{x\}\\log p\(x\),\\qquad x=u\_\{0\}\+Cz^\{\\parallel\},which is exactly the claimed identity\. We use Lemma[2](https://arxiv.org/html/2605.05387#Thmtheorem2)with the time\-t∗t^\{\*\}marginalpt∗p\_\{t^\{\*\}\}\(and its learned scorest∗=∇logpt∗s\_\{t^\{\*\}\}=\\nabla\\log p\_\{t^\{\*\}\}\)\. Starting from any point onℳ\(xt∗⟂\)\\mathcal\{M\}\(x^\{\\perp\}\_\{t^\{\*\}\}\), e\.g\.
y0:=xt∗⟂\+t∗P∥ξ,ξ∼𝒩\(0,Id\),y\_\{0\}:=x^\{\\perp\}\_\{t^\{\*\}\}\+\\sqrt\{t^\{\*\}\}\\,P\_\{\\parallel\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),we run underdamped Langevin dynamics evolving only in tangent directions:
dys\\displaystyle\\mathrm\{d\}y\_\{s\}=vsds,\\displaystyle=v\_\{s\}\\,\\mathrm\{d\}s,\(2\.9\)dvs\\displaystyle\\mathrm\{d\}v\_\{s\}=P∥st∗\(ys\)ds−γP∥vsds\+2γP∥dWs,\\displaystyle=P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\_\{s\}\)\\,\\mathrm\{d\}s\-\\gamma P\_\{\\parallel\}v\_\{s\}\\,\\mathrm\{d\}s\+\\sqrt\{2\\gamma\}\\,P\_\{\\parallel\}\\,\\mathrm\{d\}W\_\{s\},while enforcing the constraintP⟂ys≡xt∗⟂P\_\{\\perp\}y\_\{s\}\\equiv x^\{\\perp\}\_\{t^\{\*\}\}for allss\(equivalently, we project updates onto the tangent space\)\. LetY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{\\,b\}denote the resulting positionysy\_\{s\}after a prescribed Langevin time\.
This two\-stage procedure induces an initialization law at timet∗t^\{\*\}that matches the conditional normal marginal exactly and uses a tractable surrogate for the tangent conditional, namely
p^t∗b\(x⟂,x∥\)=pt∗\(x∥∣x⟂\)pt∗\(x⟂∣P⟂Z=b\),x⟂=P⟂x,x∥=P∥x\.\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\(x^\{\\perp\},x^\{\\parallel\}\)=p\_\{t^\{\*\}\}\(x^\{\\parallel\}\\mid x^\{\\perp\}\)\\,p\_\{t^\{\*\}\}\(x^\{\\perp\}\\mid P\_\{\\perp\}Z=b\),\\qquad x^\{\\perp\}=P\_\{\\perp\}x,\\ \\ x^\{\\parallel\}=P\_\{\\parallel\}x\.Herept∗\(x⟂∣P⟂Z=b\)p\_\{t^\{\*\}\}\(x^\{\\perp\}\\mid P\_\{\\perp\}Z=b\)is available in closed form because, underB=bB=b, the forward process satisfiesXt∗⟂=b\+Wt∗⟂X\_\{t^\{\*\}\}^\{\\perp\}=b\+W\_\{t^\{\*\}\}^\{\\perp\}, henceXt∗⟂∼𝒩\(b,t∗P⟂\)X\_\{t^\{\*\}\}^\{\\perp\}\\sim\\mathcal\{N\}\(b,t^\{\*\}P\_\{\\perp\}\)\. The remaining factorpt∗\(x∥∣x⟂\)p\_\{t^\{\*\}\}\(x^\{\\parallel\}\\mid x^\{\\perp\}\)is*not*conditioned onB=bB=b; it is the*unconditional*tangent conditional induced by the pretrained model at noise levelt∗t^\{\*\}\. Equivalently,p^t∗b\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}is the distribution obtained by \(i\) drawing the correct noisy normal component under the constraint, and then \(ii\) drawing a tangent component that is compatible with that normal slice according to the unconditional time\-t∗t^\{\*\}marginal\. This is exactly what the projected Langevin phase targets: it mixes along the affine setℳ\(xt∗⟂\)\\mathcal\{M\}\(x^\{\\perp\}\_\{t^\{\*\}\}\)using the projected scoreP∥st∗P\_\{\\parallel\}s\_\{t^\{\*\}\}, which is the score ofpt∗\(⋅∣x⟂\)p\_\{t^\{\*\}\}\(\\cdot\\mid x^\{\\perp\}\)restricted to the manifold \(Lemma[2](https://arxiv.org/html/2605.05387#Thmtheorem2)\)\. Finally, we usep^t∗b\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}as the*initial distribution*for the surrogate reverse dynamics \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\) at reverse timeτ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}, i\.e\.,Y^τ∗b∼p^t∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{\\,b\}\\sim\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\.
*Toy mixture illustration\.*We illustrate the tangent\-bias mechanism on a simple prior inℝ2\\mathbb\{R\}^\{2\}: a three\-point mixture with atoms at\(1,1\)\(1,1\),\(−1,−1\)\(\-1,\-1\), and\(0,5\)\(0,5\)with weights0\.125:0\.125:0\.750\.125:0\.125:0\.75\. We consider the linear constraint
x−y=0,equivalentlyAZ=0withA=\[1−1\],x\-y=0,\\qquad\\text\{equivalently\}\\qquad AZ=0\\ \\ \\text\{with\}\\ \\ A=\\begin\{bmatrix\}1&\-1\\end\{bmatrix\},so the conditional target isLaw\(Z∣x−y=0\)\\operatorname\{Law\}\(Z\\mid x\-y=0\), i\.e\., sampling on the diagonal affine setℳ\(0\)=\{\(x,y\)∈ℝ2:x=y\}\\mathcal\{M\}\(0\)=\\\{\(x,y\)\\in\\mathbb\{R\}^\{2\}:\\ x=y\\\}\.
We run the probability\-flow ordinary differential equation \(PF\-ODE\), the deterministic counterpart of the reverse\-time sampler, fromσmax=20\\sigma\_\{\\max\}=20toσmin=0\.01\\sigma\_\{\\min\}=0\.01, with the identificationt=σ2t=\\sigma^\{2\}\. As shown in Figure[1](https://arxiv.org/html/2605.05387#S2.F1)\(a\), the unconstrained PF\-ODE recovers the correct mixture\.
Under naive projection\-based guidance initialized atσmax\\sigma\_\{\\max\}, the constraintx−y=0x\-y=0is enforced only through the analytic normal drift, while the tangent drift remains that of the unconditional score\. At high noise, the unconditional score is dominated by the heavy\(0,5\)\(0,5\)component, and this dominant\-mode tangent direction accumulates along the manifoldℳ\(0\)\\mathcal\{M\}\(0\), distorting the conditional weights and smearing the low\-mass modes toward the dominant cluster \(Figure[1](https://arxiv.org/html/2605.05387#S2.F1)\(b\)\)\. In contrast, our two\-stage procedure runs a brief projected underdamped Langevin phase att∗=0\.25t^\{\*\}=0\.25restricted toker\(A\)\\ker\(A\)\(i\.e\., motion tangent tox−y=0x\-y=0, cf\. Lemma[2](https://arxiv.org/html/2605.05387#Thmtheorem2)\), producing an initialization close toLaw\(Xt∗∣x−y=0\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid x\-y=0\)\. Starting the surrogate reverse dynamics fromτ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}then yields samples that remain consistent with the constraint and recover the intended mode structure \(Figure[1](https://arxiv.org/html/2605.05387#S2.F1)\(c\)\)\.

\(a\) Unconstrained PF\-ODE

\(b\) Naive projection guidance

\(c\) Two\-stage \(ours\)
Figure 1:Three\-point mixture inℝ2\\mathbb\{R\}^\{2\}\. \(a\) Unconstrained PF\-ODE reproduces the prior mixture\. \(b\) Naive projection\-based guidance accumulates tangent drift dominated by the high\-mass\(0,5\)\(0,5\)mode, biasing the conditional outcome\. \(c\) Our projected Langevin initialization att∗t^\{\*\}followed by the surrogate reverse dynamics recovers constraint\-consistent samples with the correct mode structure\.
## 3Algorithm and Implementation
Our conditional sampler is organized into three conceptually distinct steps\. The design goal is to \(i\) initialize at an intermediate “safe” noise levelt∗t^\{\*\}, \(ii\) mix efficiently*along*the affine constraint manifold, and \(iii\) complete denoising while enforcing the constraint through an exact normal drift\. Figure[2](https://arxiv.org/html/2605.05387#S3.F2)provides a visual summary of the full pipeline\. In Step 1 we move to the intermediate timet∗t^\{\*\}and fix the*noisy*normal level so thatP⟂Xt∗P\_\{\\perp\}X\_\{t^\{\*\}\}has the exact conditional law underB=bB=b\. In Step 2 we run a short phase of projected underdamped Langevin dynamics \(BAOAB\) on the corresponding affine setℳ\(x⟂\)\\mathcal\{M\}\(x^\{\\perp\}\)to mix*in the tangent directions*while keeping the normal component fixed\. This produces an initializationY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{\\,b\}that is approximately distributed asLaw\(Xt∗∣P⟂Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)and is already well\-mixed alongker\(A\)\\ker\(A\), which reduces the accumulation of tangent\-score mismatch at high noise\. In Step 3 we integrate the surrogate guided reverse dynamics fromτ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}toTT, using the analytic normal drift to enforce the constraint and the pretrained score for the tangent drift during denoising\.
*Step 1: Initialization for Langevin\.*As in Section[2](https://arxiv.org/html/2605.05387#S2), we start the reverse\-time procedure at the intermediate “safe” noise levelt∗=T−τ∗t^\{\*\}=T\-\\tau^\{\*\}\. Step 1 produces the*initial state for the projected Langevin phase*\(Step 2\)\. To do so, we select any clean feasible pointx0∈ℳ\(b\)x\_\{0\}\\in\\mathcal\{M\}\(b\)satisfyingP⟂x0=bP\_\{\\perp\}x\_\{0\}=b\(equivalentlyAx0=yAx\_\{0\}=y\)\. The choice ofx0x\_\{0\}is not unique and does not affect feasibility; in practice one may take, for example,x0=b\+P∥ζx\_\{0\}=b\+P\_\{\\parallel\}\\zetawithζ∼𝒩\(0,Id\)\\zeta\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\(Gaussian initialization inker\(A\)\\ker\(A\)\), or use a plug\-in estimate \(e\.g\., a pseudoinverse or any other fast reconstruction\) and project it ontoℳ\(b\)\\mathcal\{M\}\(b\)\.
We then movex0x\_\{0\}to timet∗t^\{\*\}by adding Gaussian perturbation,
yτ∗=x0\+T−τ∗ξ=x0\+t∗ξ,ξ∼𝒩\(0,Id\)\.y^\{\\tau^\{\*\}\}\\;=\\;x\_\{0\}\+\\sqrt\{T\-\\tau^\{\*\}\}\\,\\xi\\;=\\;x\_\{0\}\+\\sqrt\{t^\{\*\}\}\\,\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\.Rather than enforcing the clean constraint levelbbat this stage, we freeze the*noisy*normal component
bnoisy:=P⟂yτ∗,b\_\{\\mathrm\{noisy\}\}\\;:=\\;P\_\{\\perp\}y^\{\\tau^\{\*\}\},which is the correct forward\-time stochastic normal level at timet∗t^\{\*\}under the conditioningB=bB=b\.yτ∗y^\{\\tau^\{\*\}\}is then used to initialize the constrained BAOAB/underdamped Langevin dynamics on the affine setℳ\(bnoisy\)\\mathcal\{M\}\(b\_\{\\mathrm\{noisy\}\}\)in Step 2\. *Step 2: Tangent BAOAB Langevin*Next, we approximate the conditional marginalLaw\(Xt∗∣P⟂Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)by running*underdamped Langevin dynamics*restricted to the affine setℳ\(bnoisy\)\\mathcal\{M\}\(b\_\{\\text\{noisy\}\}\)\. We evolve a position–velocity pair\(ys,vs\)\(y\_\{s\},v\_\{s\}\)using the projected dynamics in Equation \([2\.9](https://arxiv.org/html/2605.05387#S2.E9)\), so that both the deterministic “force” and the stochastic excitation act*only in tangent directions*ker\(A\)\\ker\(A\)\. We discretize this SDE with theBAOAB splitting integrator, which decomposes the dynamics into three sub\-operators that can be integrated in closed form\.
The BAOAB split\.Write the SDE as the sum of:
- •*B \(kick\):*deterministic velocity update due to the forcev˙=P∥st∗\(y\)\\;\\dot\{v\}=P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)\.
- •*A \(drift\):*deterministic position updatey˙=v\\;\\dot\{y\}=v\.
- •*O \(Ornstein–Uhlenbeck\):*stochastic friction/noise on velocitydv=−γP∥vds\+2γP∥dWs\\;\\mathrm\{d\}v=\-\\gamma P\_\{\\parallel\}v\\,\\mathrm\{d\}s\+\\sqrt\{2\\gamma\}\\,P\_\{\\parallel\}\\mathrm\{d\}W\_\{s\}\.
BAOAB applies these pieces in the symmetric order
B/2→A/2→O→A/2→B/2,\\text\{B/2\}\\;\\rightarrow\\;\\text\{A/2\}\\;\\rightarrow\\;\\text\{O\}\\;\\rightarrow\\;\\text\{A/2\}\\;\\rightarrow\\;\\text\{B/2\},which is time\-reversible \(in the deterministic limit\) and is known to have excellent stability and low bias in the*configurational*\(position\) marginal\.
With step sizeΔs\\Delta s, one iteration from\(y,v\)\(y,v\)proceeds as:
1. 1\.*B/2 \(half kick\):*update the velocity using the score force at the current position: v←v\+Δs2P∥st∗\(y\)\.v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}\\,P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)\.
2. 2\.*A/2 \(half drift\):*move the position forward using the current velocity: y←y\+Δs2v\.y\\leftarrow y\+\\frac\{\\Delta s\}\{2\}\\,v\.
3. 3\.*O \(OU refresh\):*apply friction and inject Gaussian noise directly in velocity\. This step is exact because it is an Ornstein–Uhlenbeck process\. Writing c1:=e−γΔs,c2:=\(1−e−2γΔs\),c\_\{1\}:=e^\{\-\\gamma\\Delta s\},\\qquad c\_\{2\}:=\\sqrt\{\\big\(1\-e^\{\-2\\gamma\\Delta s\}\\big\)\},we perform v←c1v\+c2P∥ξ,ξ∼𝒩\(0,Id\)\.v\\leftarrow c\_\{1\}v\+c\_\{2\}\\,P\_\{\\parallel\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\.Herec1c\_\{1\}contracts velocity \(friction\) andc2c\_\{2\}sets the noise amplitude; the projectionP∥P\_\{\\parallel\}ensures that the OU excitation does not change the normal component\.
4. 4\.*A/2 \(half drift\):*advance the position again: y←y\+Δs2v\.y\\leftarrow y\+\\frac\{\\Delta s\}\{2\}\\,v\.
5. 5\.*B/2 \(half kick\):*apply the remaining half force update: v←v\+Δs2P∥st∗\(y\)\.v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}\\,P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)\.
AfterKKBAOAB iterations, we denote the resulting position byY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}\. This state is well\-mixed alongker\(A\)\\ker\(A\)while remaining consistent with the forward\-time noisy level set, making it a reliable initialization for the guided reverse denoising stage\. *Step 3: Guided Reverse Denoising*Finally, starting fromY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}we integrate the guided reverse SDE fromτ=τ∗\\tau=\\tau^\{\*\}up toT−t0T\-t\_\{0\}in Equation \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\)
Algorithm[1](https://arxiv.org/html/2605.05387#alg1)summarizes these three steps in pseudocode\.
Algorithm 1Conditional Sampling via Affine BAOAB Initialization1:Input:Clean measurement
bb, starting point
x0∈ℳ\(b\)x\_\{0\}\\in\\mathcal\{M\}\(b\), intermediate noise
t∗=T−τ∗t^\{\*\}=T\-\\tau^\{\*\}, Langevin steps
KK, step size
Δs\\Delta s, friction
γ\\gamma, score network
st∗s\_\{t^\{\*\}\}\.
2:Step 1: Initialization for Langevin
3:
yτ∗←x0\+T−τ∗ξ,ξ∼𝒩\(0,Id\)y^\{\\tau^\{\*\}\}\\leftarrow x\_\{0\}\+\\sqrt\{T\-\\tau^\{\*\}\}\\xi,\\quad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)
4:
bnoisy←P⟂yτ∗b\_\{\\text\{noisy\}\}\\leftarrow P\_\{\\perp\}y^\{\\tau^\{\*\}\}⊳\\trianglerightTarget level set for the Langevin phase
5:
y←yτ∗,v←0y\\leftarrow y^\{\\tau^\{\*\}\},\\quad v\\leftarrow 0
6:
c1←e−γΔs,c2←1−e−2γΔsc\_\{1\}\\leftarrow e^\{\-\\gamma\\Delta s\},\\quad c\_\{2\}\\leftarrow\\sqrt\{1\-e^\{\-2\\gamma\\Delta s\}\}
7:Step 2: Tangent BAOAB Langevin
8:for
k=1k=1to
KKdo
9:
v←v\+Δs2P∥st∗\(y\)v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)⊳\\trianglerightB: Half\-step drift
10:
y←y\+Δs2vy\\leftarrow y\+\\frac\{\\Delta s\}\{2\}v⊳\\trianglerightA: Half\-step position
11:
v←c1v\+c2P∥ξ,ξ∼𝒩\(0,Id\)v\\leftarrow c\_\{1\}v\+c\_\{2\}P\_\{\\parallel\}\\xi,\\quad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)⊳\\trianglerightO: Projected noise injection
12:
y←y\+Δs2vy\\leftarrow y\+\\frac\{\\Delta s\}\{2\}v⊳\\trianglerightA: Half\-step position
13:
v←v\+Δs2P∥st∗\(y\)v\\leftarrow v\+\\frac\{\\Delta s\}\{2\}P\_\{\\parallel\}s\_\{t^\{\*\}\}\(y\)⊳\\trianglerightB: Half\-step drift
14:
y←P∥y\+bnoisyy\\leftarrow P\_\{\\parallel\}y\+b\_\{\\text\{noisy\}\}⊳\\trianglerightConstraint: MaintainP⟂y=bnoisyP\_\{\\perp\}y=b\_\{\\text\{noisy\}\}
15:endfor
16:
Y^τ∗b←y\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}\\leftarrow y
17:Step 3: Guided Reverse Denoising
18:Evolve
Y^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}from
τ=τ∗\\tau=\\tau^\{\*\}to
T−t0T\-t\_\{0\}using the guided reverse SDE in Equation \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\)
19:Return:Final conditional sample
z^\\hat\{z\}
Input: Measurementb\\bm\{b\}\(Linear Constraint\)Step 1: Noisy Inityτ∗y^\{\\tau^\{\*\}\}\(Forward Diffusion\)Step 2: Langevin StateY^τ∗b\\hat\{Y\}\_\{\\tau^\{\*\}\}^\{b\}\(BAOABTangent Mixing\)Step 3: Final Samplex0x\_\{0\}\(Generated Data\)Step 1Initialization for LangevinStep 2Tangent BAOABLangevinStep 3Guided ReverseDenoisingFigure 2:Visual overview of the proposed sampling process\. Step 1: diffuse the constrained input to the intermediate noise levelt∗t^\{\*\}\. Step 2: run projected BAOAB underdamped Langevin dynamics to mix along the affine constraint set while preserving the noisy normal level\. Step 3: perform guided reverse denoising with exact normal correction to obtain the final sample\.
## 4Experiments
We evaluate the proposed Langevin\-Conditioned Diffusion Model with BAOAB \(LCDM\-BAOAB\) sampler on standard256×256256\\times 256image inverse problems\. LCDM\-BAOAB uses the affine normal–tangent decomposition developed in the previous sections: it first performs projected BAOAB Langevin mixing in the tangent directions at an intermediate noise level, and then completes sampling by guided DDIM denoising with exact normal correction\.
We test on three benchmarks: CelebA\-HQ\(Karras et al\.,[2018](https://arxiv.org/html/2605.05387#bib.bib15)\), LSUN Church\(Yu et al\.,[2015](https://arxiv.org/html/2605.05387#bib.bib29)\), and ImageNet\(Deng et al\.,[2009](https://arxiv.org/html/2605.05387#bib.bib4)\)\. As the primary baseline, we use the DDNM\(Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27)\), a strong zero\-shot diffusion method for linear image inverse problems\.
We compare with DDNM because it has been reported to meaningfully outperform earlier zero\-shot conditional sampling and restoration methods for linear inverse problems\. In particular, DDNM was introduced as a unified zero\-shot framework for linear image restoration tasks such as super\-resolution, inpainting, colorization, compressed sensing, and deblurring, and was shown to improve over prior zero\-shot approaches including ILVR, RePaint, DDRM, and DPS\(Choi et al\.,[2021](https://arxiv.org/html/2605.05387#bib.bib2); Lugmayr et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib20); Kawar et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib16); Wang et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib27); Chung et al\.,[2022](https://arxiv.org/html/2605.05387#bib.bib3)\)\. Thus, DDNM provides a strong projection\-based reference point for testing whether the additional tangent\-space BAOAB Langevin initialization in LCDM\-BAOAB yields measurable improvements under matched compute\.
In Appendix[D](https://arxiv.org/html/2605.05387#A4), we show that, under the VP–DDPMε\\varepsilon\-parameterization, the DDNM update is equivalent to using the effective score
s^tDDNM\(xt;y\)=P∥st\(xt\)\+αtb−P⟂xtσt2\.\\hat\{s\}\_\{t\}^\{\\rm DDNM\}\(x\_\{t\};y\)=P\_\{\\parallel\}s\_\{t\}\(x\_\{t\}\)\+\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.Thus DDNM applies the analytic correction in the normal directions while retaining the pretrained unconditional score in the tangent directions\. Our experiments therefore test whether explicitly mixing in the tangent directions through projected BAOAB Langevin dynamics improves over a strong zero\-shot projection\-based sampler\.
All experiments are performed in the zero\-shot setting using pretrained256×256256\\times 256diffusion backbones, with no task\-specific fine\-tuning\. We report LPIPS and FID; lower values are better for both metrics\.
All experiments are run under a matched budget of100100effective network function evaluations \(NFEs\)\. For DDNM, this corresponds to100100DDIM steps withη=0\.85\\eta=0\.85\. For LCDM\-BAOAB, the budget is split into5050projected BAOAB Langevin updates and5050guided DDIM denoising steps\. In the8×8\\timessuper\-resolution experiments, we use a200200\-point DDIM time grid and start LCDM\-BAOAB from the25%25\\%point of this grid, so the guided reverse stage uses the final5050DDIM network evaluations\. Thus all reported comparisons use the same total budget of100100NFEs\. In the BAOAB phase, we cache and reuse the previous UNet output whenever possible, so that the effective number of score\-network evaluations remains matched to DDNM\.
The Langevin phase is introduced at a task\-dependent discrete DDPM timestep, denotedkmixk\_\{\\mathrm\{mix\}\}\. This index refers to the implementation timestep of the pretrained DDPM sampler and should not be confused with the continuous safe\-time parametert∗t^\{\*\}used in the theoretical analysis\. For inpainting, we usekmix=500k\_\{\\mathrm\{mix\}\}=500\. For super\-resolution, we usekmix=250k\_\{\\mathrm\{mix\}\}=250, which corresponds to a later and higher\-SNR point in the reverse trajectory\. This choice reflects the different nature of the two inverse problems\. In super\-resolution, the main difficulty is recovering high\-frequency detail from a heavily downsampled image, and tangent\-space refinement is more stable once the iterate is closer to the data manifold\. Therefore, for super\-resolution, we perform BAOAB mixing later in the denoising trajectory than we do for masking tasks\.
For super\-resolution, we consider8×8\\timesmean downsampling, mapping32×3232\\times 32observations to256×256256\\times 256images\. Quantitative results are reported on10001000images per data set\.
### 4\.1Inpainting Results
We first evaluate fixed\-mask inpainting\. The fixed mask is chosen differently across data sets\. On CelebA\-HQ, we mask a facial region, since reconstructing a semantically important part of a human face is substantially more challenging than filling an arbitrary patch\. On LSUN Church and ImageNet, we use the corresponding fixed square masks for those data sets\. This distinction is important when interpreting the CelebA\-HQ results, since the CelebA\-HQ mask targets a harder semantic completion problem\.
Table[1](https://arxiv.org/html/2605.05387#S4.T1)shows that LCDM\-BAOAB consistently improves over DDNM on all three data sets\. The improvement is modest on CelebA\-HQ, but becomes larger on LSUN Church and especially on ImageNet\. This trend is consistent with our hypothesis: as the data distribution becomes more diverse and the tangent space becomes more semantically ambiguous, projection\-only guidance is more susceptible to tangent\-space bias, and explicit tangent mixing becomes more beneficial\.
Table 1:Fixed\-mask inpainting results on256×256256\\times 256benchmarks\. Metrics are LPIPS↓\\downarrowand FID↓\\downarrow\.To test whether the inpainting improvement persists beyond a single fixed corruption pattern, we also evaluate random\-mask inpainting on10001000images per data set\. For each image, we remove a100×100100\\times 100square patch sampled at a random location\. The random mask is tied deterministically to the image identity, so DDNM and LCDM\-BAOAB are evaluated on exactly the same corrupted input for each image\. This gives a paired comparison and removes any ambiguity about whether differences are caused by the sampler or by different mask locations\.
Table[2](https://arxiv.org/html/2605.05387#S4.T2)shows that LCDM\-BAOAB again improves over DDNM across all three data sets\. The gains are smallest on CelebA\-HQ, larger on LSUN Church, and largest on ImageNet\. On ImageNet, LCDM\-BAOAB improves FID from29\.0029\.00to20\.9120\.91and LPIPS from0\.11820\.1182to0\.09330\.0933\. These results show that tangent\-space BAOAB mixing remains beneficial even when the missing region varies across images\.
Table 2:Random\-mask inpainting results on10001000images per data set\. For each image, a100×100100\\times 100square mask is sampled at a random location and shared across methods\. Metrics are LPIPS↓\\downarrowand FID↓\\downarrow\.\(a\) Real image\(b\) DDNM\(c\) LCDM\-BAOAB\(d\) Real image\(e\) DDNM\(f\) LCDM\-BAOABFigure 3:Visual comparison for inpainting\. DDNM can produce texture artifacts or semantically inconsistent completions, especially on ImageNet\. LCDM\-BAOAB produces cleaner and more coherent reconstructions while preserving measurement consistency\.
### 4\.2Super\-Resolution Results
We next evaluate8×8\\timessuper\-resolution, where the observation is obtained by mean downsampling a256×256256\\times 256image to32×3232\\times 32\. This inverse problem is substantially more ill\-posed than inpainting because most high\-frequency information is removed by the forward operator\. Consequently, the conditional distribution contains a large tangent\-space ambiguity: many high\-resolution images are consistent with the same low\-resolution observation\.
For this task, we introduce the BAOAB Langevin phase at the discrete DDPM timestepkmix=250k\_\{\\mathrm\{mix\}\}=250, later in the reverse trajectory than in the inpainting experiments\. This higher\-SNR starting point makes tangent refinement more stable and allows the sampler to recover fine\-scale structure after the coarse image content has already been established\.
Table[3](https://arxiv.org/html/2605.05387#S4.T3)shows that LCDM\-BAOAB consistently improves over DDNM on all three data sets\. The gains are largest on ImageNet, where the conditional ambiguity is strongest, and remain substantial on LSUN Church\. These results support the central claim of the paper: enforcing the measurement in the normal directions is not sufficient for highly ill\-posed linear inverse problems; additional tangent\-space mixing can substantially improve perceptual and distributional quality\.
Table 3:8×8\\timessuper\-resolution results on256×256256\\times 256benchmarks\. Metrics are LPIPS↓\\downarrowand FID↓\\downarrow\.\(a\) Real image\(b\) DDNM,8×8\\times\(c\) LCDM\-BAOAB,8×8\\timesFigure 4:8×8\\timessuper\-resolution on ImageNet\. DDNM tends to produce blurred or structurally inconsistent outputs, while LCDM\-BAOAB recovers sharper edges and more realistic high\-frequency detail\.
## 5Error Decomposition and Average KL Bounds
We now state quantitative guarantees for the discrepancy between the ideal conditional sampler and our practical procedure\. The key idea is to align the analysis with the algorithmic structure: starting from a safe noise levelt∗t^\{\*\}, our method \(i\) approximately initializes the reverse process at timet∗t^\{\*\}using the two\-stage normal/tangent construction, and \(ii\) then evolves via a guided reverse SDE whose normal drift is exact but whose tangent drift uses the unconditional score\. Accordingly, the results below separate the total error into an*initialization*term at timet∗t^\{\*\}and a*pathwise*term accumulated during the reverse evolution\. The first theorem bounds the pathwise KL divergence between the true conditional and surrogate reverse\-time*path measures*in terms of conditional mutual information between tangent and normal components\. We then introduce assumptions that control how the tangent conditional marginal varies with the levelbb, and use them to bound the average initialization error\. Combining these ingredients yields an average terminal KL bound for the tangent marginal of the generated sample, together with sharper consequences under additional separation conditions on the admissible levels\. We now quantify the error of our conditional sampling procedure\. There are two conceptual contributions:
- \(i\)a*pathwise*error from using the unconditional tangent score in the guided reverse SDE instead of the true conditional tangent score;
- \(ii\)an*initialization*error at timet∗t^\{\*\}, because we only approximately sample from the true conditional marginalLaw\(Xt∗∣P⟂Z=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid P\_\{\\perp\}Z=b\)using the two\-stage procedure described above\.
The next theorem controls the pathwise error in terms of conditional mutual information between tangent and normal components\. To prove it, we only need a second moment bound onZZ\.
###### Assumption 5\.1
ForZ∼p0Z\\sim p\_\{0\}, we have𝔼‖Z‖2<∞\\mathbb\{E\}\\\|Z\\\|^\{2\}<\\infty\.
###### Theorem 4
Let Assumption[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1)be in force\. Fix0≤τ∗<T−t00\\leq\\tau^\{\*\}<T\-t\_\{0\}\. For each levelbb, consider on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]the ideal conditional reverse SDE \([2\.7](https://arxiv.org/html/2605.05387#S2.E7)\) and the surrogate constrained reverse SDE \([2\.8](https://arxiv.org/html/2605.05387#S2.E8)\), started from the same initial law at timeτ∗\\tau^\{\*\}:
dYτ∗,b\\displaystyle\\mathrm\{d\}Y\_\{\\tau\}^\{\*,b\}=\(P∥sT−τ∗,b\(Yτ∗,b\)\+1T−τP⟂\(b−Yτ∗,b\)\)dτ\+dW¯τ,\\displaystyle=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\+\\frac\{1\}\{T\-\\tau\}P\_\{\\perp\}\(b\-Y\_\{\\tau\}^\{\*,b\}\)\\Big\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\},dY^τb\\displaystyle\\mathrm\{d\}\\hat\{Y\}\_\{\\tau\}^\{\\,b\}=\(P∥sT−τ\(Y^τb\)\+1T−τP⟂\(b−Y^τb\)\)dτ\+dW¯τ\.\\displaystyle=\\Big\(P\_\{\\parallel\}s\_\{T\-\\tau\}\(\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\)\+\\frac\{1\}\{T\-\\tau\}P\_\{\\perp\}\(b\-\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\)\\Big\)\\,\\mathrm\{d\}\\tau\+\\mathrm\{d\}\\bar\{W\}\_\{\\tau\}\.LetℙY∗,b\\mathbb\{P\}^\{Y^\{\*,b\}\}andℙY^b\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,b\}\}denote the corresponding path measures on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]\. Decompose the clean signal as
Z∥:=P∥Z,Z⟂:=P⟂Z\.Z^\{\\parallel\}:=P\_\{\\parallel\}Z,\\qquad Z^\{\\perp\}:=P\_\{\\perp\}Z\.Then
𝔼B\[KL\(ℙY∗,B∥ℙY^B\)\]≤I\(Z∥;Z⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\,\\\|\\,\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.Moreover,
𝔼B\[KL\(ℙY∗,B∥ℙY^B\)\]≥I\(Z∥;Z⟂∣Xt∗\)−I\(Z∥;Z⟂∣Xt∗⟂\)−I\(Z∥;Z⟂∣Xt0\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\,\\\|\\,\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\geq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\-I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\\big\)\-I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t\_\{0\}\}\\big\)\.
Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)is stated for a general clean signalZZ, and shows that the pathwise error is controlled by the conditional mutual information
I\(Z∥;Z⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.To make this quantity more concrete, we now specialize to a latent Gaussian\-mixture model\. This specialization is motivated by modern discrete\-latent generative models for images, in which the observed image can be viewed as a structured latent code decoded into pixel space up to a small reconstruction error\. Under this model,ZZis a Gaussian perturbation of a discrete latent variableSS, and the dependence term in Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)can be compared to the corresponding latent dependence
I\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.The next assumption and proposition formalize this reduction\.
###### Assumption 5\.2
The clean signalZ∈ℝdZ\\in\\mathbb\{R\}^\{d\}admits the representation
Z=S\+εN,Z=S\+\\varepsilon N,whereSSis a discrete random vector taking values in a countable set𝒞⊂ℝd\\mathcal\{C\}\\subset\\mathbb\{R\}^\{d\},N∼𝒩\(0,Id\)N\\sim\\mathcal\{N\}\(0,I\_\{d\}\),NNis independent ofSS, andε\>0\\varepsilon\>0\. Equivalently,ZZfollows a countable Gaussian mixture distribution whose components have means in𝒞\\mathcal\{C\}and common covarianceε2Id\\varepsilon^\{2\}I\_\{d\}\.
We write
S∥:=P∥S,S⟂:=P⟂S,S^\{\\parallel\}:=P\_\{\\parallel\}S,\\qquad S^\{\\perp\}:=P\_\{\\perp\}S,and
Z∥:=P∥Z,Z⟂:=P⟂Z\.Z^\{\\parallel\}:=P\_\{\\parallel\}Z,\\qquad Z^\{\\perp\}:=P\_\{\\perp\}Z\.In addition, we assume that the projected latent normal code has finite entropy,
H\(S⟂\)<∞\.H\(S^\{\\perp\}\)<\\infty\.Whenever Rényi\-entropy bounds are invoked, we further assume that the order\-1/21/2Rényi entropy is finite:
H1/2\(S⟂\)<∞\.H\_\{1/2\}\(S^\{\\perp\}\)<\\infty\.
###### Proposition 6
Let Assumption[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2)be in force\. Then, for everyt≥0t\\geq 0,
I\(Z∥;Z⟂∣Xt\)≤I\(S∥;S⟂∣Xt\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t\}\\big\)\.In particular, at the safe timet∗t^\{\*\},
I\(Z∥;Z⟂∣Xt∗\)≤I\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.
ProofFixt≥0t\\geq 0\. Under Assumption[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2),
Z=S\+εN,Xt=Z\+Wt=S\+εN\+Wt,Z=S\+\\varepsilon N,\\qquad X\_\{t\}=Z\+W\_\{t\}=S\+\\varepsilon N\+W\_\{t\},whereNNandWtW\_\{t\}are independent standard Gaussian noises\. SinceP∥P\_\{\\parallel\}andP⟂P\_\{\\perp\}are orthogonal projections, the tangent and normal noise components are independent\. Hence, for every regular conditional law givenXt=xX\_\{t\}=x,
Law\(Z∥,Z⟂∣S∥,S⟂,Xt=x\)=Law\(Z∥∣S∥,Xt∥=x∥\)⊗Law\(Z⟂∣S⟂,Xt⟂=x⟂\)\.\\operatorname\{Law\}\\\!\\big\(Z^\{\\parallel\},Z^\{\\perp\}\\mid S^\{\\parallel\},S^\{\\perp\},X\_\{t\}=x\\big\)=\\operatorname\{Law\}\\\!\\big\(Z^\{\\parallel\}\\mid S^\{\\parallel\},X\_\{t\}^\{\\parallel\}=x^\{\\parallel\}\\big\)\\otimes\\operatorname\{Law\}\\\!\\big\(Z^\{\\perp\}\\mid S^\{\\perp\},X\_\{t\}^\{\\perp\}=x^\{\\perp\}\\big\)\.Thus, conditionally onXt=xX\_\{t\}=x, the pair\(Z∥,Z⟂\)\(Z^\{\\parallel\},Z^\{\\perp\}\)is obtained from\(S∥,S⟂\)\(S^\{\\parallel\},S^\{\\perp\}\)by applying two separate conditionally independent channels: one fromS∥S^\{\\parallel\}toZ∥Z^\{\\parallel\}, and one fromS⟂S^\{\\perp\}toZ⟂Z^\{\\perp\}\. Therefore, by the data\-processing inequality for mutual information under product channels,
ILaw\(⋅∣Xt=x\)\(Z∥;Z⟂\)≤ILaw\(⋅∣Xt=x\)\(S∥;S⟂\)I\_\{\\operatorname\{Law\}\(\\cdot\\mid X\_\{t\}=x\)\}\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\big\)\\leq I\_\{\\operatorname\{Law\}\(\\cdot\\mid X\_\{t\}=x\)\}\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\big\)forXtX\_\{t\}\-almost everyxx\. Integrating this inequality with respect to the law ofXtX\_\{t\}gives
I\(Z∥;Z⟂∣Xt\)≤I\(S∥;S⟂∣Xt\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t\}\\big\)\.The statement att=t∗t=t^\{\*\}is the same inequality evaluated at the safe time\.
We now formalize the assumptions needed to control the initialization error at timet∗t^\{\*\}and to express the pathwise term in latent information\-theoretic form\.
###### Assumption 5\.3
Define
𝒞⟂:=P⟂𝒞\.\\mathcal\{C\}^\{\\perp\}:=P\_\{\\perp\}\\mathcal\{C\}\.For eacht≥t0t\\geq t\_\{0\}andc∈𝒞⟂c\\in\\mathcal\{C\}^\{\\perp\}, let
rtc:=Law\(Xt∥∣S⟂=c\)\.r\_\{t\}^\{c\}:=\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid S^\{\\perp\}=c\)\.We assume that for everyt≥t0t\\geq t\_\{0\}there exists a finite constantLt<∞L\_\{t\}<\\inftysuch that for allc1,c2∈𝒞⟂c\_\{1\},c\_\{2\}\\in\\mathcal\{C\}^\{\\perp\},
KL\(rtc1∥rtc2\)≤Lt∥c1−c2∥22\.\\mathrm\{KL\}\\\!\\left\(r\_\{t\}^\{c\_\{1\}\}\\,\\middle\\\|\\,r\_\{t\}^\{c\_\{2\}\}\\right\)\\leq L\_\{t\}\\,\\\|c\_\{1\}\-c\_\{2\}\\\|\_\{2\}^\{2\}\.\(5\.1\)
We now state our main quantitative guarantee for the*terminal tangent marginal*\. Recall that the ideal conditional reverse\-time dynamics\{Yτ∗,b\}τ∈\[τ∗,T−t0\]\\\{Y\_\{\\tau\}^\{\*,b\}\\\}\_\{\\tau\\in\[\\tau^\{\*\},\\,T\-t\_\{0\}\]\}, initialized from the true conditional marginal at timeτ∗\\tau^\{\*\}, and the practical surrogate procedure\{Y^τb\}τ∈\[τ∗,T−t0\]\\\{\\hat\{Y\}\_\{\\tau\}^\{\\,b\}\\\}\_\{\\tau\\in\[\\tau^\{\*\},\\,T\-t\_\{0\}\]\}, obtained by the two\-stage initialization at timet∗t^\{\*\}followed by the surrogate guided reverse dynamics, induce terminal tangent laws
μT−t0∗,b:=Law\(P∥YT−t0∗,b\),μ^T−t0b:=Law\(P∥Y^T−t0b\)\.\\mu\_\{T\-t\_\{0\}\}^\{\*,b\}:=\\operatorname\{Law\}\\\!\\big\(P\_\{\\parallel\}Y\_\{T\-t\_\{0\}\}^\{\*,b\}\\big\),\\qquad\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,b\}:=\\operatorname\{Law\}\\\!\\big\(P\_\{\\parallel\}\\hat\{Y\}\_\{T\-t\_\{0\}\}^\{\\,b\}\\big\)\.Our goal is to bound the averaged terminal discrepancy
𝔼B\[KL\(μT−t0∗,B∥μ^T−t0B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\big\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\right\]\.
###### Theorem 9
Let Assumptions[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1),[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2), and[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3)be in force\. Let
H:=H\(S⟂\),H:=H\(S^\{\\perp\}\),and assumeH<∞H<\\infty\. Fix a safe noise levelt∗∈\(t0,T\)t^\{\*\}\\in\(t\_\{0\},T\)\(equivalently,τ∗=T−t∗\\tau^\{\*\}=T\-t^\{\*\}\)\. Then
𝔼B\[KL\(μT−t0∗,B∥μ^T−t0B\)\]≤4Lt∗\(t∗\+ε2\)H\+I\(S∥;S⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\middle\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\right\]\\leq 4L\_\{t^\{\*\}\}\(t^\{\*\}\+\\varepsilon^\{2\}\)\\,H\+I\\\!\\big\(S^\{\\parallel\};\\,S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.\(5\.2\)
Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9)applies without any geometric separation assumption on the set of admissible latent normal codes𝒞⟂\\mathcal\{C\}^\{\\perp\}\. In that general case, the noisy normal observation may remain ambiguous among several nearby latent codes\. We now show that, if the admissible codes are uniformly separated, then such confusions become rare and both the initialization and pathwise contributions become exponentially small\.
###### Assumption 5\.4
There existsδ\>0\\delta\>0such that for all distinctc,c~∈𝒞⟂c,\\tilde\{c\}\\in\\mathcal\{C\}^\{\\perp\},
‖c−c~‖2≥δ\.\\\|c\-\\tilde\{c\}\\\|\_\{2\}\\geq\\delta\.
Assumption[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4)enforces a minimum spacing between admissible latent normal codes\. Since
Xt∗⟂=S⟂\+t∗\+ε2G,G∼𝒩\(0,Id\),X\_\{t^\{\*\}\}^\{\\perp\}=S^\{\\perp\}\+\\sqrt\{t^\{\*\}\+\\varepsilon^\{2\}\}\\,G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{d\}\),confusing the true latent codeccwith a different codec~\\tilde\{c\}requires a Gaussian fluctuation of order at leastδ\\delta, which occurs with probabilityexp\(−Ω\(δ2/\(t∗\+ε2\)\)\)\\exp\(\-\\Omega\(\\delta^\{2\}/\(t^\{\*\}\+\\varepsilon^\{2\}\)\)\)\. This separation upgrades the Shannon\-scale control above to exponentially small error bounds\.
###### Theorem 11
Let Assumptions[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1),[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2),[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3), and[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4)be in force\. Let
H1/2:=H1/2\(S⟂\)=2log∑c∈𝒞⟂pS⟂\(c\),σ∗2:=t∗\+ε2,H\_\{1/2\}:=H\_\{1/2\}\(S^\{\\perp\}\)=2\\log\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{p\_\{S^\{\\perp\}\}\(c\)\},\\qquad\\sigma\_\{\*\}^\{2\}:=t^\{\*\}\+\\varepsilon^\{2\},and fix a safe noise levelt∗∈\(t0,T\)t^\{\*\}\\in\(t\_\{0\},T\)\. Then
𝔼B\[KL\(pt∗∗,B∥p^t∗B\)\+KL\(ℙY∗,B∥ℙY^B\)\]≤\\displaystyle\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\+\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\;Lt∗\(δ22\+4σ∗2\)exp\(H1/2−δ28σ∗2\)\\displaystyle L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\+2exp\(H1/2−δ28σ∗2\)\.\\displaystyle\\quad\+2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.\(5\.3\)Consequently, by data processing,
𝔼B\[KL\(μT−t0∗,B∥μ^T−t0B\)\]≤\[Lt∗\(δ22\+4σ∗2\)\+2\]exp\(H1/2−δ28σ∗2\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\Big\]\\leq\\Big\[L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\+2\\Big\]\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.\(5\.4\)
Acknowledgments and Disclosure of Funding
Funding in direct support of this work: none\. Competing interests and additional revenues related to this work: the authors declare no competing interests\.
## Appendix AProof of Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)
We compare the true conditional reverse\-time dynamics and the surrogate guided dynamics at the level of path measures\. Since the two SDEs have the same diffusion coefficient and differ only in the tangent drift, the first task is to justify a Girsanov formula for their relative entropy\. For this, in Lemma[13](https://arxiv.org/html/2605.05387#Thmtheorem13)we first prove that the posterior meanx↦𝔼\[Z∣Xt=x\]x\\mapsto\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]has at most linear growth; via Tweedie’s formula, this implies the required linear\-growth control on the two drifts\. Then Lemma[14](https://arxiv.org/html/2605.05387#Thmtheorem14)converts the pathwise KL divergence into an integral of the squared drift gap along the true conditional path\.
The next step is to rewrite this drift gap in statistical terms\. Using Tweedie’s identity and projecting onto the tangent space, the drift difference becomes the difference between two posterior means of the tangent componentU=P∥ZU=P\_\{\\parallel\}Z: one conditioned on the noisy observation alone, and one conditioned on the noisy observation together with the normal componentB=P⟂ZB=P\_\{\\perp\}Z\. Averaging over the random levelBBand applying the MMSE\-gap identity turns the pathwise KL bound into an integral of conditional MMSE differences\. The conditional I–MMSE lemma is then used to identify this integral with a difference of conditional mutual informations, yielding the upper bound in terms ofI\(U;B∣Xt∗\)I\(U;B\\mid X\_\{t^\{\*\}\}\)\.
For the lower bound, the same MMSE representation is kept on the finite interval corresponding tot∈\[t0,t∗\]t\\in\[t\_\{0\},t^\{\*\}\]\. One then isolates the error term involving the MMSE of the normal componentBB, and controls it by projecting the observation onto the normal subspace\. The key observation is that, conditional onUU, the parallel observation carries no information aboutBB, so the relevant MMSE gap can be reduced to a Gaussian channel only in the normal directions\. Applying the conditional I–MMSE identity once more to this reduced channel yields the correction term involvingI\(U;B∣Xt∗⟂\)I\(U;B\\mid X\_\{t^\{\*\}\}^\{\\perp\}\), and this gives the stated lower bound\.
###### Lemma 13
Let assumption[5\.1](https://arxiv.org/html/2605.05387#S5.Thmassumption1)be in force\. Fixt≥0t\\geq 0and set
mt\(x\):=𝔼\[Z∣Xt=x\],x∈ℝd\.m\_\{t\}\(x\):=\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\],\\qquad x\\in\\mathbb\{R\}^\{d\}\.Then there exists a constantCt<∞C\_\{t\}<\\inftysuch that
\|mt\(x\)\|≤Ct\(1\+\|x\|\),x∈ℝd\.\|m\_\{t\}\(x\)\|\\leq C\_\{t\}\(1\+\|x\|\),\\qquad x\\in\\mathbb\{R\}^\{d\}\.In particular, the posterior meanx↦𝔼\[Y∣Xt=x\]x\\mapsto\\mathbb\{E\}\[Y\\mid X\_\{t\}=x\]has at most linear growth\.
ProofLetμ:=Law\(Z\)\\mu:=\\mathrm\{Law\}\(Z\), and let
ϕt\(u\):=\(2πt\)−d/2exp\(−\|u\|22t\),u∈ℝd,\\phi\_\{t\}\(u\):=\(2\\pi t\)^\{\-d/2\}\\exp\\\!\\left\(\-\\frac\{\|u\|^\{2\}\}\{2t\}\\right\),\\qquad u\\in\\mathbb\{R\}^\{d\},be the Gaussian kernel with covariance matrixtIdtI\_\{d\}\. The law ofXtX\_\{t\}admits density
pt\(x\)=∫ℝdϕt\(x−z\)μ\(dz\),p\_\{t\}\(x\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\),and the conditional mean is given by
mt\(x\)=∫ℝdzϕt\(x−z\)μ\(dz\)∫ℝdϕt\(x−z\)μ\(dz\)\.m\_\{t\}\(x\)=\\frac\{\\int\_\{\\mathbb\{R\}^\{d\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\}\{\\int\_\{\\mathbb\{R\}^\{d\}\}\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\}\.Set
N\(x\):=∫ℝdzϕt\(x−z\)μ\(dz\),D\(x\):=∫ℝdϕt\(x−z\)μ\(dz\)=pt\(x\)\.N\(x\):=\\int\_\{\\mathbb\{R\}^\{d\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\),\\qquad D\(x\):=\\int\_\{\\mathbb\{R\}^\{d\}\}\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)=p\_\{t\}\(x\)\.Then
mt\(x\)=N\(x\)D\(x\)\.m\_\{t\}\(x\)=\\frac\{N\(x\)\}\{D\(x\)\}\.
We shall prove that
\|N\(x\)\|≤Ct\(1\+\|x\|\)D\(x\),x∈ℝd\.\|N\(x\)\|\\leq C\_\{t\}\(1\+\|x\|\)D\(x\),\\qquad x\\in\\mathbb\{R\}^\{d\}\.
ChooseR\>0R\>0such that
a:=μ\(B\(0,R\)\)\>0\.a:=\\mu\(B\(0,R\)\)\>0\.This is possible sinceμ\\muis a probability measure\. Fixx∈ℝdx\\in\\mathbb\{R\}^\{d\}\. We split the numerator into a near part and a far part:
N\(x\)=N1\(x\)\+N2\(x\),N\(x\)=N\_\{1\}\(x\)\+N\_\{2\}\(x\),where
N1\(x\):=∫\{\|z\|≤4\(\|x\|\+R\)\}zϕt\(x−z\)μ\(dz\),N\_\{1\}\(x\):=\\int\_\{\\\{\|z\|\\leq 4\(\|x\|\+R\)\\\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\),and
N2\(x\):=∫\{\|z\|\>4\(\|x\|\+R\)\}zϕt\(x−z\)μ\(dz\)\.N\_\{2\}\(x\):=\\int\_\{\\\{\|z\|\>4\(\|x\|\+R\)\\\}\}z\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\.
On the set\{\|z\|≤4\(\|x\|\+R\)\}\\\{\|z\|\\leq 4\(\|x\|\+R\)\\\}one has\|z\|≤4\(\|x\|\+R\)\|z\|\\leq 4\(\|x\|\+R\), and therefore
\|N1\(x\)\|≤∫\{\|z\|≤4\(\|x\|\+R\)\}\|z\|ϕt\(x−z\)μ\(dz\)≤4\(\|x\|\+R\)D\(x\)\.\|N\_\{1\}\(x\)\|\\leq\\int\_\{\\\{\|z\|\\leq 4\(\|x\|\+R\)\\\}\}\|z\|\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)\\leq 4\(\|x\|\+R\)D\(x\)\.
Letz∈ℝdz\\in\\mathbb\{R\}^\{d\}satisfy\|z\|\>4\(\|x\|\+R\)\|z\|\>4\(\|x\|\+R\), and letu∈B\(0,R\)u\\in B\(0,R\)\. Then
\|x−u\|≤\|x\|\+\|u\|≤\|x\|\+R<\|z\|4,\|x\-u\|\\leq\|x\|\+\|u\|\\leq\|x\|\+R<\\frac\{\|z\|\}\{4\},while
\|x−z\|≥\|z\|−\|x\|\>\|z\|−\|z\|4=34\|z\|\.\|x\-z\|\\geq\|z\|\-\|x\|\>\|z\|\-\\frac\{\|z\|\}\{4\}=\\frac\{3\}\{4\}\|z\|\.Hence
\|x−z\|2−\|x−u\|2≥916\|z\|2−116\|z\|2=12\|z\|2\.\|x\-z\|^\{2\}\-\|x\-u\|^\{2\}\\geq\\frac\{9\}\{16\}\|z\|^\{2\}\-\\frac\{1\}\{16\}\|z\|^\{2\}=\\frac\{1\}\{2\}\|z\|^\{2\}\.Consequently,
ϕt\(x−z\)ϕt\(x−u\)=exp\(−\|x−z\|2−\|x−u\|22t\)≤exp\(−\|z\|24t\)\.\\frac\{\\phi\_\{t\}\(x\-z\)\}\{\\phi\_\{t\}\(x\-u\)\}=\\exp\\\!\\left\(\-\\frac\{\|x\-z\|^\{2\}\-\|x\-u\|^\{2\}\}\{2t\}\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{\|z\|^\{2\}\}\{4t\}\\right\)\.Thus,
ϕt\(x−z\)≤e−\|z\|2/\(4t\)ϕt\(x−u\),u∈B\(0,R\)\.\\phi\_\{t\}\(x\-z\)\\leq e^\{\-\|z\|^\{2\}/\(4t\)\}\\,\\phi\_\{t\}\(x\-u\),\\qquad u\\in B\(0,R\)\.Integrating this inequality with respect toμ\(du\)\\mu\(du\)overB\(0,R\)B\(0,R\)gives
aϕt\(x−z\)≤e−\|z\|2/\(4t\)∫B\(0,R\)ϕt\(x−u\)μ\(du\)≤e−\|z\|2/\(4t\)D\(x\),a\\,\\phi\_\{t\}\(x\-z\)\\leq e^\{\-\|z\|^\{2\}/\(4t\)\}\\int\_\{B\(0,R\)\}\\phi\_\{t\}\(x\-u\)\\,\\mu\(du\)\\leq e^\{\-\|z\|^\{2\}/\(4t\)\}D\(x\),and therefore
ϕt\(x−z\)≤a−1e−\|z\|2/\(4t\)D\(x\)\.\\phi\_\{t\}\(x\-z\)\\leq a^\{\-1\}e^\{\-\|z\|^\{2\}/\(4t\)\}D\(x\)\.Using this bound, we obtain
\|N2\(x\)\|\\displaystyle\|N\_\{2\}\(x\)\|≤∫\{\|z\|\>4\(\|x\|\+R\)\}\|z\|ϕt\(x−z\)μ\(dz\)\\displaystyle\\leq\\int\_\{\\\{\|z\|\>4\(\|x\|\+R\)\\\}\}\|z\|\\,\\phi\_\{t\}\(x\-z\)\\,\\mu\(dz\)≤a−1D\(x\)∫ℝd\|z\|e−\|z\|2/\(4t\)μ\(dz\)\.\\displaystyle\\leq a^\{\-1\}D\(x\)\\int\_\{\\mathbb\{R\}^\{d\}\}\|z\|e^\{\-\|z\|^\{2\}/\(4t\)\}\\,\\mu\(dz\)\.Since the functionr↦re−r2/\(4t\)r\\mapsto re^\{\-r^\{2\}/\(4t\)\}is bounded on\[0,∞\)\[0,\\infty\), the quantity
Ct,1:=a−1∫ℝd\|z\|e−\|z\|2/\(4t\)μ\(dz\)C\_\{t,1\}:=a^\{\-1\}\\int\_\{\\mathbb\{R\}^\{d\}\}\|z\|e^\{\-\|z\|^\{2\}/\(4t\)\}\\,\\mu\(dz\)is finite\. Hence
\|N2\(x\)\|≤Ct,1D\(x\)\.\|N\_\{2\}\(x\)\|\\leq C\_\{t,1\}D\(x\)\.
Combining the bounds forN1\(x\)N\_\{1\}\(x\)andN2\(x\)N\_\{2\}\(x\), we get
\|N\(x\)\|≤\(4\(\|x\|\+R\)\+Ct,1\)D\(x\)\.\|N\(x\)\|\\leq\\bigl\(4\(\|x\|\+R\)\+C\_\{t,1\}\\bigr\)D\(x\)\.Dividing byD\(x\)\>0D\(x\)\>0, we conclude that
\|mt\(x\)\|=\|N\(x\)D\(x\)\|≤4\|x\|\+4R\+Ct,1\.\|m\_\{t\}\(x\)\|=\\left\|\\frac\{N\(x\)\}\{D\(x\)\}\\right\|\\leq 4\|x\|\+4R\+C\_\{t,1\}\.Therefore there exists a finite constantCtC\_\{t\}such that
\|mt\(x\)\|≤Ct\(1\+\|x\|\),x∈ℝd\.\|m\_\{t\}\(x\)\|\\leq C\_\{t\}\(1\+\|x\|\),\\qquad x\\in\\mathbb\{R\}^\{d\}\.This completes the proof\.
###### Lemma 14
LetT\>0T\>0, letΩ=C\(\[0,T\];ℝd\)\\Omega=C\(\[0,T\];\\mathbb\{R\}^\{d\}\)be endowed with the canonical filtration\(ℱt\)0≤t≤T\(\\mathcal\{F\}\_\{t\}\)\_\{0\\leq t\\leq T\}, and letXt\(ω\)=ω\(t\)X\_\{t\}\(\\omega\)=\\omega\(t\)be the coordinate process\.
Assume thatb,β:\[0,T\]×ℝd→ℝdb,\\beta:\[0,T\]\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}are Borel measurable and satisfy
\|b\(t,x\)\|\+\|β\(t,x\)\|≤L\(1\+\|x\|\),\(t,x\)∈\[0,T\]×ℝd,\|b\(t,x\)\|\+\|\\beta\(t,x\)\|\\leq L\(1\+\|x\|\),\\qquad\(t,x\)\\in\[0,T\]\\times\\mathbb\{R\}^\{d\},for some constantL\>0L\>0\. Letσ∈ℝd×d\\sigma\\in\\mathbb\{R\}^\{d\\times d\}be a constant invertible matrix, and letν\\nube a probability measure onℝd\\mathbb\{R\}^\{d\}such that
∫ℝd\|x\|2ν\(dx\)<∞\.\\int\_\{\\mathbb\{R\}^\{d\}\}\|x\|^\{2\}\\,\\nu\(dx\)<\\infty\.
Suppose thatℙβ\\mathbb\{P\}^\{\\beta\}is a weak solution law of
dXt=β\(t,Xt\)dt\+σdWt,X0∼ν\.dX\_\{t\}=\\beta\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\},\\qquad X\_\{0\}\\sim\\nu\.Equivalently, underℙβ\\mathbb\{P\}^\{\\beta\},
Wtβ:=σ−1\(Xt−X0−∫0tβ\(s,Xs\)𝑑s\)W\_\{t\}^\{\\beta\}:=\\sigma^\{\-1\}\\Bigl\(X\_\{t\}\-X\_\{0\}\-\\int\_\{0\}^\{t\}\\beta\(s,X\_\{s\}\)\\,ds\\Bigr\)is add\-dimensional Brownian motion\.
Define
θ\(t,x\):=σ−1\(b−β\)\(t,x\),\\theta\(t,x\):=\\sigma^\{\-1\}\(b\-\\beta\)\(t,x\),and
Zt:=exp\(∫0tθ\(s,Xs\)⋅𝑑Wsβ−12∫0t\|θ\(s,Xs\)\|2𝑑s\),0≤t≤T\.Z\_\{t\}:=\\exp\\\!\\left\(\\int\_\{0\}^\{t\}\\theta\(s,X\_\{s\}\)\\cdot dW\_\{s\}^\{\\beta\}\-\\frac\{1\}\{2\}\\int\_\{0\}^\{t\}\|\\theta\(s,X\_\{s\}\)\|^\{2\}\\,ds\\right\),\\qquad 0\\leq t\\leq T\.
Then the following hold:
1. 1\.Z=\(Zt\)0≤t≤TZ=\(Z\_\{t\}\)\_\{0\\leq t\\leq T\}is a trueℙβ\\mathbb\{P\}^\{\\beta\}\-martingale;
2. 2\.the probability measureℙb\\mathbb\{P\}^\{b\}on\(Ω,ℱT\)\(\\Omega,\\mathcal\{F\}\_\{T\}\)defined by dℙbdℙβ=ZT\\frac\{d\\mathbb\{P\}^\{b\}\}\{d\\mathbb\{P\}^\{\\beta\}\}=Z\_\{T\}is a weak solution law of dXt=b\(t,Xt\)dt\+σdWt,X0∼ν\.dX\_\{t\}=b\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\},\\qquad X\_\{0\}\\sim\\nu\.
In particular,
ℙb≪ℙβonℱT,\\mathbb\{P\}^\{b\}\\ll\\mathbb\{P\}^\{\\beta\}\\quad\\text\{on \}\\mathcal\{F\}\_\{T\},with Radon–Nikodym derivativeZTZ\_\{T\}\.
If, in addition, the martingale problem for\(b,σ,ν\)\(b,\\sigma,\\nu\)is well posed, thenℙb\\mathbb\{P\}^\{b\}is the unique weak solution law of thebb\-equation\. If both martingale problems\(β,σ,ν\)\(\\beta,\\sigma,\\nu\)and\(b,σ,ν\)\(b,\\sigma,\\nu\)are well posed, thenℙβ\\mathbb\{P\}^\{\\beta\}andℙb\\mathbb\{P\}^\{b\}are equivalent onℱT\\mathcal\{F\}\_\{T\}\.
ProofWe divide the argument into several steps\.
For eachn∈ℕn\\in\\mathbb\{N\}, define the stopping time
τn:=inf\{t∈\[0,T\]:\|Xt\|≥n\}∧T,\\tau\_\{n\}:=\\inf\\\{t\\in\[0,T\]:\|X\_\{t\}\|\\geq n\\\}\\wedge T,and set
Zt\(n\):=Zt∧τn\.Z\_\{t\}^\{\(n\)\}:=Z\_\{t\\wedge\\tau\_\{n\}\}\.Since the process
θn\(t\):=θ\(t,Xt\)𝟏\{t≤τn\}\\theta\_\{n\}\(t\):=\\theta\(t,X\_\{t\}\)\\mathbf\{1\}\_\{\\\{t\\leq\\tau\_\{n\}\\\}\}is bounded, the classical bounded\-integrand version of Girsanov’s theorem implies thatZ\(n\)Z^\{\(n\)\}is a trueℙβ\\mathbb\{P\}^\{\\beta\}\-martingale\. Define a probability measureℚn\\mathbb\{Q\}\_\{n\}onℱT\\mathcal\{F\}\_\{T\}by
dℚndℙβ=ZT\(n\)\.\\frac\{d\\mathbb\{Q\}\_\{n\}\}\{d\\mathbb\{P\}^\{\\beta\}\}=Z\_\{T\}^\{\(n\)\}\.
Underℚn\\mathbb\{Q\}\_\{n\}, the process
Wt\(n\):=Wtβ−∫0t∧τnθ\(s,Xs\)𝑑sW\_\{t\}^\{\(n\)\}:=W\_\{t\}^\{\\beta\}\-\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\theta\(s,X\_\{s\}\)\\,dsis add\-dimensional Brownian motion\. Consequently,
Xt∧τn\\displaystyle X\_\{t\\wedge\\tau\_\{n\}\}=X0\+∫0t∧τnβ\(s,Xs\)𝑑s\+σWt∧τnβ\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\\beta\}=X0\+∫0t∧τnβ\(s,Xs\)𝑑s\+σ∫0t∧τnθ\(s,Xs\)𝑑s\+σWt∧τn\(n\)\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\theta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\(n\)\}=X0\+∫0t∧τnb\(s,Xs\)𝑑s\+σWt∧τn\(n\)\.\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}b\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\(n\)\}\.Thus, underℚn\\mathbb\{Q\}\_\{n\}, the stopped coordinate process solves thebb\-equation up toτn\\tau\_\{n\}\. Define
fn\(t\):=EQn\[sup0≤u≤t\|Xu∧τn\|2\],0≤t≤T\.f\_\{n\}\(t\):=E^\{Q\_\{n\}\}\\Big\[\\sup\_\{0\\leq u\\leq t\}\|X\_\{u\\wedge\\tau\_\{n\}\}\|^\{2\}\\Big\],\\qquad 0\\leq t\\leq T\.Write
Xt∧τn=X0\+At\(n\)\+Mt\(n\),X\_\{t\\wedge\\tau\_\{n\}\}=X\_\{0\}\+A\_\{t\}^\{\(n\)\}\+M\_\{t\}^\{\(n\)\},where
At\(n\):=∫0t∧τnb\(s,Xs\)𝑑s,Mt\(n\):=σWt∧τn\(n\)\.A\_\{t\}^\{\(n\)\}:=\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}b\(s,X\_\{s\}\)\\,ds,\\qquad M\_\{t\}^\{\(n\)\}:=\\sigma W\_\{t\\wedge\\tau\_\{n\}\}^\{\(n\)\}\.Using\(a\+b\+c\)2≤3\(a2\+b2\+c2\)\(a\+b\+c\)^\{2\}\\leq 3\(a^\{2\}\+b^\{2\}\+c^\{2\}\), we obtain
fn\(t\)≤3Eℚn\|X0\|2\+3Eℚn\[supu≤t\|Au\(n\)\|2\]\+3Eℚn\[supu≤t\|Mu\(n\)\|2\]\.f\_\{n\}\(t\)\\leq 3E^\{\\mathbb\{Q\}\_\{n\}\}\|X\_\{0\}\|^\{2\}\+3E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|A\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\+3E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|M\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\.
SinceZ0\(n\)=1Z\_\{0\}^\{\(n\)\}=1, the law ofX0X\_\{0\}underℚn\\mathbb\{Q\}\_\{n\}is the same as underℙβ\\mathbb\{P\}^\{\\beta\}, namelyν\\nu\. Hence
Eℚn\|X0\|2=∫ℝd\|x\|2ν\(dx\)<∞\.E^\{\\mathbb\{Q\}\_\{n\}\}\|X\_\{0\}\|^\{2\}=\\int\_\{\\mathbb\{R\}^\{d\}\}\|x\|^\{2\}\\,\\nu\(dx\)<\\infty\.
For the drift term, the linear\-growth assumption yields
\|b\(t,x\)\|≤L\(1\+\|x\|\),\|b\(t,x\)\|\\leq L\(1\+\|x\|\),so
supu≤t\|Au\(n\)\|≤∫0t𝟏\{s≤τn\}\|b\(s,Xs\)\|𝑑s≤L∫0t\(1\+\|Xs∧τn\|\)𝑑s\.\\sup\_\{u\\leq t\}\|A\_\{u\}^\{\(n\)\}\|\\leq\\int\_\{0\}^\{t\}\\mathbf\{1\}\_\{\\\{s\\leq\\tau\_\{n\}\\\}\}\|b\(s,X\_\{s\}\)\|\\,ds\\leq L\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|\\bigr\)\\,ds\.By Cauchy–Schwarz,
\(∫0t\(1\+\|Xs∧τn\|\)𝑑s\)2≤t∫0t\(1\+\|Xs∧τn\|\)2𝑑s≤2t∫0t\(1\+\|Xs∧τn\|2\)𝑑s\.\\Big\(\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|\\bigr\)\\,ds\\Big\)^\{2\}\\leq t\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|\\bigr\)^\{2\}\\,ds\\leq 2t\\int\_\{0\}^\{t\}\\bigl\(1\+\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\bigr\)\\,ds\.Therefore,
Eℚn\[supu≤t\|Au\(n\)\|2\]≤2L2t∫0t\(1\+Eℚn\|Xs∧τn\|2\)𝑑s≤2L2t∫0t\(1\+fn\(s\)\)𝑑s\.E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|A\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\\leq 2L^\{2\}t\\int\_\{0\}^\{t\}\\Bigl\(1\+E^\{\\mathbb\{Q\}\_\{n\}\}\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\Bigr\)\\,ds\\leq 2L^\{2\}t\\int\_\{0\}^\{t\}\\bigl\(1\+f\_\{n\}\(s\)\\bigr\)\\,ds\.
For the martingale term,M\(n\)M^\{\(n\)\}is a continuousℚn\\mathbb\{Q\}\_\{n\}\-martingale with quadratic variation
⟨M\(n\)⟩t=∫0t∧τnσσ⊤𝑑s\.\\langle M^\{\(n\)\}\\rangle\_\{t\}=\\int\_\{0\}^\{t\\wedge\\tau\_\{n\}\}\\sigma\\sigma^\{\\top\}\\,ds\.Hence, by the Burkholder–Davis–Gundy inequality\(Karatzas and Shreve,[2014](https://arxiv.org/html/2605.05387#bib.bib14)\),
Eℚn\[supu≤t\|Mu\(n\)\|2\]≤CBDGEℚn\[tr⟨M\(n\)⟩t\]≤CBDG‖σ‖HS2t\.E^\{\\mathbb\{Q\}\_\{n\}\}\\Big\[\\sup\_\{u\\leq t\}\|M\_\{u\}^\{\(n\)\}\|^\{2\}\\Big\]\\leq C\_\{\\mathrm\{BDG\}\}\\,E^\{\\mathbb\{Q\}\_\{n\}\}\\big\[\\mathrm\{tr\}\\langle M^\{\(n\)\}\\rangle\_\{t\}\\big\]\\leq C\_\{\\mathrm\{BDG\}\}\\\|\\sigma\\\|\_\{\\mathrm\{HS\}\}^\{2\}\\,t\.Combining the above bounds, we find constantsC0,C1\>0C\_\{0\},C\_\{1\}\>0, independent ofnn, such that
fn\(t\)≤C0\+C1∫0t\(1\+fn\(s\)\)𝑑s,0≤t≤T\.f\_\{n\}\(t\)\\leq C\_\{0\}\+C\_\{1\}\\int\_\{0\}^\{t\}\\bigl\(1\+f\_\{n\}\(s\)\\bigr\)\\,ds,\\qquad 0\\leq t\\leq T\.By Gronwall’s lemma,
supn≥1sup0≤t≤Tfn\(t\)≤CT\\sup\_\{n\\geq 1\}\\sup\_\{0\\leq t\\leq T\}f\_\{n\}\(t\)\\leq C\_\{T\}for some constantCT<∞C\_\{T\}<\\inftyindependent ofnn\. In particular,
supn≥1Eℚn∫0T\|Xs∧τn\|2𝑑s≤TCT\.\\sup\_\{n\\geq 1\}E^\{\\mathbb\{Q\}\_\{n\}\}\\int\_\{0\}^\{T\}\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\,ds\\leq TC\_\{T\}\.Underℚn\\mathbb\{Q\}\_\{n\},
dWtβ=dWt\(n\)\+𝟏\{t≤τn\}θ\(t,Xt\)dt\.dW\_\{t\}^\{\\beta\}=dW\_\{t\}^\{\(n\)\}\+\\mathbf\{1\}\_\{\\\{t\\leq\\tau\_\{n\}\\\}\}\\theta\(t,X\_\{t\}\)\\,dt\.Substituting into the definition ofZT\(n\)Z\_\{T\}^\{\(n\)\}yields
logZT\(n\)=∫0T∧τnθ\(s,Xs\)⋅𝑑Ws\(n\)\+12∫0T∧τn\|θ\(s,Xs\)\|2𝑑s\.\\log Z\_\{T\}^\{\(n\)\}=\\int\_\{0\}^\{T\\wedge\\tau\_\{n\}\}\\theta\(s,X\_\{s\}\)\\cdot dW\_\{s\}^\{\(n\)\}\+\\frac\{1\}\{2\}\\int\_\{0\}^\{T\\wedge\\tau\_\{n\}\}\|\\theta\(s,X\_\{s\}\)\|^\{2\}\\,ds\.Taking expectations underℚn\\mathbb\{Q\}\_\{n\}, the stochastic integral has mean zero, and therefore
Eℚn\[logZT\(n\)\]=12Eℚn∫0T∧τn\|θ\(s,Xs\)\|2𝑑s\.E^\{\\mathbb\{Q\}\_\{n\}\}\[\\log Z\_\{T\}^\{\(n\)\}\]=\\frac\{1\}\{2\}E^\{\\mathbb\{Q\}\_\{n\}\}\\int\_\{0\}^\{T\\wedge\\tau\_\{n\}\}\|\\theta\(s,X\_\{s\}\)\|^\{2\}\\,ds\.Sinceθ\(t,x\)=σ−1\(b−β\)\(t,x\)\\theta\(t,x\)=\\sigma^\{\-1\}\(b\-\\beta\)\(t,x\)and bothbbandβ\\betahave linear growth, there exists a constantC\>0C\>0such that
\|θ\(t,x\)\|2≤C\(1\+\|x\|2\)\.\|\\theta\(t,x\)\|^\{2\}\\leq C\(1\+\|x\|^\{2\}\)\.Hence
Eℚn\[logZT\(n\)\]≤C\(T\+Eℚn∫0T\|Xs∧τn\|2𝑑s\)≤CT′\.E^\{\\mathbb\{Q\}\_\{n\}\}\[\\log Z\_\{T\}^\{\(n\)\}\]\\leq C\\left\(T\+E^\{\\mathbb\{Q\}\_\{n\}\}\\int\_\{0\}^\{T\}\|X\_\{s\\wedge\\tau\_\{n\}\}\|^\{2\}\\,ds\\right\)\\leq C\_\{T\}^\{\\prime\}\.Moreover, by definition ofℚn\\mathbb\{Q\}\_\{n\},
Eℙβ\[ZT\(n\)logZT\(n\)\]=Eℚn\[logZT\(n\)\]\.E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[Z\_\{T\}^\{\(n\)\}\\log Z\_\{T\}^\{\(n\)\}\\big\]=E^\{\\mathbb\{Q\}\_\{n\}\}\[\\log Z\_\{T\}^\{\(n\)\}\]\.Thus,
supn≥1Eℙβ\[ZT\(n\)logZT\(n\)\]<∞\.\\sup\_\{n\\geq 1\}E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[Z\_\{T\}^\{\(n\)\}\\log Z\_\{T\}^\{\(n\)\}\\big\]<\\infty\.Since the functionx↦xlogxx\\mapsto x\\log xis increasing convex function , de la Vallée\-Poussin’s criterion implies that the family\{ZT\(n\)\}n≥1\\\{Z\_\{T\}^\{\(n\)\}\\\}\_\{n\\geq 1\}is uniformly integrable\(Durrett,[2019](https://arxiv.org/html/2605.05387#bib.bib8)\)\.
Nowτn↑T\\tau\_\{n\}\\uparrow Tℙβ\\mathbb\{P\}^\{\\beta\}\-almost surely, andZZis continuous, hence
ZT\(n\)→ZTℙβ\-a\.s\.Z\_\{T\}^\{\(n\)\}\\to Z\_\{T\}\\qquad\\mathbb\{P\}^\{\\beta\}\\text\{\-a\.s\.\}Uniform integrability therefore implies
Eℙβ\[ZT\]=limn→∞Eℙβ\[ZT\(n\)\]=1\.E^\{\\mathbb\{P\}^\{\\beta\}\}\[Z\_\{T\}\]=\\lim\_\{n\\to\\infty\}E^\{\\mathbb\{P\}^\{\\beta\}\}\[Z\_\{T\}^\{\(n\)\}\]=1\.It follows thatZZis a trueℙβ\\mathbb\{P\}^\{\\beta\}\-martingale\.Now define a probability measureℙb\\mathbb\{P\}^\{b\}onℱT\\mathcal\{F\}\_\{T\}by
dℙbdℙβ=ZT\.\\frac\{d\\mathbb\{P\}^\{b\}\}\{d\\mathbb\{P\}^\{\\beta\}\}=Z\_\{T\}\.SinceZZis a true martingale, the classical Girsanov theorem applies and yields that
Wtb:=Wtβ−∫0tθ\(s,Xs\)𝑑sW\_\{t\}^\{b\}:=W\_\{t\}^\{\\beta\}\-\\int\_\{0\}^\{t\}\\theta\(s,X\_\{s\}\)\\,dsis a Brownian motion underℙb\\mathbb\{P\}^\{b\}\. Therefore,
Xt\\displaystyle X\_\{t\}=X0\+∫0tβ\(s,Xs\)𝑑s\+σWtβ\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\}^\{\\beta\}=X0\+∫0tβ\(s,Xs\)𝑑s\+σ∫0tθ\(s,Xs\)𝑑s\+σWtb\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\}\\beta\(s,X\_\{s\}\)\\,ds\+\\sigma\\int\_\{0\}^\{t\}\\theta\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\}^\{b\}=X0\+∫0tb\(s,Xs\)𝑑s\+σWtb\.\\displaystyle=X\_\{0\}\+\\int\_\{0\}^\{t\}b\(s,X\_\{s\}\)\\,ds\+\\sigma W\_\{t\}^\{b\}\.ThusXXsolves
dXt=b\(t,Xt\)dt\+σdWtbdX\_\{t\}=b\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\}^\{b\}underℙb\\mathbb\{P\}^\{b\}\.
It remains to identify the initial law\. For every Borel setA⊂ℝdA\\subset\\mathbb\{R\}^\{d\},
ℙb\(X0∈A\)=Eℙβ\[𝟏\{X0∈A\}ZT\]\.\\mathbb\{P\}^\{b\}\(X\_\{0\}\\in A\)=E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}Z\_\{T\}\\big\]\.Since𝟏\{X0∈A\}∈ℱ0\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}\\in\\mathcal\{F\}\_\{0\}andZZis a martingale withZ0=1Z\_\{0\}=1,
Eℙβ\[𝟏\{X0∈A\}ZT\]=Eℙβ\[𝟏\{X0∈A\}Eℙβ\[ZT∣ℱ0\]\]=Eℙβ\[𝟏\{X0∈A\}\]=ν\(A\)\.E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}Z\_\{T\}\\big\]=E^\{\\mathbb\{P\}^\{\\beta\}\}\\Big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}E^\{\\mathbb\{P\}^\{\\beta\}\}\[Z\_\{T\}\\mid\\mathcal\{F\}\_\{0\}\]\\Big\]=E^\{\\mathbb\{P\}^\{\\beta\}\}\\big\[\\mathbf\{1\}\_\{\\\{X\_\{0\}\\in A\\\}\}\\big\]=\\nu\(A\)\.HenceX0∼νX\_\{0\}\\sim\\nuunderℙb\\mathbb\{P\}^\{b\}as well\. This proves thatℙb\\mathbb\{P\}^\{b\}is a weak solution law of
dXt=b\(t,Xt\)dt\+σdWt,X0∼ν\.dX\_\{t\}=b\(t,X\_\{t\}\)\\,dt\+\\sigma\\,dW\_\{t\},\\qquad X\_\{0\}\\sim\\nu\.
If the martingale problem for\(b,σ,ν\)\(b,\\sigma,\\nu\)is well posed, then the probability measureℙb\\mathbb\{P\}^\{b\}constructed above coincides with the unique weak solution law of thebb\-equation\. If both martingale problems\(β,σ,ν\)\(\\beta,\\sigma,\\nu\)and\(b,σ,ν\)\(b,\\sigma,\\nu\)are well posed, then the same argument withbbandβ\\betainterchanged yields
ℙβ≪ℙb,\\mathbb\{P\}^\{\\beta\}\\ll\\mathbb\{P\}^\{b\},and therefore
ℙβ∼ℙb\.\\mathbb\{P\}^\{\\beta\}\\sim\\mathbb\{P\}^\{b\}\.
This completes the proof\.
###### Lemma 15
LetUUbe square\-integrable and let𝒢⊆ℋ\\mathcal\{G\}\\subseteq\\mathcal\{H\}beσ\\sigma\-fields\. Then
𝔼\[∥𝔼\[U∣ℋ\]−𝔼\[U∣𝒢\]∥22\]=mmse\(U∣𝒢\)−mmse\(U∣ℋ\),\\mathbb\{E\}\\bigl\[\\\|\\mathbb\{E\}\[U\\mid\\mathcal\{H\}\]\-\\mathbb\{E\}\[U\\mid\\mathcal\{G\}\]\\\|\_\{2\}^\{2\}\\bigr\]=\\operatorname\{mmse\}\(U\\mid\\mathcal\{G\}\)\-\\operatorname\{mmse\}\(U\\mid\\mathcal\{H\}\),wheremmse\(U∣𝒢\):=𝔼∥U−𝔼\[U∣𝒢\]∥22\\operatorname\{mmse\}\(U\\mid\\mathcal\{G\}\):=\\mathbb\{E\}\\\|U\-\\mathbb\{E\}\[U\\mid\\mathcal\{G\}\]\\\|\_\{2\}^\{2\}\.
ProofThe identity is the Pythagorean theorem for orthogonal projections inL2L^\{2\}\(conditional expectation is the orthogonal projection onto the subspace of𝒢\\mathcal\{G\}\-measurable functions\)\.
###### Lemma 16
LetX∈ℝdX\\in\\mathbb\{R\}^\{d\}be a random vector with𝔼‖X‖22<∞\\mathbb\{E\}\\\|X\\\|\_\{2\}^\{2\}<\\infty, letSSbe an arbitrary random element, and letN∼𝒩\(0,Id\)N\\sim\\mathcal\{N\}\(0,I\_\{d\}\)be independent of\(X,S\)\(X,S\)\. Forγ\>0\\gamma\>0define the Gaussian observation channel
Yγ:=γX\+N\.Y\_\{\\gamma\}\\;:=\\;\\sqrt\{\\gamma\}\\,X\+N\.ThenI\(X;Yγ∣S\)I\(X;Y\_\{\\gamma\}\\mid S\)is differentiable inγ\\gammaand
ddγI\(X;Yγ∣S\)=12mmse\(X∣Yγ,S\),\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}\\,I\(X;Y\_\{\\gamma\}\\mid S\)\\;=\\;\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S\),where
mmse\(X∣Yγ,S\):=𝔼\[∥X−𝔼\[X∣Yγ,S\]∥22\]\.\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S\)\\;:=\\;\\mathbb\{E\}\\\!\\left\[\\big\\\|X\-\\mathbb\{E\}\[X\\mid Y\_\{\\gamma\},S\]\\big\\\|\_\{2\}^\{2\}\\right\]\.
ProofFixγ\>0\\gamma\>0\. By disintegration,
I\(X;Yγ∣S\)=∫I\(X;Yγ∣S=s\)PS\(ds\)\.I\(X;Y\_\{\\gamma\}\\mid S\)=\\int I\(X;Y\_\{\\gamma\}\\mid S=s\)\\,P\_\{S\}\(\\mathrm\{d\}s\)\.\(A\.1\)For eachss, conditional onS=sS=sthe channel remains AWGN:Yγ=γX\+NY\_\{\\gamma\}=\\sqrt\{\\gamma\}\\,X\+NwithN⟂⟂XN\\perp\\\!\\\!\\\!\\perp XunderLaw\(⋅∣S=s\)\\operatorname\{Law\}\(\\cdot\\mid S=s\)\. Since𝔼\[‖X‖22∣S=s\]<∞\\mathbb\{E\}\[\\\|X\\\|\_\{2\}^\{2\}\\mid S=s\]<\\inftyforPSP\_\{S\}\-a\.e\.ss, the \(vector\) I–MMSE identity of\(Guo et al\.,[2005](https://arxiv.org/html/2605.05387#bib.bib10)\)applied to the conditional input lawX∣S=sX\\mid S=syields
ddγI\(X;Yγ∣S=s\)=12mmse\(X∣Yγ,S=s\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(X;Y\_\{\\gamma\}\\mid S=s\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\.\(A\.2\)Moreover, for everyss,
0≤mmse\(X∣Yγ,S=s\)≤𝔼\[‖X‖22∣S=s\],0\\leq\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\\leq\\mathbb\{E\}\[\\\|X\\\|\_\{2\}^\{2\}\\mid S=s\],because the MMSE is the minimum mean\-squared error and is upper bounded by the MSE of the zero estimator\. Since𝔼‖X‖22=∫𝔼\[‖X‖22∣S=s\]PS\(ds\)<∞\\mathbb\{E\}\\\|X\\\|\_\{2\}^\{2\}=\\int\\mathbb\{E\}\[\\\|X\\\|\_\{2\}^\{2\}\\mid S=s\]\\,P\_\{S\}\(\\mathrm\{d\}s\)<\\infty, dominated convergence \(Leibniz rule\) allows differentiating under the integral in \([A\.1](https://arxiv.org/html/2605.05387#A1.E1)\), giving
ddγI\(X;Yγ∣S\)=12∫mmse\(X∣Yγ,S=s\)PS\(ds\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(X;Y\_\{\\gamma\}\\mid S\)=\\frac\{1\}\{2\}\\int\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\\,P\_\{S\}\(\\mathrm\{d\}s\)\.Finally, by the law of total expectation and the definition of conditional MMSE,
∫mmse\(X∣Yγ,S=s\)PS\(ds\)=𝔼\[∥X−𝔼\[X∣Yγ,S\]∥22\]=mmse\(X∣Yγ,S\),\\int\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S=s\)\\,P\_\{S\}\(\\mathrm\{d\}s\)=\\mathbb\{E\}\\\!\\left\[\\big\\\|X\-\\mathbb\{E\}\[X\\mid Y\_\{\\gamma\},S\]\\big\\\|\_\{2\}^\{2\}\\right\]=\\operatorname\{mmse\}\(X\\mid Y\_\{\\gamma\},S\),which proves the claim\.
Proof\[Proof of Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)\] Fixbb\. LetY∗,bY^\{\*,b\}andY^b\\hat\{Y\}^\{\\,b\}solve the two reverse\-time SDEs on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]in the theorem, started from the same law at timeτ∗\\tau^\{\*\}and with the same unit diffusion coefficient\. Denote their drifts byfτ∗,bf\_\{\\tau\}^\{\*,b\}andf^τb\\hat\{f\}\_\{\\tau\}^\{\\,b\}, respectively\.
We first explain why these drifts have at most linear growth and why Lemma[14](https://arxiv.org/html/2605.05387#Thmtheorem14)applies\. By Tweedie’s formula, fort=T−τt=T\-\\tau,
𝔼\[Z∣Xt=x\]=x\+tst\(x\),𝔼\[Z∣Xt=x,B=b\]=x\+tst∗,b\(x\)\.\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]=x\+t\\,s\_\{t\}\(x\),\\qquad\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]=x\+t\\,s\_\{t\}^\{\*,b\}\(x\)\.Hence
fτ∗,b\(x\)=P∥st∗,b\(x\)\+1tP⟂\(b−x\)=1t\(𝔼\[Z∣Xt=x,B=b\]−x\),f\_\{\\tau\}^\{\*,b\}\(x\)=P\_\{\\parallel\}s\_\{t\}^\{\*,b\}\(x\)\+\\frac\{1\}\{t\}P\_\{\\perp\}\(b\-x\)=\\frac\{1\}\{t\}\\bigl\(\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]\-x\\bigr\),because under the conditioningB=bB=bwe haveP⟂𝔼\[Z∣Xt=x,B=b\]=bP\_\{\\perp\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]=b\. Likewise,
f^τb\(x\)=P∥st\(x\)\+1tP⟂\(b−x\)=1t\(P∥𝔼\[Z∣Xt=x\]−P∥x\+P⟂\(b−x\)\)\.\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)=P\_\{\\parallel\}s\_\{t\}\(x\)\+\\frac\{1\}\{t\}P\_\{\\perp\}\(b\-x\)=\\frac\{1\}\{t\}\\bigl\(P\_\{\\parallel\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]\-P\_\{\\parallel\}x\+P\_\{\\perp\}\(b\-x\)\\bigr\)\.By Lemma[13](https://arxiv.org/html/2605.05387#Thmtheorem13), the mapsx↦𝔼\[Z∣Xt=x\]x\\mapsto\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]andx↦𝔼\[Z∣Xt=x,B=b\]x\\mapsto\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]have at most linear growth\. Sincet=T−τ∈\[t0,t∗\]t=T\-\\tau\\in\[t\_\{0\},t^\{\*\}\]on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\], the factor1/t1/tis uniformly bounded by1/t01/t\_\{0\}\. Therefore bothfτ∗,bf\_\{\\tau\}^\{\*,b\}andf^τb\\hat\{f\}\_\{\\tau\}^\{\\,b\}satisfy a linear\-growth bound of the form
\|fτ∗,b\(x\)\|\+\|f^τb\(x\)\|≤Cb,t0\(1\+\|x\|\),τ∈\[τ∗,T−t0\]\.\|f\_\{\\tau\}^\{\*,b\}\(x\)\|\+\|\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)\|\\leq C\_\{b,t\_\{0\}\}\(1\+\|x\|\),\\qquad\\tau\\in\[\\tau^\{\*\},T\-t\_\{0\}\]\.
NowY∗,bY^\{\*,b\}is already given as a weak solution law for the driftf∗,bf^\{\*,b\}, namely the ideal conditional reverse process\. We therefore apply Lemma[14](https://arxiv.org/html/2605.05387#Thmtheorem14)with
β\(τ,x\)=fτ∗,b\(x\),b\(τ,x\)=f^τb\(x\),σ=Id,\\beta\(\\tau,x\)=f\_\{\\tau\}^\{\*,b\}\(x\),\\qquad b\(\\tau,x\)=\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\),\\qquad\\sigma=I\_\{d\},and with initial law equal to the common law ofYτ∗∗,bY^\{\*,b\}\_\{\\tau^\{\*\}\}andY^τ∗b\\hat\{Y\}^\{\\,b\}\_\{\\tau^\{\*\}\}\. The lemma yields existence of the surrogate weak solution law and the relative\-entropy identity
KL\(ℙY∗,b∥ℙY^b\)\\displaystyle\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,b\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,b\}\}\\big\)=12𝔼Y∗,b\[∫τ∗T−t0‖fτ∗,b\(Yτ∗,b\)−f^τb\(Yτ∗,b\)‖22dτ\]\\displaystyle=\\frac\{1\}\{2\}\\,\\mathbb\{E\}^\{Y^\{\*,b\}\}\\\!\\left\[\\int\_\{\\tau^\{\*\}\}^\{T\-t\_\{0\}\}\\big\\\|f\_\{\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\\big\\\|\_\{2\}^\{2\}\\,\\mathrm\{d\}\\tau\\right\]≤12𝔼Y∗,b\[∫τ∗T‖fτ∗,b\(Yτ∗,b\)−f^τb\(Yτ∗,b\)‖22dτ\]\.\\displaystyle\\leq\\frac\{1\}\{2\}\\,\\mathbb\{E\}^\{Y^\{\*,b\}\}\\\!\\left\[\\int\_\{\\tau^\{\*\}\}^\{T\}\\big\\\|f\_\{\\tau\}^\{\*,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(Y\_\{\\tau\}^\{\*,b\}\)\\big\\\|\_\{2\}^\{2\}\\,\\mathrm\{d\}\\tau\\right\]\.\(A\.3\)SinceY∗,bY^\{\*,b\}is the true reverse\-time process of the conditional forward diffusion\{Xt\}t∈\[0,T\]\\\{X\_\{t\}\\\}\_\{t\\in\[0,T\]\}underB=bB=b, it is defined up to terminal reverse timeTT\. Hence we may extend the integral from\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\]to\[τ∗,T\)\[\\tau^\{\*\},T\)\. By inspection, the two SDEs have the same normal drift, hence for allx∈ℝdx\\in\\mathbb\{R\}^\{d\},
fτ∗,b\(x\)−f^τb\(x\)=P∥\(st∗,b\(x\)−st\(x\)\)\.f\_\{\\tau\}^\{\*,b\}\(x\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)=P\_\{\\parallel\}\\Big\(s\_\{t\}^\{\*,b\}\(x\)\-s\_\{t\}\(x\)\\Big\)\.\(A\.4\)
Now letZ∼p0Z\\sim p\_\{0\}denote the clean signal, and decompose it as
U:=P∥Z,B:=P⟂Z\.U:=P\_\{\\parallel\}Z,\\qquad B:=P\_\{\\perp\}Z\.
Tweedie’s formula yields, for everyx∈ℝdx\\in\\mathbb\{R\}^\{d\},
𝔼\[Z∣Xt=x\]=x\+tst\(x\),𝔼\[Z∣Xt=x,B=b\]=x\+tst∗,b\(x\)\.\\mathbb\{E\}\[Z\\mid X\_\{t\}=x\]=x\+t\\,s\_\{t\}\(x\),\\qquad\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,\\,B=b\]=x\+t\\,s\_\{t\}^\{\*,b\}\(x\)\.Projecting ontoker\(A\)\\ker\(A\)and subtracting, we obtain
P∥\(st∗,b−st\)\(x\)=1t\(𝔼\[U∣Xt=x,B=b\]−𝔼\[U∣Xt=x\]\)\.P\_\{\\parallel\}\\\!\\big\(s\_\{t\}^\{\*,b\}\-s\_\{t\}\\big\)\(x\)=\\frac\{1\}\{t\}\\Big\(\\mathbb\{E\}\[U\\mid X\_\{t\}=x,\\,B=b\]\-\\mathbb\{E\}\[U\\mid X\_\{t\}=x\]\\Big\)\.\(A\.5\)Settingt=T−τt=T\-\\tauand combining \([A\.4](https://arxiv.org/html/2605.05387#A1.E4)\) with \([A\.5](https://arxiv.org/html/2605.05387#A1.E5)\), we get
fτ∗,b\(x\)−f^τb\(x\)=1T−τ\(𝔼\[U∣XT−τ=x,B=b\]−𝔼\[U∣XT−τ=x\]\)\.f\_\{\\tau\}^\{\*,b\}\(x\)\-\\hat\{f\}\_\{\\tau\}^\{\\,b\}\(x\)=\\frac\{1\}\{T\-\\tau\}\\Big\(\\mathbb\{E\}\[U\\mid X\_\{T\-\\tau\}=x,\\,B=b\]\-\\mathbb\{E\}\[U\\mid X\_\{T\-\\tau\}=x\]\\Big\)\.
Plugging this identity into \([A](https://arxiv.org/html/2605.05387#A1.Ex91)\), averaging over the random levelBB, and applying Lemma[15](https://arxiv.org/html/2605.05387#Thmtheorem15), yields
𝔼B\[KL\(ℙY∗,B∥ℙY^B\)\]≤12∫τ∗T1\(T−τ\)2\(mmse\(U∣XT−τ\)−mmse\(U∣XT−τ,B\)\)dτ\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\frac\{1\}\{2\}\\int\_\{\\tau^\{\*\}\}^\{T\}\\frac\{1\}\{\(T\-\\tau\)^\{2\}\}\\Big\(\\operatorname\{mmse\}\(U\\mid X\_\{T\-\\tau\}\)\-\\operatorname\{mmse\}\(U\\mid X\_\{T\-\\tau\},B\)\\Big\)\\,\\mathrm\{d\}\\tau\.\(A\.6\)
Now define
γ:=1t=1T−τ,X~γ:=γXt=γZ\+Ξ,Ξ∼𝒩\(0,Id\)independent ofZ\.\\gamma:=\\frac\{1\}\{t\}=\\frac\{1\}\{T\-\\tau\},\\qquad\\tilde\{X\}\_\{\\gamma\}:=\\sqrt\{\\gamma\}\\,X\_\{t\}=\\sqrt\{\\gamma\}\\,Z\+\\Xi,\\qquad\\Xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\\ \\text\{independent of \}Z\.SinceXt↦X~γX\_\{t\}\\mapsto\\tilde\{X\}\_\{\\gamma\}is an invertible scaling, conditioning onXtX\_\{t\}is equivalent to conditioning onX~γ\\tilde\{X\}\_\{\\gamma\}\. Moreover,
dγ=dτ\(T−τ\)2\.\\mathrm\{d\}\\gamma=\\frac\{\\mathrm\{d\}\\tau\}\{\(T\-\\tau\)^\{2\}\}\.Therefore \([A\.6](https://arxiv.org/html/2605.05387#A1.E6)\) becomes
𝔼B\[KL\(ℙY∗,B∥ℙY^B\)\]≤12∫γ∗∞\(mmse\(U∣X~γ\)−mmse\(U∣X~γ,B\)\)dγ,\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\infty\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\\Big\)\\,\\mathrm\{d\}\\gamma,\(A\.7\)whereγ∗:=1/\(T−τ∗\)\\gamma^\{\*\}:=1/\(T\-\\tau^\{\*\}\)\.
Define
Φ\(γ\):=I\(U;X~γ\)−I\(U;X~γ∣B\)\.\\Phi\(\\gamma\):=I\(U;\\tilde\{X\}\_\{\\gamma\}\)\-I\(U;\\tilde\{X\}\_\{\\gamma\}\\mid B\)\.
Conditioning onBBturnsX~γ=γ\(U\+B\)\+Ξ\\tilde\{X\}\_\{\\gamma\}=\\sqrt\{\\gamma\}\(U\+B\)\+\\Xiinto an AWGN channel inUUwith a known \(measurable\) shift, so Lemma[16](https://arxiv.org/html/2605.05387#Thmtheorem16)\(withX=UX=U,S=BS=BandYγ=X~γY\_\{\\gamma\}=\\tilde\{X\}\_\{\\gamma\}\) yields
ddγI\(U;X~γ∣B\)=12mmse\(U∣X~γ,B\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(U;\\tilde\{X\}\_\{\\gamma\}\\mid B\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\.\(A\.8\)
Next, use the chain rule
I\(Z;X~γ\)=I\(U;X~γ\)\+I\(B;X~γ∣U\)\.I\(Z;\\tilde\{X\}\_\{\\gamma\}\)=I\(U;\\tilde\{X\}\_\{\\gamma\}\)\+I\(B;\\tilde\{X\}\_\{\\gamma\}\\mid U\)\.\(A\.9\)SinceX~γ=γZ\+Ξ\\tilde\{X\}\_\{\\gamma\}=\\sqrt\{\\gamma\}Z\+\\Xiis AWGN inZZ, so by I\-MMSE we have
ddγI\(Z;X~γ\)=12mmse\(Z∣X~γ\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(Z;\\tilde\{X\}\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(Z\\mid\\tilde\{X\}\_\{\\gamma\}\)\.Also, givenUU, the observationX~γ\\tilde\{X\}\_\{\\gamma\}is an AWGN channel inBBwith a known shift, so Lemma[16](https://arxiv.org/html/2605.05387#Thmtheorem16)\(withX=BX=B,S=US=U\) yields
ddγI\(B;X~γ∣U\)=12mmse\(B∣X~γ,U\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(B;\\tilde\{X\}\_\{\\gamma\}\\mid U\)=\\frac\{1\}\{2\}\\,\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\.Differentiating \([A\.9](https://arxiv.org/html/2605.05387#A1.E9)\) and subtracting the last display from the derivative ofI\(Z;X~γ\)I\(Z;\\tilde\{X\}\_\{\\gamma\}\)gives
ddγI\(U;X~γ\)=12\(mmse\(Z∣X~γ\)−mmse\(B∣X~γ,U\)\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(U;\\tilde\{X\}\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(Z\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\.\(A\.10\)BecauseUUandBBlive in orthogonal subspaces andZ=U\+BZ=U\+B,
mmse\(Z∣X~γ\)=mmse\(U∣X~γ\)\+mmse\(B∣X~γ\),\\operatorname\{mmse\}\(Z\\mid\\tilde\{X\}\_\{\\gamma\}\)=\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\+\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\),hence \([A\.10](https://arxiv.org/html/2605.05387#A1.E10)\) becomes exactly
ddγI\(U;X~γ\)=12\(mmse\(U∣X~γ\)\+mmse\(B∣X~γ\)−mmse\(B∣X~γ,U\)\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}I\(U;\\tilde\{X\}\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\+\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\.\(A\.11\)
Subtracting \([A\.8](https://arxiv.org/html/2605.05387#A1.E8)\) from \([A\.11](https://arxiv.org/html/2605.05387#A1.E11)\) yields
ddγΦ\(γ\)=12\(mmse\(U∣X~γ\)−mmse\(U∣X~γ,B\)\)\+12\(mmse\(B∣X~γ\)−mmse\(B∣X~γ,U\)\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\gamma\}\\Phi\(\\gamma\)=\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\\Big\)\+\\frac\{1\}\{2\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\.\(A\.12\)Insert \([A\.12](https://arxiv.org/html/2605.05387#A1.E12)\) into \([A\.7](https://arxiv.org/html/2605.05387#A1.E7)\) to obtain the exact decomposition
𝔼B\[KL\(ℙY∗,B∥ℙY^B\)\]≤\[Φ\(γ\)\]γ=γ∗∞−A,\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq\\Big\[\\Phi\(\\gamma\)\\Big\]\_\{\\gamma=\\gamma^\{\*\}\}^\{\\infty\}\-A,\(A\.13\)where
A:=12∫γ∗∞\(mmse\(B∣X~γ\)−mmse\(B∣X~γ,U\)\)dγ\.A:=\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\infty\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\\,\\mathrm\{d\}\\gamma\.By the orthogonality principle / law of total variance,
mmse\(B∣X~γ\)−mmse\(B∣X~γ,U\)=𝔼\[∥𝔼\[B∣X~γ,U\]−𝔼\[B∣X~γ\]∥22\]≥0,\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)=\\mathbb\{E\}\\\!\\left\[\\big\\\|\\mathbb\{E\}\[B\\mid\\tilde\{X\}\_\{\\gamma\},U\]\-\\mathbb\{E\}\[B\\mid\\tilde\{X\}\_\{\\gamma\}\]\\big\\\|\_\{2\}^\{2\}\\right\]\\geq 0,soA≥0A\\geq 0\.
Using the identityI\(U;X\)−I\(U;X∣B\)=I\(U;B\)−I\(U;B∣X\)I\(U;X\)\-I\(U;X\\mid B\)=I\(U;B\)\-I\(U;B\\mid X\)\(a direct consequence of the chain rule\), we have
Φ\(γ\)=I\(U;B\)−I\(U;B∣X~γ\)\.\\Phi\(\\gamma\)=I\(U;B\)\-I\(U;B\\mid\\tilde\{X\}\_\{\\gamma\}\)\.Then,
\[Φ\(γ\)\]γ=γ∗∞≤I\(U;B∣X~γ∗\)\.\\Big\[\\Phi\(\\gamma\)\\Big\]\_\{\\gamma=\\gamma^\{\*\}\}^\{\\infty\}\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\.\(A\.14\)Combining \([A\.13](https://arxiv.org/html/2605.05387#A1.E13)\) and \([A\.14](https://arxiv.org/html/2605.05387#A1.E14)\) gives
𝔼B\[KL\(ℙY∗,B∥ℙY^B\)\]≤I\(U;B∣X~γ∗\)−A≤I\(U;B∣X~γ∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\big\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\-A\\ \\leq\\ I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\.\(A\.15\)
Now we want to give a lower bound for \([A\.7](https://arxiv.org/html/2605.05387#A1.E7)\)\. This is equal to the lower bound
12∫γ∗γmax\(mmse\(U∣X~γ\)−mmse\(U∣X~γ,B\)\)dγ,\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\gamma\_\{max\}\}\\Big\(\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(U\\mid\\tilde\{X\}\_\{\\gamma\},B\)\\Big\)\\,\\mathrm\{d\}\\gamma,whereγmax=1t0\\gamma\_\{max\}=\\frac\{1\}\{t\_\{0\}\}\. Based on what we hade then we only need to lower bound
I\(U;B∣X~γ∗\)−I\(U;B∣X~γmax\)−AγmaxI\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}\)\-I\(U;B\\mid\\tilde\{X\}\_\{\\gamma\_\{max\}\}\)\-A\_\{\\gamma\_\{max\}\}Where
Aγmax:=12∫γ∗γmax\(mmse\(B∣X~γ\)−mmse\(B∣X~γ,U\)\)dγ\.A\_\{\\gamma\_\{max\}\}:=\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\gamma\_\{max\}\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\Big\)\\,\\mathrm\{d\}\\gamma\.\(A\.16\)Decompose the observation into orthogonal components
X~γ⟂:=P⟂X~γ,X~γ∥:=P∥X~γ,X~γ=X~γ⟂\+X~γ∥\.\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}:=P\_\{\\perp\}\\tilde\{X\}\_\{\\gamma\},\\qquad\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}:=P\_\{\\parallel\}\\tilde\{X\}\_\{\\gamma\},\\qquad\\tilde\{X\}\_\{\\gamma\}=\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\+\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}\.SinceX~γ=γ\(U\+B\)\+Ξ\\tilde\{X\}\_\{\\gamma\}=\\sqrt\{\\gamma\}\(U\+B\)\+\\XiwithΞ∼𝒩\(0,Id\)\\Xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\)independent of\(U,B\)\(U,B\), we have
X~γ⟂=γB\+Ξ⟂,X~γ∥=γU\+Ξ∥,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}=\\sqrt\{\\gamma\}\\,B\+\\Xi^\{\\perp\},\\qquad\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}=\\sqrt\{\\gamma\}\\,U\+\\Xi^\{\\parallel\},whereΞ⟂:=P⟂Ξ\\Xi^\{\\perp\}:=P\_\{\\perp\}\\XiandΞ∥:=P∥Ξ\\Xi^\{\\parallel\}:=P\_\{\\parallel\}\\Xiare independent and independent of\(U,B\)\(U,B\)\(because they are orthogonal projections of a standard Gaussian\)\.
*Key claim:*conditioning onUU, the parallel observation carries no information aboutBB, hence
mmse\(B∣X~γ,U\)=mmse\(B∣X~γ⟂,U\)\.\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)=\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},U\)\.\(A\.17\)Indeed, givenUU, we can writeX~γ∥=γU\+Ξ∥\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}=\\sqrt\{\\gamma\}\\,U\+\\Xi^\{\\parallel\}as a function ofUUplus independent noiseΞ∥\\Xi^\{\\parallel\}, soX~γ∥⟂⟂\(B,X~γ⟂\)∣U\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}\\perp\\\!\\\!\\\!\\perp\(B,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\\mid U\. Therefore
Law\(B∣U,X~γ⟂,X~γ∥\)=Law\(B∣U,X~γ⟂\),\\operatorname\{Law\}\(B\\mid U,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},\\tilde\{X\}\_\{\\gamma\}^\{\\parallel\}\)=\\operatorname\{Law\}\(B\\mid U,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\),which implies𝔼\[B∣U,X~γ\]=𝔼\[B∣U,X~γ⟂\]\\mathbb\{E\}\[B\\mid U,\\tilde\{X\}\_\{\\gamma\}\]=\\mathbb\{E\}\[B\\mid U,\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\]and thus \([A\.17](https://arxiv.org/html/2605.05387#A1.E17)\)\.
Next, by monotonicity of MMSE with respect to side information \(conditioning on more cannot increase MMSE\),
mmse\(B∣X~γ\)≤mmse\(B∣X~γ⟂\),\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\\leq\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\),\(A\.18\)sinceσ\(X~γ⟂\)⊆σ\(X~γ\)\\sigma\(\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\\subseteq\\sigma\(\\tilde\{X\}\_\{\\gamma\}\)\.
Combining \([A\.17](https://arxiv.org/html/2605.05387#A1.E17)\) and \([A\.18](https://arxiv.org/html/2605.05387#A1.E18)\) yields the*correct*pointwise bound
mmse\(B∣X~γ\)−mmse\(B∣X~γ,U\)≤mmse\(B∣X~γ⟂\)−mmse\(B∣X~γ⟂,U\)\.\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\},U\)\\ \\leq\\ \\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},U\)\.\(A\.19\)Plugging \([A\.19](https://arxiv.org/html/2605.05387#A1.E19)\) into \([A\.16](https://arxiv.org/html/2605.05387#A1.E16)\) givesAγmax≤A⟂A\_\{\\gamma\_\{max\}\}\\leq A^\{\\perp\}, where
A⟂:=12∫γ∗∞\(mmse\(B∣X~γ⟂\)−mmse\(B∣X~γ⟂,U\)\)𝑑γ\.A^\{\\perp\}:=\\frac\{1\}\{2\}\\int\_\{\\gamma^\{\*\}\}^\{\\infty\}\\Big\(\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\-\\operatorname\{mmse\}\(B\\mid\\tilde\{X\}\_\{\\gamma\}^\{\\perp\},U\)\\Big\)\\,d\\gamma\.
Finally,X~γ⟂=γB\+Ξ⟂\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}=\\sqrt\{\\gamma\}\\,B\+\\Xi^\{\\perp\}is a Gaussian channel forBB\(in the normal subspace\); applying Lemma[16](https://arxiv.org/html/2605.05387#Thmtheorem16)\(after identifying an orthonormal basis ofrange\(P⟂\)\\mathrm\{range\}\(P\_\{\\perp\}\), if desired\) gives
A⟂=\[I\(B;X~γ⟂\)−I\(B;X~γ⟂∣U\)\]γ=γ∗∞≤I\(U;B∣X~γ∗⟂\),A^\{\\perp\}=\\Big\[I\(B;\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\)\-I\(B;\\tilde\{X\}\_\{\\gamma\}^\{\\perp\}\\mid U\)\\Big\]\_\{\\gamma=\\gamma^\{\*\}\}^\{\\infty\}\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}^\{\\perp\}\),ThusAγmax≤I\(U;B∣X~γ∗⟂\)A\_\{\\gamma\_\{max\}\}\\leq I\(U;B\\mid\\tilde\{X\}\_\{\\gamma^\{\*\}\}^\{\\perp\}\), completing the lower bound\.
## Appendix BProof of Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9)
The proof follows the same decomposition as the sampler\. At the safe timet∗t^\{\*\}, our initialization is not the exact conditional lawLaw\(Xt∗∣B=b\)\\operatorname\{Law\}\(X\_\{t^\{\*\}\}\\mid B=b\), but a surrogate law obtained by sampling the correct noisy normal component and then sampling the tangent component from the unconditional slice given that normal observation\. Thus the error splits into an initialization term at timet∗t^\{\*\}and a pathwise term accumulated during the reverse dynamics\. The pathwise part is already controlled by Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4), so it remains to control the initialization discrepancy\. For this, we write both the true and surrogate tangent laws as mixtures over the discrete latent normal codeS⟂=P⟂SS^\{\\perp\}=P\_\{\\perp\}S, reduce the resulting KL divergence to a posterior resampling error through a coupling inequality for mixture KL, and then bound that posterior resampling error by Shannon entropy using the Gaussian I–MMSE identity\.
###### Lemma 17
Let\{rc\}c∈𝒞\\\{r\_\{c\}\\\}\_\{c\\in\\mathcal\{C\}\}be a family of probability measures on a measurable space\(E,ℰ\)\(E,\\mathcal\{E\}\), where𝒞\\mathcal\{C\}is countable\. Letα,β\\alpha,\\betabe probability mass functions on𝒞\\mathcal\{C\}, and define
μ:=∑c∈𝒞α\(c\)rc,ν:=∑c∈𝒞β\(c\)rc\.\\mu:=\\sum\_\{c\\in\\mathcal\{C\}\}\\alpha\(c\)\\,r\_\{c\},\\qquad\\nu:=\\sum\_\{c\\in\\mathcal\{C\}\}\\beta\(c\)\\,r\_\{c\}\.Then
KL\(μ∥ν\)≤infλ∈Γ\(α,β\)∑c,c~∈𝒞λ\(c,c~\)KL\(rc∥rc~\),\\mathrm\{KL\}\(\\mu\\\|\\nu\)\\leq\\inf\_\{\\lambda\\in\\Gamma\(\\alpha,\\beta\)\}\\sum\_\{c,\\tilde\{c\}\\in\\mathcal\{C\}\}\\lambda\(c,\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{c\}\\\|r\_\{\\tilde\{c\}\}\),whereΓ\(α,β\)\\Gamma\(\\alpha,\\beta\)denotes the set of couplings ofα\\alphaandβ\\beta\.
ProofFix any couplingλ∈Γ\(α,β\)\\lambda\\in\\Gamma\(\\alpha,\\beta\)\. Define two probability measures on𝒞×𝒞×E\\mathcal\{C\}\\times\\mathcal\{C\}\\times Eby
𝒫λ\(c,c~,dx\):=λ\(c,c~\)rc\(dx\),Qλ\(c,c~,dx\):=λ\(c,c~\)rc~\(dx\)\.\\mathcal\{P\}\_\{\\lambda\}\(c,\\tilde\{c\},dx\):=\\lambda\(c,\\tilde\{c\}\)\\,r\_\{c\}\(dx\),\\qquad Q\_\{\\lambda\}\(c,\\tilde\{c\},dx\):=\\lambda\(c,\\tilde\{c\}\)\\,r\_\{\\tilde\{c\}\}\(dx\)\.TheirEE\-marginals are exactlyμ\\muandν\\nu\. Therefore, by data processing under the projection\(c,c~,x\)↦x\(c,\\tilde\{c\},x\)\\mapsto x,
KL\(μ∥ν\)≤KL\(𝒫λ∥Qλ\)\.\\mathrm\{KL\}\(\\mu\\\|\\nu\)\\leq\\mathrm\{KL\}\(\\mathcal\{P\}\_\{\\lambda\}\\\|Q\_\{\\lambda\}\)\.Since𝒫λ\\mathcal\{P\}\_\{\\lambda\}andQλQ\_\{\\lambda\}have the same\(c,c~\)\(c,\\tilde\{c\}\)\-marginalλ\\lambda, the chain rule for relative entropy gives
KL\(𝒫λ∥Qλ\)=∑c,c~λ\(c,c~\)KL\(rc∥rc~\)\.\\mathrm\{KL\}\(\\mathcal\{P\}\_\{\\lambda\}\\\|Q\_\{\\lambda\}\)=\\sum\_\{c,\\tilde\{c\}\}\\lambda\(c,\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{c\}\\\|r\_\{\\tilde\{c\}\}\)\.Taking the infimum overλ∈Γ\(α,β\)\\lambda\\in\\Gamma\(\\alpha,\\beta\)yields the claim\.
###### Lemma 18
LetSSbe a discrete random variable inℝm\\mathbb\{R\}^\{m\}withH\(C\)<∞H\(C\)<\\infty, and let
X=S\+σG,G∼𝒩\(0,Im\),X=S\+\\sigma G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{m\}\),withGGindependent ofSS\. LetS~\\tilde\{S\}be an independent posterior draw, i\.e\.
S~∣X∼Law\(S∣X\),S~⟂S∣X\.\\tilde\{S\}\\mid X\\sim\\operatorname\{Law\}\(S\\mid X\),\\qquad\\tilde\{S\}\\perp S\\mid X\.Then
𝔼‖S−S~‖22≤4σ2H\(C\)\.\\mathbb\{E\}\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}\\leq 4\\sigma^\{2\}H\(C\)\.
ProofConditional onXX, the random variablesCCandC~\\tilde\{C\}are i\.i\.d\. with common lawLaw\(C∣X\)\\operatorname\{Law\}\(C\\mid X\)\. Hence
𝔼\[∥S−S~∥22\|X\]=2tr\(Cov\(S∣X\)\),\\mathbb\{E\}\\\!\\left\[\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}\\,\\middle\|\\,X\\right\]=2\\,\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid X\)\\big\),so
𝔼‖S−S~‖22=2𝔼tr\(Cov\(S∣X\)\)\.\\mathbb\{E\}\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}=2\\,\\mathbb\{E\}\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid X\)\\big\)\.\(B\.1\)Setγ:=1/σ2\\gamma:=1/\\sigma^\{2\}andYγ:=γS\+GY\_\{\\gamma\}:=\\sqrt\{\\gamma\}\\,S\+G\. SinceX=σYγX=\\sigma Y\_\{\\gamma\}, the observationsXXandYγY\_\{\\gamma\}are equivalent, and
𝔼tr\(Cov\(S∣X\)\)=𝔼tr\(Cov\(S∣Yγ\)\)=:mmse\(γ\)\.\\mathbb\{E\}\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid X\)\\big\)=\\mathbb\{E\}\\operatorname\{tr\}\\\!\\big\(\\operatorname\{Cov\}\(S\\mid Y\_\{\\gamma\}\)\\big\)=:\\operatorname\{mmse\}\(\\gamma\)\.For the Gaussian channelYγ=γS\+GY\_\{\\gamma\}=\\sqrt\{\\gamma\}\\,S\+G, the vector I–MMSE identity gives
I\(S;Yγ\)=12∫0γmmse\(s\)𝑑s\.I\(S;Y\_\{\\gamma\}\)=\\frac\{1\}\{2\}\\int\_\{0\}^\{\\gamma\}\\operatorname\{mmse\}\(s\)\\,ds\.Sincemmse\(s\)\\operatorname\{mmse\}\(s\)is nonincreasing inss,
I\(S;Yγ\)≥γ2mmse\(γ\)\.I\(S;Y\_\{\\gamma\}\)\\geq\\frac\{\\gamma\}\{2\}\\,\\operatorname\{mmse\}\(\\gamma\)\.BecauseCCis discrete,
I\(S;Yγ\)≤H\(S\)\.I\(S;Y\_\{\\gamma\}\)\\leq H\(S\)\.Thus
mmse\(γ\)≤2H\(S\)γ=2σ2H\(S\)\.\\operatorname\{mmse\}\(\\gamma\)\\leq\\frac\{2H\(S\)\}\{\\gamma\}=2\\sigma^\{2\}H\(S\)\.Substituting into \([B\.1](https://arxiv.org/html/2605.05387#A2.E1)\) gives
𝔼‖S−S~‖22≤4σ2H\(S\)\.\\mathbb\{E\}\\\|S\-\\tilde\{S\}\\\|\_\{2\}^\{2\}\\leq 4\\sigma^\{2\}H\(S\)\.
ProofLet
rtc:=Law\(Xt∥∣S⟂=c\),c∈𝒞⟂\.r\_\{t\}^\{c\}:=\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid S^\{\\perp\}=c\),\\qquad c\\in\\mathcal\{C\}^\{\\perp\}\.Under Assumption[5\.2](https://arxiv.org/html/2605.05387#S5.Thmassumption2),
Z=S\+εN,B=P⟂Z=S⟂\+εN⟂,Xt⟂=B\+Wt⟂\.Z=S\+\\varepsilon N,\\qquad B=P\_\{\\perp\}Z=S^\{\\perp\}\+\\varepsilon N^\{\\perp\},\\qquad X\_\{t\}^\{\\perp\}=B\+W\_\{t\}^\{\\perp\}\.Since the tangent and normal noises are independent, conditional onS⟂S^\{\\perp\}the variableXt∥X\_\{t\}^\{\\parallel\}is independent of bothBBandXt⟂X\_\{t\}^\{\\perp\}\. Therefore, if
πb\(c\):=ℙ\(S⟂=c∣B=b\),πx\(c\):=ℙ\(S⟂=c∣Xt⟂=x\),\\pi\_\{b\}\(c\):=\\mathbb\{P\}\(S^\{\\perp\}=c\\mid B=b\),\\qquad\\pi\_\{x\}\(c\):=\\mathbb\{P\}\(S^\{\\perp\}=c\\mid X\_\{t\}^\{\\perp\}=x\),then
Law\(Xt∥∣B=b\)=∑c∈𝒞⟂πb\(c\)rtc,Law\(Xt∥∣Xt⟂=x\)=∑c∈𝒞⟂πx\(c\)rtc\.\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid B=b\)=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{b\}\(c\)\\,r\_\{t\}^\{c\},\\qquad\\operatorname\{Law\}\(X\_\{t\}^\{\\parallel\}\\mid X\_\{t\}^\{\\perp\}=x\)=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{x\}\(c\)\\,r\_\{t\}^\{c\}\.
Moreover, conditional onB=bB=b, the normal component isXt⟂=b\+Wt⟂X\_\{t\}^\{\\perp\}=b\+W\_\{t\}^\{\\perp\}, and it is independent ofXt∥X\_\{t\}^\{\\parallel\}\. Hence the true conditional law and the surrogate initialization law factorize as
pt∗,b\(x⟂,x∥\)=pt\(x⟂∣B=b\)μtb\(x∥\),μtb:=∑c∈𝒞⟂πb\(c\)rtc,p\_\{t\}^\{\*,b\}\(x^\{\\perp\},x^\{\\parallel\}\)=p\_\{t\}\(x^\{\\perp\}\\mid B=b\)\\,\\mu\_\{t\}^\{b\}\(x^\{\\parallel\}\),\\qquad\\mu\_\{t\}^\{b\}:=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{b\}\(c\)\\,r\_\{t\}^\{c\},and
p^tb\(x⟂,x∥\)=pt\(x⟂∣B=b\)νtx⟂\(x∥\),νtx:=∑c∈𝒞⟂πx\(c\)rtc\.\\hat\{p\}\_\{t\}^\{\\,b\}\(x^\{\\perp\},x^\{\\parallel\}\)=p\_\{t\}\(x^\{\\perp\}\\mid B=b\)\\,\\nu\_\{t\}^\{x^\{\\perp\}\}\(x^\{\\parallel\}\),\\qquad\\nu\_\{t\}^\{x\}:=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\pi\_\{x\}\(c\)\\,r\_\{t\}^\{c\}\.Therefore
KL\(pt∗,b∥p^tb\)=𝔼\[KL\(μtb∥νtXt⟂\)\|B=b\]\.\\mathrm\{KL\}\(p\_\{t\}^\{\*,b\}\\\|\\hat\{p\}\_\{t\}^\{\\,b\}\)=\\mathbb\{E\}\\\!\\left\[\\mathrm\{KL\}\(\\mu\_\{t\}^\{b\}\\\|\\nu\_\{t\}^\{X\_\{t\}^\{\\perp\}\}\)\\,\\middle\|\\,B=b\\right\]\.\(B\.2\)
Applying Lemma[17](https://arxiv.org/html/2605.05387#Thmtheorem17)and choosing the product couplingλ=πb⊗πx\\lambda=\\pi\_\{b\}\\otimes\\pi\_\{x\}, we obtain
KL\(μtb∥νtx\)≤∑c,c~πb\(c\)πx\(c~\)KL\(rtc∥rtc~\)\.\\mathrm\{KL\}\(\\mu\_\{t\}^\{b\}\\\|\\nu\_\{t\}^\{x\}\)\\leq\\sum\_\{c,\\tilde\{c\}\}\\pi\_\{b\}\(c\)\\pi\_\{x\}\(\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{t\}^\{c\}\\\|r\_\{t\}^\{\\tilde\{c\}\}\)\.By Assumption[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3),
KL\(rtc∥rtc~\)≤Lt‖c−c~‖22,t≥t0,\\mathrm\{KL\}\(r\_\{t\}^\{c\}\\\|r\_\{t\}^\{\\tilde\{c\}\}\)\\leq L\_\{t\}\\\|c\-\\tilde\{c\}\\\|\_\{2\}^\{2\},\\qquad t\\geq t\_\{0\},and hence
KL\(μtb∥νtx\)≤Lt∑c,c~πb\(c\)πx\(c~\)‖c−c~‖22\.\\mathrm\{KL\}\(\\mu\_\{t\}^\{b\}\\\|\\nu\_\{t\}^\{x\}\)\\leq L\_\{t\}\\sum\_\{c,\\tilde\{c\}\}\\pi\_\{b\}\(c\)\\pi\_\{x\}\(\\tilde\{c\}\)\\\|c\-\\tilde\{c\}\\\|\_\{2\}^\{2\}\.Substituting into \([B\.2](https://arxiv.org/html/2605.05387#A2.E2)\) and averaging overBBgives
𝔼B\[KL\(pt∗,B∥p^tB\)\]≤Lt𝔼‖S⟂−S~⟂‖22,\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\(p\_\{t\}^\{\*,B\}\\\|\\hat\{p\}\_\{t\}^\{\\,B\}\)\\right\]\\leq L\_\{t\}\\,\\mathbb\{E\}\\\|S^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}^\{2\},\(B\.3\)where, conditional onXt⟂X\_\{t\}^\{\\perp\}, the random variableS~⟂\\tilde\{S\}^\{\\perp\}is an independent draw fromLaw\(S⟂∣Xt⟂\)\\operatorname\{Law\}\(S^\{\\perp\}\\mid X\_\{t\}^\{\\perp\}\)\.
Since
Xt⟂=S⟂\+εN⟂\+Wt⟂=S⟂\+t\+ε2G,G∼𝒩\(0,Im\),X\_\{t\}^\{\\perp\}=S^\{\\perp\}\+\\varepsilon N^\{\\perp\}\+W\_\{t\}^\{\\perp\}=S^\{\\perp\}\+\\sqrt\{t\+\\varepsilon^\{2\}\}\\,G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{m\}\),Lemma[18](https://arxiv.org/html/2605.05387#Thmtheorem18)yields
𝔼‖S⟂−S~⟂‖22≤4\(t\+ε2\)H\(S⟂\)\.\\mathbb\{E\}\\\|S^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}^\{2\}\\leq 4\(t\+\\varepsilon^\{2\}\)H\(S^\{\\perp\}\)\.Combining this with \([B\.3](https://arxiv.org/html/2605.05387#A2.E3)\) and settingt=t∗t=t^\{\*\}, we obtain
𝔼B\[KL\(pt∗∗,B∥p^t∗B\)\]≤4Lt∗\(t∗\+ε2\)H\(S⟂\)\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\(p\_\{t^\{\*\}\}^\{\*,B\}\\\|\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\)\\right\]\\leq 4L\_\{t^\{\*\}\}\(t^\{\*\}\+\\varepsilon^\{2\}\)H\(S^\{\\perp\}\)\.\(B\.4\)
For eachbb, letℙ∗,b\\mathbb\{P\}^\{\*,b\}be the path measure of the true conditional reverse SDE on\[τ∗,T−t0\]\[\\tau^\{\*\},T\-t\_\{0\}\], started frompt∗∗,bp\_\{t^\{\*\}\}^\{\*,b\}, and letℙ^b\\hat\{\\mathbb\{P\}\}^\{\\,b\}be the path measure of the surrogate reverse SDE on the same interval, started fromp^t∗b\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\. Letℙ~b\\tilde\{\\mathbb\{P\}\}^\{\\,b\}denote the path measure obtained by running the surrogate reverse SDE from the true initial lawpt∗∗,bp\_\{t^\{\*\}\}^\{\*,b\}\. Then
KL\(ℙ∗,b∥ℙ^b\)=KL\(ℙ∗,b∥ℙ~b\)\+KL\(pt∗∗,b∥p^t∗b\)\.\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,b\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,b\}\)=\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,b\}\\\|\\tilde\{\\mathbb\{P\}\}^\{\\,b\}\)\+\\mathrm\{KL\}\(p\_\{t^\{\*\}\}^\{\*,b\}\\\|\\hat\{p\}\_\{t^\{\*\}\}^\{\\,b\}\)\.Averaging overBByields
𝔼B\[KL\(ℙ∗,B∥ℙ^B\)\]=𝔼B\[KL\(ℙ∗,B∥ℙ~B\)\]\+𝔼B\[KL\(pt∗∗,B∥p^t∗B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]=\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\tilde\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\+\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(p\_\{t^\{\*\}\}^\{\*,B\}\\\|\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\)\\big\]\.By Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4),
𝔼B\[KL\(ℙ∗,B∥ℙ~B\)\]≤I\(Z∥;Z⟂∣Xt∗\),\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\tilde\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\\leq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\),and by Proposition[6](https://arxiv.org/html/2605.05387#Thmtheorem6),
I\(Z∥;Z⟂∣Xt∗\)≤I\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.Together with \([B\.4](https://arxiv.org/html/2605.05387#A2.E4)\), this gives
𝔼B\[KL\(ℙ∗,B∥ℙ^B\)\]≤4Lt∗\(t∗\+ε2\)H\(S⟂\)\+I\(S∥;S⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\\leq 4L\_\{t^\{\*\}\}\(t^\{\*\}\+\\varepsilon^\{2\}\)H\(S^\{\\perp\}\)\+I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.
Finally, the terminal tangent marginal is a measurable image of path space, so by data processing,
𝔼B\[KL\(μT−t0∗,B∥μ^T−t0B\)\]≤𝔼B\[KL\(ℙ∗,B∥ℙ^B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\left\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\right\]\\leq\\mathbb\{E\}\_\{B\}\\\!\\big\[\\mathrm\{KL\}\(\\mathbb\{P\}^\{\*,B\}\\\|\\hat\{\\mathbb\{P\}\}^\{\\,B\}\)\\big\]\.This proves \([5\.2](https://arxiv.org/html/2605.05387#S5.E2)\)\.
## Appendix CProof of Theorem[11](https://arxiv.org/html/2605.05387#Thmtheorem11)
The proof again separates initialization and pathwise contributions, but now theδ\\delta\-separation assumption upgrades both bounds from Shannon\-scale control to exponential control\. The initialization term is handled through the same mixture representation as above, followed by a posterior\-resampling tail estimate for the effective normal Gaussian channel\. The pathwise term is bounded through Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4), the latent comparison proposition, and an exponential bound on the residual conditional entropyH\(S⟂∣Xt∗⟂\)H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\.
###### Lemma 19
LetCCbe a discrete random variable supported on a countable set𝒞⟂⊂ℝm\\mathcal\{C\}^\{\\perp\}\\subset\\mathbb\{R\}^\{m\}with pmfpSp\_\{S\}, and define
H1/2\(S\):=2log∑c∈𝒞⟂pS\(c\)\.H\_\{1/2\}\(S\):=2\\log\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{p\_\{S\}\(c\)\}\.Fixt\>0t\>0, and consider the Gaussian channel
X∣\(S=c\)∼𝒩\(c,\(t\+ε2\)Im\)\.X\\mid\(S=c\)\\sim\\mathcal\{N\}\(c,\(t\+\\varepsilon^\{2\}\)I\_\{m\}\)\.Letpt\(x∣c\)p\_\{t\}\(x\\mid c\)denote this Gaussian density and letpt\(c∣x\)p\_\{t\}\(c\\mid x\)be the posterior
pt\(c∣x\)=pS\(c\)pt\(x∣c\)∑u∈𝒞⟂pS\(u\)pt\(x∣u\)\.p\_\{t\}\(c\\mid x\)=\\frac\{p\_\{S\}\(c\)\\,p\_\{t\}\(x\\mid c\)\}\{\\sum\_\{u\\in\\mathcal\{C\}^\{\\perp\}\}p\_\{S\}\(u\)\\,p\_\{t\}\(x\\mid u\)\}\.For eachc∗∈𝒞⟂c^\{\*\}\\in\\mathcal\{C\}^\{\\perp\}, define the posterior\-resampling kernel
Kt\(c,c∗\):=∫pt\(c∣x\)pt\(x∣c∗\)𝑑x\.K\_\{t\}\(c,c^\{\*\}\):=\\int p\_\{t\}\(c\\mid x\)\\,p\_\{t\}\(x\\mid c^\{\*\}\)\\,dx\.LetS∗∼pSS^\{\*\}\\sim p\_\{S\}, and conditional onS∗=c∗S^\{\*\}=c^\{\*\}, letS~\\tilde\{S\}have pmfKt\(⋅,c∗\)K\_\{t\}\(\\cdot,c^\{\*\}\)\. Set
R:=‖S~−S∗‖2\.R:=\\\|\\tilde\{S\}\-S^\{\*\}\\\|\_\{2\}\.Then for everyr≥0r\\geq 0,
ℙ\(R≥r\)≤12exp\(H1/2\(S\)−r28\(t\+ε2\)\)\.\\mathbb\{P\}\(R\\geq r\)\\leq\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\(S\)\-\\frac\{r^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.
ProofFixc,c∗∈𝒞⟂c,c^\{\*\}\\in\\mathcal\{C\}^\{\\perp\}andx∈ℝmx\\in\\mathbb\{R\}^\{m\}\. By Bayes’ rule,
pt\(c∣x\)≤pS\(c\)pt\(x∣c\)pS\(c\)pt\(x∣c\)\+pS\(c∗\)pt\(x∣c∗\)\.p\_\{t\}\(c\\mid x\)\\leq\\frac\{p\_\{S\}\(c\)p\_\{t\}\(x\\mid c\)\}\{p\_\{S\}\(c\)p\_\{t\}\(x\\mid c\)\+p\_\{S\}\(c^\{\*\}\)p\_\{t\}\(x\\mid c^\{\*\}\)\}\.WritingA:=pS\(c\)pt\(x∣c\)A:=p\_\{S\}\(c\)p\_\{t\}\(x\\mid c\)andB:=pS\(c∗\)pt\(x∣c∗\)B:=p\_\{S\}\(c^\{\*\}\)p\_\{t\}\(x\\mid c^\{\*\}\), the inequalityA\+B≥2ABA\+B\\geq 2\\sqrt\{AB\}gives
AA\+B≤12AB=12pS\(c\)pS\(c∗\)pt\(x∣c\)pt\(x∣c∗\)\.\\frac\{A\}\{A\+B\}\\leq\\frac\{1\}\{2\}\\sqrt\{\\frac\{A\}\{B\}\}=\\frac\{1\}\{2\}\\sqrt\{\\frac\{p\_\{S\}\(c\)\}\{p\_\{S\}\(c^\{\*\}\)\}\}\\sqrt\{\\frac\{p\_\{t\}\(x\\mid c\)\}\{p\_\{t\}\(x\\mid c^\{\*\}\)\}\}\.Multiplying bypt\(x∣c∗\)p\_\{t\}\(x\\mid c^\{\*\}\)and integrating overxx, we obtain
Kt\(c,c∗\)≤12pS\(c\)pS\(c∗\)∫pt\(x∣c\)pt\(x∣c∗\)𝑑x\.K\_\{t\}\(c,c^\{\*\}\)\\leq\\frac\{1\}\{2\}\\sqrt\{\\frac\{p\_\{S\}\(c\)\}\{p\_\{S\}\(c^\{\*\}\)\}\}\\int\\sqrt\{p\_\{t\}\(x\\mid c\)p\_\{t\}\(x\\mid c^\{\*\}\)\}\\,dx\.For isotropic Gaussians with covariance\(t\+ε2\)Im\(t\+\\varepsilon^\{2\}\)I\_\{m\}, the Hellinger affinity is
∫pt\(x∣c\)pt\(x∣c∗\)𝑑x=exp\(−‖c−c∗‖228\(t\+ε2\)\)\.\\int\\sqrt\{p\_\{t\}\(x\\mid c\)p\_\{t\}\(x\\mid c^\{\*\}\)\}\\,dx=\\exp\\\!\\Big\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.Hence
Kt\(c,c∗\)≤12pS\(c\)pS\(c∗\)exp\(−‖c−c∗‖228\(t\+ε2\)\)\.K\_\{t\}\(c,c^\{\*\}\)\\leq\\frac\{1\}\{2\}\\sqrt\{\\frac\{p\_\{S\}\(c\)\}\{p\_\{S\}\(c^\{\*\}\)\}\}\\exp\\\!\\Big\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.Let
M:=∑c∈𝒞⟂pS\(c\),M2=eH1/2\(C\)\.M:=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{p\_\{S\}\(c\)\},\\qquad M^\{2\}=e^\{H\_\{1/2\}\(C\)\}\.Then
ℙ\(R≥r∣S∗=c∗\)=∑‖c−c∗‖≥rKt\(c,c∗\)≤M2pS\(c∗\)e−r2/\(8\(t\+ε2\)\)\.\\mathbb\{P\}\(R\\geq r\\mid S^\{\*\}=c^\{\*\}\)=\\sum\_\{\\\|c\-c^\{\*\}\\\|\\geq r\}K\_\{t\}\(c,c^\{\*\}\)\\leq\\frac\{M\}\{2\\sqrt\{p\_\{S\}\(c^\{\*\}\)\}\}e^\{\-r^\{2\}/\(8\(t\+\\varepsilon^\{2\}\)\)\}\.Averaging overS∗∼pSS^\{\*\}\\sim p\_\{S\}yields
ℙ\(R≥r\)≤12exp\(H1/2\(S\)−r28\(t\+ε2\)\)\.\\mathbb\{P\}\(R\\geq r\)\\leq\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\(S\)\-\\frac\{r^\{2\}\}\{8\(t\+\\varepsilon^\{2\}\)\}\\Big\)\.
###### Lemma 20
Assume Assumption[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4)andH1/2\(S⟂\)<∞H\_\{1/2\}\(S^\{\\perp\}\)<\\infty\. Then
H\(S⟂∣Xt∗⟂\)≤2exp\(H1/2\(S⟂\)−δ28\(t∗\+ε2\)\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\leq 2\\exp\\\!\\Big\(H\_\{1/2\}\(S^\{\\perp\}\)\-\\frac\{\\delta^\{2\}\}\{8\(t^\{\*\}\+\\varepsilon^\{2\}\)\}\\Big\)\.
ProofWriteπc:=ℙ\(S⟂=c\)\\pi\_\{c\}:=\\mathbb\{P\}\(S^\{\\perp\}=c\)forc∈𝒞⟂c\\in\\mathcal\{C\}^\{\\perp\}, and fixc∗∈𝒞⟂c^\{\*\}\\in\\mathcal\{C\}^\{\\perp\}\. Conditional onS⟂=c∗S^\{\\perp\}=c^\{\*\},
Xt∗⟂=c∗\+t∗\+ε2G,G∼𝒩\(0,Im\)\.X\_\{t^\{\*\}\}^\{\\perp\}=c^\{\*\}\+\\sqrt\{t^\{\*\}\+\\varepsilon^\{2\}\}\\,G,\\qquad G\\sim\\mathcal\{N\}\(0,I\_\{m\}\)\.Forx∈ℝmx\\in\\mathbb\{R\}^\{m\}, let
px\(c\):=ℙ\(S⟂=c∣Xt∗⟂=x\)\.p\_\{x\}\(c\):=\\mathbb\{P\}\(S^\{\\perp\}=c\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)\.Then
H\(S⟂∣Xt∗⟂=x\)=∑c∈𝒞⟂px\(c\)log1px\(c\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)=\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}p\_\{x\}\(c\)\\log\\frac\{1\}\{p\_\{x\}\(c\)\}\.Forc≠c∗c\\neq c^\{\*\}, define
lc\(x\):=px\(c\)px\(c∗\),R\(x\):=∑c≠c∗lc\(x\)\.l\_\{c\}\(x\):=\\frac\{p\_\{x\}\(c\)\}\{p\_\{x\}\(c^\{\*\}\)\},\\qquad R\(x\):=\\sum\_\{c\\neq c^\{\*\}\}l\_\{c\}\(x\)\.Then
px\(c∗\)=11\+R\(x\),px\(c\)=lc\(x\)1\+R\(x\)\(c≠c∗\),p\_\{x\}\(c^\{\*\}\)=\\frac\{1\}\{1\+R\(x\)\},\\qquad p\_\{x\}\(c\)=\\frac\{l\_\{c\}\(x\)\}\{1\+R\(x\)\}\\quad\(c\\neq c^\{\*\}\),and therefore
H\(S⟂∣Xt∗⟂=x\)=∑c≠c∗lc\(x\)1\+R\(x\)log1lc\(x\)\+log\(1\+R\(x\)\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)=\\sum\_\{c\\neq c^\{\*\}\}\\frac\{l\_\{c\}\(x\)\}\{1\+R\(x\)\}\\log\\frac\{1\}\{l\_\{c\}\(x\)\}\+\\log\(1\+R\(x\)\)\.Using
ulog1u≤u\(0<u≤1\),log\(1\+v\)≤v\(v≥0\),u\\log\\frac\{1\}\{u\}\\leq\\sqrt\{u\}\\quad\(0<u\\leq 1\),\\qquad\\log\(1\+v\)\\leq\\sqrt\{v\}\\quad\(v\\geq 0\),and discarding the nonpositive terms withrc\(x\)\>1r\_\{c\}\(x\)\>1, we obtain
H\(S⟂∣Xt∗⟂=x\)≤2∑c≠c∗lc\(x\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}=x\)\\leq 2\\sum\_\{c\\neq c^\{\*\}\}\\sqrt\{l\_\{c\}\(x\)\}\.
By Bayes’ rule,
lc\(x\)=πcπc∗φσ∗2\(x−c\)φσ∗2\(x−c∗\),σ∗2:=t∗\+ε2,l\_\{c\}\(x\)=\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\\frac\{\\varphi\_\{\\sigma\_\{\*\}^\{2\}\}\(x\-c\)\}\{\\varphi\_\{\\sigma\_\{\*\}^\{2\}\}\(x\-c^\{\*\}\)\},\\qquad\\sigma\_\{\*\}^\{2\}:=t^\{\*\}\+\\varepsilon^\{2\},whereφσ∗2\\varphi\_\{\\sigma\_\{\*\}^\{2\}\}is the Gaussian density with covarianceσ∗2Im\\sigma\_\{\*\}^\{2\}I\_\{m\}\. Writingx=c∗\+wx=c^\{\*\}\+w, we obtain
lc\(x\)=πcπc∗exp\(−‖c−c∗‖224σ∗2\+⟨w,c−c∗⟩2σ∗2\)\.\\sqrt\{l\_\{c\}\(x\)\}=\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}\\exp\\\!\\left\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{4\\sigma\_\{\*\}^\{2\}\}\+\\frac\{\\langle w,c\-c^\{\*\}\\rangle\}\{2\\sigma\_\{\*\}^\{2\}\}\\right\)\.Taking expectation overw∼𝒩\(0,σ∗2Im\)w\\sim\\mathcal\{N\}\(0,\\sigma\_\{\*\}^\{2\}I\_\{m\}\),
𝔼\[lc\(Xt∗⟂\)\|S⟂=c∗\]=πcπc∗exp\(−‖c−c∗‖228σ∗2\)\.\\mathbb\{E\}\\\!\\left\[\\sqrt\{l\_\{c\}\(X\_\{t^\{\*\}\}^\{\\perp\}\)\}\\,\\middle\|\\,S^\{\\perp\}=c^\{\*\}\\right\]=\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}\\exp\\\!\\left\(\-\\frac\{\\\|c\-c^\{\*\}\\\|\_\{2\}^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\right\)\.Byδ\\delta\-separation,
‖c−c∗‖2≥δ\(c≠c∗\),\\\|c\-c^\{\*\}\\\|\_\{2\}\\geq\\delta\\qquad\(c\\neq c^\{\*\}\),and hence
𝔼\[lc\(Xt∗⟂\)\|S⟂=c∗\]≤πcπc∗e−δ2/\(8σ∗2\)\.\\mathbb\{E\}\\\!\\left\[\\sqrt\{l\_\{c\}\(X\_\{t^\{\*\}\}^\{\\perp\}\)\}\\,\\middle\|\\,S^\{\\perp\}=c^\{\*\}\\right\]\\leq\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\.Therefore
𝔼\[H\(S⟂∣Xt∗⟂\)\|S⟂=c∗\]≤2e−δ2/\(8σ∗2\)∑c≠c∗πcπc∗\.\\mathbb\{E\}\\\!\\left\[H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\,\\middle\|\\,S^\{\\perp\}=c^\{\*\}\\right\]\\leq 2e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\\sum\_\{c\\neq c^\{\*\}\}\\sqrt\{\\frac\{\\pi\_\{c\}\}\{\\pi\_\{c^\{\*\}\}\}\}\.Averaging overS⟂S^\{\\perp\}gives
H\(S⟂∣Xt∗⟂\)≤2e−δ2/\(8σ∗2\)∑c∗πc∗∑c≠c∗πc≤2e−δ2/\(8σ∗2\)\(∑c∈𝒞⟂πc\)2\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\leq 2e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\\sum\_\{c^\{\*\}\}\\sqrt\{\\pi\_\{c^\{\*\}\}\}\\sum\_\{c\\neq c^\{\*\}\}\\sqrt\{\\pi\_\{c\}\}\\leq 2e^\{\-\\delta^\{2\}/\(8\\sigma\_\{\*\}^\{2\}\)\}\\Big\(\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{\\pi\_\{c\}\}\\Big\)^\{2\}\.Since
\(∑c∈𝒞⟂πc\)2=eH1/2\(S⟂\),\\Big\(\\sum\_\{c\\in\\mathcal\{C\}^\{\\perp\}\}\\sqrt\{\\pi\_\{c\}\}\\Big\)^\{2\}=e^\{H\_\{1/2\}\(S^\{\\perp\}\)\},the claim follows\.
ProofLet
σ∗2:=t∗\+ε2,H1/2:=H1/2\(S⟂\)\.\\sigma\_\{\*\}^\{2\}:=t^\{\*\}\+\\varepsilon^\{2\},\\qquad H\_\{1/2\}:=H\_\{1/2\}\(S^\{\\perp\}\)\.As in the proof of Theorem[9](https://arxiv.org/html/2605.05387#Thmtheorem9), the true and surrogate tangent laws at timettcan be written as mixtures overS⟂S^\{\\perp\}, and Lemma[17](https://arxiv.org/html/2605.05387#Thmtheorem17)therefore yields
KL\(pt∗,b∥p^tb\)≤∑c,c~πb\(c\)πx\(c~\)KL\(rtc∥rtc~\)\.\\mathrm\{KL\}\(p\_\{t\}^\{\*,b\}\\\|\\hat\{p\}\_\{t\}^\{\\,b\}\)\\leq\\sum\_\{c,\\tilde\{c\}\}\\pi\_\{b\}\(c\)\\pi\_\{x\}\(\\tilde\{c\}\)\\,\\mathrm\{KL\}\(r\_\{t\}^\{c\}\\\|r\_\{t\}^\{\\tilde\{c\}\}\)\.Using Assumption[5\.3](https://arxiv.org/html/2605.05387#S5.Thmassumption3)at timet=t∗t=t^\{\*\}, we get
KL\(rt∗c∥rt∗c~\)≤Lt∗‖c−c~‖22\.\\mathrm\{KL\}\(r\_\{t^\{\*\}\}^\{c\}\\\|r\_\{t^\{\*\}\}^\{\\tilde\{c\}\}\)\\leq L\_\{t^\{\*\}\}\\\|c\-\\tilde\{c\}\\\|\_\{2\}^\{2\}\.Hence
𝔼B\[KL\(pt∗∗,B∥p^t∗B\)\]≤Lt∗𝔼‖S∗⟂−S~⟂‖22,\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\\Big\]\\leq L\_\{t^\{\*\}\}\\,\\mathbb\{E\}\\\|\{S^\{\*\}\}^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}^\{2\},whereS∗⟂∼pS⟂\{S^\{\*\}\}^\{\\perp\}\\sim p\_\{S^\{\\perp\}\}, and conditional onS∗⟂=c∗\{S^\{\*\}\}^\{\\perp\}=c^\{\*\}, the variableS~⟂\\tilde\{S\}^\{\\perp\}is drawn from the posterior\-resampling kernel of the effective channel
Xt∗⟂∣\(S⟂=c\)∼𝒩\(c,σ∗2Im\)\.X\_\{t^\{\*\}\}^\{\\perp\}\\mid\(S^\{\\perp\}=c\)\\sim\\mathcal\{N\}\(c,\\sigma\_\{\*\}^\{2\}I\_\{m\}\)\.
Let
R:=‖S∗⟂−S~⟂‖2\.R:=\\\|\{S^\{\*\}\}^\{\\perp\}\-\\tilde\{S\}^\{\\perp\}\\\|\_\{2\}\.By Assumption[5\.4](https://arxiv.org/html/2605.05387#S5.Thmassumption4), eitherR=0R=0orR≥δR\\geq\\delta\. Therefore
𝔼\[R2\]=∫0∞ℙ\(R2≥s\)𝑑s=∫0δ2ℙ\(R≥δ\)𝑑s\+∫δ2∞ℙ\(R≥s\)𝑑s\.\\mathbb\{E\}\[R^\{2\}\]=\\int\_\{0\}^\{\\infty\}\\mathbb\{P\}\(R^\{2\}\\geq s\)\\,ds=\\int\_\{0\}^\{\\delta^\{2\}\}\\mathbb\{P\}\(R\\geq\\delta\)\\,ds\+\\int\_\{\\delta^\{2\}\}^\{\\infty\}\\mathbb\{P\}\(R\\geq\\sqrt\{s\}\)\\,ds\.Applying Lemma[19](https://arxiv.org/html/2605.05387#Thmtheorem19),
ℙ\(R≥r\)≤12exp\(H1/2−r28σ∗2\),\\mathbb\{P\}\(R\\geq r\)\\leq\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{r^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\),we obtain
𝔼\[R2\]≤δ22exp\(H1/2−δ28σ∗2\)\+∫δ2∞12exp\(H1/2−s8σ∗2\)𝑑s\.\\mathbb\{E\}\[R^\{2\}\]\\leq\\frac\{\\delta^\{2\}\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\+\\int\_\{\\delta^\{2\}\}^\{\\infty\}\\frac\{1\}\{2\}\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{s\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\\,ds\.Evaluating the integral yields
𝔼\[R2\]≤\(δ22\+4σ∗2\)exp\(H1/2−δ28σ∗2\),\\mathbb\{E\}\[R^\{2\}\]\\leq\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\),and therefore
𝔼B\[KL\(pt∗∗,B∥p^t∗B\)\]≤Lt∗\(δ22\+4σ∗2\)exp\(H1/2−δ28σ∗2\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\\Big\]\\leq L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.
For the pathwise term, Theorem[4](https://arxiv.org/html/2605.05387#Thmtheorem4)gives
𝔼B\[KL\(ℙY∗,B∥ℙY~B\)\]≤I\(Z∥;Z⟂∣Xt∗\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\tilde\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.Using Proposition[6](https://arxiv.org/html/2605.05387#Thmtheorem6),
I\(Z∥;Z⟂∣Xt∗\)≤I\(S∥;S⟂∣Xt∗\)\.I\\\!\\big\(Z^\{\\parallel\};Z^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\.SinceS⟂=S⟂S^\{\\perp\}=S^\{\\perp\},
I\(S∥;S⟂∣Xt∗\)=I\(S∥;S⟂∣Xt∗\)≤H\(S⟂∣Xt∗\)≤H\(S⟂∣Xt∗⟂\)\.I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)=I\\\!\\big\(S^\{\\parallel\};S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\\big\)\\leq H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}\)\\leq H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\.Lemma[20](https://arxiv.org/html/2605.05387#Thmtheorem20)now implies
H\(S⟂∣Xt∗⟂\)≤2exp\(H1/2−δ28σ∗2\)\.H\(S^\{\\perp\}\\mid X\_\{t^\{\*\}\}^\{\\perp\}\)\\leq 2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.Hence
𝔼B\[KL\(ℙY∗,B∥ℙY~B\)\]≤2exp\(H1/2−δ28σ∗2\)\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\tilde\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq 2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\.
Combining the initialization and pathwise bounds yields
𝔼B\[KL\(pt∗∗,B∥p^t∗B\)\+KL\(ℙY∗,B∥ℙY^B\)\]≤Lt∗\(δ22\+4σ∗2\)exp\(H1/2−δ28σ∗2\)\+2exp\(H1/2−δ28σ∗2\),\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\+\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\\leq L\_\{t^\{\*\}\}\\Big\(\\frac\{\\delta^\{2\}\}\{2\}\+4\\sigma\_\{\*\}^\{2\}\\Big\)\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\)\+2\\exp\\\!\\Big\(H\_\{1/2\}\-\\frac\{\\delta^\{2\}\}\{8\\sigma\_\{\*\}^\{2\}\}\\Big\),which is \([11](https://arxiv.org/html/2605.05387#S5.Ex44)\)\.
Finally, the terminal tangent marginal is a measurable image of path space, so by data processing,
𝔼B\[KL\(μT−t0∗,B∥μ^T−t0B\)\]≤𝔼B\[KL\(pt∗∗,B∥p^t∗B\)\+KL\(ℙY∗,B∥ℙY^B\)\]\.\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(\\mu\_\{T\-t\_\{0\}\}^\{\*,B\}\\,\\\|\\,\\hat\{\\mu\}\_\{T\-t\_\{0\}\}^\{\\,B\}\\big\)\\Big\]\\leq\\mathbb\{E\}\_\{B\}\\\!\\Big\[\\mathrm\{KL\}\\\!\\big\(p\_\{t^\{\*\}\}^\{\*,B\}\\,\\\|\\,\\hat\{p\}\_\{t^\{\*\}\}^\{\\,B\}\\big\)\+\\mathrm\{KL\}\\\!\\big\(\\mathbb\{P\}^\{Y^\{\*,B\}\}\\\|\\mathbb\{P\}^\{\\hat\{Y\}^\{\\,B\}\}\\big\)\\Big\]\.This proves \([5\.4](https://arxiv.org/html/2605.05387#S5.E4)\)\.
## Appendix DDDPM implementation of the VP normal correction
This appendix records the VP/DDPM form of the normal correction used in the experiments\. Consider the forward marginal
Xt=αtZ\+σtξ,ξ∼𝒩\(0,Id\),X\_\{t\}=\\alpha\_\{t\}Z\+\\sigma\_\{t\}\\xi,\\qquad\\xi\\sim\\mathcal\{N\}\(0,I\_\{d\}\),and condition onB=P⟂Z=bB=P\_\{\\perp\}Z=b\. Letpt∗,bp\_\{t\}^\{\*,b\}be the density ofXt∣B=bX\_\{t\}\\mid B=b, with scorest∗,b=∇logpt∗,bs\_\{t\}^\{\*,b\}=\\nabla\\log p\_\{t\}^\{\*,b\}\. The VP Tweedie identity gives
st∗,b\(x\)=αt𝔼\[Z∣Xt=x,B=b\]−xσt2\.s\_\{t\}^\{\*,b\}\(x\)=\\frac\{\\alpha\_\{t\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]\-x\}\{\\sigma\_\{t\}^\{2\}\}\.Projecting onto the normal space and usingP⟂Z=bP\_\{\\perp\}Z=bunder the conditioning,
P⟂𝔼\[Z∣Xt=x,B=b\]=b,P\_\{\\perp\}\\mathbb\{E\}\[Z\\mid X\_\{t\}=x,B=b\]=b,we obtain
P⟂st∗,b\(x\)=αtb−P⟂xσt2\.P\_\{\\perp\}s\_\{t\}^\{\*,b\}\(x\)=\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\}\{\\sigma\_\{t\}^\{2\}\}\.Thus the normal correction used in the DDPM implementation is
αtb−P⟂xtσt2\.\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.
We next relate this expression to the DDNM\-style projected denoising update\. For a pretrained VP/DDPM model, the usual Tweedie denoiser is
z^0\(xt\)=xt\+σt2st\(xt\)αt\.\\hat\{z\}\_\{0\}\(x\_\{t\}\)=\\frac\{x\_\{t\}\+\\sigma\_\{t\}^\{2\}s\_\{t\}\(x\_\{t\}\)\}\{\\alpha\_\{t\}\}\.DDNM replaces the normal component of this denoised estimate by the observed levelbb:
z~0\(xt;y\)=P∥z^0\(xt\)\+b\.\\tilde\{z\}\_\{0\}\(x\_\{t\};y\)=P\_\{\\parallel\}\\hat\{z\}\_\{0\}\(x\_\{t\}\)\+b\.The score associated with this projected denoiser is obtained by inverting Tweedie’s formula:
s^tDDNM\(xt;y\)=αtz~0\(xt;y\)−xtσt2\.\\hat\{s\}\_\{t\}^\{\\rm DDNM\}\(x\_\{t\};y\)=\\frac\{\\alpha\_\{t\}\\tilde\{z\}\_\{0\}\(x\_\{t\};y\)\-x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.Substituting the expression forz~0\\tilde\{z\}\_\{0\},
s^tDDNM\(xt;y\)\\displaystyle\\hat\{s\}\_\{t\}^\{\\rm DDNM\}\(x\_\{t\};y\)=αtP∥z^0\(xt\)\+αtb−xtσt2\\displaystyle=\\frac\{\\alpha\_\{t\}P\_\{\\parallel\}\\hat\{z\}\_\{0\}\(x\_\{t\}\)\+\\alpha\_\{t\}b\-x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}=P∥xt\+σt2P∥st\(xt\)\+αtb−P∥xt−P⟂xtσt2\\displaystyle=\\frac\{P\_\{\\parallel\}x\_\{t\}\+\\sigma\_\{t\}^\{2\}P\_\{\\parallel\}s\_\{t\}\(x\_\{t\}\)\+\\alpha\_\{t\}b\-P\_\{\\parallel\}x\_\{t\}\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}=P∥st\(xt\)\+αtb−P⟂xtσt2\.\\displaystyle=P\_\{\\parallel\}s\_\{t\}\(x\_\{t\}\)\+\\frac\{\\alpha\_\{t\}b\-P\_\{\\perp\}x\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}\.Therefore the DDPM/DDIM implementation used in the experiments applies the closed\-form VP normal correction together with the pretrained tangent score\.
## References
- Cheng et al\. \(2018\)Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan\.Underdamped langevin mcmc: A non\-asymptotic analysis\.In*Conference on learning theory*, pages 300–323\. PMLR, 2018\.
- Choi et al\. \(2021\)Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon\.Ilvr: Conditioning method for denoising diffusion probabilistic models\.*arXiv preprint arXiv:2108\.02938*, 2021\.
- Chung et al\. \(2022\)Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye\.Diffusion posterior sampling for general noisy inverse problems\.*arXiv preprint arXiv:2209\.14687*, 2022\.
- Deng et al\. \(2009\)Jia Deng, Wei Dong, Richard Socher, Li\-Jia Li, Kai Li, and Li Fei\-Fei\.Imagenet: A large\-scale hierarchical image database\.In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 248–255, 2009\.doi:10\.1109/CVPR\.2009\.5206848\.
- Dhariwal and Nichol \(2021\)Prafulla Dhariwal and Alexander Nichol\.Diffusion models beat gans on image synthesis\.*Advances in neural information processing systems*, 34:8780–8794, 2021\.
- Didi et al\. \(2023\)Kieran Didi, Francisco Vargas, Simon V Mathis, Vincent Dutordoir, Emile Mathieu, Urszula J Komorowska, and Pietro Lio\.A framework for conditional diffusion modelling with applications in motif scaffolding for protein design\.*arXiv preprint arXiv:2312\.09236*, 2023\.
- Dou and Song \(2024\)Zehao Dou and Yang Song\.Diffusion posterior sampling for linear inverse problem solving: A filtering perspective\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Durrett \(2019\)Rick Durrett\.*Probability: theory and examples*, volume 49\.Cambridge university press, 2019\.
- Fan et al\. \(2023\)Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee\.Dpok: Reinforcement learning for fine\-tuning text\-to\-image diffusion models\.*Advances in Neural Information Processing Systems*, 36:79858–79885, 2023\.
- Guo et al\. \(2005\)Dongning Guo, Shlomo Shamai, and Sergio Verdú\.Mutual information and minimum mean\-square error in gaussian channels\.*IEEE Transactions on Information Theory*, 51\(4\):1261–1282, 2005\.doi:10\.1109/TIT\.2005\.844072\.URL[https://arxiv\.org/abs/cs/0412108](https://arxiv.org/abs/cs/0412108)\.
- Guo et al\. \(2026\)Zhengyi Guo, Wenpin Tang, and Renyuan Xu\.Conditional diffusion guidance under hard constraint: A stochastic analysis approach\.*arXiv preprint arXiv:2602\.05533*, 2026\.
- Ho and Salimans \(2022\)Jonathan Ho and Tim Salimans\.Classifier\-free diffusion guidance\.*arXiv preprint arXiv:2207\.12598*, 2022\.
- Ho et al\. \(2020\)Jonathan Ho, Ajay Jain, and Pieter Abbeel\.Denoising diffusion probabilistic models\.*Advances in neural information processing systems*, 33:6840–6851, 2020\.
- Karatzas and Shreve \(2014\)Ioannis Karatzas and Steven Shreve\.*Brownian motion and stochastic calculus*\.springer, 2014\.
- Karras et al\. \(2018\)Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen\.Progressive growing of gans for improved quality, stability, and variation\.In*International Conference on Learning Representations \(ICLR\)*, 2018\.URL[https://arxiv\.org/abs/1710\.10196](https://arxiv.org/abs/1710.10196)\.
- Kawar et al\. \(2022\)Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song\.Denoising diffusion restoration models\.*Advances in neural information processing systems*, 35:23593–23606, 2022\.
- Lamperski \(2021\)Andrew Lamperski\.Projected stochastic gradient langevin algorithms for constrained sampling and non\-convex learning\.In*Conference on Learning Theory*, pages 2891–2937\. PMLR, 2021\.
- Leimkuhler and Matthews \(2013\)Benedict Leimkuhler and Charles Matthews\.Robust and efficient configurational molecular sampling via langevin dynamics\.*The Journal of chemical physics*, 138\(17\), 2013\.
- Liang et al\. \(2025\)Yuchen Liang, Peizhong Ju, Yingbin Liang, and Ness Shroff\.Theory on score\-mismatched diffusion models and zero\-shot conditional samplers\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Lugmayr et al\. \(2022\)Andreas Lugmayr, Martin Danelljan, Antonio Romero, Fisher Yu, Radu Timofte, and Luc Van Gool\.Repaint: Inpainting using denoising diffusion probabilistic models\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 11451–11461, 2022\.doi:10\.1109/CVPR52688\.2022\.01117\.URL[https://arxiv\.org/abs/2201\.09865](https://arxiv.org/abs/2201.09865)\.
- Meng et al\. \(2021\)Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun\-Yan Zhu, and Stefano Ermon\.Sdedit: Guided image synthesis and editing with stochastic differential equations\.*arXiv preprint arXiv:2108\.01073*, 2021\.
- Song et al\. \(2020a\)Jiaming Song, Chenlin Meng, and Stefano Ermon\.Denoising diffusion implicit models\.*arXiv preprint arXiv:2010\.02502*, 2020a\.
- Song and Ermon \(2019\)Yang Song and Stefano Ermon\.Generative modeling by estimating gradients of the data distribution\.*Advances in neural information processing systems*, 32, 2019\.
- Song et al\. \(2020b\)Yang Song, Jascha Sohl\-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole\.Score\-based generative modeling through stochastic differential equations\.*arXiv preprint arXiv:2011\.13456*, 2020b\.
- Song et al\. \(2021\)Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon\.Solving inverse problems in medical imaging with score\-based generative models\.*arXiv preprint arXiv:2111\.08005*, 2021\.
- Uehara et al\. \(2024\)Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, and Sergey Levine\.Understanding reinforcement learning\-based fine\-tuning of diffusion models: A tutorial and review\.*arXiv preprint arXiv:2407\.13734*, 2024\.
- Wang et al\. \(2022\)Yinhuai Wang, Jiwen Yu, and Jian Zhang\.Zero\-shot image restoration using denoising diffusion null\-space model\.*arXiv preprint arXiv:2212\.00490*, 2022\.
- Wu et al\. \(2023\)Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham\.Practical and asymptotically exact conditional sampling in diffusion models\.*Advances in Neural Information Processing Systems*, 36:31372–31403, 2023\.
- Yu et al\. \(2015\)Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao\.Lsun: Construction of a large\-scale image dataset using deep learning with humans in the loop\.*arXiv preprint arXiv:1506\.03365*, 2015\.URL[https://arxiv\.org/abs/1506\.03365](https://arxiv.org/abs/1506.03365)\.
- Zhao et al\. \(2025\)Yulai Zhao, Masatoshi Uehara, Gabriele Scalia, Sunyuan Kung, Tommaso Biancalani, Sergey Levine, and Ehsan Hajiramezanali\.Adding conditional control to diffusion models with reinforcement learning\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Zhou et al\. \(2024\)Linqi Zhou, Aaron Lou, Samar Khanna, and Stefano Ermon\.Denoising diffusion bridge models\.In*The Twelfth International Conference on Learning Representations*, 2024\.Similar Articles
Constrained Diffusion Models with Primal-Dual Inference
This paper proposes primal-dual inference for constrained diffusion models, jointly inferring the optimal distribution and its dual variable via a dual-conditioned score network, with convergence guarantees and applications in wireless resource allocation and portfolio management.
Temporal Difference Learning for Diffusion Models
This paper introduces a temporal difference (TD) learning objective for diffusion models that enforces cross-time consistency along the denoising trajectory. It reformulates denoising as a reinforcement learning policy evaluation problem, showing significant improvements in sample quality (FID), especially for few-step samplers.
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation
This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.
Active Learning for Conditional Generative Compressed Sensing
This paper proposes a framework for conditional generative compressed sensing, proving stable recovery bounds for prompt-conditioned models and demonstrating how prompt matching influences sampling distributions in experiments with Stable Diffusion.
Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation
Revisits uniform diffusion models, identifying a mismatch between the plug-in ELBO and cross-entropy denoising objective, and proposes leave-one-out parameterizations along with an absorbing-state reformulation that improves generation without additional training.