A Mathematical Introduction to Diffusion Models

arXiv cs.LG 07/03/26, 04:00 AM Papers
Summary
This paper provides a proof-oriented introduction to diffusion models, covering Langevin dynamics, score-based models, discretization, discrete diffusion, and inference-time control, intended for graduate students.
arXiv:2607.01693v1 Announce Type: new Abstract: These notes give a proof-oriented introduction to diffusion models from the viewpoint of sampling, tracing a single arc from classical sampling dynamics to modern diffusion samplers, their error analysis, and inference-time control. Throughout, the material is layered into core definitions and identities proved in full, representative estimates proved under simplifying assumptions, and research-level theorems stated with a proof roadmap. The intended audience is beginning graduate students with a background in probability but no prior exposure to stochastic differential equations, stochastic numerics, or diffusion models.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:43 AM
# A Mathematical Introduction to Diffusion Models
Source: [https://arxiv.org/html/2607.01693](https://arxiv.org/html/2607.01693)
Jianfeng Lu

\(Date: July 2nd, 2026\)

###### Abstract\.

These notes give a proof\-oriented introduction to diffusion models from the viewpoint of sampling, tracing a single arc from classical sampling dynamics to modern diffusion samplers, their error analysis, and inference\-time control\. They proceed in five movements\. The first develops the sampling language and the Langevin toolkit—target distributions and discrepancies, Markov kernels, Fokker–Planck evolution, and entropy dissipation—and turns it into convergence guarantees for Langevin diffusion, the unadjusted Langevin algorithm, and its Metropolis\-adjusted correction\. The second builds continuous\-time score\-based diffusion models through Gaussian noising, Tweedie’s identity, the reverse\-time SDE, and the probability flow ODE, and revisits the same Gaussian channel through stochastic localization and Polchinski flow\. The third discretizes these continuous dynamics into implementable DDPM samplers— exact and Gaussian reverse kernels, the denoising\-score equivalence—and carries out a sampling\-error analysis that separates early stopping, KL telescoping, and score error, treated across Euler–Maruyama, Hessian control, and high\-accuracy first\-order rejection sampling\. The fourth develops discrete diffusion on finite state spaces, where continuous\-time Markov chains replace the noising SDE and the reverse kernel and its error analysis are recast in finite\-state form\. The fifth turns to inference\-time steering of a trained model: guidance, reward tilting, path\-space control, and inference\-time reinforcement learning\. Throughout, the material is layered into core definitions and identities proved in full, representative estimates proved under simplifying assumptions, and research\-level theorems stated with a proof roadmap\. The intended audience is beginning graduate students with a background in probability but no prior exposure to stochastic differential equations, stochastic numerics, or diffusion models\.

Lecture notes for the John Tukey Summer Graduate School on Mathematics of Generative Models at SLMath \(June 22nd, 2026 – July 2nd, 2026\), co\-organized and co\-taught with Eric Vanden\-Eijnden\. We thank SLMath for the hospitality and students in the summer school for helpful feedback\. The work is supported in part by the National Science Foundation via awards DMS\-2309378 and IIS\-2403276\.

###### Contents

1. [1Introduction to Sampling and Langevin Dynamics](https://arxiv.org/html/2607.01693#S1)
2. [2Convergence of Langevin Diffusion and ULA](https://arxiv.org/html/2607.01693#S2)
3. [3Score\-Based Diffusion Models](https://arxiv.org/html/2607.01693#S3)
4. [4Stochastic Localization and Polchinski Flow](https://arxiv.org/html/2607.01693#S4)
5. [5Discretizing Continuous Diffusion Models](https://arxiv.org/html/2607.01693#S5)
6. [6Error Analysis for Diffusion Models](https://arxiv.org/html/2607.01693#S6)
7. [7Discrete Diffusion Models](https://arxiv.org/html/2607.01693#S7)
8. [8Guidance, Reward Tilting, and Inference\-Time RL](https://arxiv.org/html/2607.01693#S8)
9. [AItô Calculus and Girsanov Theorem](https://arxiv.org/html/2607.01693#A1)
10. [BGaussian Toolbox](https://arxiv.org/html/2607.01693#A2)
11. [References](https://arxiv.org/html/2607.01693#bib)

### How to read these notes

The intended audience is beginning graduate students who have seen probability, linear algebra, and multivariable calculus, but may not have prior exposure to stochastic differential equations or diffusion models\. The notes therefore separate three levels of material\.

- •*Core definitions and identities*are proved in detail\. These run through every section and include the Fokker–Planck equation and entropy dissipation, Tweedie’s identity and the reverse SDE, the probability flow ODE, the exact and Gaussian DDPM reverse kernels, the denoising\-score equivalence, the finite\-state reverse kernel of discrete diffusion, and the reward\-tilting identity\.
- •*Representative estimates*are proved under simplifying assumptions, to show a mechanism in its cleanest form\. For example, we prove KL contraction under a log\-Sobolev inequality, compute the ULA bias exactly for a Gaussian target, and bound the one\-step discretization and score\-error contributions to the diffusion\-model KL\.
- •*Research\-level theorems*are stated in simplified form with a proof roadmap\. These include the ULA and MALA convergence guarantees, high\-accuracy diffusion sampling via first\-order rejection sampling and the discrete\-diffusion error analysis\. The goal is not to reproduce every technical lemma of the original papers, but to make clear what each theorem is saying, why the assumptions appear, and how the proof is organized\.

Two appendices collect the analytic tools used repeatedly: Itô calculus and the Girsanov change\-of\-measure formula in Appendix[A](https://arxiv.org/html/2607.01693#A1), and the Gaussian identities in Appendix[B](https://arxiv.org/html/2607.01693#A2)\. A reader comfortable with these may skip them and refer back as needed\.

These notes are self\-contained but deliberately selective, and they overlap with several excellent treatments that a reader may wish to consult in parallel\. Yuansi Chen’s lecture notes for the ETH course*Computational and Statistical Aspects of Diffusion Models*\[[15](https://arxiv.org/html/2607.01693#bib.bib15)\]cover much of the same ground with a stronger emphasis on convergence proofs, guidance, and discrete diffusion, and are a useful companion to the later sections here\. For the Langevin and log\-concave sampling side, Chewi’s book\[[18](https://arxiv.org/html/2607.01693#bib.bib18)\]is the standard reference\. For stochastic calculus and numerical analysis of SDEs and CTMCs, we refer to the book\[[21](https://arxiv.org/html/2607.01693#bib.bib21)\]by E, Li, and Vanden\-Eijnden\. Throughout, when a result first appeared in a research paper we cite the original; the lecture notes above are flagged here once, as a unified entry point to several of the topics that follow\.

These notes do not try to survey the full modern generative\-modeling landscape\. We focus on diffusion and Langevin mechanisms for which reverse\-time SDEs, probability\-flow ODEs, score\-estimation identities, and inference\-time control are the central actors\. The notes treat learned\-score error as an input to sampler guarantees, but do not discuss statistical learning theory for estimating scores from finite data\. We also do not give detailed treatments of likelihood\-based normalizing flows, adversarial and variational models, autoregressive architectures, large\-scale latent\-diffusion engineering, or several newer continuous\-time alternatives\. For entry points to the latter, see flow matching\[[34](https://arxiv.org/html/2607.01693#bib.bib34)\], rectified flows\[[35](https://arxiv.org/html/2607.01693#bib.bib35)\], stochastic interpolants\[[1](https://arxiv.org/html/2607.01693#bib.bib1)\], consistency models\[[43](https://arxiv.org/html/2607.01693#bib.bib43)\], and recent one\-step formulations such as mean flow\[[23](https://arxiv.org/html/2607.01693#bib.bib23)\]\. These methods share much of the same language of probability paths, transport equations, denoisers, and ODE/SDE samplers\.

### Minimal probability background

We use the following conventions repeatedly\. IfXXis a random variable, thenLaw⁡\(X\)\\operatorname\{Law\}\(X\)denotes its distribution\. IfXXhas densitypp, expectations are written either as𝔼\[f\(X\)\]\\mathbb\{E\}\[f\(X\)\]or∫f\(x\)p\(x\)dx\\int f\(x\)p\(x\)\\,\\mathrm\{d\}x\. A conditional expectation𝔼\[Y∣X=x\]\\mathbb\{E\}\[Y\\mid X=x\]is best thought of as the best prediction ofYYas a function of the observed valuexx\. For a Markov chain,P\(x,dy\)P\(x,\\,\\mathrm\{d\}y\)denotes the distribution of the next state given the current statexx\. For an SDE, all formal differentiations can be justified under smoothness and decay assumptions; the notes use this formal calculus as a way to keep the main ideas visible\.

We often use the same symbol for a probability law and its density when the meaning is clear from context\. For example,p𝖽𝖺𝗍𝖺p\_\{\\mathsf\{data\}\}may denote the data law, the density of that law, or the measurep𝖽𝖺𝗍𝖺\(dx\)p\_\{\\mathsf\{data\}\}\(\\,\\mathrm\{d\}x\)\. Similarly,ptp\_\{t\}may denote eitherLaw⁡\(Xt\)\\operatorname\{Law\}\(X\_\{t\}\)or its density\. This common abuse of notation keeps the formulas readable, but when a density value is needed we write expressions such as∇log⁡pt\(x\)\\nabla\\log p\_\{t\}\(x\)\.

### Recurring notation

The following symbols are used across several sections\. More local notation is introduced where it is needed\.

### Algorithm acronyms

The following acronyms name algorithms or algorithmic discretizations used in several sections\.

### AI usage disclosure

Large\-language\-model tools were used in the preparation of these notes to help with drafting, editing, reorganization, and consistency checks\. The mathematical content, exposition choices, references, and any remaining errors are the responsibility of the author\.

## 1\.Introduction to Sampling and Langevin Dynamics

Sampling is the problem of producing random points that look as if they were drawn from a given distribution\. This is easy to state but often hard to do: the distribution may be known only up to a normalizing constant, or only through data, and drawing from it directly may be infeasible\. The strategy running through these notes is indirect\. Rather than sample the hard target in one shot, we build a random process that is easy to simulate and whose distribution gradually drifts toward the target, and we read off a sample once the process has run long enough\. Making this idea precise calls for one habit above all: instead of following a single trajectory, track the whole distribution as it moves through the algorithm\. This first section deliberately goes slowly, fixing that habit and the language we use to measure how far one distribution is from another\.

### 1\.1\.The sampling problem

Letπ\\pibe a probability distribution onℝd\\mathbb\{R\}^\{d\}\. In most applications,π\\piis not available through exact samples\. Instead we may have access to:

- •the unnormalized densityπ\(x\)∝e−U\(x\)\\pi\(x\)\\propto e^\{\-U\(x\)\};
- •the gradient∇log⁡π\(x\)=−∇U\(x\)\\nabla\\log\\pi\(x\)=\-\\nabla U\(x\);
- •samples from a noisy distribution related toπ\\pi;

The goal is to construct a random variableX^\\widehat\{X\}with lawπ^\\widehat\{\\pi\}such thatD\(π^,π\)≤εD\(\\widehat\{\\pi\},\\pi\)\\leq\\varepsilon, whereDDis an appropriate discrepancy\. There is no single best choice ofDD\. Letμ\\muandν\\nube two probability laws onℝd\\mathbb\{R\}^\{d\}\. We recall three basic discrepancies\.

1. \(1\)Total variation asks whether every event has nearly the right probability: D𝖳𝖵⁡\(μ,ν\)=supA\|μ\(A\)−ν\(A\)\|=12∫\|dμdλ−dνdλ\|dλ\.\\operatorname\{D\_\{\\mathsf\{TV\}\}\}\(\\mu,\\nu\)=\\sup\_\{A\}\\left\\lvert\\mu\(A\)\-\\nu\(A\)\\right\\rvert=\\frac\{1\}\{2\}\\int\\left\\lvert\\frac\{\\,\\mathrm\{d\}\\mu\}\{\\,\\mathrm\{d\}\\lambda\}\-\\frac\{\\,\\mathrm\{d\}\\nu\}\{\\,\\mathrm\{d\}\\lambda\}\\right\\rvert\\,\\mathrm\{d\}\\lambda\.Hereλ\\lambdais any measure such that bothμ\\muandν\\nuare absolutely continuous with respect toλ\\lambda, writtenμ,ν≪λ\\mu,\\nu\\ll\\lambda; for example, we may takeλ=μ\+ν\\lambda=\\mu\+\\nu\. The notationdμ/dλ\\,\\mathrm\{d\}\\mu/\\,\\mathrm\{d\}\\lambdadenotes the Radon–Nikodym derivative, the density ofμ\\muwith respect toλ\\lambda, characterized by μ\(A\)=∫Adμdλdλ\\mu\(A\)=\\int\_\{A\}\\frac\{\\,\\mathrm\{d\}\\mu\}\{\\,\\mathrm\{d\}\\lambda\}\\,\\,\\mathrm\{d\}\\lambdafor measurable setsAA\. The same applies todν/dλ\\,\\mathrm\{d\}\\nu/\\,\\mathrm\{d\}\\lambda, and the value of the total\-variation distance does not depend on which such reference measureλ\\lambdais chosen\.
2. \(2\)Wasserstein\-22distance asks whether probability mass can be transported a short geometric distance: W22\(μ,ν\)=infγ∈Π\(μ,ν\)∫‖x−y‖2γ\(dx,dy\)\.W\_\{2\}^\{2\}\(\\mu,\\nu\)=\\inf\_\{\\gamma\\in\\Pi\(\\mu,\\nu\)\}\\int\\left\\lVert x\-y\\right\\rVert^\{2\}\\,\\gamma\(\\,\\mathrm\{d\}x,\\,\\mathrm\{d\}y\)\.HereΠ\(μ,ν\)\\Pi\(\\mu,\\nu\)is the set of couplings ofμ\\muandν\\nu, meaning probability measuresγ\\gammaonℝd×ℝd\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}whose first marginal isμ\\muand whose second marginal isν\\nu\.
3. \(3\)KL divergence asks whetherμ\\mucan be encoded efficiently usingν\\nuas a reference: D𝖪𝖫⁡\(μ∥ν\)=∫log⁡\(dμdν\)dμ,\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)=\\int\\log\\\!\\left\(\\frac\{\\,\\mathrm\{d\}\\mu\}\{\\,\\mathrm\{d\}\\nu\}\\right\)\\,\\mathrm\{d\}\\mu,whenμ\\muis absolutely continuous with respect toν\\nu, and\+∞\+\\inftyotherwise\. Note that the KL divergence is asymmetric:D𝖪𝖫⁡\(μ∥ν\)≠D𝖪𝖫⁡\(ν∥μ\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\\neq\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\\\|\\mu\)\. In particular, KL is not a metric\.

KL is often the main bookkeeping divergence we will use in this course\. The other metrics translate this information into statements about events, geometric displacement, or weak test functions\. The basic comparison begins with the Csiszár–Kullback–Pinsker inequality; see, for example, Bakry, Gentil, and Ledoux\[[6](https://arxiv.org/html/2607.01693#bib.bib6)\]:

D𝖳𝖵⁡\(μ,ν\)≤12D𝖪𝖫⁡\(μ∥ν\)\.\\operatorname\{D\_\{\\mathsf\{TV\}\}\}\(\\mu,\\nu\)\\leq\\sqrt\{\\frac\{1\}\{2\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\}\.Thus a KL guarantee immediately gives a TV guarantee, and hence also a weak test\-function guarantee\. Another recurring rule is*data processing*: applying the same observation map to two random objects cannot increase KL\. IfTTis measurable andT\#μT\_\{\\\#\}\\mudenotes the law ofT\(X\)T\(X\)forX∼μX\\sim\\mu, then

D𝖪𝖫⁡\(T\#μ∥T\#ν\)≤D𝖪𝖫⁡\(μ∥ν\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(T\_\{\\\#\}\\mu\\\|T\_\{\\\#\}\\nu\)\\leq\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\.The same monotonicity holds for total variation\. We use this principle repeatedly to pass from path\-space comparisons to endpoint laws; a proof is given in Lemma[A\.1](https://arxiv.org/html/2607.01693#A1.Thmtheorem1)\.

Wasserstein distance is more geometric\. Ifμ\\muandν\\nuare supported on a set of diameterRR, then a maximal coupling gives

W22\(μ,ν\)≤R2D𝖳𝖵⁡\(μ,ν\)≤R212D𝖪𝖫⁡\(μ∥ν\)\.W\_\{2\}^\{2\}\(\\mu,\\nu\)\\leq R^\{2\}\\operatorname\{D\_\{\\mathsf\{TV\}\}\}\(\\mu,\\nu\)\\leq R^\{2\}\\sqrt\{\\frac\{1\}\{2\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\}\.Without a bounded\-diameter or moment/functional\-inequality assumption, there is no universal comparison in the other direction: smallW2W\_\{2\}need not imply small TV or KL, and small TV or KL need not controlW2W\_\{2\}if a tiny amount of mass can move very far away\.

A more powerful comparison is available relative to a fixed reference law\. A probability lawν\\nusatisfies a TalagrandT2T\_\{2\}transport\-entropy inequality with constantCTC\_\{T\}if

W22\(μ,ν\)≤2CTD𝖪𝖫⁡\(μ∥ν\)for allμ≪ν\.W\_\{2\}^\{2\}\(\\mu,\\nu\)\\leq 2C\_\{T\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\\qquad\\text\{for all \}\\mu\\ll\\nu\.One standard route to this inequality is through log\-Sobolev\. In the convention of these notes, ifν\\nusatisfies

D𝖪𝖫⁡\(μ∥ν\)≤C𝖫𝖲𝖨2FI⁡\(μ∥ν\),\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\\leq\\frac\{C\_\{\\mathsf\{LSI\}\}\}\{2\}\\operatorname\{FI\}\(\\mu\\\|\\nu\),where

FI⁡\(μ∥ν\)=∫‖∇log⁡dμdν‖2dμ\\operatorname\{FI\}\(\\mu\\\|\\nu\)=\\int\\left\\lVert\\nabla\\log\\frac\{\\,\\mathrm\{d\}\\mu\}\{\\,\\mathrm\{d\}\\nu\}\\right\\rVert^\{2\}\\,\\mathrm\{d\}\\muwhen the derivative is well defined, then the Otto–Villani theorem gives the transport bound

W22\(μ,ν\)≤2C𝖫𝖲𝖨D𝖪𝖫⁡\(μ∥ν\)\.W\_\{2\}^\{2\}\(\\mu,\\nu\)\\leq 2C\_\{\\mathsf\{LSI\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\.This is one reason log\-Sobolev and transport inequalities appear naturally in sampling theory; see Villani\[[47](https://arxiv.org/html/2607.01693#bib.bib47)\]and Bakry, Gentil, and Ledoux\[[6](https://arxiv.org/html/2607.01693#bib.bib6)\]\.

###### Example\(Three distances see different things\)\.

Letμ=δ0\\mu=\\delta\_\{0\}andν=δϵ\\nu=\\delta\_\{\\epsilon\}onℝ\\mathbb\{R\}\. ThenD𝖳𝖵⁡\(μ,ν\)=1\\operatorname\{D\_\{\\mathsf\{TV\}\}\}\(\\mu,\\nu\)=1for everyϵ≠0\\epsilon\\neq 0, because the two point masses live on disjoint sets\. On the other hand,W2\(μ,ν\)=ϵW\_\{2\}\(\\mu,\\nu\)=\\epsilonandD𝖪𝖫⁡\(μ∥ν\)=D𝖪𝖫⁡\(ν∥μ\)=\+∞\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\\\|\\mu\)=\+\\infty\. Thus TV and KL are too strict for comparing a point mass to a tiny displacement of itself, while Wasserstein sees that the two laws are geometrically close\.

### 1\.2\.Markov chain Monte Carlo

Markov chain Monte Carlo \(MCMC\) turns sampling into the problem of running a Markov chain\. Instead of drawing directly fromπ\\pi, we choose a transition rule whose repeated application should haveπ\\pias its equilibrium law\. A Markov kernelPPmaps a current statexxto a distributionP\(x,⋅\)P\(x,\\cdot\)\. IfπP=π\\pi P=\\pi, thenπ\\piis invariant, and the ideal MCMC chain runs

Xn\+1∼P\(Xn,⋅\),n=0,1,…,X\_\{n\+1\}\\sim P\(X\_\{n\},\\cdot\),\\qquad n=0,1,\\ldots,with the hope thatLaw⁡\(Xn\)\\operatorname\{Law\}\(X\_\{n\}\)approachesπ\\piasn→∞n\\to\\infty\.

This viewpoint immediately separates two questions\. First, is the transition kernel designed so that the right law is invariant? Second, how long must the chain run before it is close to equilibrium? For a genuine metricDD, such as TV orW2W\_\{2\}, a typical algorithmic decomposition is

D\(Law⁡\(Xn\),π\)≤D\(Law⁡\(Xn\),πh\)⏟mixing to the algorithm’s invariant law\+D\(πh,π\)⏟bias from approximation,D\(\\operatorname\{Law\}\(X\_\{n\}\),\\pi\)\\leq\\underbrace\{D\(\\operatorname\{Law\}\(X\_\{n\}\),\\pi\_\{h\}\)\}\_\{\\text\{mixing to the algorithm's invariant law\}\}\+\\underbrace\{D\(\\pi\_\{h\},\\pi\)\}\_\{\\text\{bias from approximation\}\},whereπh\\pi\_\{h\}is the invariant distribution of the implementable algorithm\. This is a triangle\-inequality argument, so it applies to genuine metrics but not to KL\. In general there is no inequality of the formD𝖪𝖫⁡\(μ∥ρ\)≤D𝖪𝖫⁡\(μ∥ν\)\+D𝖪𝖫⁡\(ν∥ρ\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\rho\)\\leq\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\+\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\\\|\\rho\)\.

This decomposition separates the error caused by not running the algorithm long enough from the error caused by using a transition rule whose invariant law is not exactlyπ\\pi\. For Langevin sampling, that second error appears when a continuous diffusion is replaced by a time\-stepping rule\. The discretization may lead to an implementable chain withπh≠π\\pi\_\{h\}\\neq\\pi\. In exact MCMC methods, such as Metropolis\-adjusted Langevin, the transition is constructed so thatπh=π\\pi\_\{h\}=\\pi\.

###### Definition 1\.1\(Invariant and reversible kernels\)\.

A probability distributionπ\\piis*invariant*forPPif

∫π\(dx\)P\(x,A\)=π\(A\)for every measurable setA\.\\int\\pi\(\\,\\mathrm\{d\}x\)P\(x,A\)=\\pi\(A\)\\qquad\\text\{for every measurable set \}A\.It is*reversible*forPPif the detailed balance identity

π\(dx\)P\(x,dy\)=π\(dy\)P\(y,dx\)\\pi\(\\,\\mathrm\{d\}x\)P\(x,\\,\\mathrm\{d\}y\)=\\pi\(\\,\\mathrm\{d\}y\)P\(y,\\,\\mathrm\{d\}x\)holds as measures on pairs\(x,y\)\(x,y\)\. Reversibility implies invariance by integrating both sides overxx\.

#### Caution: invariance is not convergence\.

Invariance is only a fixed\-point statement: ifX0∼πX\_\{0\}\\sim\\pi, then one step of the chain still has lawπ\\pi\. It does not by itself say that a chain started from another law will approachπ\\pi\. For example, the identity kernelP\(x,⋅\)=δxP\(x,\\cdot\)=\\delta\_\{x\}leaves every distribution invariant, but it never mixes\. Even uniqueness of the invariant law is not quite enough without excluding periodic behavior: on the two\-point space\{0,1\}\\\{0,1\\\}, the deterministic flip0↦10\\mapsto 1,1↦01\\mapsto 0has the uniform distribution as an invariant reversible law, but a chain started from0oscillates forever\.

Thus detailed balance is a convenient way to verify invariance, not a convergence theorem\. To justify MCMC, one also needs an ergodicity mechanism, such as irreducibility and aperiodicity in finite state spaces, or the appropriate Harris recurrence and minorization conditions in general state spaces\. Quantitative sampling bounds then add still more structure: spectral gaps, conductance, contraction, or functional inequalities\. See Meyn and Tweedie\[[37](https://arxiv.org/html/2607.01693#bib.bib37)\]for a classical treatment of these Markov\-chain conditions\.

A common design principle for MCMC is therefore to start one level above the implementable chain\. First design a continuous\-time Markov process whose invariant law isπ\\piand whose convergence mechanism can be analyzed\. Then discretize that process to obtain a computable Markov chain, possibly adding a Metropolis correction if one wants to remove discretization bias\. Langevin dynamics is the canonical example of this principle: the continuous diffusion hasπ\\pias its invariant law, while ULA and MALA are two different ways to turn that diffusion into an algorithm\.

### 1\.3\.Overdamped Langevin diffusion

Assumeπ\(x\)∝e−U\(x\)\\pi\(x\)\\propto e^\{\-U\(x\)\}withU:ℝd→ℝU:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}smooth\. The overdamped Langevin diffusion is the continuous\-time Markov process

\(1\.1\)dXt=−∇U\(Xt\)dt\+2dBt=∇log⁡π\(Xt\)dt\+2dBt\.\\,\\mathrm\{d\}X\_\{t\}=\-\\nabla U\(X\_\{t\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{t\}=\\nabla\\log\\pi\(X\_\{t\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{t\}\.The drift−∇U=∇log⁡π\-\\nabla U=\\nabla\\log\\pipoints toward regions whereπ\\piis larger, much as gradient descent moves toward lower potential\. The diffusion coefficient in the Brownian term is tuned so that the ensemble keeps precisely the spread prescribed byπ\\pi\.

This is the basic difference between optimization and sampling\. Optimization asks for a pointx⋆x\_\{\\star\}with small objective value, whereas sampling asks for a law\. Ifπ\\piis bimodal, with half of its mass near each of two separated modes, an optimizer may correctly return one mode, but a sampler must visit both modes with the correct frequencies\.

There are two complementary ways to read \([1\.1](https://arxiv.org/html/2607.01693#S1.E1)\)\. Pathwise, each particle is pulled down the potential landscape by−∇U\-\\nabla Uwhile Brownian motion keeps the ensemble spread out\. Distributionally, the densityqtq\_\{t\}ofXtX\_\{t\}evolves by a deterministic PDE\. The corresponding Fokker–Planck equation is

\(1\.2\)∂tqt=∇⋅\(qt∇U\)\+Δqt=∇⋅\(qt∇log⁡qtπ\)\.\\partial\_\{t\}q\_\{t\}=\\nabla\\cdot\(q\_\{t\}\\nabla U\)\+\\Delta q\_\{t\}=\\nabla\\cdot\\\!\\left\(q\_\{t\}\\nabla\\log\\frac\{q\_\{t\}\}\{\\pi\}\\right\)\.At this stage the point of \([1\.2](https://arxiv.org/html/2607.01693#S1.E2)\) is simply that the stochastic particle system has a deterministic law\-level description\. The SDE picture is good for intuition and simulation\. The PDE picture is good for proving invariance and entropy decay\.

To derive the Fokker–Planck equation, we recall Itô calculus in the form needed for our calculation, more details can be found in Appendix[A](https://arxiv.org/html/2607.01693#A1)\. If

dXt=b\(Xt\)dt\+σdBt\\,\\mathrm\{d\}X\_\{t\}=b\(X\_\{t\}\)\\,\\mathrm\{d\}t\+\\sigma\\,\\,\\mathrm\{d\}B\_\{t\}with constant diffusion matrixσ\\sigma, then for a smooth test functionφ\\varphi,

dφ\(Xt\)=⟨∇φ\(Xt\),b\(Xt\)⟩dt\+12Tr⁡\(σσ⊤∇2φ\(Xt\)\)dt\+⟨∇φ\(Xt\),σdBt⟩\.\\,\\mathrm\{d\}\\varphi\(X\_\{t\}\)=\\left\\langle\\nabla\\varphi\(X\_\{t\}\),b\(X\_\{t\}\)\\right\\rangle\\,\\mathrm\{d\}t\+\\frac\{1\}\{2\}\\operatorname\{Tr\}\\\!\\left\(\\sigma\\sigma^\{\\top\}\\nabla^\{2\}\\varphi\(X\_\{t\}\)\\right\)\\,\\mathrm\{d\}t\+\\left\\langle\\nabla\\varphi\(X\_\{t\}\),\\sigma\\,\\,\\mathrm\{d\}B\_\{t\}\\right\\rangle\.The last term is a martingale increment, so it disappears after taking expectations\. For \([1\.1](https://arxiv.org/html/2607.01693#S1.E1)\),b=−∇Ub=\-\\nabla Uandσ=2I\\sigma=\\sqrt\{2\}\\,I\. Applying the preceding formula to \([1\.1](https://arxiv.org/html/2607.01693#S1.E1)\) gives

ddt𝔼\[φ\(Xt\)\]=𝔼\[−⟨∇U\(Xt\),∇φ\(Xt\)⟩\+Δφ\(Xt\)\]\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\mathbb\{E\}\[\\varphi\(X\_\{t\}\)\]=\\mathbb\{E\}\[\-\\left\\langle\\nabla U\(X\_\{t\}\),\\nabla\\varphi\(X\_\{t\}\)\\right\\rangle\+\\Delta\\varphi\(X\_\{t\}\)\]\.IfXtX\_\{t\}has densityqtq\_\{t\}, then

ddt∫φqtdx=∫\[−⟨∇U,∇φ⟩\+Δφ\]qtdx\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\int\\varphi q\_\{t\}\\,\\mathrm\{d\}x=\\int\\left\[\-\\left\\langle\\nabla U,\\nabla\\varphi\\right\\rangle\+\\Delta\\varphi\\right\]q\_\{t\}\\,\\mathrm\{d\}x\.Integrating by parts, assuming boundary terms vanish,

∫−qt⟨∇U,∇φ⟩dx=∫φ∇⋅\(qt∇U\)dx,∫qtΔφdx=∫φΔqtdx\.\\int\-q\_\{t\}\\left\\langle\\nabla U,\\nabla\\varphi\\right\\rangle\\,\\mathrm\{d\}x=\\int\\varphi\\,\\nabla\\cdot\(q\_\{t\}\\nabla U\)\\,\\mathrm\{d\}x,\\qquad\\int q\_\{t\}\\Delta\\varphi\\,\\mathrm\{d\}x=\\int\\varphi\\,\\Delta q\_\{t\}\\,\\mathrm\{d\}x\.Since this holds for all test functionsφ\\varphi, we obtain

∂tqt=∇⋅\(qt∇U\)\+Δqt\.\\partial\_\{t\}q\_\{t\}=\\nabla\\cdot\(q\_\{t\}\\nabla U\)\+\\Delta q\_\{t\}\.Finally,π∝e−U\\pi\\propto e^\{\-U\}implies∇U=−∇log⁡π\\nabla U=\-\\nabla\\log\\pi, and

∇⋅\(qt∇U\)\+Δqt=∇⋅\(qt∇log⁡qtπ\)\.\\nabla\\cdot\(q\_\{t\}\\nabla U\)\+\\Delta q\_\{t\}=\\nabla\\cdot\\left\(q\_\{t\}\\nabla\\log\\frac\{q\_\{t\}\}\{\\pi\}\\right\)\.
The differential operator that appeared in the test\-function calculation,

ℒφ=−⟨∇U,∇φ⟩\+Δφ,\\mathcal\{L\}\\varphi=\-\\left\\langle\\nabla U,\\nabla\\varphi\\right\\rangle\+\\Delta\\varphi,is called the infinitesimal generator of the Langevin diffusion\. The Fokker–Planck equation is the adjoint equation on densities\. In weak form, the calculation above says

ddt∫φqtdx=∫\(ℒφ\)qtdx=∫φ\(ℒ∗qt\)dx,\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\int\\varphi q\_\{t\}\\,\\mathrm\{d\}x=\\int\(\\mathcal\{L\}\\varphi\)q\_\{t\}\\,\\mathrm\{d\}x=\\int\\varphi\(\\mathcal\{L\}^\{\\ast\}q\_\{t\}\)\\,\\mathrm\{d\}x,so∂tqt=ℒ∗qt\\partial\_\{t\}q\_\{t\}=\\mathcal\{L\}^\{\\ast\}q\_\{t\}, where

ℒ∗q=∇⋅\(q∇U\)\+Δq\.\\mathcal\{L\}^\{\\ast\}q=\\nabla\\cdot\(q\\nabla U\)\+\\Delta q\.
###### Proposition 1\.2\(Stationarity\)\.

The target densityπ∝e−U\\pi\\propto e^\{\-U\}satisfies

ℒ∗π=0\.\\mathcal\{L\}^\{\\ast\}\\pi=0\.Consequently, ifq0=πq\_\{0\}=\\pi, then the solution of the Fokker–Planck equation remainsqt=πq\_\{t\}=\\pifor alltt\. Thusπ\\piis an invariant law of \([1\.1](https://arxiv.org/html/2607.01693#S1.E1)\)\.

###### Proof\.

Since∇log⁡π=−∇U\\nabla\\log\\pi=\-\\nabla U, we have∇π=−π∇U\\nabla\\pi=\-\\pi\\nabla U\. Therefore

ℒ∗π=∇⋅\(π∇U\)\+Δπ=∇⋅\(π∇U\+∇π\)=0\.\\mathcal\{L\}^\{\\ast\}\\pi=\\nabla\\cdot\(\\pi\\nabla U\)\+\\Delta\\pi=\\nabla\\cdot\(\\pi\\nabla U\+\\nabla\\pi\)=0\.Thus the right\-hand side of the adjoint equation∂tqt=ℒ∗qt\\partial\_\{t\}q\_\{t\}=\\mathcal\{L\}^\{\\ast\}q\_\{t\}vanishes atqt=πq\_\{t\}=\\pi, so the densityπ\\piis stationary\. ∎

###### Example\(Ornstein–Uhlenbeck process\)\.

Takeπ=𝒩\(0,I\)\\pi=\\mathcal\{N\}\\\!\\left\(0,I\\right\), soU\(x\)=‖x‖2/2U\(x\)=\\left\\lVert x\\right\\rVert^\{2\}/2\. Langevin dynamics becomes

dXt=−Xtdt\+2dBt\.\\,\\mathrm\{d\}X\_\{t\}=\-X\_\{t\}\\,\\mathrm\{d\}t\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{t\}\.The explicit solution is

Xt=e−tX0\+2∫0te−\(t−s\)dBs\.X\_\{t\}=e^\{\-t\}X\_\{0\}\+\\sqrt\{2\}\\int\_\{0\}^\{t\}e^\{\-\(t\-s\)\}\\,\\mathrm\{d\}B\_\{s\}\.IfX0X\_\{0\}is deterministic, then

Xt∼𝒩\(e−tX0,\(1−e−2t\)I\)\.X\_\{t\}\\sim\\mathcal\{N\}\\\!\\left\(e^\{\-t\}X\_\{0\},\(1\-e^\{\-2t\}\)I\\right\)\.This example is worth remembering: the mean contracts exponentially and the variance fills in to the target variance\.

### 1\.4\.Entropy dissipation

Differentiating KL along the flow gives

\(1\.3\)ddtD𝖪𝖫⁡\(qt∥π\)=−∫‖∇log⁡qtπ‖2qtdx=−FI⁡\(qt∥π\),\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{t\}\\\|\\pi\)=\-\\int\\left\\lVert\\nabla\\log\\frac\{q\_\{t\}\}\{\\pi\}\\right\\rVert^\{2\}q\_\{t\}\\,\\,\\mathrm\{d\}x=\-\\operatorname\{FI\}\(q\_\{t\}\\\|\\pi\),whereFI⁡\(q∥π\)\\operatorname\{FI\}\(q\\\|\\pi\)is the relative Fisher information\.

###### Proof of \([1\.3](https://arxiv.org/html/2607.01693#S1.E3)\)\.

We have

D𝖪𝖫⁡\(qt∥π\)=∫qtlog⁡qtπdx\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{t\}\\\|\\pi\)=\\int q\_\{t\}\\log\\frac\{q\_\{t\}\}\{\\pi\}\\,\\mathrm\{d\}x\.Because∫∂tqtdx=0\\int\\partial\_\{t\}q\_\{t\}\\,\\mathrm\{d\}x=0, differentiating gives

ddtD𝖪𝖫⁡\(qt∥π\)=∫∂tqtlog⁡qtπdx\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{t\}\\\|\\pi\)=\\int\\partial\_\{t\}q\_\{t\}\\,\\log\\frac\{q\_\{t\}\}\{\\pi\}\\,\\mathrm\{d\}x\.Using the equivalent form of the Fokker–Planck equation,

∂tqt=∇⋅\(qt∇log⁡qtπ\),\\partial\_\{t\}q\_\{t\}=\\nabla\\cdot\\\!\\left\(q\_\{t\}\\nabla\\log\\frac\{q\_\{t\}\}\{\\pi\}\\right\),and integrating by parts,

∫∇⋅\(qt∇log⁡qtπ\)log⁡qtπdx=−∫qt‖∇log⁡qtπ‖2dx\.\\int\\nabla\\cdot\\\!\\left\(q\_\{t\}\\nabla\\log\\frac\{q\_\{t\}\}\{\\pi\}\\right\)\\log\\frac\{q\_\{t\}\}\{\\pi\}\\,\\mathrm\{d\}x=\-\\int q\_\{t\}\\left\\lVert\\nabla\\log\\frac\{q\_\{t\}\}\{\\pi\}\\right\\rVert^\{2\}\\,\\mathrm\{d\}x\.This is exactly−FI⁡\(qt∥π\)\-\\operatorname\{FI\}\(q\_\{t\}\\\|\\pi\)\. ∎

The identity just proved becomes a quantitative convergence estimate once it is combined with a functional inequality\. Ifπ\\pisatisfies a log\-Sobolev inequality

D𝖪𝖫⁡\(q∥π\)≤C𝖫𝖲𝖨2FI⁡\(q∥π\),\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|\\pi\)\\leq\\frac\{C\_\{\\mathsf\{LSI\}\}\}\{2\}\\operatorname\{FI\}\(q\\\|\\pi\),then \([1\.3](https://arxiv.org/html/2607.01693#S1.E3)\) implies

ddtD𝖪𝖫⁡\(qt∥π\)≤−2C𝖫𝖲𝖨D𝖪𝖫⁡\(qt∥π\)\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{t\}\\\|\\pi\)\\leq\-\\frac\{2\}\{C\_\{\\mathsf\{LSI\}\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{t\}\\\|\\pi\)\.By Grönwall’s inequality,

D𝖪𝖫⁡\(qt∥π\)≤e−2t/C𝖫𝖲𝖨D𝖪𝖫⁡\(q0∥π\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{t\}\\\|\\pi\)\\leq e^\{\-2t/C\_\{\\mathsf\{LSI\}\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{0\}\\\|\\pi\)\.Thus the continuous\-time process reduces KL error exponentially\.

In particular, if the potentialUUismm\-strongly convex, that is∇2U⪰mI\\nabla^\{2\}U\\succeq mIwithm\>0m\>0, then by the Bakry–Émery criterion\[[6](https://arxiv.org/html/2607.01693#bib.bib6)\]the target satisfies the log\-Sobolev inequality withC𝖫𝖲𝖨≤1/mC\_\{\\mathsf\{LSI\}\}\\leq 1/m, and the estimate above becomes the clean exponential rate

D𝖪𝖫⁡\(qt∥π\)≤e−2mtD𝖪𝖫⁡\(q0∥π\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{t\}\\\|\\pi\)\\leq e^\{\-2mt\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{0\}\\\|\\pi\)\.
The main lesson is the mechanism: KL divergence is a Lyapunov function, the relative Fisher information is its dissipation rate, and a structural inequality for the target converts dissipation into convergence\. Sampling arguments often follow this pattern: choose a discrepancy, compute its derivative along the dynamics, and use an inequality for the target law to turn the derivative into a rate\.

The same identity also has a geometric reading\. In optimal\-transport language, overdamped Langevin is theW2W\_\{2\}\-gradient flow ofD𝖪𝖫\(⋅∥π\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\cdot\\\|\\pi\), and \([1\.3](https://arxiv.org/html/2607.01693#S1.E3)\) is its energy\-dissipation identity; see Ambrosio, Gigli, and Savaré\[[3](https://arxiv.org/html/2607.01693#bib.bib3)\]\. For background on diffusion semigroups, entropy dissipation, and log\-Sobolev inequalities, see Bakry, Gentil, and Ledoux\[[6](https://arxiv.org/html/2607.01693#bib.bib6)\]; for the transport viewpoint behind Wasserstein geometry, see Villani\[[47](https://arxiv.org/html/2607.01693#bib.bib47)\]\.

## 2\.Convergence of Langevin Diffusion and ULA

The previous section established continuous\-time convergence through the entropy identity and log\-Sobolev inequality\. We now ask what remains of that story after replacing the diffusion by an implementable Markov chain\. The main new issue is numerical error: a time\-stepping rule may converge as a Markov chain, but its invariant law need not be the target law\. For a modern treatment of log\-concave sampling algorithms, including Langevin algorithms and their Metropolis\-adjusted variants, see the excellent book by Chewi\[[18](https://arxiv.org/html/2607.01693#bib.bib18)\]\.

### 2\.1\.Unadjusted Langevin algorithm

Euler–Maruyama with step sizeh\>0h\>0gives the unadjusted Langevin algorithm \(ULA\)

\(2\.1\)Xn\+1=Xn−h∇U\(Xn\)\+2hξn,ξn∼𝒩\(0,I\)\.X\_\{n\+1\}=X\_\{n\}\-h\\nabla U\(X\_\{n\}\)\+\\sqrt\{2h\}\\,\\xi\_\{n\},\\qquad\\xi\_\{n\}\\sim\\mathcal\{N\}\\\!\\left\(0,I\\right\)\.ULA is often the first sampler one considers because it is simple, gradient\-based, and cheap\. Its limitation is equally important: \([2\.1](https://arxiv.org/html/2607.01693#S2.E1)\) is not an exact transition of \([1\.1](https://arxiv.org/html/2607.01693#S1.E1)\), and in general does not preserveπ\\pi\.

It is tempting to view ULA as “almost Langevin” and therefore assume that it must have almost the right stationary law\. Already for a Gaussian target one can see exactly what “almost” means\. For small step size the bias is small, but it does not vanish at fixedhhno matter how long the chain is run\. This is the difference between mixing error, which decreases with more iterations, and discretization bias, which is built into the transition rule itself\.

###### Example\(ULA bias for a standard Gaussian\)\.

LetU\(x\)=x2/2U\(x\)=x^\{2\}/2in one dimension, so the target isπ=𝒩\(0,1\)\\pi=\\mathcal\{N\}\\\!\\left\(0,1\\right\)\. ULA becomes the Markov chain

Xn\+1=\(1−h\)Xn\+2hξn\.X\_\{n\+1\}=\(1\-h\)X\_\{n\}\+\\sqrt\{2h\}\\xi\_\{n\}\.Recall from the Gaussian toolbox, Lemma[B\.1](https://arxiv.org/html/2607.01693#A2.Thmtheorem1), that affine transformations of Gaussians are Gaussian, and that independent Gaussian variances add\. In stationarity,XnX\_\{n\}andXn\+1X\_\{n\+1\}have the same law\. Thus, if the invariant law is centered Gaussian with variancevhv\_\{h\}, then

Xn∼𝒩\(0,vh\)⟹Xn\+1∼𝒩\(0,\(1−h\)2vh\+2h\),X\_\{n\}\\sim\\mathcal\{N\}\\\!\\left\(0,v\_\{h\}\\right\)\\quad\\Longrightarrow\\quad X\_\{n\+1\}\\sim\\mathcal\{N\}\\\!\\left\(0,\(1\-h\)^\{2\}v\_\{h\}\+2h\\right\),becauseξn∼𝒩\(0,1\)\\xi\_\{n\}\\sim\\mathcal\{N\}\\\!\\left\(0,1\\right\)is independent ofXnX\_\{n\}\. Equality of the stationary input and output variances gives the identity

vh=\(1−h\)2vh\+2h\.v\_\{h\}=\(1\-h\)^\{2\}v\_\{h\}\+2h\.Solving,

vh=2h1−\(1−h\)2=22−h\.v\_\{h\}=\\frac\{2h\}\{1\-\(1\-h\)^\{2\}\}=\\frac\{2\}\{2\-h\}\.Thus the invariant law of ULA isπh=𝒩\(0,2/\(2−h\)\)\\pi\_\{h\}=\\mathcal\{N\}\\\!\\left\(0,2/\(2\-h\)\\right\), notπ=𝒩\(0,1\)\\pi=\\mathcal\{N\}\\\!\\left\(0,1\\right\)unlessh=0h=0\. For smallhh, the variance bias is2/\(2−h\)−1=h/\(2−h\)=O\(h\)2/\(2\-h\)\-1=h/\(2\-h\)=O\(h\)\. The same bias can also be expressed as divergences between the two invariant laws:

D𝖪𝖫⁡\(πh∥π\)=12\(vh−1−log⁡vh\),D𝖪𝖫⁡\(π∥πh\)=12\(1vh−1\+log⁡vh\),\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\pi\_\{h\}\\\|\\pi\)=\\frac\{1\}\{2\}\\left\(v\_\{h\}\-1\-\\log v\_\{h\}\\right\),\\qquad\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\pi\\\|\\pi\_\{h\}\)=\\frac\{1\}\{2\}\\left\(\\frac\{1\}\{v\_\{h\}\}\-1\+\\log v\_\{h\}\\right\),and

W22\(πh,π\)=\(vh−1\)2\.W\_\{2\}^\{2\}\(\\pi\_\{h\},\\pi\)=\(\\sqrt\{v\_\{h\}\}\-1\)^\{2\}\.All three quantities areh2/16\+O\(h3\)h^\{2\}/16\+O\(h^\{3\}\)ash↓0h\\downarrow 0\. This example is the cleanest way to see why unadjusted discretization creates bias\.

### 2\.2\.One\-step KL discretization error

LetPh\(x,⋅\)P\_\{h\}\(x,\\cdot\)be the exact Langevin transition for timehh, and letP^h\(x,⋅\)\\widehat\{P\}\_\{h\}\(x,\\cdot\)be the ULA transition\. To compare them in KL, interpolate one ULA step over the interval\[0,h\]\[0,h\]by

dX^s=−∇U\(x\)ds\+2dBs,X^0=x\.\\,\\mathrm\{d\}\\widehat\{X\}\_\{s\}=\-\\nabla U\(x\)\\,\\mathrm\{d\}s\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{s\},\\qquad\\widehat\{X\}\_\{0\}=x\.The exact Langevin law started from the same point has drift fieldy↦−∇U\(y\)y\\mapsto\-\\nabla U\(y\)\. Since the KL below isLaw⁡\(X^\[0,h\]\)\\operatorname\{Law\}\(\\widehat\{X\}\_\{\[0,h\]\}\)relative toLaw⁡\(X\[0,h\]\)\\operatorname\{Law\}\(X\_\{\[0,h\]\}\), the Girsanov formula evaluates both drift fields along the first path, namely atX^s\\widehat\{X\}\_\{s\}\. Thus the formula reviewed in Appendix[A](https://arxiv.org/html/2607.01693#A1)gives the path\-space comparison

D𝖪𝖫⁡\(Law⁡\(X^\[0,h\]\)∥Law⁡\(X\[0,h\]\)\)=14∫0h𝔼‖∇U\(X^s\)−∇U\(x\)‖2ds\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(\\operatorname\{Law\}\(\\widehat\{X\}\_\{\[0,h\]\}\)\\middle\\\|\\operatorname\{Law\}\(X\_\{\[0,h\]\}\)\\right\)=\\frac\{1\}\{4\}\\int\_\{0\}^\{h\}\\mathbb\{E\}\\left\\lVert\\nabla U\(\\widehat\{X\}\_\{s\}\)\-\\nabla U\(x\)\\right\\rVert^\{2\}\\,\\mathrm\{d\}s\.By the data\-processing inequality reviewed in Lemma[A\.1](https://arxiv.org/html/2607.01693#A1.Thmtheorem1), the KL between the endpoint laws is no larger\. If∇U\\nabla UisLL\-Lipschitz, then

D𝖪𝖫⁡\(P^h\(x,⋅\)∥Ph\(x,⋅\)\)≤L24∫0h𝔼‖X^s−x‖2ds\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(\\widehat\{P\}\_\{h\}\(x,\\cdot\)\\middle\\\|P\_\{h\}\(x,\\cdot\)\\right\)\\leq\\frac\{L^\{2\}\}\{4\}\\int\_\{0\}^\{h\}\\mathbb\{E\}\\left\\lVert\\widehat\{X\}\_\{s\}\-x\\right\\rVert^\{2\}\\,\\mathrm\{d\}s\.SinceX^s=x−s∇U\(x\)\+2Bs\\widehat\{X\}\_\{s\}=x\-s\\nabla U\(x\)\+\\sqrt\{2\}\\,B\_\{s\},

𝔼‖X^s−x‖2=s2‖∇U\(x\)‖2\+2ds,\\mathbb\{E\}\\left\\lVert\\widehat\{X\}\_\{s\}\-x\\right\\rVert^\{2\}=s^\{2\}\\left\\lVert\\nabla U\(x\)\\right\\rVert^\{2\}\+2ds,and therefore

\(2\.2\)D𝖪𝖫⁡\(P^h\(x,⋅\)∥Ph\(x,⋅\)\)≤L24\(dh2\+h33‖∇U\(x\)‖2\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(\\widehat\{P\}\_\{h\}\(x,\\cdot\)\\middle\\\|P\_\{h\}\(x,\\cdot\)\\right\)\\leq\\frac\{L^\{2\}\}\{4\}\\left\(dh^\{2\}\+\\frac\{h^\{3\}\}\{3\}\\left\\lVert\\nabla U\(x\)\\right\\rVert^\{2\}\\right\)\.This is the KL analogue of a local truncation error: freezing the drift for one short interval costs orderdh2dh^\{2\}, plus a term depending on the local drift size\.

### 2\.3\.Direct KL recursion for ULA

The one\-step estimate explains the scale of the discretization error, but it is not by itself a convergence theorem toπ\\pi\. Indeed, a comparison such asD𝖪𝖫⁡\(q^k∥qkh\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|q\_\{kh\}\)cannot simply be added toD𝖪𝖫⁡\(qkh∥π\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\_\{kh\}\\\|\\pi\), because KL has no triangle inequality\. The modern KL\-based analysis instead tracks the target\-relative quantityD𝖪𝖫⁡\(q^k∥π\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|\\pi\)directly\.

The following result is a translation of the KL theorem of Vempala and Wibisono\[[46](https://arxiv.org/html/2607.01693#bib.bib46)\]into our notation\.

###### Theorem 2\.1\(ULA convergence in KL\)\.

Assume thatπ\(dx\)∝e−U\(x\)dx\\pi\(\\,\\mathrm\{d\}x\)\\propto e^\{\-U\(x\)\}\\,\\mathrm\{d\}xsatisfies the log\-Sobolev inequality with constantC𝖫𝖲𝖨C\_\{\\mathsf\{LSI\}\}in the convention

D𝖪𝖫⁡\(q∥π\)≤C𝖫𝖲𝖨2FI⁡\(q∥π\),\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|\\pi\)\\leq\\frac\{C\_\{\\mathsf\{LSI\}\}\}\{2\}\\operatorname\{FI\}\(q\\\|\\pi\),and that∇U\\nabla UisLL\-Lipschitz\. Letq^k\\widehat\{q\}\_\{k\}be the law of ULA \([2\.1](https://arxiv.org/html/2607.01693#S2.E1)\) with step size

0<h≤14C𝖫𝖲𝖨L2\.0<h\\leq\\frac\{1\}\{4C\_\{\\mathsf\{LSI\}\}L^\{2\}\}\.IfD𝖪𝖫⁡\(q^0∥π\)<∞\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{0\}\\\|\\pi\)<\\infty, then

\(2\.3\)D𝖪𝖫⁡\(q^k∥π\)≤e−kh/C𝖫𝖲𝖨D𝖪𝖫⁡\(q^0∥π\)\+8C𝖫𝖲𝖨L2dh\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|\\pi\)\\leq e^\{\-kh/C\_\{\\mathsf\{LSI\}\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{0\}\\\|\\pi\)\+8C\_\{\\mathsf\{LSI\}\}L^\{2\}dh\.

No convexity is assumed in this theorem\. The log\-Sobolev inequality supplies the global mixing mechanism, while the bounded Hessian assumption supplies the local control needed to discretize the diffusion\.

Thus, to reach KL targetD𝖪𝖫⁡\(q^k∥π\)≤ε2\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|\\pi\)\\leq\\varepsilon^\{2\}, it suffices to choose

h≲min⁡\{1C𝖫𝖲𝖨L2,ε2C𝖫𝖲𝖨L2d\}h\\lesssim\\min\\left\\\{\\frac\{1\}\{C\_\{\\mathsf\{LSI\}\}L^\{2\}\},\\frac\{\\varepsilon^\{2\}\}\{C\_\{\\mathsf\{LSI\}\}L^\{2\}d\}\\right\\\}and run for

k≳C𝖫𝖲𝖨hlog⁡D𝖪𝖫⁡\(q^0∥π\)ε2k\\gtrsim\\frac\{C\_\{\\mathsf\{LSI\}\}\}\{h\}\\log\\frac\{\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{0\}\\\|\\pi\)\}\{\\varepsilon^\{2\}\}iterations\. Equivalently, the iteration complexity is

k=O~\(C𝖫𝖲𝖨2L2dε2\)\.k=\\widetilde\{O\}\\\!\\left\(\\frac\{C\_\{\\mathsf\{LSI\}\}^\{2\}L^\{2\}d\}\{\\varepsilon^\{2\}\}\\right\)\.The dependence on the ambient dimension is linear in this KL analysis\. By the Bakry–Émery criterion for log\-Sobolev inequalities \(see Bakry, Gentil, and Ledoux\[[6](https://arxiv.org/html/2607.01693#bib.bib6)\]\), undermm\-strong convexity one hasC𝖫𝖲𝖨≤1/mC\_\{\\mathsf\{LSI\}\}\\leq 1/m, so the bound becomesO~\(κ2d/ε2\)\\widetilde\{O\}\(\\kappa^\{2\}d/\\varepsilon^\{2\}\)complexity, whereκ=L/m\\kappa=L/mis the condition number\.

###### Proof sketch\.

Fix one step and start from lawq^k\\widehat\{q\}\_\{k\}\. Interpolate the ULA update by running, for elapsed timet∈\[0,h\]t\\in\[0,h\], the frozen\-drift diffusion

dX¯t=−∇U\(X¯0\)dt\+2dBt,X¯0∼q^k\.\\,\\mathrm\{d\}\\bar\{X\}\_\{t\}=\-\\nabla U\(\\bar\{X\}\_\{0\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{t\},\\qquad\\bar\{X\}\_\{0\}\\sim\\widehat\{q\}\_\{k\}\.Writeνt=Law⁡\(X¯t\)\\nu\_\{t\}=\\operatorname\{Law\}\(\\bar\{X\}\_\{t\}\), soνh=q^k\+1\\nu\_\{h\}=\\widehat\{q\}\_\{k\+1\}\. DifferentiatingD𝖪𝖫⁡\(νt∥π\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\_\{t\}\\\|\\pi\)along this interpolated process gives the same negative Fisher\-information term as in the continuous Langevin calculation, plus the error caused by freezing the drift:

ddtD𝖪𝖫⁡\(νt∥π\)=−FI⁡\(νt∥π\)\+𝔼⟨∇U\(X¯t\)−∇U\(X¯0\),∇log⁡νtπ\(X¯t\)⟩\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\_\{t\}\\\|\\pi\)=\-\\operatorname\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\mathbb\{E\}\\left\\langle\\nabla U\(\\bar\{X\}\_\{t\}\)\-\\nabla U\(\\bar\{X\}\_\{0\}\),\\nabla\\log\\frac\{\\nu\_\{t\}\}\{\\pi\}\(\\bar\{X\}\_\{t\}\)\\right\\rangle\.Young’s inequality gives, directly for the inner product above,

𝔼⟨∇U\(X¯t\)−∇U\(X¯0\),∇log⁡νtπ\(X¯t\)⟩\\displaystyle\\mathbb\{E\}\\left\\langle\\nabla U\(\\bar\{X\}\_\{t\}\)\-\\nabla U\(\\bar\{X\}\_\{0\}\),\\nabla\\log\\frac\{\\nu\_\{t\}\}\{\\pi\}\(\\bar\{X\}\_\{t\}\)\\right\\rangle≤12𝔼‖∇log⁡νtπ\(X¯t\)‖2\+12𝔼‖∇U\(X¯t\)−∇U\(X¯0\)‖2\\displaystyle\\leq\\frac\{1\}\{2\}\\mathbb\{E\}\\left\\lVert\\nabla\\log\\frac\{\\nu\_\{t\}\}\{\\pi\}\(\\bar\{X\}\_\{t\}\)\\right\\rVert^\{2\}\+\\frac\{1\}\{2\}\\mathbb\{E\}\\left\\lVert\\nabla U\(\\bar\{X\}\_\{t\}\)\-\\nabla U\(\\bar\{X\}\_\{0\}\)\\right\\rVert^\{2\}=12FI⁡\(νt∥π\)\+12𝔼‖∇U\(X¯t\)−∇U\(X¯0\)‖2\.\\displaystyle=\\frac\{1\}\{2\}\\operatorname\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\frac\{1\}\{2\}\\mathbb\{E\}\\left\\lVert\\nabla U\(\\bar\{X\}\_\{t\}\)\-\\nabla U\(\\bar\{X\}\_\{0\}\)\\right\\rVert^\{2\}\.ByLL\-smoothness,‖∇U\(X¯t\)−∇U\(X¯0\)‖≤L‖X¯t−X¯0‖\\left\\lVert\\nabla U\(\\bar\{X\}\_\{t\}\)\-\\nabla U\(\\bar\{X\}\_\{0\}\)\\right\\rVert\\leq L\\left\\lVert\\bar\{X\}\_\{t\}\-\\bar\{X\}\_\{0\}\\right\\rVert, so

ddtD𝖪𝖫⁡\(νt∥π\)≤−12FI⁡\(νt∥π\)\+L22𝔼‖X¯t−X¯0‖2\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\_\{t\}\\\|\\pi\)\\leq\-\\frac\{1\}\{2\}\\operatorname\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\frac\{L^\{2\}\}\{2\}\\mathbb\{E\}\\left\\lVert\\bar\{X\}\_\{t\}\-\\bar\{X\}\_\{0\}\\right\\rVert^\{2\}\.The log\-Sobolev inequality turns the first term into contraction:FI⁡\(νt∥π\)≥2C𝖫𝖲𝖨−1D𝖪𝖫⁡\(νt∥π\)\\operatorname\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\geq 2C\_\{\\mathsf\{LSI\}\}^\{\-1\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\_\{t\}\\\|\\pi\)\. Thus

ddtD𝖪𝖫⁡\(νt∥π\)≤−1C𝖫𝖲𝖨D𝖪𝖫⁡\(νt∥π\)\+L22𝔼‖X¯t−X¯0‖2\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\_\{t\}\\\|\\pi\)\\leq\-\\frac\{1\}\{C\_\{\\mathsf\{LSI\}\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\nu\_\{t\}\\\|\\pi\)\+\\frac\{L^\{2\}\}\{2\}\\mathbb\{E\}\\left\\lVert\\bar\{X\}\_\{t\}\-\\bar\{X\}\_\{0\}\\right\\rVert^\{2\}\.It remains to bound the displacement of the frozen\-drift interpolation\. Since

𝔼‖X¯t−X¯0‖2=t2𝔼q^k‖∇U‖2\+2dt,\\mathbb\{E\}\\left\\lVert\\bar\{X\}\_\{t\}\-\\bar\{X\}\_\{0\}\\right\\rVert^\{2\}=t^\{2\}\\mathbb\{E\}\_\{\\widehat\{q\}\_\{k\}\}\\left\\lVert\\nabla U\\right\\rVert^\{2\}\+2dt,the Brownian part contributes the localdh2dh^\{2\}term after integration overt∈\[0,h\]t\\in\[0,h\]\. For the drift part, use an optimalW2W\_\{2\}coupling betweenq^k\\widehat\{q\}\_\{k\}andπ\\pi\. Along such a coupling, smoothness gives‖∇U\(x\)‖2≤2‖∇U\(y\)‖2\+2L2‖x−y‖2\\left\\lVert\\nabla U\(x\)\\right\\rVert^\{2\}\\leq 2\\left\\lVert\\nabla U\(y\)\\right\\rVert^\{2\}\+2L^\{2\}\\left\\lVert x\-y\\right\\rVert^\{2\}, and averaging gives

𝔼q^k‖∇U‖2≤2𝔼π‖∇U‖2\+2L2W22\(q^k,π\)\.\\mathbb\{E\}\_\{\\widehat\{q\}\_\{k\}\}\\left\\lVert\\nabla U\\right\\rVert^\{2\}\\leq 2\\mathbb\{E\}\_\{\\pi\}\\left\\lVert\\nabla U\\right\\rVert^\{2\}\+2L^\{2\}W\_\{2\}^\{2\}\(\\widehat\{q\}\_\{k\},\\pi\)\.The first term is at most2Ld2Ld, because integration by parts underπ∝e−U\\pi\\propto e^\{\-U\}gives𝔼π‖∇U‖2=𝔼πΔU≤Ld\\mathbb\{E\}\_\{\\pi\}\\left\\lVert\\nabla U\\right\\rVert^\{2\}=\\mathbb\{E\}\_\{\\pi\}\\Delta U\\leq Ld\. The second term is bounded by Talagrand’s comparisonW22\(q^k,π\)≤2C𝖫𝖲𝖨D𝖪𝖫⁡\(q^k∥π\)W\_\{2\}^\{2\}\(\\widehat\{q\}\_\{k\},\\pi\)\\leq 2C\_\{\\mathsf\{LSI\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|\\pi\)from Subsection[1\.1](https://arxiv.org/html/2607.01693#S1.SS1)\. Hence the drift contribution is a lower\-orderdh2dh^\{2\}term plus a multiple ofC𝖫𝖲𝖨L4h3D𝖪𝖫⁡\(q^k∥π\)C\_\{\\mathsf\{LSI\}\}L^\{4\}h^\{3\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|\\pi\), and the stated step\-size condition makes this last piece small enough to be absorbed into the contraction\.

The resulting one\-step inequality of Vempala and Wibisono\[[46](https://arxiv.org/html/2607.01693#bib.bib46)\], in our notation, is

D𝖪𝖫⁡\(q^k\+1∥π\)≤e−h/C𝖫𝖲𝖨D𝖪𝖫⁡\(q^k∥π\)\+6L2dh2\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\+1\}\\\|\\pi\)\\leq e^\{\-h/C\_\{\\mathsf\{LSI\}\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|\\pi\)\+6L^\{2\}dh^\{2\}\.Iterating this recursion gives

D𝖪𝖫⁡\(q^k∥π\)≤e−kh/C𝖫𝖲𝖨D𝖪𝖫⁡\(q^0∥π\)\+6L2dh2∑j=0k−1e−jh/C𝖫𝖲𝖨\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{k\}\\\|\\pi\)\\leq e^\{\-kh/C\_\{\\mathsf\{LSI\}\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\widehat\{q\}\_\{0\}\\\|\\pi\)\+6L^\{2\}dh^\{2\}\\sum\_\{j=0\}^\{k\-1\}e^\{\-jh/C\_\{\\mathsf\{LSI\}\}\}\.Finally,∑j=0k−1e−jh/C𝖫𝖲𝖨≤\(1−e−h/C𝖫𝖲𝖨\)−1≤4C𝖫𝖲𝖨/\(3h\)\\sum\_\{j=0\}^\{k\-1\}e^\{\-jh/C\_\{\\mathsf\{LSI\}\}\}\\leq\(1\-e^\{\-h/C\_\{\\mathsf\{LSI\}\}\}\)^\{\-1\}\\leq 4C\_\{\\mathsf\{LSI\}\}/\(3h\)under the same small\-step assumptions\. The last term is therefore at most8C𝖫𝖲𝖨L2dh8C\_\{\\mathsf\{LSI\}\}L^\{2\}dh, which gives \([2\.3](https://arxiv.org/html/2607.01693#S2.E3)\)\. This is the discrete analogue of entropy dissipation: log\-Sobolev supplies the contraction, while smoothness controls the Euler–Maruyama error\. ∎

Let us take a look at the result of the Theorem\. The dependence on dimension is linear, but the dependence on accuracy is polynomial: to make the KL error of orderε2\\varepsilon^\{2\}, ULA usesO~\(C𝖫𝖲𝖨2L2d/ε2\)\\widetilde\{O\}\(C\_\{\\mathsf\{LSI\}\}^\{2\}L^\{2\}d/\\varepsilon^\{2\}\)steps\. This happens because the Euler discretization has a stationary bias of orderhh, so high accuracy forces a small step size\. It is therefore natural to ask whether one can keep the Langevin proposal but remove the discretization bias by an exact correction\. Classical Metropolis adjustment does precisely this when density ratios are available\.

### 2\.4\.Metropolis\-adjusted Langevin algorithm

The Metropolis\-adjusted Langevin algorithm \(MALA\) corrects the ULA proposal with an accept\-reject step\. The distinction from the gradient\-only setting is central: density evaluations enable the classical high\-accuracy rejection ideas, whereas gradient information alone does not provide the density ratios those ideas require\.

Letπ\(dx\)∝e−U\(x\)dx\\pi\(\\,\\mathrm\{d\}x\)\\propto e^\{\-U\(x\)\}\\,\\mathrm\{d\}x\. From a current statexx, MALA first draws the ULA proposal

y∼qh\(x,⋅\)=𝒩\(x−h∇U\(x\),2hI\)\.y\\sim q\_\{h\}\(x,\\cdot\)=\\mathcal\{N\}\\\!\\left\(x\-h\\nabla U\(x\),2hI\\right\)\.It then acceptsyywith probability

\(2\.4\)ah\(x,y\)=1∧π\(y\)qh\(y,x\)π\(x\)qh\(x,y\)\.a\_\{h\}\(x,y\)=1\\wedge\\frac\{\\pi\(y\)q\_\{h\}\(y,x\)\}\{\\pi\(x\)q\_\{h\}\(x,y\)\}\.If the proposal is rejected, the chain stays atxx\. The Metropolis ratio \([2\.4](https://arxiv.org/html/2607.01693#S2.E4)\) enforces detailed balance:

π\(dx\)Ph𝖬𝖠𝖫𝖠\(x,dy\)=π\(dy\)Ph𝖬𝖠𝖫𝖠\(y,dx\)\.\\pi\(\\,\\mathrm\{d\}x\)P\_\{h\}^\{\\mathsf\{MALA\}\}\(x,\\,\\mathrm\{d\}y\)=\\pi\(\\,\\mathrm\{d\}y\)P\_\{h\}^\{\\mathsf\{MALA\}\}\(y,\\,\\mathrm\{d\}x\)\.Consequently,π\\piis exactly stationary for MALA\. This is the basic advantage over ULA: the discretization bias is removed, at the price of needing the density ratioπ\(y\)/π\(x\)\\pi\(y\)/\\pi\(x\)\.

MALA is the cleanest example of a correction mechanism\. The proposal uses only gradient information, just like ULA, but the accept\-reject step asks whether the proposed move has the right probability under the target density\. This single density\-ratio check changes the invariant distribution from an approximationπh\\pi\_\{h\}back to the exact targetπ\\pi\.

Exact stationarity, however, is only the invariance part of the story\. To turn MALA into a quantitative sampling algorithm, one still has to bound how many accepted\-or\-rejected proposals are needed before the chain is close toπ\\pi\. The modern high\-accuracy theory separates this analysis into two tasks: sampling efficiently once a good initialization is available, and producing that initialization in the first place\. Both are phrased relative to a warm start\. A standard warm\-start condition is the following: an initial lawμ0\\mu\_\{0\}isMM\-warm with respect toπ\\piif

μ0\(A\)≤Mπ\(A\)for every measurable setA,\\mu\_\{0\}\(A\)\\leq M\\pi\(A\)\\qquad\\text\{for every measurable set \}A,or equivalently

\(2\.5\)‖dμ0dπ‖L∞\(π\)≤M\.\\left\\\|\\frac\{\\,\\mathrm\{d\}\\mu\_\{0\}\}\{\\,\\mathrm\{d\}\\pi\}\\right\\\|\_\{L^\{\\infty\}\(\\pi\)\}\\leq M\.Some warm\-start preparation results use the finite\-order Rényi version:

Dq\(μ0∥π\)=1q−1log∫\(dμ0dπ\)qdπ=O\(1\)D\_\{q\}\(\\mu\_\{0\}\\\|\\pi\)=\\frac\{1\}\{q\-1\}\\log\\int\\left\(\\frac\{\\,\\mathrm\{d\}\\mu\_\{0\}\}\{\\,\\mathrm\{d\}\\pi\}\\right\)^\{q\}\\,\\mathrm\{d\}\\pi=O\(1\)for someq\>1q\>1\. Note that in the Rényi notation, \([2\.5](https://arxiv.org/html/2607.01693#S2.E5)\) can be written asD∞\(μ0∥π\)≤log⁡MD\_\{\\infty\}\(\\mu\_\{0\}\\\|\\pi\)\\leq\\log M\.

For the first task, fix a target total\-variation accuracyε∈\(0,1\)\\varepsilon\\in\(0,1\)\. Assuming anMM\-warm start, Wu, Schmidler, and Chen\[[51](https://arxiv.org/html/2607.01693#bib.bib51)\]prove that, formm\-strongly log\-concave andLL\-smooth targets inℝd\\mathbb\{R\}^\{d\}, MALA mixes in

O\(κdlog3⁡max⁡\{κ,d,Mε\}\),κ=L/m,O\\\!\\left\(\\kappa\\sqrt\{d\}\\,\\log^\{3\}\\\!\\max\\left\\\{\\kappa,d,\\frac\{M\}\{\\varepsilon\}\\right\\\}\\right\),\\qquad\\kappa=L/m,iterations, up to universal constants\. Chen and Gatmiry\[[17](https://arxiv.org/html/2607.01693#bib.bib17)\]extend this warm\-start MALA picture under smoothness and isoperimetry, recovering the same leadingκd\\kappa\\sqrt\{d\}behavior in the strongly log\-concave case, again with logarithmic dependence onM/εM/\\varepsilon\.

For the second task, Altschuler and Chewi\[[2](https://arxiv.org/html/2607.01693#bib.bib2)\]show that kinetic Langevin, also called underdamped Langevin, can produce the required finite\-order Rényi warm start with the sameO~\(d\)\\widetilde\{O\}\(\\sqrt\{d\}\)dimension dependence\. At the level of dimension dependence, the high\-accuracy picture is: use kinetic Langevin to prepare a warm start, then use MALA as the exact Metropolis\-corrected sampler\.

The discussion above also shows the limitation of classical adjusted MCMC\. MALA removes discretization bias when density evaluations are available, but its sharp convergence guarantees require structural assumptions such as strong log\-concavity, log\-Sobolev or isoperimetric inequalities, smoothness, and warm starts\. A general data distribution may be multimodal, singular, or available only through samples, so these assumptions are not a natural starting point\. This motivates a different question: if we are given data rather than an explicit target density, how should we design an algorithm to generate new samples?

## 3\.Score\-Based Diffusion Models

The previous section treated sampling from an explicit density: the algorithm could use∇U\\nabla U, and MALA could even use density ratios to remove discretization bias\. Diffusion models begin from a different premise\. The target law is represented by data, not by a tractable formula, so we do not try to run Langevin dynamics directly on the data distribution\. Instead we add noise to the data, forming a path of smoother laws\. At each positive noise level, the score of the noised law is the local vector field that guides a reverse dynamics back toward the data\.

The denoising\-score connection underlying this approach goes back in part to the DDPM formulation of Ho, Jain, and Abbeel\[[26](https://arxiv.org/html/2607.01693#bib.bib26)\]\. The continuous\-time SDE and probability\-flow formulation used here follows Yang Song, Sohl\-Dickstein, Kingma, Kumar, Ermon, and Poole\[[44](https://arxiv.org/html/2607.01693#bib.bib44)\]\. The basic picture is simple: at timett, the noised distributionptp\_\{t\}is a blurred version of the data, and∇log⁡pt\(x\)\\nabla\\log p\_\{t\}\(x\)points toward nearby regions where this blurred density is larger\. Diffusion models learn this time\-indexed field of local denoising directions\.

### 3\.1\.Continuous\-time forward noising

The cleanest conceptual starting point is continuous time\. LetX0∼p𝖽𝖺𝗍𝖺X\_\{0\}\\sim p\_\{\\mathsf\{data\}\}and let\(Xt\)t∈\[0,T\]\(X\_\{t\}\)\_\{t\\in\[0,T\]\}solve a forward noising SDE

\(3\.1\)dXt=ft\(Xt\)dt\+gtdBt,X0∼p𝖽𝖺𝗍𝖺\.\\,\\mathrm\{d\}X\_\{t\}=f\_\{t\}\(X\_\{t\}\)\\,\\mathrm\{d\}t\+g\_\{t\}\\,\\mathrm\{d\}B\_\{t\},\\qquad X\_\{0\}\\sim p\_\{\\mathsf\{data\}\}\.HereBtB\_\{t\}is Brownian motion,ftf\_\{t\}is a drift field, andgtg\_\{t\}is a scalar diffusion coefficient\. We writeptp\_\{t\}for the density ofXtX\_\{t\}\. Most diffusion models use linear Gaussian forward processes for which, for deterministic functionsata\_\{t\}andσt\\sigma\_\{t\},

\(3\.2\)Xt∣X0∼𝒩\(atX0,σt2I\)\.X\_\{t\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(a\_\{t\}X\_\{0\},\\sigma\_\{t\}^\{2\}I\\right\)\.
###### Example\(Two common continuous schedules\)\.

For the variance\-exploding heat flow,

Xt=X0\+tZ,pt=p𝖽𝖺𝗍𝖺∗𝒩\(0,tI\),X\_\{t\}=X\_\{0\}\+\\sqrt\{t\}\\,Z,\\qquad p\_\{t\}=p\_\{\\mathsf\{data\}\}\*\\mathcal\{N\}\\\!\\left\(0,tI\\right\),with Gaussian transition kernel

Xt∣X0∼𝒩\(X0,tI\)\.X\_\{t\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(X\_\{0\},tI\\right\)\.For the Ornstein–Uhlenbeck or variance\-preserving flow with constant rate,

dXt=−12Xtdt\+dBt,\\,\\mathrm\{d\}X\_\{t\}=\-\\frac\{1\}\{2\}X\_\{t\}\\,\\mathrm\{d\}t\+\\,\\mathrm\{d\}B\_\{t\},one has

Xt∣X0∼𝒩\(e−t/2X0,\(1−e−t\)I\)\.X\_\{t\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(e^\{\-t/2\}X\_\{0\},\(1\-e^\{\-t\}\)I\\right\)\.Both examples fit \([3\.2](https://arxiv.org/html/2607.01693#S3.E2)\); onlyata\_\{t\}andσt\\sigma\_\{t\}differ\. The variance\-preserving convention hasat2\+σt2=1a\_\{t\}^\{2\}\+\\sigma\_\{t\}^\{2\}=1, so the total marginal variance is preserved when the data have unit scale\.

The two scalar functionsata\_\{t\}andσt\\sigma\_\{t\}summarize the signal\-to\-noise ratio\. The coefficientata\_\{t\}says how much of the original sample remains visible in the conditional mean, whileσt\\sigma\_\{t\}says how much independent Gaussian uncertainty has been added\. Early in the forward process,σt\\sigma\_\{t\}is small and the score may be complicated because it reflects the fine structure of the data\. At large noise,ptp\_\{t\}is smoother and closer to a simple reference law, so the reverse sampler has an easier starting point\.

### 3\.2\.Continuous\-time Tweedie identity

The most natural way to undo the forward noising is to denoise: given a noisy observationXt=xX\_\{t\}=x, estimate the clean sample that produced it, namely the posterior mean𝔼\[X0∣Xt=x\]\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{t\}=x\]\. The object that encodes this denoising is the*score*of the noised law,

\(3\.3\)𝗌t⋆\(x\)=∇log⁡pt\(x\),\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=\\nabla\\log p\_\{t\}\(x\),the true score at timett, approximated in practice by a learned model𝗌t\(x\)≈𝗌t⋆\(x\)\\mathsf\{s\}\_\{t\}\(x\)\\approx\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)\. The Gaussian marginal formula \([3\.2](https://arxiv.org/html/2607.01693#S3.E2)\) makes the link between the score and denoising precise through a continuous\-time form of Tweedie’s identity:

\(3\.4\)𝗌t⋆\(x\)=1σt2𝔼\[atX0−Xt∣Xt=x\]\.\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=\\frac\{1\}\{\\sigma\_\{t\}^\{2\}\}\\mathbb\{E\}\[a\_\{t\}X\_\{0\}\-X\_\{t\}\\mid X\_\{t\}=x\]\.Equivalently, wheneverat≠0a\_\{t\}\\neq 0, define the optimal denoiser

𝖣t⋆\(x\):=𝔼\[X0∣Xt=x\]=at−1\(x\+σt2𝗌t⋆\(x\)\)\.\\mathsf\{D\}^\{\\star\}\_\{t\}\(x\):=\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{t\}=x\]=a\_\{t\}^\{\-1\}\\bigl\(x\+\\sigma\_\{t\}^\{2\}\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)\\bigr\)\.A learned score𝗌t\\mathsf\{s\}\_\{t\}similarly determines a learned denoiser

𝖣t\(x\):=at−1\(x\+σt2𝗌t\(x\)\)\.\\mathsf\{D\}\_\{t\}\(x\):=a\_\{t\}^\{\-1\}\\bigl\(x\+\\sigma\_\{t\}^\{2\}\\mathsf\{s\}\_\{t\}\(x\)\\bigr\)\.Thus denoising and score estimation are two ways of describing the same posterior information\.

###### Proof\.

The density ofXtX\_\{t\}is the Gaussian mixture

pt\(x\)=∫1\(2πσt2\)d/2exp⁡\(−‖x−atx0‖22σt2\)p𝖽𝖺𝗍𝖺\(dx0\)\.p\_\{t\}\(x\)=\\int\\frac\{1\}\{\(2\\pi\\sigma\_\{t\}^\{2\}\)^\{d/2\}\}\\exp\\left\(\-\\frac\{\\left\\lVert x\-a\_\{t\}x\_\{0\}\\right\\rVert^\{2\}\}\{2\\sigma\_\{t\}^\{2\}\}\\right\)p\_\{\\mathsf\{data\}\}\(\\,\\mathrm\{d\}x\_\{0\}\)\.Write the Gaussian kernel in this integral as

φt\(x∣x0\)=1\(2πσt2\)d/2exp⁡\(−‖x−atx0‖22σt2\)\.\\varphi\_\{t\}\(x\\mid x\_\{0\}\)=\\frac\{1\}\{\(2\\pi\\sigma\_\{t\}^\{2\}\)^\{d/2\}\}\\exp\\left\(\-\\frac\{\\left\\lVert x\-a\_\{t\}x\_\{0\}\\right\\rVert^\{2\}\}\{2\\sigma\_\{t\}^\{2\}\}\\right\)\.For fixedx0x\_\{0\},

∇xφt\(x∣x0\)=atx0−xσt2φt\(x∣x0\)\.\\nabla\_\{x\}\\varphi\_\{t\}\(x\\mid x\_\{0\}\)=\\frac\{a\_\{t\}x\_\{0\}\-x\}\{\\sigma\_\{t\}^\{2\}\}\\,\\varphi\_\{t\}\(x\\mid x\_\{0\}\)\.Thus, differentiating under the integral,

∇pt\(x\)=1σt2∫\(atx0−x\)φt\(x∣x0\)p𝖽𝖺𝗍𝖺\(dx0\)\.\\nabla p\_\{t\}\(x\)=\\frac\{1\}\{\\sigma\_\{t\}^\{2\}\}\\int\(a\_\{t\}x\_\{0\}\-x\)\\,\\varphi\_\{t\}\(x\\mid x\_\{0\}\)\\,p\_\{\\mathsf\{data\}\}\(\\,\\mathrm\{d\}x\_\{0\}\)\.On the other hand, Bayes’ rule gives the conditional law of the clean sample given the noisy observation:

ℙ\(X0∈dx0∣Xt=x\)=φt\(x∣x0\)p𝖽𝖺𝗍𝖺\(dx0\)pt\(x\)\.\\mathbb\{P\}\(X\_\{0\}\\in\\,\\mathrm\{d\}x\_\{0\}\\mid X\_\{t\}=x\)=\\frac\{\\varphi\_\{t\}\(x\\mid x\_\{0\}\)\\,p\_\{\\mathsf\{data\}\}\(\\,\\mathrm\{d\}x\_\{0\}\)\}\{p\_\{t\}\(x\)\}\.Dividing the previous display bypt\(x\)p\_\{t\}\(x\)therefore gives

∇log⁡pt\(x\)=1σt2𝔼\[atX0−x∣Xt=x\],\\nabla\\log p\_\{t\}\(x\)=\\frac\{1\}\{\\sigma\_\{t\}^\{2\}\}\\mathbb\{E\}\[a\_\{t\}X\_\{0\}\-x\\mid X\_\{t\}=x\],which is the same as \([3\.4](https://arxiv.org/html/2607.01693#S3.E4)\), since the conditioning setsXt=xX\_\{t\}=x\. ∎

This Tweedie identity explains why the score can be learned from empirical data\. Fixt\>0t\>0\. Under the Gaussian corruptionXt=atX0\+σtZX\_\{t\}=a\_\{t\}X\_\{0\}\+\\sigma\_\{t\}Z, withZ∼𝒩\(0,I\)Z\\sim\\mathcal\{N\}\\\!\\left\(0,I\\right\)independent ofX0X\_\{0\}, the random vector inside this conditional expectation becomes

atX0−Xtσt2=−Zσt\.\\frac\{a\_\{t\}X\_\{0\}\-X\_\{t\}\}\{\\sigma\_\{t\}^\{2\}\}=\-\\frac\{Z\}\{\\sigma\_\{t\}\}\.Thus the conditional mean of−Z/σt\-Z/\\sigma\_\{t\}givenatX0\+σtZ=xa\_\{t\}X\_\{0\}\+\\sigma\_\{t\}Z=xis exactly𝗌t⋆\(x\)\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\), which motivates the regression population loss:

𝔼‖𝗌t\(atX0\+σtZ\)\+Zσt‖2\.\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}\_\{t\}\(a\_\{t\}X\_\{0\}\+\\sigma\_\{t\}Z\)\+\\frac\{Z\}\{\\sigma\_\{t\}\}\\right\\rVert^\{2\}\.Replacing the expectation overX0∼p𝖽𝖺𝗍𝖺X\_\{0\}\\sim p\_\{\\mathsf\{data\}\}by an empirical average over data points, while resampling the Gaussian noise, gives the basic denoising score\-matching objective\.

This is why the learned field has a denoising interpretation rather than being an arbitrary vector field\. The regression target is noisy for each individual corruption, but its conditional average points from the observation toward the posterior mean𝖣t⋆\(x\)\\mathsf\{D\}^\{\\star\}\_\{t\}\(x\)of the clean sample\. Equivalently, the score records the Bayes correction encoded by the noised data distribution\.

### 3\.3\.Continuous\-time reverse SDE

The forward SDE \([3\.1](https://arxiv.org/html/2607.01693#S3.E1)\) is designed to move data toward a simple law\. For sampling, the object we need is not a pathwise inverse of the Brownian motion\. It is enough to construct a Markov process whose one\-time marginals run through the same densities in the opposite order: if\(pt\)0≤t≤T\(p\_\{t\}\)\_\{0\\leq t\\leq T\}are the forward noising marginals, then the reverse sampler\(Ys←\)0≤s≤T\(Y^\{\\leftarrow\}\_\{s\}\)\_\{0\\leq s\\leq T\}should satisfy

Law⁡\(Ys←\)=pT−sfor everys∈\[0,T\]\.\\operatorname\{Law\}\(Y^\{\\leftarrow\}\_\{s\}\)=p\_\{T\-s\}\\qquad\\text\{for every \}s\\in\[0,T\]\.This requirement is deliberately distributional\. It says that the reverse sampler has the right snapshots, not that it retraces individual forward Brownian paths\. We therefore first match the evolution of densities\. Setqs=pT−sq\_\{s\}=p\_\{T\-s\}and writet=T−st=T\-s\. The forward density satisfies

∂tpt=−∇⋅\(ftpt\)\+12gt2Δpt\.\\partial\_\{t\}p\_\{t\}=\-\\nabla\\cdot\(f\_\{t\}p\_\{t\}\)\+\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\Delta p\_\{t\}\.Therefore, witht=T−st=T\-s,

∂sqs=∇⋅\(ftpt\)−12gt2Δpt\.\\partial\_\{s\}q\_\{s\}=\\nabla\\cdot\(f\_\{t\}p\_\{t\}\)\-\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\Delta p\_\{t\}\.The key step uses the identityΔpt=∇⋅\(pt∇log⁡pt\)\\Delta p\_\{t\}=\\nabla\\cdot\(p\_\{t\}\\nabla\\log p\_\{t\}\), which brings in precisely the score𝗌t⋆=∇log⁡pt\\mathsf\{s\}^\{\\star\}\_\{t\}=\\nabla\\log p\_\{t\}from \([3\.3](https://arxiv.org/html/2607.01693#S3.E3)\)\. This rewrites the density evolution as

∂sqs=−∇⋅\[\(−ft\+gt2𝗌t⋆\)pt\]\+12gt2Δpt\.\\partial\_\{s\}q\_\{s\}=\-\\nabla\\cdot\\left\[\\left\(\-f\_\{t\}\+g\_\{t\}^\{2\}\\mathsf\{s\}^\{\\star\}\_\{t\}\\right\)p\_\{t\}\\right\]\+\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\Delta p\_\{t\}\.This is the Fokker–Planck equation for the reverse\-time diffusion

\(3\.5\)dYs←=\[−fT−s\(Ys←\)\+gT−s2𝗌T−s⋆\(Ys←\)\]ds\+gT−sdBs←,Y0←∼pT\.\\,\\mathrm\{d\}Y^\{\\leftarrow\}\_\{s\}=\\left\[\-f\_\{T\-s\}\(Y^\{\\leftarrow\}\_\{s\}\)\+g\_\{T\-s\}^\{2\}\\,\\mathsf\{s\}^\{\\star\}\_\{T\-s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\right\]\\,\\mathrm\{d\}s\+g\_\{T\-s\}\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{s\},\\qquad Y^\{\\leftarrow\}\_\{0\}\\sim p\_\{T\}\.HereBs←B^\{\\leftarrow\}\_\{s\}is Brownian motion in the reverse sampling time\. By construction, ifY0←∼pTY^\{\\leftarrow\}\_\{0\}\\sim p\_\{T\}, thenYs←∼pT−sY^\{\\leftarrow\}\_\{s\}\\sim p\_\{T\-s\}for everys∈\[0,T\]s\\in\[0,T\]\.

The drift in the reverse SDE has two pieces\. The term−fT−s\-f\_\{T\-s\}reverses the deterministic transport part of the forward dynamics\. The extra termgT−s2𝗌T−s⋆g\_\{T\-s\}^\{2\}\\,\\mathsf\{s\}^\{\\star\}\_\{T\-s\}is exactly the score introduced in Subsection[3\.2](https://arxiv.org/html/2607.01693#S3.SS2)as the denoising direction: it corrects the density evolution caused by diffusion by nudging each sample back toward regions of higher noised density, i\.e\. toward the posterior mean of the clean sample\. This is the precise sense in which the denoiser of the previous subsection*is*the engine of the reverse dynamics\. Replacing the true score𝗌t⋆\\mathsf\{s\}^\{\\star\}\_\{t\}by a learned score𝗌t\\mathsf\{s\}\_\{t\}yields the score\-based reverse SDE sampler, the continuous object that DDPM\-type algorithms discretize\.

It is instructive to specialize \([3\.5](https://arxiv.org/html/2607.01693#S3.E5)\) to the two schedules of the running example; we will revisit the same two processes from the probability\-flow viewpoint in the next subsection\.

###### Example\(Variance\-exploding reverse SDE\)\.

ForXt=X0\+tZX\_\{t\}=X\_\{0\}\+\\sqrt\{t\}Z, we haveft=0f\_\{t\}=0andgt=1g\_\{t\}=1, so the reverse SDE is

dYs←=𝗌T−s⋆\(Ys←\)ds\+dBs←,Y0←∼pT\.\\,\\mathrm\{d\}Y^\{\\leftarrow\}\_\{s\}=\\mathsf\{s\}^\{\\star\}\_\{T\-s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\,\\mathrm\{d\}s\+\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{s\},\\qquad Y^\{\\leftarrow\}\_\{0\}\\sim p\_\{T\}\.The whole drift is the score: starting from a high\-noise draw, the sampler repeatedly steps along the denoising direction𝗌T−s⋆\\mathsf\{s\}^\{\\star\}\_\{T\-s\}while fresh Brownian noise is injected\.

###### Example\(Variance\-preserving Ornstein–Uhlenbeck reverse SDE\)\.

For

dXt=−12Xtdt\+dBt,\\,\\mathrm\{d\}X\_\{t\}=\-\\frac\{1\}\{2\}X\_\{t\}\\,\\mathrm\{d\}t\+\\,\\mathrm\{d\}B\_\{t\},we haveft=−12xf\_\{t\}=\-\\frac\{1\}\{2\}xandgt=1g\_\{t\}=1, so the reverse SDE is

dYs←=\[12Ys←\+𝗌T−s⋆\(Ys←\)\]ds\+dBs←,Y0←∼pT\.\\,\\mathrm\{d\}Y^\{\\leftarrow\}\_\{s\}=\\left\[\\frac\{1\}\{2\}Y^\{\\leftarrow\}\_\{s\}\+\\mathsf\{s\}^\{\\star\}\_\{T\-s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\right\]\\,\\mathrm\{d\}s\+\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{s\},\\qquad Y^\{\\leftarrow\}\_\{0\}\\sim p\_\{T\}\.The term12Ys←\\frac\{1\}\{2\}Y^\{\\leftarrow\}\_\{s\}undoes the OU contraction toward the origin, while the score term𝗌T−s⋆\\mathsf\{s\}^\{\\star\}\_\{T\-s\}supplies the denoising correction\.

In practice the reverse SDE is run in discrete time, and this discretization is exactly DDPM\. One fixes a grid in the sampling timess, replaces the true score𝗌T−s⋆\\mathsf\{s\}^\{\\star\}\_\{T\-s\}by the learned model𝗌T−s\\mathsf\{s\}\_\{T\-s\}, and takes Euler–Maruyama steps of \([3\.5](https://arxiv.org/html/2607.01693#S3.E5)\): each step nudges the current sample along the denoising drift and then adds fresh Gaussian noise of the appropriate variance\. This is the DDPM sampler of Ho, Jain, and Abbeel\[[26](https://arxiv.org/html/2607.01693#bib.bib26)\], the stochastic counterpart of the deterministic DDIM update we will read off from the probability\-flow ODE in the next subsection\. The detailed Gaussian transition kernels and the resulting discretization error are taken up in Section[5](https://arxiv.org/html/2607.01693#S5)\.

### 3\.4\.Probability flow ODE

There is a deterministic ODE whose one\-time marginals are the same as those of the forward SDE \([3\.1](https://arxiv.org/html/2607.01693#S3.E1)\)\. This ODE is called the*probability flow ODE*\. The probability flow ODE replaces Brownian randomness by a transport field that has the same effect on marginal densities\. Define the velocity field

\(3\.6\)vt\(x\)=ft\(x\)−12gt2𝗌t⋆\(x\)=ft\(x\)−12gt2∇log⁡pt\(x\)\.v\_\{t\}\(x\)=f\_\{t\}\(x\)\-\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=f\_\{t\}\(x\)\-\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\nabla\\log p\_\{t\}\(x\)\.The probability flow ODE is

\(3\.7\)dXtdt=vt\(Xt\)\.\\frac\{\\,\\mathrm\{d\}X\_\{t\}\}\{\\,\\mathrm\{d\}t\}=v\_\{t\}\(X\_\{t\}\)\.IfX0∼p0X\_\{0\}\\sim p\_\{0\}, then the solution of \([3\.7](https://arxiv.org/html/2607.01693#S3.E7)\) has marginal densityptp\_\{t\}at every timettfor which the ODE is well posed\.

###### Derivation from the Fokker–Planck equation\.

The forward SDE \([3\.1](https://arxiv.org/html/2607.01693#S3.E1)\) has Fokker–Planck equation

\(3\.8\)∂tpt=−∇⋅\(ftpt\)\+12gt2Δpt\.\\partial\_\{t\}p\_\{t\}=\-\\nabla\\cdot\(f\_\{t\}p\_\{t\}\)\+\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\Delta p\_\{t\}\.Use the identity

Δpt=∇⋅\(∇pt\)=∇⋅\(pt∇log⁡pt\)=∇⋅\(pt𝗌t⋆\)\.\\Delta p\_\{t\}=\\nabla\\cdot\(\\nabla p\_\{t\}\)=\\nabla\\cdot\(p\_\{t\}\\nabla\\log p\_\{t\}\)=\\nabla\\cdot\(p\_\{t\}\\mathsf\{s\}^\{\\star\}\_\{t\}\)\.Then \([3\.8](https://arxiv.org/html/2607.01693#S3.E8)\) becomes

∂tpt=−∇⋅\(ftpt\)\+12gt2∇⋅\(pt𝗌t⋆\)=−∇⋅\[\(ft−12gt2𝗌t⋆\)pt\]\.\\partial\_\{t\}p\_\{t\}=\-\\nabla\\cdot\(f\_\{t\}p\_\{t\}\)\+\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\nabla\\cdot\(p\_\{t\}\\mathsf\{s\}^\{\\star\}\_\{t\}\)=\-\\nabla\\cdot\\left\[\\left\(f\_\{t\}\-\\frac\{1\}\{2\}g\_\{t\}^\{2\}\\mathsf\{s\}^\{\\star\}\_\{t\}\\right\)p\_\{t\}\\right\]\.This is exactly the continuity equation

\(3\.9\)∂tpt\+∇⋅\(ptvt\)=0\\partial\_\{t\}p\_\{t\}\+\\nabla\\cdot\(p\_\{t\}v\_\{t\}\)=0for the density transported by the deterministic flowX˙t=vt\(Xt\)\\dot\{X\}\_\{t\}=v\_\{t\}\(X\_\{t\}\)\. Sinceptp\_\{t\}satisfies this equation with initial conditionp0p\_\{0\}, the ODE flow pushesp0p\_\{0\}forward toptp\_\{t\}\. ∎

###### Example\(Variance\-exploding probability flow\)\.

ForXt=X0\+tZX\_\{t\}=X\_\{0\}\+\\sqrt\{t\}Z, we haveft=0f\_\{t\}=0andgt=1g\_\{t\}=1\. The probability flow velocity is

vt\(x\)=−12∇log⁡pt\(x\)\.v\_\{t\}\(x\)=\-\\frac\{1\}\{2\}\\nabla\\log p\_\{t\}\(x\)\.The score points toward higher density, so−12∇log⁡pt\-\\frac\{1\}\{2\}\\nabla\\log p\_\{t\}pushes mass outward and reproduces the smoothing effect of the variance\-exploding noising process\. To generate samples, one integrates the same ODE backward in time, which reverses this velocity\.

###### Example\(Variance\-preserving Ornstein–Uhlenbeck flow\)\.

For

dXt=−12Xtdt\+dBt,\\,\\mathrm\{d\}X\_\{t\}=\-\\frac\{1\}\{2\}X\_\{t\}\\,\\mathrm\{d\}t\+\\,\\mathrm\{d\}B\_\{t\},the probability flow velocity is

vt\(x\)=−12x−12∇log⁡pt\(x\)\.v\_\{t\}\(x\)=\-\\frac\{1\}\{2\}x\-\\frac\{1\}\{2\}\\nabla\\log p\_\{t\}\(x\)\.The first term is the deterministic OU contraction toward the origin\. The second term is the transport representation of the Brownian smoothing\.

The reverse SDE and the probability flow ODE are two different machines that produce the same snapshots when the score is exact\. The SDE machine keeps injecting randomness; the ODE machine deterministically transports particles \(starting from random initial data\)\. Since many numerical and control questions depend on paths, not only snapshots, one should not freely replace one machine by the other without checking what quantity is being analyzed\.

For sampling, we use the same reverse\-time convention as in \([3\.5](https://arxiv.org/html/2607.01693#S3.E5)\): sets=T−ts=T\-t, start from a high\-noise draw with lawpTp\_\{T\}, and integrate froms=0s=0tos=Ts=T\. The reverse\-time probability\-flow sampler is the ODE

dZs←ds=−vT−s\(Zs←\)=−fT−s\(Zs←\)\+12gT−s2𝗌T−s⋆\(Zs←\),Z0←∼pT\.\\frac\{\\,\\mathrm\{d\}Z^\{\\leftarrow\}\_\{s\}\}\{\\,\\mathrm\{d\}s\}=\-v\_\{T\-s\}\(Z^\{\\leftarrow\}\_\{s\}\)=\-f\_\{T\-s\}\(Z^\{\\leftarrow\}\_\{s\}\)\+\\frac\{1\}\{2\}g\_\{T\-s\}^\{2\}\\mathsf\{s\}^\{\\star\}\_\{T\-s\}\(Z^\{\\leftarrow\}\_\{s\}\),\\qquad Z^\{\\leftarrow\}\_\{0\}\\sim p\_\{T\}\.By the same continuity\-equation calculation,Zs←Z^\{\\leftarrow\}\_\{s\}has lawpT−sp\_\{T\-s\}when the score is exact\. Compare this with the reverse SDE drift in \([3\.5](https://arxiv.org/html/2607.01693#S3.E5)\):−fT−s\+gT−s2𝗌T−s⋆\-f\_\{T\-s\}\+g\_\{T\-s\}^\{2\}\\mathsf\{s\}^\{\\star\}\_\{T\-s\}\. The DDIM sampler of Song, Meng, and Ermon\[[42](https://arxiv.org/html/2607.01693#bib.bib42)\]is the standard sampler on this probability\-flow side: it uses the same trained denoising model as DDPM, but follows the deterministic update suggested by the ODE viewpoint\.

The deterministic nature of the ODE has an important consequence for likelihoods\. Along a solution ofX˙t=vt\(Xt\)\\dot\{X\}\_\{t\}=v\_\{t\}\(X\_\{t\}\), the continuity equation \([3\.9](https://arxiv.org/html/2607.01693#S3.E9)\) implies

\(3\.10\)ddtlog⁡pt\(Xt\)=−∇⋅vt\(Xt\)\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\log p\_\{t\}\(X\_\{t\}\)=\-\\nabla\\cdot v\_\{t\}\(X\_\{t\}\)\.Indeed,

ddtlog⁡pt\(Xt\)=∂tlog⁡pt\(Xt\)\+⟨vt\(Xt\),∇log⁡pt\(Xt\)⟩,\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\log p\_\{t\}\(X\_\{t\}\)=\\partial\_\{t\}\\log p\_\{t\}\(X\_\{t\}\)\+\\left\\langle v\_\{t\}\(X\_\{t\}\),\\nabla\\log p\_\{t\}\(X\_\{t\}\)\\right\\rangle,and dividing \([3\.9](https://arxiv.org/html/2607.01693#S3.E9)\) byptp\_\{t\}gives

∂tlog⁡pt\+⟨vt,∇log⁡pt⟩=−∇⋅vt\.\\partial\_\{t\}\\log p\_\{t\}\+\\left\\langle v\_\{t\},\\nabla\\log p\_\{t\}\\right\\rangle=\-\\nabla\\cdot v\_\{t\}\.Therefore, if the mapX0↦XTX\_\{0\}\\mapsto X\_\{T\}is obtained by integrating the probability flow ODE forward,

log⁡p0\(X0\)=log⁡pT\(XT\)\+∫0T∇⋅vt\(Xt\)dt\.\\log p\_\{0\}\(X\_\{0\}\)=\\log p\_\{T\}\(X\_\{T\}\)\+\\int\_\{0\}^\{T\}\\nabla\\cdot v\_\{t\}\(X\_\{t\}\)\\,\\mathrm\{d\}t\.This is the continuous normalizing\-flow identity used to compute likelihoods from probability flow ODEs\.

## 4\.Stochastic Localization and Polchinski Flow

The previous section introduced diffusion models through Gaussian noising\. We started from data, followed the noised marginalsptp\_\{t\}, learned the score∇log⁡pt\\nabla\\log p\_\{t\}, and used that score to run either a reverse SDE or a probability\-flow ODE\. This section looks at the same Gaussian channel from two complementary angles: stochastic localization and Polchinski flow\. All three viewpoints start from a clean random variable and the noisy observation \(here and below we take the variance\-exploding one\)

Xt=X⋆\+tZ,X\_\{t\}=X\_\{\\star\}\+\\sqrt\{t\}Z,but they organize the information in different ways\. Diffusion models emphasize reverse\-time sampling\. Stochastic localization asks a Bayesian question: as the noisy observation becomes more precise, how does the posterior law of the hidden signalX⋆X\_\{\\star\}evolve? Polchinski flow asks a density\-level question: as the noise level changes, how do the smoothed densityptp\_\{t\}and the effective potentialUt=−log⁡ptU\_\{t\}=\-\\log p\_\{t\}evolve?

These two changes of perspective are useful because they expose structure that is less visible from the reverse SDE alone\. In stochastic localization, the score appears as a posterior denoising correction\. In Polchinski flow, the score is the negative gradient of a coarse\-grained energy landscape\. Both viewpoints lead naturally to quantitative tools, especially covariance identities, that will be useful later\.

We begin by treating stochastic localization as an observation model for the same Gaussian channel\. This lets us compare diffusion time with localization precision, derive the posterior law, and connect its mean to the score\. We then record the martingale structure that makes localization useful\. Finally, we return to the densityptp\_\{t\}itself and write the corresponding Polchinski flow for the effective potentialUtU\_\{t\}\.

### 4\.1\.Stochastic localization

LetX⋆∼μX\_\{\\star\}\\sim\\mube the signal, or clean data point\. Here we write the law asμ\\murather thanp𝖽𝖺𝗍𝖺p\_\{\\mathsf\{data\}\}, to emphasize that the construction applies to a general distribution and not only to the data law\. Start with the Gaussian smoothing channel

Xt=X⋆\+tZ,Z∼𝒩\(0,I\)\.X\_\{t\}=X\_\{\\star\}\+\\sqrt\{t\}Z,\\qquad Z\\sim\\mathcal\{N\}\\\!\\left\(0,I\\right\)\.Herettis the noise variance, andpt=Law⁡\(Xt\)p\_\{t\}=\\operatorname\{Law\}\(X\_\{t\}\)is the Gaussian convolution of the data law with covariancetItI\. Asttincreases, the density becomes smoother; asttdecreases, the channel approaches the original data distribution\.

Stochastic localization uses the same Gaussian channel, but parametrizes it by precision rather than variance\. Observe the same unknownX⋆X\_\{\\star\}through the continuous\-time Gaussian observation process

\(4\.1\)dYu=X⋆du\+dWu,Y0=0\.\\,\\mathrm\{d\}Y\_\{u\}=X\_\{\\star\}\\,\\mathrm\{d\}u\+\\,\\mathrm\{d\}W\_\{u\},\\qquad Y\_\{0\}=0\.HereWuW\_\{u\}is Brownian motion independent ofX⋆X\_\{\\star\}\. Conditional onX⋆=xX\_\{\\star\}=x, we haveYu∼𝒩\(ux,uI\)Y\_\{u\}\\sim\\mathcal\{N\}\\\!\\left\(ux,uI\\right\), and thus after a rescaling:

Y¯u:=u−1Yu∼𝒩\(x,u−1I\)\.\\bar\{Y\}\_\{u\}:=u^\{\-1\}Y\_\{u\}\\sim\\mathcal\{N\}\\\!\\left\(x,u^\{\-1\}I\\right\)\.Hence the localization timeuucorresponds to the diffusion noise variancet=u−1t=u^\{\-1\}:

Y¯u=X⋆\+tZ,t=1u\.\\bar\{Y\}\_\{u\}=X\_\{\\star\}\+\\sqrt\{t\}Z,\\qquad t=\\frac\{1\}\{u\}\.The same noised observation can therefore be indexed either by variancettor by precisionuu\. Diffusion notation emphasizes smoothing asttincreases\. Localization notation emphasizes Bayesian inference asuuincreases and the observation becomes more informative\. This precision parametrization is useful because the posterior mean and covariance then obey simple martingale identities, as we will discuss below\.

We now write this posterior law explicitly\. At precisionuu, the observation isYuY\_\{u\}, and the basic object is the conditional law of the hidden signalX⋆X\_\{\\star\}\. Letμu\(⋅∣y\)=Law\(X⋆∣Yu=y\)\\mu\_\{u\}\(\\cdot\\mid y\)=\\operatorname\{Law\}\(X\_\{\\star\}\\mid Y\_\{u\}=y\)be the posterior law after observingYu=yY\_\{u\}=y\. Bayes’ rule gives

\(4\.2\)μu\(dx∣Yu=y\)=1Zu\(y\)exp⁡\{⟨y,x⟩−u2‖x‖2\}μ\(dx\)\.\\mu\_\{u\}\(\\,\\mathrm\{d\}x\\mid Y\_\{u\}=y\)=\\frac\{1\}\{Z\_\{u\}\(y\)\}\\exp\\left\\\{\\left\\langle y,x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\\\}\\mu\(\\,\\mathrm\{d\}x\)\.The normalizing constant is

Zu\(y\)=∫exp⁡\{⟨y,x⟩−u2‖x‖2\}μ\(dx\)\.Z\_\{u\}\(y\)=\\int\\exp\\left\\\{\\left\\langle y,x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\\\}\\mu\(\\,\\mathrm\{d\}x\)\.This is why the method is called localization: the posterior is the original measure tilted by a random linear field and penalized by a growing quadratic function\. Asuugrows, the posterior becomes increasingly concentrated near the hidden signal\.

###### Derivation of \([4\.2](https://arxiv.org/html/2607.01693#S4.E2)\)\.

GivenX⋆=xX\_\{\\star\}=x, the observationYuY\_\{u\}has density proportional to

exp⁡\(−‖y−ux‖22u\)\.\\exp\\left\(\-\\frac\{\\left\\lVert y\-ux\\right\\rVert^\{2\}\}\{2u\}\\right\)\.Expanding the square,

−‖y−ux‖22u=−‖y‖22u\+⟨y,x⟩−u2‖x‖2\.\-\\frac\{\\left\\lVert y\-ux\\right\\rVert^\{2\}\}\{2u\}=\-\\frac\{\\left\\lVert y\\right\\rVert^\{2\}\}\{2u\}\+\\left\\langle y,x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\.The first term is independent ofxxand is absorbed into the normalizing constant\. Multiplying by the prior measureμ\(dx\)\\mu\(\\,\\mathrm\{d\}x\)gives \([4\.2](https://arxiv.org/html/2607.01693#S4.E2)\)\. ∎

Now put the random observation back in and write

μu=μu\(⋅∣Yu\),Mu=∫xμu\(dx\)\.\\mu\_\{u\}=\\mu\_\{u\}\(\\cdot\\mid Y\_\{u\}\),\\qquad M\_\{u\}=\\int x\\,\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\.The original stochastic\-localization viewpoint is that the random measure\(μu\)u≥0\(\\mu\_\{u\}\)\_\{u\\geq 0\}itself evolves by a measure\-valued SDE, which we now derive\. LetℱuY=σ\(Yr:0≤r≤u\)\\mathcal\{F\}\_\{u\}^\{Y\}=\\sigma\(Y\_\{r\}:0\\leq r\\leq u\)and define the innovation process

Iu=Yu−∫0uMsds,dIu=dYu−Mudu\.I\_\{u\}=Y\_\{u\}\-\\int\_\{0\}^\{u\}M\_\{s\}\\,\\mathrm\{d\}s,\\qquad\\,\\mathrm\{d\}I\_\{u\}=\\,\\mathrm\{d\}Y\_\{u\}\-M\_\{u\}\\,\\mathrm\{d\}u\.This subtracts the part of the next observation that is already predictable from the current posterior\. Intuitively,IuI\_\{u\}records only the “surprise” in the observation stream: after conditioning onℱuY\\mathcal\{F\}\_\{u\}^\{Y\}, the drift ofdYu\\,\\mathrm\{d\}Y\_\{u\}isMuduM\_\{u\}\\,\\mathrm\{d\}u, so subtracting it leaves an increment with no predictable component and the same accumulated covariance as the original Brownian noise\. More formally, one can check thatIuI\_\{u\}is a continuousℱuY\\mathcal\{F\}\_\{u\}^\{Y\}\-martingale whose accumulated covariance over\[0,u\]\[0,u\]isuIuI; Levy’s characterization then identifies it as a Brownian motion with respect toℱuY\\mathcal\{F\}\_\{u\}^\{Y\}\. The posterior law satisfies

\(4\.3\)dμu\(dx\)=⟨x−Mu,dIu⟩μu\(dx\)\.\\,\\mathrm\{d\}\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)=\\left\\langle x\-M\_\{u\},\\,\\mathrm\{d\}I\_\{u\}\\right\\rangle\\,\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\.Equivalently, for every bounded test functionff,

d∫f\(x\)μu\(dx\)=⟨∫f\(x\)\(x−Mu\)μu\(dx\),dIu⟩\.\\,\\mathrm\{d\}\\int f\(x\)\\,\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)=\\left\\langle\\int f\(x\)\(x\-M\_\{u\}\)\\,\\mu\_\{u\}\(\\,\\mathrm\{d\}x\),\\,\\mathrm\{d\}I\_\{u\}\\right\\rangle\.The same identity applies componentwise to vector\-valued test functions when the relevant moments are finite\. In particular, for every Borel setAA, the processu↦μu\(A\)=Pr⁡\(X⋆∈A∣ℱuY\)u\\mapsto\\mu\_\{u\}\(A\)=\\Pr\(X\_\{\\star\}\\in A\\mid\\mathcal\{F\}\_\{u\}^\{Y\}\)is a martingale:

𝔼\[μv\(A\)∣ℱuY\]=μu\(A\),v≥u\.\\mathbb\{E\}\[\\mu\_\{v\}\(A\)\\mid\\mathcal\{F\}\_\{u\}^\{Y\}\]=\\mu\_\{u\}\(A\),\\qquad v\\geq u\.Thus, in this formulation, stochastic localization is a measure\-valued martingale whose random tilt concentrates the prior while preserving conditional expectations\.

###### Proof of \([4\.3](https://arxiv.org/html/2607.01693#S4.E3)\)\.

It is enough to compute the differential of the normalized posterior weight\. From \([4\.2](https://arxiv.org/html/2607.01693#S4.E2)\), along the random observation path,

μu\(dx\)μ\(dx\)=1Zuexp⁡\{⟨Yu,x⟩−u2‖x‖2\}\.\\frac\{\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\}\{\\mu\(\\,\\mathrm\{d\}x\)\}=\\frac\{1\}\{Z\_\{u\}\}\\exp\\left\\\{\\left\\langle Y\_\{u\},x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\\\}\.First, Itô’s formula gives

dexp⁡\{⟨Yu,x⟩−u2‖x‖2\}\\displaystyle\\,\\mathrm\{d\}\\exp\\left\\\{\\left\\langle Y\_\{u\},x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\\\}=exp⁡\{⟨Yu,x⟩−u2‖x‖2\}\(⟨x,dYu⟩−12‖x‖2du\)\\displaystyle=\\exp\\left\\\{\\left\\langle Y\_\{u\},x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\\\}\\left\(\\left\\langle x,\\,\\mathrm\{d\}Y\_\{u\}\\right\\rangle\-\\frac\{1\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\,\\mathrm\{d\}u\\right\)\+12exp⁡\{⟨Yu,x⟩−u2‖x‖2\}‖x‖2du\\displaystyle\\qquad\+\\frac\{1\}\{2\}\\exp\\left\\\{\\left\\langle Y\_\{u\},x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\\\}\\left\\lVert x\\right\\rVert^\{2\}\\,\\mathrm\{d\}u=exp⁡\{⟨Yu,x⟩−u2‖x‖2\}⟨x,dYu⟩\.\\displaystyle=\\exp\\left\\\{\\left\\langle Y\_\{u\},x\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\\\}\\left\\langle x,\\,\\mathrm\{d\}Y\_\{u\}\\right\\rangle\.The middle line contains the Itô correction; it cancels the drift from the term−u2‖x‖2\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}in the exponent\. Integrating this identity inxxgives

dZuZu=⟨Mu,dYu⟩,dlog⁡Zu=⟨Mu,dYu⟩−12‖Mu‖2du\.\\frac\{\\,\\mathrm\{d\}Z\_\{u\}\}\{Z\_\{u\}\}=\\left\\langle M\_\{u\},\\,\\mathrm\{d\}Y\_\{u\}\\right\\rangle,\\qquad\\,\\mathrm\{d\}\\log Z\_\{u\}=\\left\\langle M\_\{u\},\\,\\mathrm\{d\}Y\_\{u\}\\right\\rangle\-\\frac\{1\}\{2\}\\left\\lVert M\_\{u\}\\right\\rVert^\{2\}\\,\\mathrm\{d\}u\.Therefore the log\-density ofμu\\mu\_\{u\}with respect toμ\\musatisfies

dlog⁡μu\(dx\)μ\(dx\)=⟨x−Mu,dYu⟩−12\(‖x‖2−‖Mu‖2\)du\.\\,\\mathrm\{d\}\\log\\frac\{\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\}\{\\mu\(\\,\\mathrm\{d\}x\)\}=\\left\\langle x\-M\_\{u\},\\,\\mathrm\{d\}Y\_\{u\}\\right\\rangle\-\\frac\{1\}\{2\}\\bigl\(\\left\\lVert x\\right\\rVert^\{2\}\-\\left\\lVert M\_\{u\}\\right\\rVert^\{2\}\\bigr\)\\,\\mathrm\{d\}u\.Applying Itô’s formula once more, now to the exponential of this log\-density, gives

dμu\(dx\)μu\(dx\)=⟨x−Mu,dYu⟩−⟨x−Mu,Mu⟩du=⟨x−Mu,dIu⟩\.\\frac\{\\,\\mathrm\{d\}\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\}\{\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\}=\\left\\langle x\-M\_\{u\},\\,\\mathrm\{d\}Y\_\{u\}\\right\\rangle\-\\left\\langle x\-M\_\{u\},M\_\{u\}\\right\\rangle\\,\\mathrm\{d\}u=\\left\\langle x\-M\_\{u\},\\,\\mathrm\{d\}I\_\{u\}\\right\\rangle\.This is exactly \([4\.3](https://arxiv.org/html/2607.01693#S4.E3)\); integrating against a bounded test functionffgives the weak form\. ∎

### 4\.2\.Posterior means and martingale structure

The previous subsection constructed the posterior law and its measure\-valued martingale equation\. We now pass to its first two moments\. First we freeze the observation and connect the posterior mean to the diffusion\-model score; then we put the random observation back in and record the martingale dynamics\.

For a deterministic observationyy, define the posterior mean and covariance

mu\(y\)=𝔼\[X⋆∣Yu=y\],Σu\(y\)=Cov⁡\(X⋆∣Yu=y\)\.m\_\{u\}\(y\)=\\mathbb\{E\}\[X\_\{\\star\}\\mid Y\_\{u\}=y\],\\qquad\\Sigma\_\{u\}\(y\)=\\operatorname\{Cov\}\(X\_\{\\star\}\\mid Y\_\{u\}=y\)\.The posterior mean is exactly a denoiser\. Indeed, sinceY¯u=Yu/u=X⋆\+u−1/2Z\\bar\{Y\}\_\{u\}=Y\_\{u\}/u=X\_\{\\star\}\+u^\{\-1/2\}Z, a deterministic observationyycorresponds to the noised sampley/uy/u, and Tweedie’s identity \([3\.4](https://arxiv.org/html/2607.01693#S3.E4)\) gives

\(4\.4\)mu\(y\)=yu\+1u∇log⁡p1/u\(yu\),m\_\{u\}\(y\)=\\frac\{y\}\{u\}\+\\frac\{1\}\{u\}\\nabla\\log p\_\{1/u\}\\\!\\left\(\\frac\{y\}\{u\}\\right\),whereptp\_\{t\}is the density ofX⋆\+tZX\_\{\\star\}\+\\sqrt\{t\}Z\. Equivalently,

∇log⁡p1/u\(y/u\)=u\(mu\(y\)−y/u\)\.\\nabla\\log p\_\{1/u\}\(y/u\)=u\\bigl\(m\_\{u\}\(y\)\-y/u\\bigr\)\.Thus learning a score is the same problem as learning the localization posterior mean\.

###### Specializing Tweedie to the localization channel\.

The rescaled observationY¯u=X⋆\+u−1/2Z\\bar\{Y\}\_\{u\}=X\_\{\\star\}\+u^\{\-1/2\}Zis the forward channelXt=X⋆\+tZX\_\{t\}=X\_\{\\star\}\+\\sqrt\{t\}Zat timet=1/ut=1/u, in the variance\-exploding conventionat=1a\_\{t\}=1,σt2=t=1/u\\sigma\_\{t\}^\{2\}=t=1/u\. Thus the continuous\-time Tweedie identity \([3\.4](https://arxiv.org/html/2607.01693#S3.E4)\) reads, in the current notation,

∇log⁡p1/u\(y/u\)=u𝔼\[X⋆−y/u∣Yu=y\]=u\(mu\(y\)−y/u\),\\nabla\\log p\_\{1/u\}\(y/u\)=u\\,\\mathbb\{E\}\[X\_\{\\star\}\-y/u\\mid Y\_\{u\}=y\]=u\\bigl\(m\_\{u\}\(y\)\-y/u\\bigr\),where we used𝔼\[X⋆∣Yu=y\]=mu\(y\)\\mathbb\{E\}\[X\_\{\\star\}\\mid Y\_\{u\}=y\]=m\_\{u\}\(y\)\. Solving formu\(y\)m\_\{u\}\(y\)then yields \([4\.4](https://arxiv.org/html/2607.01693#S4.E4)\)\. ∎

The covariance, in turn, is the linear response of this denoiser: differentiating the posterior tilt \([4\.2](https://arxiv.org/html/2607.01693#S4.E2)\) with respect to the observation gives

\(4\.5\)∇ymu\(y\)=Σu\(y\)\.\\nabla\_\{y\}m\_\{u\}\(y\)=\\Sigma\_\{u\}\(y\)\.Combining this with Tweedie’s identity \([4\.4](https://arxiv.org/html/2607.01693#S4.E4)\), and writingt=1/ut=1/u, gives the score\-Jacobian identity

\(4\.6\)∇𝗌t⋆\(x\)=t−2Σ1/t\(x/t\)−t−1I\.\\nabla\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=t^\{\-2\}\\Sigma\_\{1/t\}\(x/t\)\-t^\{\-1\}I\.Up to this time change, then, the Hessian of the noised log\-density is precisely the posterior covariance of the localization process\.

Now return to the random localization path\. We writeMuM\_\{u\}andΣu\\Sigma\_\{u\}below for the random evaluationsmu\(Yu\)m\_\{u\}\(Y\_\{u\}\)andΣu\(Yu\)\\Sigma\_\{u\}\(Y\_\{u\}\)\. The measure\-valued equation \([4\.3](https://arxiv.org/html/2607.01693#S4.E3)\) then turns the fixed\-observation posterior moments into stochastic dynamics\.

Start with the mean\. Applying the weak form of \([4\.3](https://arxiv.org/html/2607.01693#S4.E3)\) componentwise to the test functionf\(x\)=xf\(x\)=xgives, under standard integrability assumptions,

\(4\.7\)dMu=\(∫x\(x−Mu\)⊤μu\(dx\)\)dIu=ΣudIu\.\\,\\mathrm\{d\}M\_\{u\}=\\left\(\\int x\(x\-M\_\{u\}\)^\{\\top\}\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\\right\)\\,\\mathrm\{d\}I\_\{u\}=\\Sigma\_\{u\}\\,\\,\\mathrm\{d\}I\_\{u\}\.Thus the mean moves only in response to the innovation Brownian motion, and the size and direction of that motion are governed by the current posterior covariance\. In particular, the posterior mean is a martingale: before seeing the next small piece of data, the current posterior mean is the best prediction of the next posterior mean\.

###### Why the posterior mean is a martingale\.

First note thatMuM\_\{u\}, although defined by conditioning on the single observationYuY\_\{u\}, is unchanged if we condition on the entire historyℱuY=σ\(Yr:0≤r≤u\)\\mathcal\{F\}\_\{u\}^\{Y\}=\\sigma\(Y\_\{r\}:0\\leq r\\leq u\)\. The reason is thatYuY\_\{u\}is a sufficient statistic forX⋆X\_\{\\star\}: by Girsanov’s theorem \(Appendix[A](https://arxiv.org/html/2607.01693#A1)\), the likelihood of the path\(Yr\)0≤r≤u\(Y\_\{r\}\)\_\{0\\leq r\\leq u\}under the signal valueX⋆=xX\_\{\\star\}=x, relative to the Brownian motion, is

exp⁡\(∫0u⟨x,dYr⟩−u2‖x‖2\)=exp⁡\(⟨x,Yu⟩−u2‖x‖2\),\\exp\\\!\\left\(\\int\_\{0\}^\{u\}\\left\\langle x,\\,\\mathrm\{d\}Y\_\{r\}\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\)=\\exp\\\!\\left\(\\left\\langle x,Y\_\{u\}\\right\\rangle\-\\frac\{u\}\{2\}\\left\\lVert x\\right\\rVert^\{2\}\\right\),which depends on the path only through the endpointYuY\_\{u\}\. This is exactly the tilt appearing in \([4\.2](https://arxiv.org/html/2607.01693#S4.E2)\), so the posterior given the whole history equals the posterior givenYuY\_\{u\}, and in particular

𝔼\[X⋆∣ℱuY\]=𝔼\[X⋆∣Yu\]=Mu\.\\mathbb\{E\}\[X\_\{\\star\}\\mid\\mathcal\{F\}\_\{u\}^\{Y\}\]=\\mathbb\{E\}\[X\_\{\\star\}\\mid Y\_\{u\}\]=M\_\{u\}\.The martingale property now follows from the tower property\. Forv≥uv\\geq u, sinceℱuY⊆ℱvY\\mathcal\{F\}\_\{u\}^\{Y\}\\subseteq\\mathcal\{F\}\_\{v\}^\{Y\},

𝔼\[Mv∣ℱuY\]=𝔼\[𝔼\[X⋆∣ℱvY\]∣ℱuY\]=𝔼\[X⋆∣ℱuY\]=Mu\.∎\\mathbb\{E\}\[M\_\{v\}\\mid\\mathcal\{F\}\_\{u\}^\{Y\}\]=\\mathbb\{E\}\[\\mathbb\{E\}\[X\_\{\\star\}\\mid\\mathcal\{F\}\_\{v\}^\{Y\}\]\\mid\\mathcal\{F\}\_\{u\}^\{Y\}\]=\\mathbb\{E\}\[X\_\{\\star\}\\mid\\mathcal\{F\}\_\{u\}^\{Y\}\]=M\_\{u\}\.\\qed

The covariance, which set the size of these mean fluctuations, has an evolution of its own\. Unlike the mean, it carries a drift, and that drift is strictly dissipative: in one dimension,

ddu𝔼\[Σu\]=−𝔼\[Σu2\]≤0,\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}u\}\\mathbb\{E\}\[\\Sigma\_\{u\}\]=\-\\mathbb\{E\}\[\\Sigma\_\{u\}^\{2\}\]\\leq 0,and in multiple dimensions the trace satisfies

\(4\.8\)ddu𝔼\[Tr⁡Σu\]=−𝔼\[Tr⁡\(Σu2\)\]≤0\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}u\}\\mathbb\{E\}\[\\operatorname\{Tr\}\\Sigma\_\{u\}\]=\-\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\Sigma\_\{u\}^\{2\}\)\]\\leq 0\.Integrate \([4\.8](https://arxiv.org/html/2607.01693#S4.E8)\) overu∈\[0,∞\)u\\in\[0,\\infty\): the fundamental theorem of calculus gives

∫0∞𝔼Tr⁡\(Σu2\)du=𝔼Tr⁡Σ0−limu→∞𝔼Tr⁡Σu≤𝔼Tr⁡Σ0,\\int\_\{0\}^\{\\infty\}\\mathbb\{E\}\\operatorname\{Tr\}\(\\Sigma\_\{u\}^\{2\}\)\\,\\,\\mathrm\{d\}u=\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{0\}\-\\lim\_\{u\\to\\infty\}\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{u\}\\leq\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{0\},the inequality becauseTr⁡Σu≥0\\operatorname\{Tr\}\\Sigma\_\{u\}\\geq 0\. Atu=0u=0no observation has yet been made, soΣ0=Cov⁡\(X⋆\)\\Sigma\_\{0\}=\\operatorname\{Cov\}\(X\_\{\\star\}\); hence ifCov⁡\(X⋆\)⪯I\\operatorname\{Cov\}\(X\_\{\\star\}\)\\preceq I,

\(4\.9\)∫0∞𝔼Tr⁡\(Σu2\)du≤𝔼Tr⁡Σ0≤d,\\int\_\{0\}^\{\\infty\}\\mathbb\{E\}\\operatorname\{Tr\}\(\\Sigma\_\{u\}^\{2\}\)\\,\\,\\mathrm\{d\}u\\leq\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{0\}\\leq d,so the total squared covariance accumulated along the entire path is bounded by the dimension\. This is the geometric content of localization: the random posterior measure steadily concentrates, while its mean drifts by martingale increments whose size is set by the uncertainty that remains\.

###### One\-dimensional covariance calculation\.

In one dimension the posterior variance splits into a second moment and the square of the mean,

Σu=𝔼\[X⋆2∣Yu\]−Mu2,\\Sigma\_\{u\}=\\mathbb\{E\}\[X\_\{\\star\}^\{2\}\\mid Y\_\{u\}\]\-M\_\{u\}^\{2\},and we track the two pieces separately\. For the second moment, the weak form of \([4\.3](https://arxiv.org/html/2607.01693#S4.E3)\) with test functionf\(x\)=x2f\(x\)=x^\{2\}gives

d𝔼\[X⋆2∣Yu\]=\(∫x2\(x−Mu\)μu\(dx\)\)dIu=\(𝔼\[X⋆3∣Yu\]−Mu𝔼\[X⋆2∣Yu\]\)dIu,\\,\\mathrm\{d\}\\,\\mathbb\{E\}\[X\_\{\\star\}^\{2\}\\mid Y\_\{u\}\]=\\Bigl\(\\int x^\{2\}\(x\-M\_\{u\}\)\\,\\mu\_\{u\}\(\\,\\mathrm\{d\}x\)\\Bigr\)\\,\\mathrm\{d\}I\_\{u\}=\\Bigl\(\\mathbb\{E\}\[X\_\{\\star\}^\{3\}\\mid Y\_\{u\}\]\-M\_\{u\}\\mathbb\{E\}\[X\_\{\\star\}^\{2\}\\mid Y\_\{u\}\]\\Bigr\)\\,\\mathrm\{d\}I\_\{u\},where we used𝔼\[X⋆∣Yu\]=Mu\\mathbb\{E\}\[X\_\{\\star\}\\mid Y\_\{u\}\]=M\_\{u\}\. For the squared mean,dMu=ΣudIu\\,\\mathrm\{d\}M\_\{u\}=\\Sigma\_\{u\}\\,\\mathrm\{d\}I\_\{u\}and Itô’s formula give

d\(Mu2\)=2MuΣudIu\+Σu2du\.\\,\\mathrm\{d\}\(M\_\{u\}^\{2\}\)=2M\_\{u\}\\Sigma\_\{u\}\\,\\,\\mathrm\{d\}I\_\{u\}\+\\Sigma\_\{u\}^\{2\}\\,\\,\\mathrm\{d\}u\.Subtracting the two identities,

dΣu=\(𝔼\[X⋆3∣Yu\]−Mu𝔼\[X⋆2∣Yu\]−2MuΣu\)dIu−Σu2du\.\\,\\mathrm\{d\}\\Sigma\_\{u\}=\\Bigl\(\\mathbb\{E\}\[X\_\{\\star\}^\{3\}\\mid Y\_\{u\}\]\-M\_\{u\}\\mathbb\{E\}\[X\_\{\\star\}^\{2\}\\mid Y\_\{u\}\]\-2M\_\{u\}\\Sigma\_\{u\}\\Bigr\)\\,\\mathrm\{d\}I\_\{u\}\-\\Sigma\_\{u\}^\{2\}\\,\\,\\mathrm\{d\}u\.ThedIu\\,\\mathrm\{d\}I\_\{u\}term is a martingale increment and has mean zero, so taking expectations leaves only the drift,

ddu𝔼\[Σu\]=−𝔼\[Σu2\]\.\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}u\}\\mathbb\{E\}\[\\Sigma\_\{u\}\]=\-\\mathbb\{E\}\[\\Sigma\_\{u\}^\{2\}\]\.Indddimensions the same computation, now with the matrix\-valued test functionf\(x\)=xx⊤f\(x\)=xx^\{\\top\}, gives the covariance SDE, in components,

\(4\.10\)d\(Σu\)ij=∑k𝔼\[\(X⋆−Mu\)i\(X⋆−Mu\)j\(X⋆−Mu\)k∣Yu\]\(dIu\)k−\(Σu2\)ijdu\.\\,\\mathrm\{d\}\(\\Sigma\_\{u\}\)\_\{ij\}=\\sum\_\{k\}\\mathbb\{E\}\\bigl\[\(X\_\{\\star\}\-M\_\{u\}\)\_\{i\}\(X\_\{\\star\}\-M\_\{u\}\)\_\{j\}\(X\_\{\\star\}\-M\_\{u\}\)\_\{k\}\\mid Y\_\{u\}\\bigr\]\\,\(\\,\\mathrm\{d\}I\_\{u\}\)\_\{k\}\-\(\\Sigma\_\{u\}^\{2\}\)\_\{ij\}\\,\\,\\mathrm\{d\}u\.The first term is a mean\-zero martingale increment, so taking traces and expectations recovers \([4\.8](https://arxiv.org/html/2607.01693#S4.E8)\),ddu𝔼\[Tr⁡Σu\]=−𝔼\[Tr⁡\(Σu2\)\]\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}u\}\\mathbb\{E\}\[\\operatorname\{Tr\}\\Sigma\_\{u\}\]=\-\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\Sigma\_\{u\}^\{2\}\)\]\. ∎

This martingale\-and\-covariance structure is what makes stochastic localization useful as an analytical tool\. Rather than bounding the original high\-dimensional measure directly, one embeds it in a random path of progressively tilted posteriors and argues along the path, where the mean is a martingale and the covariance dissipates\. Many static questions thereby reduce to estimating quantities averaged over the localization path, with the covariance identities \([4\.8](https://arxiv.org/html/2607.01693#S4.E8)\)–\([4\.9](https://arxiv.org/html/2607.01693#S4.E9)\) supplying the bookkeeping\.

The main early applications of stochastic localization\[[22](https://arxiv.org/html/2607.01693#bib.bib22)\]were in high\-dimensional geometry\. For an isotropic log\-concave measure, the Kannan–Lovász–Simonovits \(KLS\) conjecture asks, in one formulation, for a dimension\-free lower bound on the Cheeger isoperimetric constant\. This problem is closely tied to spectral gaps, Poincaré inequalities, thin\-shell estimates, and concentration of measure\. Eldan introduced stochastic localization to relate thin\-shell and spectral\-gap/KLS bounds up to logarithmic factors\[[22](https://arxiv.org/html/2607.01693#bib.bib22)\]\. Lee and Vempala developed the method further for isoperimetry, concentration, and mixing, improving the known KLS\-type bounds for isotropic log\-concave measures\[[32](https://arxiv.org/html/2607.01693#bib.bib32)\]\. Later work of Chen obtained an almost constant lower bound for the KLS isoperimetric coefficient using related localization techniques\[[14](https://arxiv.org/html/2607.01693#bib.bib14)\]\. Chen and Eldan also made the sampling connection explicit: their localization\-schemes framework associates Markov chains to localized martingales of measures and proves mixing bounds by analyzing the localization process\[[16](https://arxiv.org/html/2607.01693#bib.bib16)\]\. For our purposes, the lesson is that stochastic localization converts global geometric or sampling questions into estimates on the random posterior covariance process\. The same covariance\-budget viewpoint is what later reappears in analyses of diffusion models\[[38](https://arxiv.org/html/2607.01693#bib.bib38),[9](https://arxiv.org/html/2607.01693#bib.bib9)\]\.

### 4\.3\.Polchinski flow for the effective potential

The localization picture describes the Gaussian channel through posterior quantities\. Polchinski flow describes the same channel at the level of the noised marginals\. In diffusion\-model language, this is the variance\-exploding forward process from the previous section:

Xt=X0\+tZ,X\_\{t\}=X\_\{0\}\+\\sqrt\{t\}Z,whose marginal density is the Gaussian\-smoothed density

pt=p0∗𝒩\(0,tI\)\.p\_\{t\}=p\_\{0\}\*\\mathcal\{N\}\\\!\\left\(0,tI\\right\)\.It solves the heat equation

\(4\.11\)∂tpt=12Δpt\.\\partial\_\{t\}p\_\{t\}=\\frac\{1\}\{2\}\\Delta p\_\{t\}\.Define the effective potential

Ut\(x\)=−log⁡pt\(x\)\.U\_\{t\}\(x\)=\-\\log p\_\{t\}\(x\)\.The diffusion\-model score at noise levelttis therefore𝗌t⋆=∇log⁡pt=−∇Ut\\mathsf\{s\}^\{\\star\}\_\{t\}=\\nabla\\log p\_\{t\}=\-\\nabla U\_\{t\}\. Thus Polchinski flow is not a new sampling process; it is a way to track how the same score field evolves as the forward diffusion smooths the data distribution\. The heat equation becomes the nonlinear PDE

\(4\.12\)∂tUt=12ΔUt−12‖∇Ut‖2\.\\partial\_\{t\}U\_\{t\}=\\frac\{1\}\{2\}\\Delta U\_\{t\}\-\\frac\{1\}\{2\}\\left\\lVert\\nabla U\_\{t\}\\right\\rVert^\{2\}\.This is the*finite\-dimensional Polchinski equation*\. It is the isotropic Gaussian\-convolution version of the Polchinski renormalization flow\[[41](https://arxiv.org/html/2607.01693#bib.bib41),[7](https://arxiv.org/html/2607.01693#bib.bib7)\]\. The density formulation is linear, but the effective potential formulation is nonlinear because logarithms turn smoothing into a viscous Hamilton–Jacobi equation\.

###### Proof\.

Sincept=e−Utp\_\{t\}=e^\{\-U\_\{t\}\},

Δpt=Δ\(e−Ut\)=e−Ut\(‖∇Ut‖2−ΔUt\)\.\\Delta p\_\{t\}=\\Delta\(e^\{\-U\_\{t\}\}\)=e^\{\-U\_\{t\}\}\\bigl\(\\left\\lVert\\nabla U\_\{t\}\\right\\rVert^\{2\}\-\\Delta U\_\{t\}\\bigr\)\.Using \([4\.11](https://arxiv.org/html/2607.01693#S4.E11)\),

∂tUt=−∂tptpt=−12Δptpt=12ΔUt−12‖∇Ut‖2\.∎\\partial\_\{t\}U\_\{t\}=\-\\frac\{\\partial\_\{t\}p\_\{t\}\}\{p\_\{t\}\}=\-\\frac\{1\}\{2\}\\frac\{\\Delta p\_\{t\}\}\{p\_\{t\}\}=\\frac\{1\}\{2\}\\Delta U\_\{t\}\-\\frac\{1\}\{2\}\\left\\lVert\\nabla U\_\{t\}\\right\\rVert^\{2\}\.\\qed

Because the score is the negative gradient of the effective potential,

𝗌t⋆\(x\)=∇log⁡pt\(x\)=−∇Ut\(x\),\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=\\nabla\\log p\_\{t\}\(x\)=\-\\nabla U\_\{t\}\(x\),the score satisfies the viscous Burgers equation

\(4\.13\)∂t𝗌t⋆=12Δ𝗌t⋆\+∇\(12‖𝗌t⋆‖2\)\.\\partial\_\{t\}\\mathsf\{s\}^\{\\star\}\_\{t\}=\\frac\{1\}\{2\}\\Delta\\mathsf\{s\}^\{\\star\}\_\{t\}\+\\nabla\\left\(\\frac\{1\}\{2\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\\right\\rVert^\{2\}\\right\)\.Diffusion models are usually trained by score matching, rather than solving the Burgers equation\. The equation nevertheless explains why regularity of the score improves at positive noise levels and why small\-noise estimates are delicate: ast↓0t\\downarrow 0, the smoothing that controls derivatives disappears\.

The Polchinski equation \([4\.12](https://arxiv.org/html/2607.01693#S4.E12)\) is useful because it shows what forward noising does to the energy landscape\. Sharp wells, ridges, and nonsmooth features ofU0=−log⁡p0U\_\{0\}=\-\\log p\_\{0\}are averaged out by convolution\. At positive noise,UtU\_\{t\}is smoother and its gradient is better behaved\. The reverse model is trying to follow this family of smoothed landscapes backward, where the geometry becomes progressively sharper\.

This is the renormalization interpretation of diffusion models\. Historically, renormalization entered statistical physics through the idea that one should describe a system at a chosen resolution and then track how the effective description changes when microscopic degrees of freedom are averaged out\. Kadanoff’s block\-spin scaling picture\[[29](https://arxiv.org/html/2607.01693#bib.bib29)\]gave the physical image of coarse\-graining near criticality\. Wilson’s renormalization group made this image into a systematic flow of effective theories and explained universality at critical points\[[48](https://arxiv.org/html/2607.01693#bib.bib48),[49](https://arxiv.org/html/2607.01693#bib.bib49)\]; the review of Wilson and Kogut\[[50](https://arxiv.org/html/2607.01693#bib.bib50)\]is the classical reference\. Polchinski’s exact renormalization group equation\[[41](https://arxiv.org/html/2607.01693#bib.bib41)\]then expressed this flow directly at the level of effective Lagrangians\. The finite\-dimensional equation \([4\.12](https://arxiv.org/html/2607.01693#S4.E12)\) is the diffusion\-model shadow of the same idea\.

Renormalization group language says that forward noising integrates out microscopic information\. In image models this statement is metaphorical but useful: the data distribution is gradually blurred, high\-frequency details become harder to distinguish, and the effective potentialUtU\_\{t\}describes the log\-density of the coarse\-grained law\. In field\-theoretic applications the same idea can be literal: one chooses a covariance schedule that removes degrees of freedom by scale, and the corresponding effective action follows a Polchinski\-type equation\. In the work of Bauerschmidt and collaborators, this flow is used precisely in this rigorous sense: it is an evolution equation for scale\-dependent effective potentials, so estimates along the flow turn renormalization into analytic control of stochastic dynamics and Gibbs measures\[[7](https://arxiv.org/html/2607.01693#bib.bib7)\]\.

### 4\.4\.A dictionary

The same Gaussian smoothing operation can now be summarized in three languages: diffusion dynamics, posterior localization, and renormalization by Gaussian coarse\-graining\.

## 5\.Discretizing Continuous Diffusion Models

We now pass from ideal continuous dynamics to algorithms\. A sampler must choose a finite time grid and replace each exact reverse transition by a computable transition kernel\. This section sets up the formulas needed for the numerical analysis\.

A useful way to keep the presentation organized is to separate the three implementation questions\. First, the reverse chain must be initialized from a simple high\-noise law rather than the exact terminal law\. Second, the reverse dynamics use learned scores or denoisers rather than the true ones\. Third, the continuous reverse dynamics, or equivalently the exact Bayes reverse kernels on a grid, must be discretized\. The analysis is built around exactly this trichotomy: initialization error, statistical score error, and discretization error\.

### 5\.1\.The DDPM grid, scores, and exact reverse kernels

LetX0∼p𝖽𝖺𝗍𝖺X\_\{0\}\\sim p\_\{\\mathsf\{data\}\}onℝd\\mathbb\{R\}^\{d\}\. Choose a time grid

0=t0<t1<⋯<tK=T\.0=t\_\{0\}<t\_\{1\}<\\cdots<t\_\{K\}=T\.The goal is to replace the continuous noising process by a finite Markov chain with the same prescribed one\-time marginals\. Recall from \([3\.2](https://arxiv.org/html/2607.01693#S3.E2)\) that these marginals are

Xt∣X0∼𝒩\(atX0,σt2I\)\.X\_\{t\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(a\_\{t\}X\_\{0\},\\sigma\_\{t\}^\{2\}I\\right\)\.On the grid, we write

Xk=Xtk,ak=atk,σk=σtk\.X\_\{k\}=X\_\{t\_\{k\}\},\\qquad a\_\{k\}=a\_\{t\_\{k\}\},\\qquad\\sigma\_\{k\}=\\sigma\_\{t\_\{k\}\}\.The grid schedule determines the one\-step noising parameters\. We choose a Gaussian forward transition

\(5\.1\)Xk\+1∣Xk∼𝒩\(αkXk,αk2ηkI\)\.X\_\{k\+1\}\\mid X\_\{k\}\\sim\\mathcal\{N\}\\\!\\left\(\\alpha\_\{k\}X\_\{k\},\\alpha\_\{k\}^\{2\}\\eta\_\{k\}I\\right\)\.If

Xk=akX0\+σkZk,Zk∼𝒩\(0,I\),X\_\{k\}=a\_\{k\}X\_\{0\}\+\\sigma\_\{k\}Z\_\{k\},\\qquad Z\_\{k\}\\sim\\mathcal\{N\}\\\!\\left\(0,I\\right\),then \([5\.1](https://arxiv.org/html/2607.01693#S5.E1)\) is equivalently

Xk\+1=αkXk\+αkηkξk,ξk∼𝒩\(0,I\)X\_\{k\+1\}=\\alpha\_\{k\}X\_\{k\}\+\\alpha\_\{k\}\\sqrt\{\\eta\_\{k\}\}\\xi\_\{k\},\\qquad\\xi\_\{k\}\\sim\\mathcal\{N\}\\\!\\left\(0,I\\right\)and hence

Xk\+1∣X0∼𝒩\(αkakX0,αk2\(σk2\+ηk\)I\)\.X\_\{k\+1\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(\\alpha\_\{k\}a\_\{k\}X\_\{0\},\\alpha\_\{k\}^\{2\}\(\\sigma\_\{k\}^\{2\}\+\\eta\_\{k\}\)I\\right\)\.Matching this with the target marginalXk\+1∣X0∼𝒩\(ak\+1X0,σk\+12I\)X\_\{k\+1\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(a\_\{k\+1\}X\_\{0\},\\sigma\_\{k\+1\}^\{2\}I\\right\)gives

αk=ak\+1ak,ηk=σk\+12αk2−σk2\.\\alpha\_\{k\}=\\frac\{a\_\{k\+1\}\}\{a\_\{k\}\},\\qquad\\eta\_\{k\}=\\frac\{\\sigma\_\{k\+1\}^\{2\}\}\{\\alpha\_\{k\}^\{2\}\}\-\\sigma\_\{k\}^\{2\}\.The variance\-preserving case hasak2\+σk2=1a\_\{k\}^\{2\}\+\\sigma\_\{k\}^\{2\}=1\. The variance\-exploding case hasak≡1a\_\{k\}\\equiv 1and an increasing noise scaleσk2=tk\\sigma\_\{k\}^\{2\}=t\_\{k\}\.

The reverse sampler has to invert one grid step: given a noisy point at levelk\+1k\+1, it needs the conditional law of the previous point at levelkk\. This is where denoising and score information enter\. Letpkp\_\{k\}denote the density ofXkX\_\{k\}\. The true score at timekkis

𝗌k⋆\(x\)=∇log⁡pk\(x\)\.\\mathsf\{s\}^\{\\star\}\_\{k\}\(x\)=\\nabla\\log p\_\{k\}\(x\)\.
The score determines the posterior mean of the clean sample\. This is the grid\-time specialization of the continuous\-time Tweedie identity \([3\.4](https://arxiv.org/html/2607.01693#S3.E4)\), withat,σta\_\{t\},\\sigma\_\{t\}read off on the grid:

\(5\.2\)𝗌k⋆\(x\)=1σk2𝔼\[akX0−Xk∣Xk=x\]\.\\mathsf\{s\}^\{\\star\}\_\{k\}\(x\)=\\frac\{1\}\{\\sigma\_\{k\}^\{2\}\}\\mathbb\{E\}\[a\_\{k\}X\_\{0\}\-X\_\{k\}\\mid X\_\{k\}=x\]\.Equivalently, the grid\-time version of the optimal denoiser from Subsection[3\.2](https://arxiv.org/html/2607.01693#S3.SS2)is

𝖣k⋆\(x\):=𝔼\[X0∣Xk=x\]=ak−1\(x\+σk2𝗌k⋆\(x\)\)\.\\mathsf\{D\}^\{\\star\}\_\{k\}\(x\):=\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{k\}=x\]=a\_\{k\}^\{\-1\}\\bigl\(x\+\\sigma\_\{k\}^\{2\}\\mathsf\{s\}^\{\\star\}\_\{k\}\(x\)\\bigr\)\.This is the grid\-time notation corresponding to the posterior meanmu\(y\)m\_\{u\}\(y\)from stochastic localization: in the variance\-exploding normalizationσt2=t\\sigma\_\{t\}^\{2\}=t, one has𝖣t⋆\(x\)=x\+t𝗌t⋆\(x\)=m1/t\(x/t\)\\mathsf\{D\}^\{\\star\}\_\{t\}\(x\)=x\+t\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=m\_\{1/t\}\(x/t\)\. A score model𝗌k\\mathsf\{s\}\_\{k\}is equivalently a denoiser model

𝖣k\(x\):=ak−1\(x\+σk2𝗌k\(x\)\),𝖣k\(x\)−𝖣k⋆\(x\)=ak−1σk2\(𝗌k\(x\)−𝗌k⋆\(x\)\)\.\\mathsf\{D\}\_\{k\}\(x\):=a\_\{k\}^\{\-1\}\\bigl\(x\+\\sigma\_\{k\}^\{2\}\\mathsf\{s\}\_\{k\}\(x\)\\bigr\),\\qquad\\mathsf\{D\}\_\{k\}\(x\)\-\\mathsf\{D\}^\{\\star\}\_\{k\}\(x\)=a\_\{k\}^\{\-1\}\\sigma\_\{k\}^\{2\}\\bigl\(\\mathsf\{s\}\_\{k\}\(x\)\-\\mathsf\{s\}^\{\\star\}\_\{k\}\(x\)\\bigr\)\.In the variance\-exploding normalizationat≡1a\_\{t\}\\equiv 1andσt2=t\\sigma\_\{t\}^\{2\}=t, the reverse drift can be written as

𝗌t⋆\(x\)=𝖣t⋆\(x\)−xt\.\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=\\frac\{\\mathsf\{D\}^\{\\star\}\_\{t\}\(x\)\-x\}\{t\}\.Thus one may either think of the sampler as using a score field, or as using an optimal denoiser whose value is frozen over each discrete reverse step\.

Before approximating the reverse step with a score model, let us write the exact Bayes kernel in terms of the marginal densitypkp\_\{k\}\. LetPk←\(x′→⋅\)P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\cdot\)be the density of the backward transitionLaw⁡\(Xk∣Xk\+1=x′\)\\operatorname\{Law\}\(X\_\{k\}\\mid X\_\{k\+1\}=x^\{\\prime\}\), and letPk→\(x→x′\)P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\)denote the forward transition density in \([5\.1](https://arxiv.org/html/2607.01693#S5.E1)\)\. For fixedxx,

Pk→\(x→x′\)=\(2παk2ηk\)−d/2exp⁡\(−‖x′−αkx‖22αk2ηk\)\.P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\)=\(2\\pi\\alpha\_\{k\}^\{2\}\\eta\_\{k\}\)^\{\-d/2\}\\exp\\left\(\-\\frac\{\\left\\lVert x^\{\\prime\}\-\\alpha\_\{k\}x\\right\\rVert^\{2\}\}\{2\\alpha\_\{k\}^\{2\}\\eta\_\{k\}\}\\right\)\.Bayes’ rule gives

Pk←\(x′→x\)=pk\(x\)Pk→\(x→x′\)pk\+1\(x′\),pk\+1\(x′\)=∫pk\(y\)Pk→\(y→x′\)dy\.P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to x\)=\\frac\{p\_\{k\}\(x\)P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\)\}\{p\_\{k\+1\}\(x^\{\\prime\}\)\},\\qquad p\_\{k\+1\}\(x^\{\\prime\}\)=\\int p\_\{k\}\(y\)P\_\{k\}^\{\\to\}\(y\\to x^\{\\prime\}\)\\,\\,\\mathrm\{d\}y\.So the exact reverse kernel is

\(5\.3\)Pk←\(x′→x\)∝pk\(x\)exp⁡\(−‖x−αk−1x′‖22ηk\)\.P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to x\)\\propto p\_\{k\}\(x\)\\exp\\\!\\left\(\-\\frac\{\\left\\lVert x\-\\alpha\_\{k\}^\{\-1\}x^\{\\prime\}\\right\\rVert^\{2\}\}\{2\\eta\_\{k\}\}\\right\)\.Thus each reverse step is a Bayes update:pkp\_\{k\}is the prior at the previous noise level, and the Gaussian factor is the likelihood of the rescaled observationαk−1x′\\alpha\_\{k\}^\{\-1\}x^\{\\prime\}under the channelαk−1Xk\+1=Xk\+ηkξk\\alpha\_\{k\}^\{\-1\}X\_\{k\+1\}=X\_\{k\}\+\\sqrt\{\\eta\_\{k\}\}\\xi\_\{k\}\. Ifpkp\_\{k\}were known as a density, \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\) could be used directly\. In diffusion models, the learned object is only the score∇log⁡pk\\nabla\\log p\_\{k\}, so the next step is to approximate this Bayes update locally\.

### 5\.2\.The Gaussian reverse update

A common sampler replaces \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\) by a Gaussian transition\. With the notation of this section,

\(5\.4\)P^k←\(x′→⋅\)=𝒩\(αk−1x′\+ηkαk𝗌k\+1\(x′\),ηkI\),\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\cdot\)=\\mathcal\{N\}\\\!\\left\(\\alpha\_\{k\}^\{\-1\}x^\{\\prime\}\+\\eta\_\{k\}\\alpha\_\{k\}\\mathsf\{s\}\_\{k\+1\}\(x^\{\\prime\}\),\\eta\_\{k\}I\\right\),where we use the same step\-size parameterηk\\eta\_\{k\}for the Gaussian variance\. This should be contrasted with the exact reverse kernel \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\), which need not be Gaussian because of the prior factorpk\(x\)p\_\{k\}\(x\)\.

To see why the mean in \([5\.4](https://arxiv.org/html/2607.01693#S5.E4)\) has this form, apply Tweedie’s identity to the single forward step itself\. After rescaling the next observation,

αk−1Xk\+1=Xk\+ηkξk\.\\alpha\_\{k\}^\{\-1\}X\_\{k\+1\}=X\_\{k\}\+\\sqrt\{\\eta\_\{k\}\}\\xi\_\{k\}\.Tweedie’s identity for this channel gives

𝔼\[Xk∣Xk\+1=x′\]\\displaystyle\\mathbb\{E\}\[X\_\{k\}\\mid X\_\{k\+1\}=x^\{\\prime\}\]=αk−1x′\+αkηk∇log⁡\(pk\+1\(x′\)\)\\displaystyle=\\alpha\_\{k\}^\{\-1\}x^\{\\prime\}\+\\alpha\_\{k\}\\eta\_\{k\}\\nabla\\log\\bigl\(p\_\{k\+1\}\(x^\{\\prime\}\)\\bigr\)=αk−1x′\+αkηk𝗌k\+1⋆\(x′\)\.\\displaystyle=\\alpha\_\{k\}^\{\-1\}x^\{\\prime\}\+\\alpha\_\{k\}\\eta\_\{k\}\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\(x^\{\\prime\}\)\.Thus the center of \([5\.4](https://arxiv.org/html/2607.01693#S5.E4)\) is the exact posterior mean with𝗌k\+1⋆\\mathsf\{s\}^\{\\star\}\_\{k\+1\}replaced by the learned score𝗌k\+1\\mathsf\{s\}\_\{k\+1\}\. The remaining approximation is to replace the exact posterior covariance by the local Gaussian scaleηkI\\eta\_\{k\}I: when the prior densitypkp\_\{k\}is smooth, the likelihood in \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\) confinesXkX\_\{k\}to a ball of radiusηk\\sqrt\{\\eta\_\{k\}\}around the observation, up to leading order\.

Let us also present a complementary derivation from the numerical SDE viewpoint\. Choose a forward noising SDE on\[0,T\]\[0,T\]whose marginals agree with the channelXt∣X0∼𝒩\(atX0,σt2I\)X\_\{t\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(a\_\{t\}X\_\{0\},\\sigma\_\{t\}^\{2\}I\\right\)\. Assume the schedule is smooth,at\>0a\_\{t\}\>0, with the usual normalizationa0=1a\_\{0\}=1andσ0=0\\sigma\_\{0\}=0, and define

λt=a˙tat,ft\(x\)=λtx,gt2=ddtσt2−2λtσt2\.\\lambda\_\{t\}=\\frac\{\\dot\{a\}\_\{t\}\}\{a\_\{t\}\},\\qquad f\_\{t\}\(x\)=\\lambda\_\{t\}x,\\qquad g\_\{t\}^\{2\}=\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\sigma\_\{t\}^\{2\}\-2\\lambda\_\{t\}\\sigma\_\{t\}^\{2\}\.Assume the last quantity is non\-negative, so thatgtg\_\{t\}is well defined, then the linear SDE

dXt=λtXtdt\+gtdBt\\,\\mathrm\{d\}X\_\{t\}=\\lambda\_\{t\}X\_\{t\}\\,\\mathrm\{d\}t\+g\_\{t\}\\,\\mathrm\{d\}B\_\{t\}has conditional meanatX0a\_\{t\}X\_\{0\}and conditional varianceσt2I\\sigma\_\{t\}^\{2\}I: the variance solvesv˙t=2λtvt\+gt2\\dot\{v\}\_\{t\}=2\\lambda\_\{t\}v\_\{t\}\+g\_\{t\}^\{2\}withv0=0v\_\{0\}=0, hencevt=σt2v\_\{t\}=\\sigma\_\{t\}^\{2\}\. Applying \([3\.5](https://arxiv.org/html/2607.01693#S3.E5)\) to this particular choice offtf\_\{t\}andgtg\_\{t\}gives the reverse SDE

dYs←=\{−λT−sYs←\+gT−s2∇log⁡pT−s\(Ys←\)\}ds\+gT−sdBs←\.\\,\\mathrm\{d\}Y^\{\\leftarrow\}\_\{s\}=\\left\\\{\-\\lambda\_\{T\-s\}Y^\{\\leftarrow\}\_\{s\}\+g\_\{T\-s\}^\{2\}\\nabla\\log p\_\{T\-s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\right\\\}\\,\\mathrm\{d\}s\+g\_\{T\-s\}\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{s\}\.Now apply Euler–Maruyama to this reverse SDE on the grid\. For the reverse step from timetk\+1t\_\{k\+1\}to timetkt\_\{k\}, writehk=tk\+1−tkh\_\{k\}=t\_\{k\+1\}\-t\_\{k\}and start fromx′x^\{\\prime\}at noise leveltk\+1t\_\{k\+1\}\. Freezing the coefficients and the score at the beginning of this reverse step, namely attk\+1t\_\{k\+1\}in forward time, and using the grid shorthandλk\+1:=λtk\+1\\lambda\_\{k\+1\}:=\\lambda\_\{t\_\{k\+1\}\}andgk\+1:=gtk\+1g\_\{k\+1\}:=g\_\{t\_\{k\+1\}\}, gives

x′−hkλk\+1x′\+hkgk\+12𝗌k\+1\(x′\)\+gk\+1hkξ\.x^\{\\prime\}\-h\_\{k\}\\lambda\_\{k\+1\}x^\{\\prime\}\+h\_\{k\}g\_\{k\+1\}^\{2\}\\mathsf\{s\}\_\{k\+1\}\(x^\{\\prime\}\)\+g\_\{k\+1\}\\sqrt\{h\_\{k\}\}\\,\\xi\.The discrete parameters are exactly the grid versions of the same schedule:

αk=ak\+1ak,ηk=σk\+12αk2−σk2\.\\alpha\_\{k\}=\\frac\{a\_\{k\+1\}\}\{a\_\{k\}\},\\qquad\\eta\_\{k\}=\\frac\{\\sigma\_\{k\+1\}^\{2\}\}\{\\alpha\_\{k\}^\{2\}\}\-\\sigma\_\{k\}^\{2\}\.A Taylor expansion attk\+1t\_\{k\+1\}gives

αk−1=1−hkλk\+1\+O\(hk2\),αkηk=hkgk\+12\+O\(hk2\),\\alpha\_\{k\}^\{\-1\}=1\-h\_\{k\}\\lambda\_\{k\+1\}\+O\(h\_\{k\}^\{2\}\),\\qquad\\alpha\_\{k\}\\eta\_\{k\}=h\_\{k\}g\_\{k\+1\}^\{2\}\+O\(h\_\{k\}^\{2\}\),which agrees with the Gaussian reverse update discussed above\. It is worth being explicit about the two step parameters here, since both recur in the error analysis:hk=tk\+1−tkh\_\{k\}=t\_\{k\+1\}\-t\_\{k\}is the time increment of the reverse step, whileηk\\eta\_\{k\}is the variance of the one\-step Gaussian kernel \([5\.1](https://arxiv.org/html/2607.01693#S5.E1)\)\. In a general schedule they are different quantities, related byαkηk=hkgk\+12\+O\(hk2\)\\alpha\_\{k\}\\eta\_\{k\}=h\_\{k\}g\_\{k\+1\}^\{2\}\+O\(h\_\{k\}^\{2\}\), but they coincide,ηk=hk\\eta\_\{k\}=h\_\{k\}, in the variance\-exploding normalization \(αk=1\\alpha\_\{k\}=1,gt≡1g\_\{t\}\\equiv 1\) used in Section[6](https://arxiv.org/html/2607.01693#S6)\. The update \([5\.4](https://arxiv.org/html/2607.01693#S5.E4)\) can therefore be viewed either as a local approximation to the exact Bayes kernel or as an Euler–Maruyama discretization of the reverse SDE associated with the same Gaussian noising channel\. The discretization analysis asks how much error is introduced by making this local replacement at every reverse step\.

## 6\.Error Analysis for Diffusion Models

The previous section isolated the local numerical question: each step has an exact Bayes kernel \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\), while a practical sampler uses an implementable approximation such as \([5\.4](https://arxiv.org/html/2607.01693#S5.E4)\)\. We now turn this local comparison into a global guarantee\. Beyond the early\-stopping gap between the smoothed target and the raw data law, the reverse\-chain error splits into the same three sources isolated in the previous section: how the chain is initialized, how accurately the score has been learned, and how accurately each reverse kernel is discretized\.

Before getting into the details, it is worth laying out the roadmap of the whole analysis; each piece is made precise in the subsections that follow\. Throughout,Pk←\(xk\+1→⋅\)P\_\{k\}^\{\\leftarrow\}\(x\_\{k\+1\}\\to\\cdot\)denotes the exact reverse kernels \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\), andP^k←\(xk\+1→⋅\)\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(x\_\{k\+1\}\\to\\cdot\)the implemented reverse kernels of a sampler\. For each timekk, theL2\(pk\)L^\{2\}\(p\_\{k\}\)score error is

\(6\.1\)εk,𝗌𝖼𝗈𝗋𝖾2=𝔼Xk∼pk‖𝗌k\(Xk\)−𝗌k⋆\(Xk\)‖2\.\\varepsilon\_\{k,\\mathsf\{score\}\}^\{2\}=\\mathbb\{E\}\_\{X\_\{k\}\\sim p\_\{k\}\}\\left\\lVert\\mathsf\{s\}\_\{k\}\(X\_\{k\}\)\-\\mathsf\{s\}^\{\\star\}\_\{k\}\(X\_\{k\}\)\\right\\rVert^\{2\}\.The implemented reverse chain is initialized at the high\-noise endpoint with error

D𝖪𝖫⁡\(pK∥p^K\)≤ε𝗂𝗇𝗂𝗍2,\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{K\}\\\|\\widehat\{p\}\_\{K\}\)\\leq\\varepsilon\_\{\\mathsf\{init\}\}^\{2\},and the accumulated one\-step kernel error splits into a pure discretization part and a score\-error part,

∑k=1K−1𝔼Xk\+1∼pk\+1D𝖪𝖫⁡\(Pk←\(Xk\+1→⋅\)∥P^k←\(Xk\+1→⋅\)\)≤ε𝖽𝗂𝗌𝖼2\+c∑k=1K−1ηkεk\+1,𝗌𝖼𝗈𝗋𝖾2\.\\sum\_\{k=1\}^\{K\-1\}\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim p\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\middle\\\|\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\right\)\\leq\\varepsilon\_\{\\mathsf\{disc\}\}^\{2\}\+c\\sum\_\{k=1\}^\{K\-1\}\\eta\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\}\.Putting these together, the error decomposition we will establish reads

\(6\.2\)D𝖪𝖫⁡\(p1∥p^1\)≤ε𝗂𝗇𝗂𝗍2\+ε𝖽𝗂𝗌𝖼2\+c∑k=1K−1ηkεk\+1,𝗌𝖼𝗈𝗋𝖾2,\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{1\}\\\|\\widehat\{p\}\_\{1\}\)\\leq\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\+\\varepsilon\_\{\\mathsf\{disc\}\}^\{2\}\+c\\sum\_\{k=1\}^\{K\-1\}\\eta\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\},with the early\-stopping comparison betweenp1p\_\{1\}andp𝖽𝖺𝗍𝖺p\_\{\\mathsf\{data\}\}handled separately as a geometric smoothing estimate, naturally controlled inW2W\_\{2\}\.

This decomposition is mainly a bookkeeping statement\. Instead of comparing only the final samples, we compare the exact and implemented reverse chains as full paths\. Data processing then says that the KL between the final marginals is no larger than the path\-space KL\. The path\-space chain rule breaks this latter quantity into two pieces: the mismatch between the two initial laws at the high\-noise endpoint, and the sum of the one\-step KL errors accumulated along the reverse chain\.

The factor multiplying the score error has a simple origin\. In Gaussian DDPM or Euler–Maruyama updates, the learned score changes the mean of the one\-step Gaussian kernel by aboutηk\(𝗌k\+1−𝗌k\+1⋆\)\\eta\_\{k\}\(\\mathsf\{s\}\_\{k\+1\}\-\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\)\. KL between Gaussians with the same covarianceηkI\\eta\_\{k\}Idivides the squared mean error by the variance, leaving a contribution of orderηkεk\+1,𝗌𝖼𝗈𝗋𝖾2\\eta\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\}\. Thus score errors at times with larger reverse steps matter more, while errors on very small steps are naturally downweighted\. Hereηk\\eta\_\{k\}is the one\-step variance parameter of Section[5\.1](https://arxiv.org/html/2607.01693#S5.SS1); in the variance\-exploding chain analyzed in the rest of this section it is exactly the reverse\-time stephk=tk\+1−tkh\_\{k\}=t\_\{k\+1\}\-t\_\{k\}, which is why these weights act as step sizes\.

The termε𝖽𝗂𝗌𝖼2\\varepsilon\_\{\\mathsf\{disc\}\}^\{2\}collects the error that would remain even with the exact score\. For Euler–Maruyama this is the cost of freezing the score or denoiser during a step\. The stochastic\-localization argument controls that cost through posterior covariance budgets, and the high\-accuracy correction later reduces it by sampling the local Gaussian tilt more faithfully\. The subsections below unpack the pieces in this order: early stopping, KL telescoping, score\-error contribution, Euler–Maruyama discretization and its localization refinement, and finally the high\-accuracy correction\.

### 6\.1\.Early stopping

The first issue is the endpoint of the reverse process\. We stop at a small positive noise level and target the smoothed lawp1p\_\{1\}\. This early\-stopped law is the right object for the KL analysis: path\-space chain rules and Girsanov\-type estimates give direct control ofD𝖪𝖫⁡\(p1∥p^1\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{1\}\\\|\\widehat\{p\}\_\{1\}\)for the sampler outputp^1\\widehat\{p\}\_\{1\}\. The raw data distribution may be singular, supported on a lower\-dimensional set, or otherwise mutually singular with any smooth sampler output, so it cannot generally be compared top^1\\widehat\{p\}\_\{1\}in KL\.

Thus we keep two error measurements separate\. The early\-stopping error is a geometric smoothing error and is controlled inW2W\_\{2\}\. The algorithmic error betweenp1p\_\{1\}andp^1\\widehat\{p\}\_\{1\}is controlled later in KL\.

TheW2W\_\{2\}estimate is immediate from the forward noising coupling\. In the grid notation of the previous section,X1=a1X0\+σ1ZX\_\{1\}=a\_\{1\}X\_\{0\}\+\\sigma\_\{1\}Z, so couplingX1X\_\{1\}with the same clean sampleX0X\_\{0\}gives

\(6\.3\)W22\(p𝖽𝖺𝗍𝖺,p1\)≤\(1−a1\)2𝔼‖X0‖2\+σ12d\.W\_\{2\}^\{2\}\(p\_\{\\mathsf\{data\}\},p\_\{1\}\)\\leq\(1\-a\_\{1\}\)^\{2\}\\mathbb\{E\}\\left\\lVert X\_\{0\}\\right\\rVert^\{2\}\+\\sigma\_\{1\}^\{2\}d\.Thus early stopping is harmless whenσ1\\sigma\_\{1\}and1−a11\-a\_\{1\}are small relative to the scale of the data\.

###### Proof of the early\-stopping estimate\.

Couplep1p\_\{1\}andp𝖽𝖺𝗍𝖺p\_\{\\mathsf\{data\}\}by using the same clean data pointX0X\_\{0\}and the same Gaussian noise used to form

X1=a1X0\+σ1Z,Z∼𝒩\(0,I\)\.X\_\{1\}=a\_\{1\}X\_\{0\}\+\\sigma\_\{1\}Z,\\qquad Z\\sim\\mathcal\{N\}\\\!\\left\(0,I\\right\)\.Then

X1−X0=\(a1−1\)X0\+σ1Z\.X\_\{1\}\-X\_\{0\}=\(a\_\{1\}\-1\)X\_\{0\}\+\\sigma\_\{1\}Z\.Using𝔼Z=0\\mathbb\{E\}Z=0and independence ofZZandX0X\_\{0\},

𝔼‖X1−X0‖2=\(1−a1\)2𝔼‖X0‖2\+σ12𝔼‖Z‖2=\(1−a1\)2𝔼‖X0‖2\+σ12d\.\\mathbb\{E\}\\left\\lVert X\_\{1\}\-X\_\{0\}\\right\\rVert^\{2\}=\(1\-a\_\{1\}\)^\{2\}\\mathbb\{E\}\\left\\lVert X\_\{0\}\\right\\rVert^\{2\}\+\\sigma\_\{1\}^\{2\}\\mathbb\{E\}\\left\\lVert Z\\right\\rVert^\{2\}=\(1\-a\_\{1\}\)^\{2\}\\mathbb\{E\}\\left\\lVert X\_\{0\}\\right\\rVert^\{2\}\+\\sigma\_\{1\}^\{2\}d\.SinceW22W\_\{2\}^\{2\}is the infimum over all couplings, it is bounded by the cost of this particular coupling\. ∎

### 6\.2\.KL telescoping transition error

The algorithmic comparison in this section is

D𝖪𝖫⁡\(p1∥p^1\),\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{1\}\\\|\\widehat\{p\}\_\{1\}\),wherep1p\_\{1\}is the law produced by the exact reverse chain andp^1\\widehat\{p\}\_\{1\}is the law produced by the implemented chain, both stopped at the first positive noise level\. We control this final KL by first comparing the two reverse chains as full paths and then projecting those paths to their endpoint\.

The reason this helps is simple: a Markov chain path density is a product of an initial density and transition densities\. Taking a logarithm turns that product into a sum, and taking expectation gives an additive KL identity\. This is the basic telescoping mechanism behind the error decomposition: a long reverse sampling procedure is reduced to many one\-step reverse\-kernel comparisons\.

###### Lemma 6\.1\(KL chain rule for reverse Markov kernels\)\.

LetμKPK−1⋯P1\\mu\_\{K\}P\_\{K\-1\}\\cdots P\_\{1\}andνKQK−1⋯Q1\\nu\_\{K\}Q\_\{K\-1\}\\cdots Q\_\{1\}be two path laws on\(XK,…,X1\)\(X\_\{K\},\\ldots,X\_\{1\}\), wherePk\(xk\+1→⋅\)P\_\{k\}\(x\_\{k\+1\}\\to\\cdot\)andQk\(xk\+1→⋅\)Q\_\{k\}\(x\_\{k\+1\}\\to\\cdot\)transition from levelk\+1k\+1to levelkk\. Letμk\+1\\mu\_\{k\+1\}be the marginal law ofXk\+1X\_\{k\+1\}under the first path law\. Then

D𝖪𝖫⁡\(μKPK−1⋯P1∥νKQK−1⋯Q1\)=D𝖪𝖫⁡\(μK∥νK\)\+∑k=1K−1𝔼Xk\+1∼μk\+1D𝖪𝖫⁡\(Pk\(Xk\+1→⋅\)∥Qk\(Xk\+1→⋅\)\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\_\{K\}P\_\{K\-1\}\\cdots P\_\{1\}\\\|\\nu\_\{K\}Q\_\{K\-1\}\\cdots Q\_\{1\}\)=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\_\{K\}\\\|\\nu\_\{K\}\)\+\\sum\_\{k=1\}^\{K\-1\}\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim\\mu\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(P\_\{k\}\(X\_\{k\+1\}\\to\\cdot\)\\\|Q\_\{k\}\(X\_\{k\+1\}\\to\\cdot\)\)\.

###### Proof\.

Write the two path densities as

μK\(xK\)∏k=1K−1Pk\(xk\+1→xk\),νK\(xK\)∏k=1K−1Qk\(xk\+1→xk\)\.\\mu\_\{K\}\(x\_\{K\}\)\\prod\_\{k=1\}^\{K\-1\}P\_\{k\}\(x\_\{k\+1\}\\to x\_\{k\}\),\\qquad\\nu\_\{K\}\(x\_\{K\}\)\\prod\_\{k=1\}^\{K\-1\}Q\_\{k\}\(x\_\{k\+1\}\\to x\_\{k\}\)\.The log\-ratio is therefore

log⁡μK\(xK\)νK\(xK\)\+∑k=1K−1log⁡Pk\(xk\+1→xk\)Qk\(xk\+1→xk\)\.\\log\\frac\{\\mu\_\{K\}\(x\_\{K\}\)\}\{\\nu\_\{K\}\(x\_\{K\}\)\}\+\\sum\_\{k=1\}^\{K\-1\}\\log\\frac\{P\_\{k\}\(x\_\{k\+1\}\\to x\_\{k\}\)\}\{Q\_\{k\}\(x\_\{k\+1\}\\to x\_\{k\}\)\}\.Taking expectation under the first path law gives the initial termD𝖪𝖫⁡\(μK∥νK\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\_\{K\}\\\|\\nu\_\{K\}\)\. For thekkth transition term, condition onXk\+1X\_\{k\+1\}\. Under the first path law, the conditional law ofXkX\_\{k\}givenXk\+1X\_\{k\+1\}isPk\(Xk\+1→⋅\)P\_\{k\}\(X\_\{k\+1\}\\to\\cdot\), while the law ofXk\+1X\_\{k\+1\}itself isμk\+1\\mu\_\{k\+1\}\. Thus

𝔼log⁡Pk\(Xk\+1→Xk\)Qk\(Xk\+1→Xk\)=𝔼Xk\+1∼μk\+1D𝖪𝖫⁡\(Pk\(Xk\+1→⋅\)∥Qk\(Xk\+1→⋅\)\)\.\\mathbb\{E\}\\log\\frac\{P\_\{k\}\(X\_\{k\+1\}\\to X\_\{k\}\)\}\{Q\_\{k\}\(X\_\{k\+1\}\\to X\_\{k\}\)\}=\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim\\mu\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(P\_\{k\}\(X\_\{k\+1\}\\to\\cdot\)\\\|Q\_\{k\}\(X\_\{k\+1\}\\to\\cdot\)\)\.Summing overkkgives the identity\. The general measure\-theoretic statement follows by replacing densities with Radon–Nikodym derivatives\. ∎

###### Proposition 6\.2\(From local reverse\-kernel error to final error\)\.

Let the exact reverse chain start frompKp\_\{K\}and use the exact reverse kernelsPk←\(xk\+1→⋅\)P\_\{k\}^\{\\leftarrow\}\(x\_\{k\+1\}\\to\\cdot\)from \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\),k=1,…,K−1k=1,\\ldots,K\-1, producing final lawp1p\_\{1\}\. Let an approximate reverse chain start fromp^K\\widehat\{p\}\_\{K\}and use kernelsP^k←\(xk\+1→⋅\)\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(x\_\{k\+1\}\\to\\cdot\), producing final lawp^1\\widehat\{p\}\_\{1\}\. Then

D𝖪𝖫⁡\(p1∥p^1\)≤D𝖪𝖫⁡\(pK∥p^K\)\+∑k=1K−1𝔼Xk\+1∼pk\+1D𝖪𝖫⁡\(Pk←\(Xk\+1→⋅\)∥P^k←\(Xk\+1→⋅\)\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{1\}\\\|\\widehat\{p\}\_\{1\}\)\\leq\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{K\}\\\|\\widehat\{p\}\_\{K\}\)\+\\sum\_\{k=1\}^\{K\-1\}\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim p\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\middle\\\|\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\right\)\.

###### Proof\.

Consider the exact path law of\(XK,…,X1\)\(X\_\{K\},\\ldots,X\_\{1\}\)and the approximate path law of\(X^K,…,X^1\)\(\\widehat\{X\}\_\{K\},\\ldots,\\widehat\{X\}\_\{1\}\)\. Apply Lemma[6\.1](https://arxiv.org/html/2607.01693#S6.Thmtheorem1)to these path laws\. The initial contribution isD𝖪𝖫⁡\(pK∥p^K\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{K\}\\\|\\widehat\{p\}\_\{K\}\), and the transition contribution at reverse stepkkis

𝔼Xk\+1∼pk\+1D𝖪𝖫⁡\(Pk←\(Xk\+1→⋅\)∥P^k←\(Xk\+1→⋅\)\)\.\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim p\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\middle\\\|\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\right\)\.Thus the KL divergence between the two full path laws is at most the right\-hand side above\. Projecting a path to its final coordinate is a measurable map, so data processing for KL gives the claimed bound onD𝖪𝖫⁡\(p1∥p^1\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{1\}\\\|\\widehat\{p\}\_\{1\}\)\. ∎

This proposition is the path\-space part of the error decomposition \([6\.2](https://arxiv.org/html/2607.01693#S6.E2)\)\. The remaining work is to estimate the one\-step terms in the sum\.

### 6\.3\.How score error contributes to KL

Proposition[6\.2](https://arxiv.org/html/2607.01693#S6.Thmtheorem2)reduces the global KL error to local KL comparisons between the exact reverse kernelPk←P\_\{k\}^\{\\leftarrow\}and the implemented kernelP^k←\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\. Each local comparison can contain several sources of error\. One source is numerical: even with the exact score, the implemented transition may only approximate the Bayes reverse kernel\. Another source is statistical: the sampler uses a learned score𝗌\\mathsf\{s\}in place of the true score𝗌⋆\\mathsf\{s\}^\{\\star\}\.

In the rest of this section we specialize the DDPM grid of Section[5\.1](https://arxiv.org/html/2607.01693#S5.SS1)to the variance\-exploding case \(αk=1\\alpha\_\{k\}=1,σk2=tk\\sigma\_\{k\}^\{2\}=t\_\{k\}\), now run backward from a large noise levelTTdown to an early\-stopping levelδ\>0\\delta\>0:

δ=t1<t2<⋯<tK=T,hk=tk\+1−tk\.\\delta=t\_\{1\}<t\_\{2\}<\\cdots<t\_\{K\}=T,\\qquad h\_\{k\}=t\_\{k\+1\}\-t\_\{k\}\.The reverse step indexed bykkmoves from noise leveltk\+1t\_\{k\+1\}totkt\_\{k\}, fork=1,…,K−1k=1,\\ldots,K\-1, so the chain ends att1=δt\_\{1\}=\\delta; the final law writtenp1p\_\{1\}in the earlier subsections is thus the early\-stopped lawpδp\_\{\\delta\}\. The one\-step variance parameter then reduces to the time increment,ηk=hk\\eta\_\{k\}=h\_\{k\}, and we writehkh\_\{k\}from here on to emphasize that the reverse step is a genuine time step of an SDE discretization\.

This subsection isolates the statistical part\. The implemented reverse step is the Gaussian update \([5\.4](https://arxiv.org/html/2607.01693#S5.E4)\) – equivalently, an Euler–Maruyama discretization of the reverse SDE – in which the score enters only through the mean of the kernel\. A score error therefore becomes a mean error, which we now quantify\.

To keep the notation separate, first fix the Gaussian discretization\. Let

\(6\.4\)Pk𝖤𝖬\(x′→⋅\)=𝒩\(x′\+hk𝗌k\+1⋆\(x′\),hkI\)P\_\{k\}^\{\\mathsf\{EM\}\}\(x^\{\\prime\}\\to\\cdot\)=\\mathcal\{N\}\\\!\\left\(x^\{\\prime\}\+h\_\{k\}\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\(x^\{\\prime\}\),h\_\{k\}I\\right\)be the exact\-score Euler kernel, and let

\(6\.5\)P^k𝖤𝖬\(x′→⋅\)=𝒩\(x′\+hk𝗌k\+1\(x′\),hkI\)\\widehat\{P\}\_\{k\}^\{\\mathsf\{EM\}\}\(x^\{\\prime\}\\to\\cdot\)=\\mathcal\{N\}\\\!\\left\(x^\{\\prime\}\+h\_\{k\}\\mathsf\{s\}\_\{k\+1\}\(x^\{\\prime\}\),h\_\{k\}I\\right\)be the learned\-score Euler kernel\. These are not the exact Bayes reverse kernelPk←P\_\{k\}^\{\\leftarrow\}; they are the true\-score and learned\-score versions of the same Gaussian approximation \([5\.4](https://arxiv.org/html/2607.01693#S5.E4)\)\. The following KL therefore isolates the statistical score error after the discretization has been fixed\. For fixedx′x^\{\\prime\}, applying Lemma[B\.2](https://arxiv.org/html/2607.01693#A2.Thmtheorem2)gives

D𝖪𝖫⁡\(Pk𝖤𝖬\(x′→⋅\)∥P^k𝖤𝖬\(x′→⋅\)\)=12hk‖hk\(𝗌k\+1\(x′\)−𝗌k\+1⋆\(x′\)\)‖2\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\mathsf\{EM\}\}\(x^\{\\prime\}\\to\\cdot\)\\middle\\\|\\widehat\{P\}\_\{k\}^\{\\mathsf\{EM\}\}\(x^\{\\prime\}\\to\\cdot\)\\right\)=\\frac\{1\}\{2h\_\{k\}\}\\left\\lVert h\_\{k\}\\bigl\(\\mathsf\{s\}\_\{k\+1\}\(x^\{\\prime\}\)\-\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\(x^\{\\prime\}\)\\bigr\)\\right\\rVert^\{2\}\.Averaging overXk\+1∼pk\+1X\_\{k\+1\}\\sim p\_\{k\+1\}and using the definition ofεk\+1,𝗌𝖼𝗈𝗋𝖾2\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\}gives

\(6\.6\)𝔼Xk\+1∼pk\+1D𝖪𝖫⁡\(Pk𝖤𝖬\(Xk\+1→⋅\)∥P^k𝖤𝖬\(Xk\+1→⋅\)\)=hk2εk\+1,𝗌𝖼𝗈𝗋𝖾2,\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim p\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\mathsf\{EM\}\}\(X\_\{k\+1\}\\to\\cdot\)\\middle\\\|\\widehat\{P\}\_\{k\}^\{\\mathsf\{EM\}\}\(X\_\{k\+1\}\\to\\cdot\)\\right\)=\\frac\{h\_\{k\}\}\{2\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\},which is the score\-estimation contribution in the error decomposition \([6\.2](https://arxiv.org/html/2607.01693#S6.E2)\), after the discretization choice has been separated, withηk=hk\\eta\_\{k\}=h\_\{k\}andc=12c=\\tfrac\{1\}\{2\}\.

Summing these one\-step terms over the reverse chain, the total score contribution has the form

\(6\.7\)∑k=1K−1hkεk\+1,𝗌𝖼𝗈𝗋𝖾2,\\sum\_\{k=1\}^\{K\-1\}h\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\},the discrete analogue of the path\-space estimate

∫δT𝔼Xt∼pt‖𝗌t\(Xt\)−𝗌t⋆\(Xt\)‖2dt\.\\int\_\{\\delta\}^\{T\}\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\left\\lVert\\mathsf\{s\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\\right\\rVert^\{2\}\\,\\,\\mathrm\{d\}t\.The weighthkh\_\{k\}is what makes this estimate informative: a score model need not be equally accurate at all times, since an error matters less when the sampler takes only a small step, while an error during a large reverse step has a larger effect\.

### 6\.4\.Euler–Maruyama

We now turn to the numerical error of the simplest reverse\-time integrator, the Euler–Maruyama scheme, in the variance\-exploding setting fixed at the start of this section\. Here the initialization term in the error decomposition \([6\.2](https://arxiv.org/html/2607.01693#S6.E2)\) is explicit\. If the implemented reverse chain is initialized from𝒩\(0,TI\)\\mathcal\{N\}\\\!\\left\(0,TI\\right\), then

\(6\.8\)ε𝗂𝗇𝗂𝗍2:=D𝖪𝖫⁡\(pT∥𝒩\(0,TI\)\)≤∫D𝖪𝖫⁡\(𝒩\(x,TI\)∥𝒩\(0,TI\)\)p𝖽𝖺𝗍𝖺\(dx\)=𝔼‖X0‖22T\.\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}:=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{T\}\\\|\\mathcal\{N\}\\\!\\left\(0,TI\\right\)\)\\leq\\int\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(\\mathcal\{N\}\\\!\\left\(x,TI\\right\)\\middle\\\|\\mathcal\{N\}\\\!\\left\(0,TI\\right\)\\right\)p\_\{\\mathsf\{data\}\}\(\\,\\mathrm\{d\}x\)=\\frac\{\\mathbb\{E\}\\left\\lVert X\_\{0\}\\right\\rVert^\{2\}\}\{2T\}\.Here we used the representationpT=∫𝒩\(x,TI\)p𝖽𝖺𝗍𝖺\(dx\)p\_\{T\}=\\int\\mathcal\{N\}\\\!\\left\(x,TI\\right\)\\,p\_\{\\mathsf\{data\}\}\(\\,\\mathrm\{d\}x\)and convexity of KL in its first argument\.

The other, and more substantial, contribution is the discretization error accumulated over the reverse steps, whose size is governed by the step lengthshkh\_\{k\}\. Choosing the grid well is therefore the central design question, and the relevant local scale against which to measurehkh\_\{k\}is the current noise level\. Indeed, the reverse drift contains𝗌t⋆=\(𝖣t⋆−x\)/t\\mathsf\{s\}^\{\\star\}\_\{t\}=\(\\mathsf\{D\}^\{\\star\}\_\{t\}\-x\)/t, so the same absolute step size is much more aggressive near the early\-stopping level than it is at large noise\. A uniform grid would therefore force all steps to be as small as the lowest noise scaleδ\\delta, leading to a poor dependence onT/δT/\\delta\. The standard non\-uniform\-grid remedy is to use a relative control, as in Chen, Lee, and Lu\[[12](https://arxiv.org/html/2607.01693#bib.bib12)\]:

\(6\.9\)hk≤h¯tk\+1,h¯≤12\.h\_\{k\}\\leq\\bar\{h\}\\,t\_\{k\+1\},\\qquad\\bar\{h\}\\leq\\frac\{1\}\{2\}\.The canonical example is the geometric grid

\(6\.10\)tk=\(1−h¯\)K−kT,δ=t1=\(1−h¯\)K−1T\.t\_\{k\}=\(1\-\\bar\{h\}\)^\{K\-k\}T,\\qquad\\delta=t\_\{1\}=\(1\-\\bar\{h\}\)^\{K\-1\}T\.For this grid,

K≍h¯−1log⁡Tδ,h¯≍log⁡\(T/δ\)K\.K\\asymp\\bar\{h\}^\{\-1\}\\log\\frac\{T\}\{\\delta\},\\qquad\\bar\{h\}\\asymp\\frac\{\\log\(T/\\delta\)\}\{K\}\.
With this convention, the exact Bayes reverse kernel is

\(6\.11\)Pk←\(x′→dx\)∝xpk\(x\)exp⁡\(−‖x−x′‖22hk\)dx,P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\,\\mathrm\{d\}x\)\\propto\_\{x\}p\_\{k\}\(x\)\\exp\\\!\\left\(\-\\frac\{\\left\\lVert x\-x^\{\\prime\}\\right\\rVert^\{2\}\}\{2h\_\{k\}\}\\right\)\\,\\mathrm\{d\}x,wherex′x^\{\\prime\}is the state at noise leveltk\+1t\_\{k\+1\}\. A first\-order Gaussian approximation freezes the score at the beginning of the reverse step\. This is the exact\-score Euler kernelPk𝖤𝖬P\_\{k\}^\{\\mathsf\{EM\}\}from \([6\.4](https://arxiv.org/html/2607.01693#S6.E4)\); the learned sampler usesP^k𝖤𝖬\\widehat\{P\}\_\{k\}^\{\\mathsf\{EM\}\}from \([6\.5](https://arxiv.org/html/2607.01693#S6.E5)\), replacing𝗌k\+1⋆\\mathsf\{s\}^\{\\star\}\_\{k\+1\}by𝗌k\+1\\mathsf\{s\}\_\{k\+1\}\. The superscript𝖤𝖬\\mathsf\{EM\}is just a reminder that the score has been frozen during the step\. In the general DDPM notation of \([5\.4](https://arxiv.org/html/2607.01693#S5.E4)\), these formulas are obtained by settingαk=1\\alpha\_\{k\}=1andηk=hk\\eta\_\{k\}=h\_\{k\}\.

There are two local error sources\. The score\-estimation part has already been isolated in the previous subsection: in the present variance\-exploding normalization, \([6\.6](https://arxiv.org/html/2607.01693#S6.E6)\) contributeshkεk\+1,𝗌𝖼𝗈𝗋𝖾2h\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\}up to constants\. The new issue in this subsection is the numerical error caused by Euler–Maruyama itself\. Thus we first compare the exact reverse kernelPk←P\_\{k\}^\{\\leftarrow\}with the exact\-score Euler kernelPk𝖤𝖬P\_\{k\}^\{\\mathsf\{EM\}\}\. This is the error made by freezing the true reverse drift during one time step\.

###### Proposition 6\.3\(Euler–Maruyama discretization error\)\.

Suppose that, over reverse stepkk, the exact score field is Lipschitz at scaleLkL\_\{k\}on the region where the process typically lies\. Then the exact\-score Euler kernel satisfies a one\-step estimate of the form

𝔼Xk\+1∼pk\+1D𝖪𝖫⁡\(Pk←\(Xk\+1→⋅\)∥Pk𝖤𝖬\(Xk\+1→⋅\)\)≲dLk2hk2,\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim p\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\middle\\\|P\_\{k\}^\{\\mathsf\{EM\}\}\(X\_\{k\+1\}\\to\\cdot\)\\right\)\\lesssim d\\,L\_\{k\}^\{2\}\\,h\_\{k\}^\{2\},up to lower\-order schedule factors\.

###### Proof\.

We write the single reverse step in reverse time\. Condition on the starting pointXk\+1=x′X\_\{k\+1\}=x^\{\\prime\}at noise leveltk\+1t\_\{k\+1\}, and let0≤r≤hk0\\leq r\\leq h\_\{k\}\. The exact reverse diffusion on this interval has drift𝗌tk\+1−r⋆\\mathsf\{s\}^\{\\star\}\_\{t\_\{k\+1\}\-r\}:

dYr←=𝗌tk\+1−r⋆\(Yr←\)dr\+dBr←,Y0←=x′\.\\,\\mathrm\{d\}Y^\{\\leftarrow\}\_\{r\}=\\mathsf\{s\}^\{\\star\}\_\{t\_\{k\+1\}\-r\}\(Y^\{\\leftarrow\}\_\{r\}\)\\,\\mathrm\{d\}r\+\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{r\},\\qquad Y^\{\\leftarrow\}\_\{0\}=x^\{\\prime\}\.Its endpoint law isPk←\(x′→⋅\)P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\cdot\)\. The exact\-score Euler–Maruyama interpolation freezes the drift at the beginning of the step:

dY¯r←=𝗌k\+1⋆\(x′\)dr\+dBr←,Y¯0←=x′,\\,\\mathrm\{d\}\\bar\{Y\}^\{\\leftarrow\}\_\{r\}=\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\(x^\{\\prime\}\)\\,\\mathrm\{d\}r\+\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{r\},\\qquad\\bar\{Y\}^\{\\leftarrow\}\_\{0\}=x^\{\\prime\},whose endpoint law is preciselyPk𝖤𝖬\(x′→⋅\)P\_\{k\}^\{\\mathsf\{EM\}\}\(x^\{\\prime\}\\to\\cdot\)\. By the Girsanov KL formula in Appendix[A](https://arxiv.org/html/2607.01693#A1), followed by data processing \(Lemma[A\.1](https://arxiv.org/html/2607.01693#A1.Thmtheorem1)\), the endpoint comparison is bounded by the path comparison

D𝖪𝖫⁡\(Pk←\(x′→⋅\)∥Pk𝖤𝖬\(x′→⋅\)\)≤12𝔼∫0hk‖𝗌tk\+1−r⋆\(Yr←\)−𝗌k\+1⋆\(x′\)‖2dr\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\cdot\)\\middle\\\|P\_\{k\}^\{\\mathsf\{EM\}\}\(x^\{\\prime\}\\to\\cdot\)\\right\)\\leq\\frac\{1\}\{2\}\\mathbb\{E\}\\int\_\{0\}^\{h\_\{k\}\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\_\{k\+1\}\-r\}\(Y^\{\\leftarrow\}\_\{r\}\)\-\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\(x^\{\\prime\}\)\\right\\rVert^\{2\}\\,\\mathrm\{d\}r\.This display is the basic Euler discretization estimate: the whole cost is the cost of replacing the moving score along the ideal reverse path by the frozen score at the left endpoint\.

Under the local Lipschitz bound, the score difference is controlled by the motion of the reverse path during the step, up to lower\-order terms coming from the deterministic change of the noise level\. Since the Brownian displacement over timerrhas squared size of orderdrdr, the local estimate is

𝔼‖𝗌tk\+1−r⋆\(Yr←\)−𝗌k\+1⋆\(x′\)‖2≲Lk2dr,\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\_\{k\+1\}\-r\}\(Y^\{\\leftarrow\}\_\{r\}\)\-\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\(x^\{\\prime\}\)\\right\\rVert^\{2\}\\lesssim L\_\{k\}^\{2\}\\,d\\,r,again suppressing schedule\-dependent lower\-order terms\. Integrating inr∈\[0,hk\]r\\in\[0,h\_\{k\}\]gives

12Lk2d∫0hkrdr≲dLk2hk2\.\\frac\{1\}\{2\}L\_\{k\}^\{2\}d\\int\_\{0\}^\{h\_\{k\}\}r\\,\\,\\mathrm\{d\}r\\lesssim d\\,L\_\{k\}^\{2\}h\_\{k\}^\{2\}\.Finally average this conditional bound overXk\+1∼pk\+1X\_\{k\+1\}\\sim p\_\{k\+1\}\. ∎

Proposition[6\.3](https://arxiv.org/html/2607.01693#S6.Thmtheorem3)is only the exact\-score discretization estimate\. When the sampler uses𝗌k\+1\\mathsf\{s\}\_\{k\+1\}instead of𝗌k\+1⋆\\mathsf\{s\}^\{\\star\}\_\{k\+1\}, the extra contribution is the statistical Gaussian mean\-error term computed in \([6\.6](https://arxiv.org/html/2607.01693#S6.E6)\), which becomeshkεk\+1,𝗌𝖼𝗈𝗋𝖾2h\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\}in the variance\-exploding normalization\.

Combining Proposition[6\.3](https://arxiv.org/html/2607.01693#S6.Thmtheorem3), the score\-error contribution \([6\.6](https://arxiv.org/html/2607.01693#S6.E6)\), and Proposition[6\.2](https://arxiv.org/html/2607.01693#S6.Thmtheorem2)gives the standard Euler–Maruyama diffusion\-sampling bound:

\(6\.12\)D𝖪𝖫⁡\(pδ∥p^δ\)≲ε𝗂𝗇𝗂𝗍2\+∑k=1K−1dLk2hk2\+∑k=1K−1hkεk\+1,𝗌𝖼𝗈𝗋𝖾2\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{\\delta\}\\\|\\widehat\{p\}\_\{\\delta\}\)\\lesssim\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\+\\sum\_\{k=1\}^\{K\-1\}d\\,L\_\{k\}^\{2\}h\_\{k\}^\{2\}\+\\sum\_\{k=1\}^\{K\-1\}h\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\}\.The three terms are initialization at the high\-noise endpoint, Euler discretization, and score estimation\. This is the KL telescoping estimate of Proposition[6\.2](https://arxiv.org/html/2607.01693#S6.Thmtheorem2)with the local Gaussian calculation inserted, so it is the Euler–Maruyama specialization of the error decomposition \([6\.2](https://arxiv.org/html/2607.01693#S6.E2)\)\.

In fact, the discretization error in estimate \([6\.12](https://arxiv.org/html/2607.01693#S6.E12)\) can be sharpened, as the next subsection shows\. Once initialization and score estimation have been separated, the exact\-score Euler sampler pays only for freezing the true score during each reverse step\. Lett\+t^\{\+\}denote the next grid point at or above the current noise level, sot\+=tk\+1t^\{\+\}=t\_\{k\+1\}fort∈\[tk,tk\+1\]t\\in\[t\_\{k\},t\_\{k\+1\}\]\. The discretization cost is measured by the pathwise quantity

\(6\.13\)∫δT𝔼‖𝗌t⋆\(Xt\)−𝗌t\+⋆\(Xt\+\)‖2dt,\\int\_\{\\delta\}^\{T\}\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\\,\\mathrm\{d\}t,up to deterministic schedule factors in more general schedules\. This is the pathwise version of the finite\-grid∑kdLk2hk2\\sum\_\{k\}dL\_\{k\}^\{2\}h\_\{k\}^\{2\}term: Euler–Maruyama freezes the score at the high\-noise endpointt\+t^\{\+\}of each reverse step, and the discretization error is governed by the change in the true score along the ideal reverse path as the noise time decreases fromt\+t^\{\+\}tott\.

In the error analysis above, the passage from this score difference to the bounddLk2hk2dL\_\{k\}^\{2\}h\_\{k\}^\{2\}used a worst\-case control: the score was bounded by a Lipschitz constantLkL\_\{k\}, or equivalently by a pointwise Hessian bound forlog⁡pt\\log p\_\{t\}\. The factorddcomes from the Brownian displacement during one step; the potential loss comes from multiplying it by a uniform operator\-norm bound on the Hessian\. Under weak assumptions, this worst\-case curvature can be much larger than the average curvature seen by the reverse process, especially near the early\-stopping time\. The next subsection replaces this pointwise control by an averaged posterior\-covariance quantity, leading to the nearly linear\-in\-dddiscretization bound\.

### 6\.5\.Hessian control

The difficulty left by the previous subsection is the pathwise score difference in \([6\.13](https://arxiv.org/html/2607.01693#S6.E13)\)\. The aim of this subsection is to replace the worst\-case Hessian estimate by the more averaged bound

score\-freezing cost≲h¯×posterior\-covariance budget\.\\text\{score\-freezing cost\}\\;\\lesssim\\;\\bar\{h\}\\times\\text\{posterior\-covariance budget\}\.We first derive the local score\-freezing estimate where the factorh¯\\bar\{h\}appears, and then rewrite the remaining weighted quadratic variation in terms of posterior covariance\.

Fort∈\[tk,tk\+1\]t\\in\[t\_\{k\},t\_\{k\+1\}\], writet\+=tk\+1t^\{\+\}=t\_\{k\+1\}\. The Girsanov score\-freezing cost in \([6\.13](https://arxiv.org/html/2607.01693#S6.E13)\) is, up to the constant12\\frac\{1\}\{2\}and schedule factors, the sum over grid intervals of

∫tktk\+1𝔼‖𝗌t⋆\(Xt\)−𝗌t\+⋆\(Xt\+\)‖2dt\.\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\\,\\,\\mathrm\{d\}t\.
The useful variable is not the score itself but the optimal denoiser\. In the variance\-exploding normalization,

𝖣t⋆\(x\):=𝔼\[X0∣Xt=x\]=x\+t𝗌t⋆\(x\),𝗌t⋆\(x\)=𝖣t⋆\(x\)−xt\.\\mathsf\{D\}^\{\\star\}\_\{t\}\(x\):=\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{t\}=x\]=x\+t\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\),\\qquad\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=\\frac\{\\mathsf\{D\}^\{\\star\}\_\{t\}\(x\)\-x\}\{t\}\.The stochastic\-localization martingale identity \([4\.7](https://arxiv.org/html/2607.01693#S4.E7)\), rewritten in the noise\-time coordinate, says that fort≤t\+t\\leq t^\{\+\},

𝔼‖𝖣t⋆\(Xt\)−𝖣t\+⋆\(Xt\+\)‖2=∫tt\+𝔼‖∇𝖣r⋆\(Xr\)‖F2dr\.\\mathbb\{E\}\\left\\lVert\\mathsf\{D\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{D\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}=\\int\_\{t\}^\{t^\{\+\}\}\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{r\}\(X\_\{r\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}\\,\\,\\mathrm\{d\}r\.This is the replacement for a pointwise Hessian bound: it measures the actual quadratic variation of the denoiser along the noising path\.

Now estimate one grid interval\. Using𝗌t⋆=\(𝖣t⋆−x\)/t\\mathsf\{s\}^\{\\star\}\_\{t\}=\(\\mathsf\{D\}^\{\\star\}\_\{t\}\-x\)/tand keeping only the curvature term explicitly, the score increment is bounded by the denoiser increment with the naturalt−2t^\{\-2\}weight:

𝔼‖𝗌t⋆\(Xt\)−𝗌t\+⋆\(Xt\+\)‖2≲1t2𝔼‖𝖣t⋆\(Xt\)−𝖣t\+⋆\(Xt\+\)‖2\.\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\\lesssim\\frac\{1\}\{t^\{2\}\}\\mathbb\{E\}\\left\\lVert\\mathsf\{D\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{D\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\.Here we have omitted some error terms coming from the explicit factorx/tx/t, which can be handled easily\. Substituting the martingale identity and applying Fubini gives

∫tktk\+1𝔼‖𝗌t⋆\(Xt\)−𝗌t\+⋆\(Xt\+\)‖2dt\\displaystyle\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\\,\\,\\mathrm\{d\}t≲∫tktk\+11t2∫ttk\+1𝔼‖∇𝖣r⋆\(Xr\)‖F2drdt\\displaystyle\\qquad\\lesssim\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\frac\{1\}\{t^\{2\}\}\\int\_\{t\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{r\}\(X\_\{r\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}\\,\\,\\mathrm\{d\}r\\,\\,\\mathrm\{d\}t=∫tktk\+1𝔼‖∇𝖣r⋆\(Xr\)‖F2\(∫tkrdtt2\)dr\\displaystyle\\qquad=\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{r\}\(X\_\{r\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}\\left\(\\int\_\{t\_\{k\}\}^\{r\}\\frac\{\\,\\mathrm\{d\}t\}\{t^\{2\}\}\\right\)\\,\\mathrm\{d\}r=∫tktk\+1𝔼‖∇𝖣r⋆\(Xr\)‖F2r−tkrtkdr\.\\displaystyle\\qquad=\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{r\}\(X\_\{r\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}\\frac\{r\-t\_\{k\}\}\{r\\,t\_\{k\}\}\\,\\,\\mathrm\{d\}r\.Sincer−tk≤hkr\-t\_\{k\}\\leq h\_\{k\}and the mesh condition \([6\.9](https://arxiv.org/html/2607.01693#S6.E9)\) withh¯≤1/2\\bar\{h\}\\leq 1/2givestk≥tk\+1/2t\_\{k\}\\geq t\_\{k\+1\}/2, we get

∫tktk\+1𝔼‖𝗌t⋆\(Xt\)−𝗌t\+⋆\(Xt\+\)‖2dt≲hktk\+1∫tktk\+1𝔼‖∇𝖣r⋆\(Xr\)‖F2rdr\.\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\\,\\,\\mathrm\{d\}t\\lesssim\\frac\{h\_\{k\}\}\{t\_\{k\+1\}\}\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\frac\{\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{r\}\(X\_\{r\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}\}\{r\}\\,\\,\\mathrm\{d\}r\.This is the point at whichh¯\\bar\{h\}enters: the ratiohk/tk\+1h\_\{k\}/t\_\{k\+1\}is the relative step size on the interval, and \([6\.9](https://arxiv.org/html/2607.01693#S6.E9)\) says it is at mosth¯\\bar\{h\}\. Summing overkktherefore gives

∑k∫tktk\+1𝔼‖𝗌t⋆\(Xt\)−𝗌t\+⋆\(Xt\+\)‖2dt≲h¯∫δT𝔼‖∇𝖣t⋆\(Xt\)‖F2tdt\.\\sum\_\{k\}\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\\,\\,\\mathrm\{d\}t\\lesssim\\bar\{h\}\\int\_\{\\delta\}^\{T\}\\frac\{\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}\}\{t\}\\,\\,\\mathrm\{d\}t\.
It remains to express the weighted quadratic variation in a more intrinsic form\. Let

Σt\(x\)=Cov⁡\(X0∣Xt=x\)\\Sigma\_\{t\}\(x\)=\\operatorname\{Cov\}\(X\_\{0\}\\mid X\_\{t\}=x\)be the posterior covariance in the variance\-exploding noising channel\. The linear\-response identity \([4\.5](https://arxiv.org/html/2607.01693#S4.E5)\) and the score\-Jacobian identity \([4\.6](https://arxiv.org/html/2607.01693#S4.E6)\), translated to this notation, give

∇𝖣t⋆\(x\)=I\+t∇𝗌t⋆\(x\)=t−1Σt\(x\)\.\\nabla\\mathsf\{D\}^\{\\star\}\_\{t\}\(x\)=I\+t\\nabla\\mathsf\{s\}^\{\\star\}\_\{t\}\(x\)=t^\{\-1\}\\Sigma\_\{t\}\(x\)\.The covariance\-dissipation identity \([4\.8](https://arxiv.org/html/2607.01693#S4.E8)\),ddu𝔼\[Tr⁡Σu\]=−𝔼\[Tr⁡\(Σu2\)\]\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}u\}\\mathbb\{E\}\[\\operatorname\{Tr\}\\Sigma\_\{u\}\]=\-\\mathbb\{E\}\[\\operatorname\{Tr\}\(\\Sigma\_\{u\}^\{2\}\)\], after the change of variablest=1/ut=1/u, gives

𝔼‖∇𝖣t⋆\(Xt\)‖F2=∂t𝔼Tr⁡Σt\(Xt\)\.\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}=\\partial\_\{t\}\\,\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{t\}\(X\_\{t\}\)\.Thus

∫δT𝔼‖∇𝖣t⋆\(Xt\)‖F2tdt=∫δT∂t𝔼Tr⁡Σt\(Xt\)tdt\.\\int\_\{\\delta\}^\{T\}\\frac\{\\mathbb\{E\}\\left\\lVert\\nabla\\mathsf\{D\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\\right\\rVert\_\{\\mathrm\{F\}\}^\{2\}\}\{t\}\\,\\,\\mathrm\{d\}t=\\int\_\{\\delta\}^\{T\}\\frac\{\\partial\_\{t\}\\,\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{t\}\(X\_\{t\}\)\}\{t\}\\,\\,\\mathrm\{d\}t\.Integrating by parts,

∫δT∂t𝔼Tr⁡Σt\(Xt\)tdt=𝔼Tr⁡ΣT\(XT\)T−𝔼Tr⁡Σδ\(Xδ\)δ\+∫δT𝔼Tr⁡Σt\(Xt\)t2dt\.\\int\_\{\\delta\}^\{T\}\\frac\{\\partial\_\{t\}\\,\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{t\}\(X\_\{t\}\)\}\{t\}\\,\\,\\mathrm\{d\}t=\\frac\{\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{T\}\(X\_\{T\}\)\}\{T\}\-\\frac\{\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{\\delta\}\(X\_\{\\delta\}\)\}\{\\delta\}\+\\int\_\{\\delta\}^\{T\}\\frac\{\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{t\}\(X\_\{t\}\)\}\{t^\{2\}\}\\,\\,\\mathrm\{d\}t\.The middle term is nonpositive, so the first and third terms give an upper bound\. Following the notation of Chewi\[[18](https://arxiv.org/html/2607.01693#bib.bib18), Chapter 12\], define the covariance budget

\(6\.14\)𝔇δ,T\(p𝖽𝖺𝗍𝖺\):=𝔼XT∼pTTr⁡ΣT\(XT\)T\+∫δT𝔼Xt∼ptTr⁡Σt\(Xt\)t2dt\.\\mathfrak\{D\}\_\{\\delta,T\}\(p\_\{\\mathsf\{data\}\}\):=\\frac\{\\mathbb\{E\}\_\{X\_\{T\}\\sim p\_\{T\}\}\\operatorname\{Tr\}\\Sigma\_\{T\}\(X\_\{T\}\)\}\{T\}\+\\int\_\{\\delta\}^\{T\}\\frac\{\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\operatorname\{Tr\}\\Sigma\_\{t\}\(X\_\{t\}\)\}\{t^\{2\}\}\\,\\,\\mathrm\{d\}t\.Combining the previous displays gives the clean summary

∑k∫tktk\+1𝔼‖𝗌t⋆\(Xt\)−𝗌t\+⋆\(Xt\+\)‖2dt≲h¯𝔇δ,T\(p𝖽𝖺𝗍𝖺\)\.\\sum\_\{k\}\\int\_\{t\_\{k\}\}^\{t\_\{k\+1\}\}\\mathbb\{E\}\\left\\lVert\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\}\)\-\\mathsf\{s\}^\{\\star\}\_\{t^\{\+\}\}\(X\_\{t^\{\+\}\}\)\\right\\rVert^\{2\}\\,\\,\\mathrm\{d\}t\\lesssim\\bar\{h\}\\,\\mathfrak\{D\}\_\{\\delta,T\}\(p\_\{\\mathsf\{data\}\}\)\.This is the key identification: the discretization error is governed by the weighted quadratic variation of the optimal denoiser, and𝔇δ,T\\mathfrak\{D\}\_\{\\delta,T\}packages that variation after the DDPM time weights have been accounted for\. A careful version of this estimate, together with the score\-error term, gives

\(6\.15\)D𝖪𝖫⁡\(pδ∥p^δ\)≤ε𝗂𝗇𝗂𝗍2\+4∑k=1K−1hk𝔼pk\+1‖𝗌k\+1−𝗌k\+1⋆‖2\+2h¯𝔇δ,T\(p𝖽𝖺𝗍𝖺\),\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{\\delta\}\\\|\\widehat\{p\}\_\{\\delta\}\)\\leq\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\+4\\sum\_\{k=1\}^\{K\-1\}h\_\{k\}\\,\\mathbb\{E\}\_\{p\_\{k\+1\}\}\\left\\lVert\\mathsf\{s\}\_\{k\+1\}\-\\mathsf\{s\}^\{\\star\}\_\{k\+1\}\\right\\rVert^\{2\}\+2\\bar\{h\}\\,\\mathfrak\{D\}\_\{\\delta,T\}\(p\_\{\\mathsf\{data\}\}\),where the middle term is the time\-weighted score error\. The important lesson is that𝔇δ,T\(p𝖽𝖺𝗍𝖺\)\\mathfrak\{D\}\_\{\\delta,T\}\(p\_\{\\mathsf\{data\}\}\)replaces a global Lipschitz constant for the score as the data\-dependent factor in the discretization complexity\.

The remaining question is how large this covariance budget can be\. A useful ambient\-dimensional estimate comes directly from its posterior\-variance form: for any distribution with finite second moment,

𝔼TrΣt\(Xt\)=𝔼∥X0−𝔼\[X0∣Xt\]∥2≤𝔼∥X0−Xt∥2=dt,\\mathbb\{E\}\\operatorname\{Tr\}\\Sigma\_\{t\}\(X\_\{t\}\)=\\mathbb\{E\}\\left\\lVert X\_\{0\}\-\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{t\}\]\\right\\rVert^\{2\}\\leq\\mathbb\{E\}\\left\\lVert X\_\{0\}\-X\_\{t\}\\right\\rVert^\{2\}=dt,becauseXtX\_\{t\}itself is an admissible estimator ofX0X\_\{0\}\. Substituting this into \([6\.14](https://arxiv.org/html/2607.01693#S6.E14)\) gives

𝔇δ,T\(p𝖽𝖺𝗍𝖺\)≤d\(1\+log⁡Tδ\)\.\\mathfrak\{D\}\_\{\\delta,T\}\(p\_\{\\mathsf\{data\}\}\)\\leq d\\left\(1\+\\log\\frac\{T\}\{\\delta\}\\right\)\.Combining this estimate with \([6\.15](https://arxiv.org/html/2607.01693#S6.E15)\) and the geometric\-grid relation \([6\.10](https://arxiv.org/html/2607.01693#S6.E10)\), the exact\-score discretization term is controlled by

O\(dlog2⁡\(T/δ\)K\)\.O\\\!\\left\(\\frac\{d\\log^\{2\}\(T/\\delta\)\}\{K\}\\right\)\.Thus, when the initialization and score\-estimation errors are alsoO\(ε2\)O\(\\varepsilon^\{2\}\), it is enough to take

K=O~\(dlog2⁡\(T/δ\)ε2\)K=\\widetilde\{O\}\\\!\\left\(\\frac\{d\\log^\{2\}\(T/\\delta\)\}\{\\varepsilon^\{2\}\}\\right\)reverse steps\. This is the nearlydd\-linear dependence proved by Benton, De Bortoli, Doucet, and Deligiannidis\[[9](https://arxiv.org/html/2607.01693#bib.bib9)\]under a finite second\-moment assumption\. Related KL guarantees under finite Fisher\-information assumptions were obtained by Conforti, Durmus, and Gentiloni Silveri\[[19](https://arxiv.org/html/2607.01693#bib.bib19)\]\. The covariance budget can be bounded more sharply when the data are intrinsically low\-dimensional: entropy or covering\-number estimates for the smoothed lawpδp\_\{\\delta\}replace the ambient dimension by an intrinsic dimension at scaleδ\\sqrt\{\\delta\}\.

### 6\.6\.First\-order rejection sampling and high accuracy

The Hessian\-control bound explains why Euler–Maruyama can be analyzed under weak assumptions, but it also exposes a limitation of the sampler itself\. Euler–Maruyama does not try to sample the exact one\-step Bayes kernel; it linearizes the log density and relies on the step size being small enough that the nonlinear remainder is negligible\. Thus, as for ULA before the Metropolis correction, high accuracy is bought by a fine grid and hence by a polynomial dependence on the target accuracy\. The natural MALA\-like question is whether this dependence can be reduced to logarithmic, or at least polylogarithmic, dependence on1/ε1/\\varepsilon\.

Here the analogy with MALA also reveals the main obstruction\. MALA obtains its correction from a log\-density ratio\. In a diffusion model, the one\-step target involves the noised densitypkp\_\{k\}, butpk\(x\)p\_\{k\}\(x\)andlog⁡pk\(x\)\\log p\_\{k\}\(x\)are not available as numerical quantities; the learned object is only the score𝗌k⋆=∇log⁡pk\\mathsf\{s\}^\{\\star\}\_\{k\}=\\nabla\\log p\_\{k\}\. Thus the question is sharper than simply applying a Metropolis step: can a score\-only correction emulate the missing log\-density\-ratio test and correct the local Gaussian proposal toward the exact Bayes kernel?

For the variance\-exploding chain, this question has a useful local structure\. The exact backward kernel \([6\.11](https://arxiv.org/html/2607.01693#S6.E11)\) is not merely a Gaussian with a shifted mean; it is a Gaussian tilt of the form

\(6\.16\)pk\(x\)exp⁡\(−‖x−x′‖22hk\),p\_\{k\}\(x\)\\exp\\\!\\left\(\-\\frac\{\\left\\lVert x\-x^\{\\prime\}\\right\\rVert^\{2\}\}\{2h\_\{k\}\}\\right\),wherex′x^\{\\prime\}is the state at noise leveltk\+1t\_\{k\+1\}\. Euler–Maruyama corresponds to replacing this tilt by the Gaussian obtained from a first\-order expansion oflog⁡pk\\log p\_\{k\}\. The high\-accuracy idea is to sample the tilted law \([6\.16](https://arxiv.org/html/2607.01693#S6.E16)\) more faithfully while still using only first\-order information\.

This is where the algorithmic innovation enters\. First\-order rejection sampling, denotedFORS\[[11](https://arxiv.org/html/2607.01693#bib.bib11)\], replaces direct log\-density\-ratio evaluation by randomized first\-order estimates\. Althoughlog⁡pk\(x\)\\log p\_\{k\}\(x\)andlog⁡pk\(y\)\\log p\_\{k\}\(y\)are unavailable separately, their difference can be written as a path integral involving only the score: for any smooth pathγ\\gammawithγ0=y\\gamma\_\{0\}=yandγ1=x\\gamma\_\{1\}=x,

log⁡pk\(x\)−log⁡pk\(y\)=∫01⟨γ˙r,𝗌k⋆\(γr\)⟩dr\.\\log p\_\{k\}\(x\)\-\\log p\_\{k\}\(y\)=\\int\_\{0\}^\{1\}\\left\\langle\\dot\{\\gamma\}\_\{r\},\\mathsf\{s\}^\{\\star\}\_\{k\}\(\\gamma\_\{r\}\)\\right\\rangle\\,\\,\\mathrm\{d\}r\.Sampling a random point along the path turns this integral into an unbiased estimate of the log\-density difference\. Thus, even though the endpoint log densities themselves are unavailable, score queries can produce a randomized estimate of exactly the log tilt that a rejection correction needs\.

There is still one conversion to make\. Rejection sampling needs an acceptance weight proportional to the exponential of the log tilt, not just an estimate of the log tilt itself\. Exponentiating an unbiased estimate would not usually give the right average, soFORSuses a Poisson product identity to build the exponential tilt from first\-order random estimates\. In abstract form, suppose a proposal is drawn from a lawQQand the desired law is the exponential tilt ofQQwith Radon–Nikodym derivative proportional toew\(x\)e^\{w\(x\)\}\. Assume that, given the proposalxx, one can sample bounded random variablesW1,W2,…W\_\{1\},W\_\{2\},\\ldotswith𝔼\[W1∣x\]=w\(x\)\\mathbb\{E\}\[W\_\{1\}\\mid x\]=w\(x\)\. In the simplest bounded form, assume\|Wj\|≤R\|W\_\{j\}\|\\leq Ralmost surely for a deterministic envelopeRR\. TakeJ∼𝖯𝗈𝗂𝗌𝗌𝗈𝗇\(2R\)J\\sim\\mathsf\{Poisson\}\(2R\)and accept a proposalxxwith probability

∏j=1JR\+Wj2R\.\\prod\_\{j=1\}^\{J\}\\frac\{R\+W\_\{j\}\}\{2R\}\.The probability generating function of a Poisson random variable makes the average acceptance factor proportional toew\(x\)e^\{w\(x\)\}\. Hence the accepted proposals follow the exponentially tilted law\. IfR=Θ\(1\)R=\\Theta\(1\), the number of first\-order queries is constant in expectation and logarithmic with high probability\.

###### Poisson\-product calculation\.

Condition on the proposed pointxxand write

A\(x\)=𝔼\[∏j=1JR\+Wj2R\|x\]\.A\(x\)=\\mathbb\{E\}\\left\[\\prod\_\{j=1\}^\{J\}\\frac\{R\+W\_\{j\}\}\{2R\}\\middle\|x\\right\]\.GivenJJ, the factors are independent and have conditional mean\(R\+w\(x\)\)/\(2R\)\(R\+w\(x\)\)/\(2R\)\. Therefore

A\(x\)=𝔼\[\(R\+w\(x\)2R\)J\|x\]\.A\(x\)=\\mathbb\{E\}\\left\[\\left\(\\frac\{R\+w\(x\)\}\{2R\}\\right\)^\{J\}\\middle\|x\\right\]\.The generating function ofJ∼𝖯𝗈𝗂𝗌𝗌𝗈𝗇\(2R\)J\\sim\\mathsf\{Poisson\}\(2R\)gives

A\(x\)=exp⁡\(2R\(R\+w\(x\)2R−1\)\)=ew\(x\)−R\.A\(x\)=\\exp\\\!\\left\(2R\\left\(\\frac\{R\+w\(x\)\}\{2R\}\-1\\right\)\\right\)=e^\{w\(x\)\-R\}\.The extra factore−Re^\{\-R\}is independent ofxx\. Hence the joint law of an accepted proposal is proportional toew\(x\)Q\(dx\)e^\{w\(x\)\}Q\(\\,\\mathrm\{d\}x\), which is exactly the exponential tilt\. The number of score queries isJJ, so its mean is2R2Rand a standard Poisson tail bound gives logarithmic high\-probability control whenRRis bounded\. ∎

ThusFORSis the score\-only analogue of a Metropolis correction\. MALA uses density ratios to correct a Gaussian proposal;FORSuses randomized first\-order estimates to correct a Gaussian tilt without evaluating the density\. Instead of accepting the Gaussian linearization as the sampler, the algorithm corrects that local proposal toward the true Gaussian tilt\.

In the notation of the error decomposition \([6\.2](https://arxiv.org/html/2607.01693#S6.E2)\),FORSchanges the size of the exact\-score algorithmic termε𝖽𝗂𝗌𝖼2\\varepsilon\_\{\\mathsf\{disc\}\}^\{2\}\. By sampling each local Gaussian tilt accurately enough, the accumulated local algorithmic error can be madeO\(ε2\)O\(\\varepsilon^\{2\}\)using only polylogarithmically many reverse steps in the target accuracyε\\varepsilon\. In a simplified ambient\-dimension form, the result of Chen, Chewi, Daskalakis, and Rakhlin\[[11](https://arxiv.org/html/2607.01693#bib.bib11)\]says that, suppressing schedule constants and taking the initialization and local algorithmic budgets of orderε2\\varepsilon^\{2\}, there is a reverse sampler using score evaluations for which

\(6\.17\)D𝖪𝖫⁡\(pδ∥p^δ\)≲ε2\+∑k=1K−1hkεk\+1,𝗌𝖼𝗈𝗋𝖾2\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{\\delta\}\\\|\\widehat\{p\}\_\{\\delta\}\)\\lesssim\\varepsilon^\{2\}\+\\sum\_\{k=1\}^\{K\-1\}h\_\{k\}\\varepsilon\_\{k\+1,\\mathsf\{score\}\}^\{2\}with

K=O~\(dpolylog⁡\(1/ε\)\)K=\\widetilde\{O\}\\\!\\left\(d\\,\\operatorname\{polylog\}\(1/\\varepsilon\)\\right\)in a basic ambient\-dimension reading\. The full theorem contains sharper dimension measures and refinements under additional structure, but the key message here is simpler: the expensive dependence on the accuracy parameter is replaced by a polylogarithmic one\. Together with the variance\-exploding early\-stopping estimateW22\(p𝖽𝖺𝗍𝖺,pδ\)≤δdW\_\{2\}^\{2\}\(p\_\{\\mathsf\{data\}\},p\_\{\\delta\}\)\\leq\\delta d, this separates the two error measurements: the initial smoothing is controlled inW2W\_\{2\}, while the implemented reverse chain is controlled in KL\. Note that the same time\-integrated score\-error term appears in \([6\.17](https://arxiv.org/html/2607.01693#S6.E17)\):FORSimproves the local sampling step, but it does not remove the need for an accurate learned score\.

## 7\.Discrete Diffusion Models

The development so far built and analyzed diffusion models on continuous state spaces\. We now turn to a different modeling question: what if the state space itself is not continuous? Reverse\-time Bayes rules, denoising objectives, and path\-space error decompositions survive on finite state spaces; Euclidean gradients and Brownian motion do not\. Discrete diffusion models therefore force us to state the sampling ideas in a form that does not rely on calculus\.

### 7\.1\.Forward and reverse kernels on a finite state space

Images in pixel space can often be treated as continuous, but language, symbolic sequences, molecules, graphs, and many scientific configurations are inherently discrete\. Ifxxis a token sequence, an expression such asx\+tZx\+\\sqrt\{t\}Zis meaningless\. There is no small Gaussian perturbation of the word “cat” inside the vocabulary\. Instead of adding Brownian noise, a discrete diffusion model applies a sequence of Markov kernels on a finite state space\.

The probabilistic skeleton is exactly the one we have used throughout: a forward chain that degrades the data toward a reference law, and a reverse chain recovered by Bayes’ rule\. We therefore keep the same symbols as in the continuous case\. The forward transitionPk→P\_\{k\}^\{\\to\}and the backward transitionPk←P\_\{k\}^\{\\leftarrow\}of Subsection[5\.1](https://arxiv.org/html/2607.01693#S5.SS1)are now read as*stochastic matrices*on the finite set, rather than as transition densities onℝd\\mathbb\{R\}^\{d\}\. The only genuine loss is the calculus: there is no gradient∇log⁡pk\\nabla\\log p\_\{k\}, so the score must be replaced by a finite\-difference object\.

Let𝒳\\mathcal\{X\}be a finite state space\. The data lawp𝖽𝖺𝗍𝖺p\_\{\\mathsf\{data\}\}may be supported on a subset𝒳0⊆𝒳\\mathcal\{X\}\_\{0\}\\subseteq\\mathcal\{X\}; we view it as a law on𝒳\\mathcal\{X\}by assigning zero mass outside𝒳0\\mathcal\{X\}\_\{0\}\. For ordinary categorical sequences one can take𝒳0=𝒳=𝒱L\\mathcal\{X\}\_\{0\}=\\mathcal\{X\}=\\mathcal\{V\}^\{L\}, while for the masked diffusions treated below the noisy state space𝒳\\mathcal\{X\}also contains mask symbols\. A discrete forward diffusion is a Markov chain

\(7\.1\)X0∼p𝖽𝖺𝗍𝖺,Xk\+1∼Pk→\(Xk→⋅\),k=0,…,K−1,X\_\{0\}\\sim p\_\{\\mathsf\{data\}\},\\qquad X\_\{k\+1\}\\sim P\_\{k\}^\{\\to\}\(X\_\{k\}\\to\\cdot\),\\qquad k=0,\\ldots,K\-1,where eachPk→P\_\{k\}^\{\\to\}is a stochastic matrix on𝒳\\mathcal\{X\}:

Pk→\(x→y\)≥0,∑y∈𝒳Pk→\(x→y\)=1\.P\_\{k\}^\{\\to\}\(x\\to y\)\\geq 0,\\qquad\\sum\_\{y\\in\\mathcal\{X\}\}P\_\{k\}^\{\\to\}\(x\\to y\)=1\.Letpkp\_\{k\}be the law ofXkX\_\{k\}\. In row\-vector notation,

pk\+1=pkPk→,pk=p𝖽𝖺𝗍𝖺P0→P1→⋯Pk−1→\.p\_\{k\+1\}=p\_\{k\}P\_\{k\}^\{\\to\},\\qquad p\_\{k\}=p\_\{\\mathsf\{data\}\}P\_\{0\}^\{\\to\}P\_\{1\}^\{\\to\}\\cdots P\_\{k\-1\}^\{\\to\}\.The forward kernels are chosen so thatpKp\_\{K\}is close to a simple reference distributionπ𝗋𝖾𝖿\\pi\_\{\\mathsf\{ref\}\}, such as all\-mask tokens, independent uniform tokens, or a product distribution\.

The design principle is the same as in continuous diffusion\. The forward process should be simple enough that we can sample from it and know its transition probabilities, and destructive enough that after many steps the distribution is close to a reference law\. The reverse process is then the learned part\. What changes is the nature of corruption: tokens are replaced, masked, or resampled rather than moved by small Gaussian increments\.

### 7\.2\.Exact reverse kernels and ratio scores

The exact reverse kernel is again obtained from Bayes’ rule\. For fixedx′∈𝒳x^\{\\prime\}\\in\\mathcal\{X\},

\(7\.2\)Pk←\(x′→x\)=ℙ\(Xk=x∣Xk\+1=x′\)=pk\(x\)Pk→\(x→x′\)pk\+1\(x′\)\.P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to x\)=\\mathbb\{P\}\(X\_\{k\}=x\\mid X\_\{k\+1\}=x^\{\\prime\}\)=\\frac\{p\_\{k\}\(x\)P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\)\}\{p\_\{k\+1\}\(x^\{\\prime\}\)\}\.This is the discrete analogue of \([5\.3](https://arxiv.org/html/2607.01693#S5.E3)\), in the samePk←P\_\{k\}^\{\\leftarrow\}notation: there the continuous backward kernel equalledpkp\_\{k\}times a Gaussian likelihood, renormalized; here it equalspkp\_\{k\}times a transition likelihoodPk→\(x→x′\)P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\), renormalized\.

If we knewpkp\_\{k\}exactly, \([7\.2](https://arxiv.org/html/2607.01693#S7.E2)\) would give an exact reverse sampler\. In learning, we approximate either the reverse kernelPk←\(x′→⋅\)P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\cdot\)directly or an object from which it can be computed, such as the posterior distribution ofX0X\_\{0\}givenXkX\_\{k\}\.

The reverse\-kernel identity \([7\.2](https://arxiv.org/html/2607.01693#S7.E2)\) shows exactly what must be learned\. The forward transitionPk→P\_\{k\}^\{\\to\}is known by design, but the intermediate lawpkp\_\{k\}contains information about the data distribution and is unknown\. Learning a discrete diffusion model means learning enough about these intermediate laws to simulate the reverse Bayes kernels\. One way to represent this missing information is through ratios of the unknown law\. Indeed, for two candidate predecessorsxxandyyof the same current statex′x^\{\\prime\}, the normalizing factorpk\+1\(x′\)p\_\{k\+1\}\(x^\{\\prime\}\)in \([7\.2](https://arxiv.org/html/2607.01693#S7.E2)\) cancels:

Pk←\(x′→y\)Pk←\(x′→x\)=pk\(y\)Pk→\(y→x′\)pk\(x\)Pk→\(x→x′\)\.\\frac\{P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to y\)\}\{P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to x\)\}=\\frac\{p\_\{k\}\(y\)P\_\{k\}^\{\\to\}\(y\\to x^\{\\prime\}\)\}\{p\_\{k\}\(x\)P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\)\}\.SincePk→P\_\{k\}^\{\\to\}is known, the unknown part is the same\-time density ratio

\(7\.3\)𝗌k⋆\(x,y\)=pk\(y\)pk\(x\),x,y∈𝒳\.\\mathsf\{s\}^\{\\star\}\_\{k\}\(x,y\)=\\frac\{p\_\{k\}\(y\)\}\{p\_\{k\}\(x\)\},\\qquad x,y\\in\\mathcal\{X\}\.In continuous space, the score tells us the infinitesimal log\-density change when we move fromxxtox\+dxx\+\\,\\mathrm\{d\}x\. On a graph or finite set, the ratiopk\(y\)/pk\(x\)p\_\{k\}\(y\)/p\_\{k\}\(x\)tells us the finite log\-density change when we move fromxxto a neighboring stateyy\. If the forward kernel permits only local changes, these ratios are enough to specify local reverse transition probabilities\.

For many modern discrete diffusion models, however, the denoising posterior is the more convenient target:

𝖣k⋆\(x0∣x\)=ℙ\(X0=x0∣Xk=x\),x0∈𝒳0,x∈𝒳\.\\mathsf\{D\}^\{\\star\}\_\{k\}\(x\_\{0\}\\mid x\)=\\mathbb\{P\}\(X\_\{0\}=x\_\{0\}\\mid X\_\{k\}=x\),\\qquad x\_\{0\}\\in\\mathcal\{X\}\_\{0\},\\ x\\in\\mathcal\{X\}\.This is the discrete counterpart of the continuous posterior mean𝔼\[X0∣Xk=x\]\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{k\}=x\]in Tweedie’s identity\.

The ratio\-score and denoiser views are closely related by Bayes’ rule:

𝖣k⋆\(x0∣x\)=p𝖽𝖺𝗍𝖺\(x0\)ℙ\(Xk=x∣X0=x0\)pk\(x\)\.\\mathsf\{D\}^\{\\star\}\_\{k\}\(x\_\{0\}\\mid x\)=\\frac\{p\_\{\\mathsf\{data\}\}\(x\_\{0\}\)\\mathbb\{P\}\(X\_\{k\}=x\\mid X\_\{0\}=x\_\{0\}\)\}\{p\_\{k\}\(x\)\}\.Thus comparing the posterior probabilities at two noisy states recovers the density ratiopk\(y\)/pk\(x\)p\_\{k\}\(y\)/p\_\{k\}\(x\)up to the known forward\-likelihood ratio\. The denoiser therefore contains the ratio\-score information, but in a form that is often easier to learn by supervised clean\-token prediction\.

### 7\.3\.Masked diffusion

The cleanest example is masked diffusion for token sequences\. Let the vocabulary be𝒱\\mathcal\{V\}, and add a special mask symbol\[𝖬\]\[\\mathsf\{M\}\]\. Here𝒳0=𝒱L\\mathcal\{X\}\_\{0\}=\\mathcal\{V\}^\{L\}, while the noisy state space is the augmented sequence space

𝒳𝖬:=\(𝒱∪\{\[𝖬\]\}\)L\.\\mathcal\{X\}\_\{\\mathsf\{M\}\}:=\(\\mathcal\{V\}\\cup\\\{\[\\mathsf\{M\}\]\\\}\)^\{L\}\.Forx,y∈𝒳𝖬x,y\\in\\mathcal\{X\}\_\{\\mathsf\{M\}\}, writexix\_\{i\}andyiy\_\{i\}for theiriith coordinates\. Define the forward step by

\(7\.4\)Pk→\(x→y\)=∏i=1L\{1−βk,yi=xi,xi∈𝒱,βk,yi=\[𝖬\],xi∈𝒱,1,xi=yi=\[𝖬\],0,otherwise\.P\_\{k\}^\{\\to\}\(x\\to y\)=\\prod\_\{i=1\}^\{L\}\\begin\{cases\}1\-\\beta\_\{k\},&y\_\{i\}=x\_\{i\},\\ x\_\{i\}\\in\\mathcal\{V\},\\\\ \\beta\_\{k\},&y\_\{i\}=\[\\mathsf\{M\}\],\\ x\_\{i\}\\in\\mathcal\{V\},\\\\ 1,&x\_\{i\}=y\_\{i\}=\[\\mathsf\{M\}\],\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}Thus an unmasked token either stays unchanged or becomes masked, and once a token is masked it remains masked\. Variants may use coordinate\-dependent masking rates, but the same sequence\-level form applies\.

Let

α¯k=∏s=0k−1\(1−βs\)\.\\bar\{\\alpha\}\_\{k\}=\\prod\_\{s=0\}^\{k\-1\}\(1\-\\beta\_\{s\}\)\.For each coordinateii, conditional onX0,i=a∈𝒱X\_\{0,i\}=a\\in\\mathcal\{V\},

ℙ\(Xk,i=a∣X0,i=a\)=α¯k,ℙ\(Xk,i=\[𝖬\]∣X0,i=a\)=1−α¯k\.\\mathbb\{P\}\(X\_\{k,i\}=a\\mid X\_\{0,i\}=a\)=\\bar\{\\alpha\}\_\{k\},\\qquad\\mathbb\{P\}\(X\_\{k,i\}=\[\\mathsf\{M\}\]\\mid X\_\{0,i\}=a\)=1\-\\bar\{\\alpha\}\_\{k\}\.This is analogous to the Gaussian formulaXk∣X0∼𝒩\(akX0,σk2I\)X\_\{k\}\\mid X\_\{0\}\\sim\\mathcal\{N\}\\\!\\left\(a\_\{k\}X\_\{0\},\\sigma\_\{k\}^\{2\}I\\right\): the scalar attenuation measures how much information aboutX0X\_\{0\}remains\.

Masked diffusion is especially transparent because corruption is irreversible in the forward direction\. Once a token is masked, the forward process has forgotten its identity\. The reverse model must therefore use the surrounding context and the learned data distribution to infer plausible clean tokens\. This makes the Bayesian nature of denoising visible without any calculus\. At the coordinate level, the reverse step has only two cases\.

- •IfXk\+1,i=a∈𝒱X\_\{k\+1,i\}=a\\in\\mathcal\{V\}is unmasked, then necessarilyXk,i=aX\_\{k,i\}=a\. There is nothing to sample at that coordinate\.
- •IfXk\+1,i=\[𝖬\]X\_\{k\+1,i\}=\[\\mathsf\{M\}\], thenXk,iX\_\{k,i\}may either already be masked or may be the original clean token\. The reverse sampler must decide whether to unmask, and if so which token to place\.

Specializing the reverse kernel \([7\.2](https://arxiv.org/html/2607.01693#S7.E2)\) to \([7\.4](https://arxiv.org/html/2607.01693#S7.E4)\) gives the sequence\-level formula

\(7\.5\)Pk←\(x′→x\)=pk\(x\)pk\+1\(x′\)Pk→\(x→x′\)\.P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to x\)=\\frac\{p\_\{k\}\(x\)\}\{p\_\{k\+1\}\(x^\{\\prime\}\)\}P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\)\.This is still a coordinatewise masking rule through the known factorPk→\(x→x′\)P\_\{k\}^\{\\to\}\(x\\to x^\{\\prime\}\), but the unknown data\-dependent weight is the full sequence lawpk\(x\)p\_\{k\}\(x\)\. Thus the distribution of an unmasked token generally depends on the entire partially masked sequence\.

### 7\.4\.Training objective for masked diffusion

The model is usually trained to predict masked clean tokens from a corrupted sequence\. LetMk⊆\{1,…,L\}M\_\{k\}\\subseteq\\\{1,\\ldots,L\\\}be the set of masked coordinates at timekk, and letxkx\_\{k\}be the masked sequence\. A denoising model𝖣k\(i,⋅∣xk\)\\mathsf\{D\}\_\{k\}\(i,\\cdot\\mid x\_\{k\}\)outputs a distribution over vocabulary tokens for coordinateii\. The population loss is

\(7\.6\)𝖫𝗆𝖺𝗌𝗄\(𝖣\)=𝔼\[∑i∈Mk−log⁡𝖣k\(i,X0,i∣Xk\)\]\.\\mathsf\{L\}\_\{\\mathsf\{mask\}\}\(\\mathsf\{D\}\)=\\mathbb\{E\}\\left\[\\sum\_\{i\\in M\_\{k\}\}\-\\log\\mathsf\{D\}\_\{k\}\(i,X\_\{0,i\}\\mid X\_\{k\}\)\\right\]\.This is cross\-entropy with the clean token as the label\. The population minimizer is the coordinate marginal of the posterior,

𝖣k⋆\(i,a∣xk\)=ℙ\(X0,i=a∣Xk=xk\)\.\\mathsf\{D\}^\{\\star\}\_\{k\}\(i,a\\mid x\_\{k\}\)=\\mathbb\{P\}\(X\_\{0,i\}=a\\mid X\_\{k\}=x\_\{k\}\)\.
Thus the cross\-entropy objective is not an arbitrary language\-modeling loss\. It is the discrete analogue of denoising score matching\. In the Gaussian case, the optimal prediction is a posterior mean and can be converted into a score by Tweedie’s identity\. In the masked\-token case, the optimal prediction is the full posterior distribution of the hidden clean token\.

###### Proof\.

For fixed\(i,xk,k\)\(i,x\_\{k\},k\), the conditional contribution to the loss is

∑a∈𝒱ℙ\(X0,i=a∣Xk=xk\)\[−log⁡𝖣k\(i,a∣xk\)\]\.\\sum\_\{a\\in\\mathcal\{V\}\}\\mathbb\{P\}\(X\_\{0,i\}=a\\mid X\_\{k\}=x\_\{k\}\)\\,\[\-\\log\\mathsf\{D\}\_\{k\}\(i,a\\mid x\_\{k\}\)\]\.This is the cross\-entropy between the true posterior distribution and the model distribution𝖣k\(i,⋅∣xk\)\\mathsf\{D\}\_\{k\}\(i,\\cdot\\mid x\_\{k\}\)\. It is minimized uniquely by setting the model distribution equal to the true posterior\. ∎

### 7\.5\.Sampling from a masked model

A trained denoiser turns the posterior predictions above into a reverse\-time update rule\. In the absorbing case the terminal state is the all\-mask sequence\. Starting from that state, a sampler repeatedly chooses a setA⊆MkA\\subseteq M\_\{k\}of currently masked coordinates and fills those coordinates by drawing

x^i∼𝖣k\(i,⋅∣xk\),i∈A\.\\widehat\{x\}\_\{i\}\\sim\\mathsf\{D\}\_\{k\}\(i,\\cdot\\mid x\_\{k\}\),\\qquad i\\in A\.Coordinates outsideAAare left unchanged, unless the chosen algorithm deliberately remasks or resamples them\. Thus the schedule is the rule that choosesAAand the time labelkkat each stage of the reverse process\.

This viewpoint makes the any\-order character of masked diffusion explicit\. The model is not tied to a left\-to\-right factorization: it is trained to denoise from partially masked contexts, so at inference time the sampler may reveal coordinates in any order\. This order\-agnostic view goes back to the order\-agnostic training of NADE\[[45](https://arxiv.org/html/2607.01693#bib.bib45)\], and was tied directly to absorbing/masked diffusion by the autoregressive diffusion models of Hoogeboom et al\.\[[28](https://arxiv.org/html/2607.01693#bib.bib28)\]\. For an ordering or sequence of blocksA1,A2,…,Am⊆\{1,…,L\}A\_\{1\},A\_\{2\},\\ldots,A\_\{m\}\\subseteq\\\{1,\\ldots,L\\\}, the sampler updates the still\-masked coordinates inAjA\_\{j\}using posterior predictions conditioned on the current partial sequence\. Singleton blocks give an autoregressive\-like sampler with a chosen order; larger blocks give parallel decoding; adaptive blocks chosen from model confidence give a coarse\-to\-fine or easy\-to\-hard reveal schedule\.

The reason this freedom is legitimate is that training never fixes an order in the first place\. The objective \([7\.6](https://arxiv.org/html/2607.01693#S7.E6)\) masks a*random*subset of coordinates and asks the model to predict the clean tokens at the masked positions from the unmasked ones\. Across training examples, noise levels, and random masks, the masked setMkM\_\{k\}ranges over essentially all subsets of\{1,…,L\}\\\{1,\\ldots,L\\\}, so the model is implicitly trained to approximate the entire family of conditionals

ℙ\(X0,i=a∣Xk,Mkc=xMkc\),Mk⊆\{1,…,L\},i∈Mk,\\mathbb\{P\}\\\!\\left\(X\_\{0,i\}=a\\mid X\_\{k,M\_\{k\}^\{c\}\}=x\_\{M\_\{k\}^\{c\}\}\\right\),\\qquad M\_\{k\}\\subseteq\\\{1,\\ldots,L\\\},\\ i\\in M\_\{k\},of a clean coordinateiigiven the visible entries of the partially masked sequence\. Two features of masked diffusion make this exact\. First, the forward process only masks tokens, never alters them, soXk,Mkc=X0,MkcX\_\{k,M\_\{k\}^\{c\}\}=X\_\{0,M\_\{k\}^\{c\}\}: conditioning on the visible part ofXkX\_\{k\}is the same as conditioning on the corresponding clean values\. Second, whether a coordinate is masked is decided independently of the token values, so the masking pattern carries no information beyond which coordinates are observed\. Hence the population optimum𝖣k⋆\(i,⋅∣xk\)\\mathsf\{D\}^\{\\star\}\_\{k\}\(i,\\cdot\\mid x\_\{k\}\)of \([7\.6](https://arxiv.org/html/2607.01693#S7.E6)\) is exactly the above conditional\.

An inference order is therefore just a rule for choosing which conditional to query next\. Revealing coordinateiifrom the currently observed setMkcM\_\{k\}^\{c\}usesℙ\(X0,i∣Xk,Mkc=xMkc\)\\mathbb\{P\}\(X\_\{0,i\}\\mid X\_\{k,M\_\{k\}^\{c\}\}=x\_\{M\_\{k\}^\{c\}\}\), a member of the family the model already learned\. Every order factorizes the joint into conditionals drawn from this one family, which is why any order is supported by the same trained model\. The choice of block size, however, is still a sampling choice\. If a block contains several coordinates and the sampler draws them independently from the same context, it is using a factorized approximation to the joint conditional for that block\. The next two subsections make this distinction precise: first by writing the general KL accounting for approximate reverse kernels, and then by isolating the extra error created by parallel block updates\.

### 7\.6\.Error analysis parallel to the continuous case

A schedule together with a denoiser defines approximate reverse kernelsP^k←\(x′→⋅\)\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\cdot\), while the exact kernels arePk←\(x′→⋅\)P\_\{k\}^\{\\leftarrow\}\(x^\{\\prime\}\\to\\cdot\)from \([7\.2](https://arxiv.org/html/2607.01693#S7.E2)\)\. Letp^0\\widehat\{p\}\_\{0\}be the final distribution generated by the approximate reverse chain, and letp^K\\widehat\{p\}\_\{K\}be the law used to initialize that chain at timeKK\. A KL path\-space argument gives

\(7\.7\)D𝖪𝖫⁡\(p0∥p^0\)≤D𝖪𝖫⁡\(pK∥p^K\)\+∑k=0K−1𝔼Xk\+1∼pk\+1D𝖪𝖫⁡\(Pk←\(Xk\+1→⋅\)∥P^k←\(Xk\+1→⋅\)\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{0\}\\\|\\widehat\{p\}\_\{0\}\)\\leq\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{K\}\\\|\\widehat\{p\}\_\{K\}\)\+\\sum\_\{k=0\}^\{K\-1\}\\mathbb\{E\}\_\{X\_\{k\+1\}\\sim p\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\middle\\\|\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\right\)\.This is the exact analogue of the continuous diffusion decomposition:

continuous diffusiondiscrete diffusionmeaningpK≈𝒩\(0,I\)pK≈π𝗋𝖾𝖿initialization at noisePk←∝pk×Gaussian likelihoodPk←∝pk×Pk→exact reverse kernel𝗌k≈𝗌k⋆P^k←≈Pk←or𝖣k≈𝖣k⋆learned denoising information∑kηkεk,𝗌𝖼𝗈𝗋𝖾2∑k𝔼D𝖪𝖫⁡\(Pk←∥P^k←\)statistical/model error\\begin\{array\}\[\]\{c\|c\|c\}\\text\{continuous diffusion\}&\\text\{discrete diffusion\}&\\text\{meaning\}\\\\ \\hline\\cr p\_\{K\}\\approx\\mathcal\{N\}\\\!\\left\(0,I\\right\)&p\_\{K\}\\approx\\pi\_\{\\mathsf\{ref\}\}&\\text\{initialization at noise\}\\\\ P\_\{k\}^\{\\leftarrow\}\\propto p\_\{k\}\\times\\text\{Gaussian likelihood\}&P\_\{k\}^\{\\leftarrow\}\\propto p\_\{k\}\\times P\_\{k\}^\{\\to\}&\\text\{exact reverse kernel\}\\\\ \\mathsf\{s\}\_\{k\}\\approx\\mathsf\{s\}^\{\\star\}\_\{k\}&\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\\approx P\_\{k\}^\{\\leftarrow\}\\text\{ or \}\\mathsf\{D\}\_\{k\}\\approx\\mathsf\{D\}^\{\\star\}\_\{k\}&\\text\{learned denoising information\}\\\\ \\sum\_\{k\}\\eta\_\{k\}\\varepsilon\_\{k,\\mathsf\{score\}\}^\{2\}&\\sum\_\{k\}\\mathbb\{E\}\\,\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(P\_\{k\}^\{\\leftarrow\}\\\|\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\)&\\text\{statistical/model error\}\\end\{array\}
The justification is verbatim the KL telescoping argument of Section[6](https://arxiv.org/html/2607.01693#S6): apply Lemma[6\.1](https://arxiv.org/html/2607.01693#S6.Thmtheorem1)to the exact and approximate reverse path laws, then project the paths to their final state using data processing\.

The decomposition \([7\.7](https://arxiv.org/html/2607.01693#S7.E7)\) isolates the same two error sources as in the continuous case: the initialization gapD𝖪𝖫⁡\(pK∥p^K\)\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{K\}\\\|\\widehat\{p\}\_\{K\}\)and the per\-step reverse\-kernel KL summed over the reverse steps\. In absorbing masked diffusion, finiteness of these KL terms is already informative\. If the sampler starts from the deterministic all\-mask state, thenp^K=δ\[𝖬\]L\\widehat\{p\}\_\{K\}=\\delta\_\{\[\\mathsf\{M\}\]^\{L\}\}, so the initialization term is finite only when the forward terminal lawpKp\_\{K\}is also supported on the all\-mask state\. For the independent masking chain \([7\.4](https://arxiv.org/html/2607.01693#S7.E4)\), this requiresα¯K=0\\bar\{\\alpha\}\_\{K\}=0, for instance through a terminal step withβk=1\\beta\_\{k\}=1\.

Once the supports are compatible so that the KL terms are finite, the second term in \([7\.7](https://arxiv.org/html/2607.01693#S7.E7)\) is tied to the quantity the model actually trains on\. For masked diffusion the model does not learn the reverse kernel directly; it learns posterior clean\-token distributions through the cross\-entropy objective \([7\.6](https://arxiv.org/html/2607.01693#S7.E6)\)\. Schematically, we have the bound

𝔼D𝖪𝖫\(Pk←\(Xk\+1→⋅\)∥P^k←\(Xk\+1→⋅\)\)≲𝔼∑i∈Mk\+1D𝖪𝖫\(𝖣k\+1⋆\(i,⋅∣Xk\+1\)∥𝖣k\+1\(i,⋅∣Xk\+1\)\)\.\\mathbb\{E\}\\,\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(P\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\\|\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\)\\lesssim\\mathbb\{E\}\\sum\_\{i\\in M\_\{k\+1\}\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(\\mathsf\{D\}^\{\\star\}\_\{k\+1\}\(i,\\cdot\\mid X\_\{k\+1\}\)\\middle\\\|\\mathsf\{D\}\_\{k\+1\}\(i,\\cdot\\mid X\_\{k\+1\}\)\\right\)\.This bound is the cross\-entropy regret of the posterior predictions, the part that training controls: it vanishes as the learned posterior𝖣k\+1\\mathsf\{D\}\_\{k\+1\}approaches the true posterior𝖣k\+1⋆\\mathsf\{D\}^\{\\star\}\_\{k\+1\}\. It is the*entire*per\-step error when the sampler unmasks one coordinate at a time, because each single\-coordinate reverse step queries a true conditional \(as explained in Subsection[7\.5](https://arxiv.org/html/2607.01693#S7.SS5)\), so the approximate reverse chain reproduces the exact one\. With a perfect denoiser, single\-coordinate decoding is therefore exact in any order\. Its drawback is cost—revealing a single coordinate per step takes as many network evaluations as there are coordinates—which raises the question of the next subsection: can we instead reveal several coordinates at once, and what accuracy does that sacrifice?

### 7\.7\.Parallel masked inference

To avoid spending one network evaluation per coordinate, practical masked samplers often reveal a whole block of coordinates in a single reverse step\. This can mean two different things\. The simplest implementation is*factorized block decoding*: ifAAis the set of coordinates to reveal from the current partially observed sequencexkx\_\{k\}, it samples

aA∼∏i∈A𝖣k\(i,ai∣xk\)\.a\_\{A\}\\sim\\prod\_\{i\\in A\}\\mathsf\{D\}\_\{k\}\(i,a\_\{i\}\\mid x\_\{k\}\)\.The exact object, however, is the joint conditional law

ℙ\(X0,A=aA∣Xk=xk\)\.\\mathbb\{P\}\(X\_\{0,A\}=a\_\{A\}\\mid X\_\{k\}=x\_\{k\}\)\.Equivalently, for any orderingA=\{i1,…,im\}A=\\\{i\_\{1\},\\ldots,i\_\{m\}\\\}, the chain rule writes this joint conditional as

∏ℓ=1mℙ\(X0,iℓ=aiℓ∣Xk=xk,X0,i1=ai1,…,X0,iℓ−1=aiℓ−1\)\.\\prod\_\{\\ell=1\}^\{m\}\\mathbb\{P\}\\\!\\left\(X\_\{0,i\_\{\\ell\}\}=a\_\{i\_\{\\ell\}\}\\mid X\_\{k\}=x\_\{k\},\\,X\_\{0,i\_\{1\}\}=a\_\{i\_\{1\}\},\\ldots,X\_\{0,i\_\{\\ell\-1\}\}=a\_\{i\_\{\\ell\-1\}\}\\right\)\.Thus a block can be sampled exactly by sequentially querying updated conditionals, but the product of one\-coordinate posteriors from the same context is exact only under conditional independence\. The gap between these two laws is the factorization error\. The inference schedule therefore sets a speed–accuracy tradeoff: larger blocks shorten the reverse chain but introduce conditional\-independence bias, while singleton blocks remove the bias at the cost of as many reverse evaluations as there are coordinates\. Lavenant and Zanella\[[31](https://arxiv.org/html/2607.01693#bib.bib31)\]analyze this error directly for masked diffusions with factorized approximations, decomposing the relative\-entropy error into learning and factorization terms and relating the optimal block\-size schedule to an information profile of the data distribution\.

This is not the only possible approach of parallel inference\. The block sampler above is a factorized approximation to the joint conditional; a different question is whether one can parallelize the exact sequential conditional\-sampling procedure itself\. In an oracle model with access to exact conditional marginals of a target law on𝒱L\\mathcal\{V\}^\{L\}, Anari, Gao, and Rubinstein\[[5](https://arxiv.org/html/2607.01693#bib.bib5)\]show how to organize the queries to sample arbitrary product\-space laws inO~\(L2/3\)\\widetilde\{O\}\(L^\{2/3\}\)parallel time\. Anari et al\.\[[4](https://arxiv.org/html/2607.01693#bib.bib4)\]improve the parallel time toO~\(L1/2\)\\widetilde\{O\}\(L^\{1/2\}\)using autospeculative rejection sampling\. These results are best read as a corrected\-parallel counterpart to the simple block update: they do not replace a joint conditional by independent marginals, but instead use parallel queries to simulate the sequential conditionals with a correction\.

### 7\.8\.CTMC setup and learning

The discussion so far used a discrete\-time masked chain because it makes the Bayes rule and the denoising posterior easy to see\. For analyzing practical samplers, however, the sharper language is continuous time: it lets us describe the exact reverse dynamics as one coordinate jump at a time, and then view parallel updates as a numerical approximation to those dynamics\.

This is one reason that most modern discrete diffusion models are phrased not with the stochastic matricesPk→P\_\{k\}^\{\\to\}but as a*continuous\-time Markov chain*\(CTMC\), the finite\-state analogue of the forward SDE\. The forward corruption is run as a CTMC whose rate matrix factorizes over coordinates, so each coordinate is noised independently\. This has a decisive consequence for the reverse process: at any instant only one coordinate changes, because the probability that two independent coordinates jump in the same infinitesimal interval iso\(dt\)o\(\\,\\mathrm\{d\}t\)\. The exact reverse dynamics are therefore unambiguous and update a single coordinate at a time, and the parallel\-update bias described above reappears in a controlled form, as the error of approximating these dynamics on a finite time grid \(*τ\\tau\-leaping*\)\. Continuous time thus separates the posterior/score error from the discretization error cleanly, produces the cleanest analogue of the score, and admits a flexible family of reverse samplers\[[10](https://arxiv.org/html/2607.01693#bib.bib10),[36](https://arxiv.org/html/2607.01693#bib.bib36)\]\.

A CTMC on a finite set𝒳\\mathcal\{X\}is generated by a time\-dependent forward*rate matrix*Rt→R\_\{t\}^\{\\to\}with

Rt→\(x,y\)≥0\(y≠x\),Rt→\(x,x\)=−∑y≠xRt→\(x,y\)\.R\_\{t\}^\{\\to\}\(x,y\)\\geq 0\\ \(y\\neq x\),\\qquad R\_\{t\}^\{\\to\}\(x,x\)=\-\\sum\_\{y\\neq x\}R\_\{t\}^\{\\to\}\(x,y\)\.Thus the off\-diagonal entries are jump rates, and the diagonal entry makes each row sum to zero\. Over an infinitesimal step,

ℙ\(Xt\+dt=y∣Xt=x\)=\{Rt→\(x,y\)dt\+o\(dt\),y≠x,1\+Rt→\(x,x\)dt\+o\(dt\),y=x\.\\mathbb\{P\}\(X\_\{t\+\\,\\mathrm\{d\}t\}=y\\mid X\_\{t\}=x\)=\\begin\{cases\}R\_\{t\}^\{\\to\}\(x,y\)\\,\\mathrm\{d\}t\+o\(\\,\\mathrm\{d\}t\),&y\\neq x,\\\\ 1\+R\_\{t\}^\{\\to\}\(x,x\)\\,\\mathrm\{d\}t\+o\(\\,\\mathrm\{d\}t\),&y=x\.\\end\{cases\}Equivalently,

ℙ\(Xt\+dt=y∣Xt=x\)=δxy\+Rt→\(x,y\)dt\+o\(dt\)\.\\mathbb\{P\}\(X\_\{t\+\\,\\mathrm\{d\}t\}=y\\mid X\_\{t\}=x\)=\\delta\_\{xy\}\+R\_\{t\}^\{\\to\}\(x,y\)\\,\\mathrm\{d\}t\+o\(\\,\\mathrm\{d\}t\)\.Ifpt\(x\)=ℙ\(Xt=x\)p\_\{t\}\(x\)=\\mathbb\{P\}\(X\_\{t\}=x\)is written as a row vector, then its evolution is

\(7\.8\)∂tpt=ptRt→,∂tpt\(y\)=∑x∈𝒳pt\(x\)Rt→\(x,y\)\.\\partial\_\{t\}p\_\{t\}=p\_\{t\}R\_\{t\}^\{\\to\},\\qquad\\partial\_\{t\}p\_\{t\}\(y\)=\\sum\_\{x\\in\\mathcal\{X\}\}p\_\{t\}\(x\)R\_\{t\}^\{\\to\}\(x,y\)\.This is the finite\-state counterpart of the Fokker–Planck equation \([1\.2](https://arxiv.org/html/2607.01693#S1.E2)\): a rate matrix replaces the drift\-and\-diffusion operator acting on continuous densities\.

The rate matrix is also where the corruption family is specified\. An*absorbing*\(masked\) generator sends each token to\[𝖬\]\[\\mathsf\{M\}\]at a schedule\-dependent rate and is the continuous\-time version of the masked diffusion of the previous subsections; a*uniform*generator pushes each token toward the uniform distribution over𝒱\\mathcal\{V\}; other choices encode structured token graphs\. Masked diffusion is therefore a special case of this CTMC picture\.

The reverse process is again a CTMC, and its rates are fixed by a Bayes\-ratio formula that is the time\-continuous version of the reverse kernel \([7\.2](https://arxiv.org/html/2607.01693#S7.E2)\)\. Fory≠xy\\neq x,

\(7\.9\)Rt←\(y,x\)=Rt→\(x,y\)pt\(x\)pt\(y\)\.R\_\{t\}^\{\\leftarrow\}\(y,x\)=R\_\{t\}^\{\\to\}\(x,y\)\\,\\frac\{p\_\{t\}\(x\)\}\{p\_\{t\}\(y\)\}\.With the diagonal chosen so that rows sum to zero, these rates are exactly the ones that track the same marginal in reverse time\. Indeed, for eachyywithpt\(y\)\>0p\_\{t\}\(y\)\>0,

\(ptRt←\)\(y\)\\displaystyle\(p\_\{t\}R\_\{t\}^\{\\leftarrow\}\)\(y\)=∑x≠ypt\(x\)Rt←\(x,y\)−pt\(y\)∑x≠yRt←\(y,x\)\\displaystyle=\\sum\_\{x\\neq y\}p\_\{t\}\(x\)R\_\{t\}^\{\\leftarrow\}\(x,y\)\-p\_\{t\}\(y\)\\sum\_\{x\\neq y\}R\_\{t\}^\{\\leftarrow\}\(y,x\)=∑x≠ypt\(x\)Rt→\(y,x\)pt\(y\)pt\(x\)−pt\(y\)∑x≠yRt→\(x,y\)pt\(x\)pt\(y\)\\displaystyle=\\sum\_\{x\\neq y\}p\_\{t\}\(x\)R\_\{t\}^\{\\to\}\(y,x\)\\frac\{p\_\{t\}\(y\)\}\{p\_\{t\}\(x\)\}\-p\_\{t\}\(y\)\\sum\_\{x\\neq y\}R\_\{t\}^\{\\to\}\(x,y\)\\frac\{p\_\{t\}\(x\)\}\{p\_\{t\}\(y\)\}=pt\(y\)∑x≠yRt→\(y,x\)−∑x≠ypt\(x\)Rt→\(x,y\)\\displaystyle=p\_\{t\}\(y\)\\sum\_\{x\\neq y\}R\_\{t\}^\{\\to\}\(y,x\)\-\\sum\_\{x\\neq y\}p\_\{t\}\(x\)R\_\{t\}^\{\\to\}\(x,y\)=−∂tpt\(y\),\\displaystyle=\-\\partial\_\{t\}p\_\{t\}\(y\),where the last equality is the forward equation \([7\.8](https://arxiv.org/html/2607.01693#S7.E8)\)\. Thus−∂tpt=ptRt←\-\\partial\_\{t\}p\_\{t\}=p\_\{t\}R\_\{t\}^\{\\leftarrow\}, which is the forward equation for the chain when it is simulated from timeTTdown to time0\. Starting the reverse CTMC frompTp\_\{T\}therefore gives marginalptp\_\{t\}at every intermediate time\. As in the continuous case, the forward ratesRt→R\_\{t\}^\{\\to\}are known by design and the only unknown ingredient is the collection of same\-time density ratios over neighboring states\. We write the exact score as

𝗌t⋆\(x,y\):=pt\(y\)pt\(x\),y≠x,Rt→\(y,x\)\>0\.\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\):=\\frac\{p\_\{t\}\(y\)\}\{p\_\{t\}\(x\)\},\\qquad y\\neq x,\\ R\_\{t\}^\{\\to\}\(y,x\)\>0\.This is the continuous\-time form of the ratio score \([7\.3](https://arxiv.org/html/2607.01693#S7.E3)\), the discrete analogue of∇log⁡pt\\nabla\\log p\_\{t\}that records the finite relative change of the noised law along each admissible move\. Let𝗌t\(x,y\)\>0\\mathsf\{s\}\_\{t\}\(x,y\)\>0be the learned prediction,𝗌t\(x,y\)≈𝗌t⋆\(x,y\)\\mathsf\{s\}\_\{t\}\(x,y\)\\approx\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\. The orientation is chosen for reverse sampling: when the reverse chain is atxx, a candidate predecessoryyis admissible exactly when the forward CTMC can jump fromyytoxx\. With this convention the learned reverse rate is

R^t←\(x,y\)=Rt→\(y,x\)𝗌t\(x,y\),y≠x\.\\widehat\{R\}\_\{t\}^\{\\leftarrow\}\(x,y\)=R\_\{t\}^\{\\to\}\(y,x\)\\,\\mathsf\{s\}\_\{t\}\(x,y\),\\qquad y\\neq x\.At first the training target looks inaccessible, since the ratio contains the unknown marginal lawptp\_\{t\}\. The key denoising identity is that this unknown ratio can be written as a posterior average of known forward likelihood ratios\.

\(7\.10\)𝗌t⋆\(x,y\)=𝔼\[ℙ\(Xt=y∣X0\)ℙ\(Xt=x∣X0\)\|Xt=x\]\.\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)=\\mathbb\{E\}\\\!\\left\[\\frac\{\\mathbb\{P\}\(X\_\{t\}=y\\mid X\_\{0\}\)\}\{\\mathbb\{P\}\(X\_\{t\}=x\\mid X\_\{0\}\)\}\\,\\middle\|\\,X\_\{t\}=x\\right\]\.Thus training can sample clean dataX0X\_\{0\}, corrupt it toXt=xX\_\{t\}=x, and use the known forward likelihood ratio inside the loss\.

Score\-entropy training\[[36](https://arxiv.org/html/2607.01693#bib.bib36)\]turns this identity into a supervised loss for positive ratios by first choosing a scalar discrepancy for one reverse edge\. Let𝗌⋆\>0\\mathsf\{s\}^\{\\star\}\>0denote the exact ratio, let𝗌\>0\\mathsf\{s\}\>0be the learned prediction, and setF\(u\)=−log⁡uF\(u\)=\-\\log uonu\>0u\>0\. Score entropy uses the scaled Bregman divergence

𝗌⋆DF\(𝗌,𝗌⋆\)=𝗌⋆\(F\(𝗌\)−F\(𝗌⋆\)−F′\(𝗌⋆\)\(𝗌−𝗌⋆\)\)=𝗌−𝗌⋆log⁡𝗌\+𝗌⋆log⁡𝗌⋆−𝗌⋆\.\\mathsf\{s\}^\{\\star\}D\_\{F\}\(\\mathsf\{s\},\\mathsf\{s\}^\{\\star\}\)=\\mathsf\{s\}^\{\\star\}\\bigl\(F\(\\mathsf\{s\}\)\-F\(\\mathsf\{s\}^\{\\star\}\)\-F^\{\\prime\}\(\\mathsf\{s\}^\{\\star\}\)\(\\mathsf\{s\}\-\\mathsf\{s\}^\{\\star\}\)\\bigr\)=\\mathsf\{s\}\-\\mathsf\{s\}^\{\\star\}\\log\\mathsf\{s\}\+\\mathsf\{s\}^\{\\star\}\\log\\mathsf\{s\}^\{\\star\}\-\\mathsf\{s\}^\{\\star\}\.This quantity is nonnegative, has derivative1−𝗌⋆/𝗌1\-\\mathsf\{s\}^\{\\star\}/\\mathsf\{s\}, and is minimized at𝗌=𝗌⋆\\mathsf\{s\}=\\mathsf\{s\}^\{\\star\}; note that it is defined only for positive ratios, so the learned𝗌t\(x,y\)\\mathsf\{s\}\_\{t\}\(x,y\)must be constrained to remain positive\. In the learning objective the terms𝗌⋆log⁡𝗌⋆−𝗌⋆\\mathsf\{s\}^\{\\star\}\\log\\mathsf\{s\}^\{\\star\}\-\\mathsf\{s\}^\{\\star\}are independent of the learned score\. Thus the model\-dependent part of one edge is𝗌t\(x,y\)−𝗌t⋆\(x,y\)log⁡𝗌t\(x,y\)\\mathsf\{s\}\_\{t\}\(x,y\)\-\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\\log\\mathsf\{s\}\_\{t\}\(x,y\)\. In training, the unknown ratio𝗌t⋆\(x,y\)\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)is replaced by the computable forward likelihood ratio from the noising process\. A schematic denoising score\-entropy loss is

\(7\.11\)𝔼t,X0∑x∈𝒳ℙ\(Xt=x∣X0\)∑y:Rt→\(y,x\)\>0wt\(x,y\)\[𝗌t\(x,y\)−ℙ\(Xt=y∣X0\)ℙ\(Xt=x∣X0\)log⁡𝗌t\(x,y\)\],\\mathbb\{E\}\_\{t,X\_\{0\}\}\\sum\_\{x\\in\\mathcal\{X\}\}\\mathbb\{P\}\(X\_\{t\}=x\\mid X\_\{0\}\)\\sum\_\{y:\\,R\_\{t\}^\{\\to\}\(y,x\)\>0\}w\_\{t\}\(x,y\)\\left\[\\mathsf\{s\}\_\{t\}\(x,y\)\-\\frac\{\\mathbb\{P\}\(X\_\{t\}=y\\mid X\_\{0\}\)\}\{\\mathbb\{P\}\(X\_\{t\}=x\\mid X\_\{0\}\)\}\\log\\mathsf\{s\}\_\{t\}\(x,y\)\\right\],up to terms independent of the learned score\. Herewt\(x,y\)≥0w\_\{t\}\(x,y\)\\geq 0is a chosen weight, often the incoming forward rateRt→\(y,x\)R\_\{t\}^\{\\to\}\(y,x\)or a multiple of it\.

The reason this objective has the correct population target is exactly the denoising identity \([7\.10](https://arxiv.org/html/2607.01693#S7.E10)\) above\. Since the displayed loss is affine in the likelihood ratio appearing in \([7\.10](https://arxiv.org/html/2607.01693#S7.E10)\), averaging over the unknown clean data simply replaces it by its conditional mean\. The resulting edgewise risk is, up to constants,𝗌t\(x,y\)−𝗌t⋆\(x,y\)log⁡𝗌t\(x,y\),\\mathsf\{s\}\_\{t\}\(x,y\)\-\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\\log\\mathsf\{s\}\_\{t\}\(x,y\),and is minimized at the exact score𝗌t\(x,y\)=𝗌t⋆\(x,y\)\\mathsf\{s\}\_\{t\}\(x,y\)=\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)on each reverse\-admissible edge\.

The special choicewt\(x,y\)=Rt→\(y,x\)w\_\{t\}\(x,y\)=R\_\{t\}^\{\\to\}\(y,x\)gives this denoising loss its sampling interpretation\. On the edge from the current statexxback to a possible predecessoryy, the true reverse rate isRt→\(y,x\)𝗌t⋆\(x,y\)R\_\{t\}^\{\\to\}\(y,x\)\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\), while the learned reverse rate isRt→\(y,x\)𝗌t\(x,y\)R\_\{t\}^\{\\to\}\(y,x\)\\mathsf\{s\}\_\{t\}\(x,y\)\. The corresponding per\-edge contribution to the CTMC path\-space relative entropy is \(as will be shown in \([7\.12](https://arxiv.org/html/2607.01693#S7.E12)\)\)

Rt→\(y,x\)\[𝗌t\(x,y\)−𝗌t⋆\(x,y\)\+𝗌t⋆\(x,y\)log⁡𝗌t⋆\(x,y\)𝗌t\(x,y\)\]\.R\_\{t\}^\{\\to\}\(y,x\)\\left\[\\mathsf\{s\}\_\{t\}\(x,y\)\-\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\+\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\\log\\frac\{\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\}\{\\mathsf\{s\}\_\{t\}\(x,y\)\}\\right\]\.After dropping terms independent of the learned score, this is exactly the objective in \([7\.11](https://arxiv.org/html/2607.01693#S7.E11)\) withwt\(x,y\)=Rt→\(y,x\)w\_\{t\}\(x,y\)=R\_\{t\}^\{\\to\}\(y,x\)\. Thus the rate\-weighted choice matches the path\-space error quantity used in the CTMC sampling analysis below\.

### 7\.9\.CTMC sampling

Once the learned ratios have specified the reverse rates, the remaining problem is numerical simulation of the resulting jump process\.τ\\tau\-leaping is a standard acceleration of the stochastic simulation algorithm for chemical reaction networks, introduced by Gillespie\[[24](https://arxiv.org/html/2607.01693#bib.bib24)\]\. The idea is to group many small jumps over a short interval, while pretending that the jump rates are essentially constant during that interval\.

For the learned reverse CTMC, keep the same forward\-time grid convention as before,

0=t0<t1<⋯<tK=T,hk=tk\+1−tk\.0=t\_\{0\}<t\_\{1\}<\\cdots<t\_\{K\}=T,\\qquad h\_\{k\}=t\_\{k\+1\}\-t\_\{k\}\.The reverse sampler moves fromtk\+1t\_\{k\+1\}down totkt\_\{k\}\. If the current state isxxat timetk\+1t\_\{k\+1\}, the simplest tau\-leap freezes the learned reverse rates at\(tk\+1,x\)\(t\_\{k\+1\},x\)and uses the first\-order kernel

ℙ\(X^tk=z∣X^tk\+1=x\)=hkR^tk\+1←\(x,z\)\+O\(hk2\),z≠x,\\mathbb\{P\}\(\\widehat\{X\}\_\{t\_\{k\}\}=z\\mid\\widehat\{X\}\_\{t\_\{k\+1\}\}=x\)=h\_\{k\}\\,\\widehat\{R\}\_\{t\_\{k\+1\}\}^\{\\leftarrow\}\(x,z\)\+O\(h\_\{k\}^\{2\}\),\\qquad z\\neq x,with the remaining probability assigned to staying atxx\. Equivalently, for product spaces one draws independent Poisson clocksNz∼𝖯𝗈𝗂𝗌𝗌𝗈𝗇\(hkR^tk\+1←\(x,z\)\)N\_\{z\}\\sim\\mathsf\{Poisson\}\(h\_\{k\}\\widehat\{R\}\_\{t\_\{k\+1\}\}^\{\\leftarrow\}\(x,z\)\)for the admissible local moves and applies the proposed coordinate changes in parallel, with a fixed tie\-breaking convention if several incompatible clocks ring\. This is the discrete analogue of an Euler–Maruyama step: it replaces the time\-varying reverse generator by a frozen one and, in parallel implementations, introduces a local independence error by allowing several coordinates to update during one step\. We refer to\[[21](https://arxiv.org/html/2607.01693#bib.bib21), Chapter 13\]for a more detailed discussion ofτ\\tau\-leaping and related algorithms\.

Tau\-leaping is convenient, but it is still an approximation because the rates are frozen over a whole interval\. A closely related exact construction is*uniformization*, originally introduced by Grassmann\[[25](https://arxiv.org/html/2607.01693#bib.bib25)\], applied to chemical reaction simulation by Beentjes and Baker\[[8](https://arxiv.org/html/2607.01693#bib.bib8)\], and applied to discrete diffusion models by Chen and Ying\[[13](https://arxiv.org/html/2607.01693#bib.bib13)\]\.

Let us describe how uniformization works\. On the interval\[tk,tk\+1\]\[t\_\{k\},t\_\{k\+1\}\], choose a clock rate

Λk≥sups∈\[tk,tk\+1\]x∈𝒳∑z≠xR^s←\(x,z\)\.\\Lambda\_\{k\}\\geq\\sup\_\{\\begin\{subarray\}\{c\}s\\in\[t\_\{k\},t\_\{k\+1\}\]\\\\ x\\in\\mathcal\{X\}\\end\{subarray\}\}\\sum\_\{z\\neq x\}\\widehat\{R\}\_\{s\}^\{\\leftarrow\}\(x,z\)\.Starting from the state at timetk\+1t\_\{k\+1\}, draw Poisson event times in\[tk,tk\+1\]\[t\_\{k\},t\_\{k\+1\}\]with rateΛk\\Lambda\_\{k\}and process them in decreasing time\. At an event timess, if the current state isxx, jump toz≠xz\\neq xwith probabilityR^s←\(x,z\)/Λk\\widehat\{R\}\_\{s\}^\{\\leftarrow\}\(x,z\)/\\Lambda\_\{k\}and otherwise stay atxx\. The stay\-put events are virtual jumps\. Since the clock dominates all outgoing rates, this thinning construction has exactly the jump law of the learned reverse CTMC on the interval, without a frozen\-rate approximation\.

### 7\.10\.Error analysis for CTMC samplers

Having described the two CTMC simulation schemes, we now ask how their output laws differ from the desired data law\. The numerical analysis can be organized in the same way as in Section[6](https://arxiv.org/html/2607.01693#S6)\. There are three contributions\. First, the exact reverse process should start frompTp\_\{T\}, whereas the sampler is initialized from some implementable lawp^T\\widehat\{p\}\_\{T\}\. This gives the common initialization gap\. Define

ε𝗂𝗇𝗂𝗍2:=D𝖪𝖫⁡\(pT∥p^T\)\.\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}:=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{T\}\\\|\\widehat\{p\}\_\{T\}\)\.Two common choices ofp^T\\widehat\{p\}\_\{T\}should be interpreted differently\. Ifp^T\\widehat\{p\}\_\{T\}is a point mass, thenε𝗂𝗇𝗂𝗍2\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}is mainly a support condition\. In absorbing masked diffusion, for instance, one often takesp^T=δ\[𝖬\]L\\widehat\{p\}\_\{T\}=\\delta\_\{\[\\mathsf\{M\}\]^\{L\}\}; the KL is finite only if the forward terminal lawpTp\_\{T\}is also supported on the all\-mask state\. With a finite integrated masking rate this may fail, so one should either force exact terminal absorption or measure the endpoint error in a weaker metric\.

Ifp^T\\widehat\{p\}\_\{T\}is full\-support, the same term is an ordinary initialization error\. For example, ifp^T\\widehat\{p\}\_\{T\}is uniform and the forward CTMC contracts KL top^T\\widehat\{p\}\_\{T\}at rateρ\\rho, then

ε𝗂𝗇𝗂𝗍2≤e−ρTD𝖪𝖫⁡\(p0∥p^T\)≤e−ρTlog⁡\|𝒳\|\.\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\\leq e^\{\-\\rho T\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{0\}\\\|\\widehat\{p\}\_\{T\}\)\\leq e^\{\-\\rho T\}\\log\|\\mathcal\{X\}\|\.On product token spaces𝒳=𝒱L\\mathcal\{X\}=\\mathcal\{V\}^\{L\}, this term is of orderL\(log⁡\|𝒱\|\)e−TL\(\\log\|\\mathcal\{V\}\|\)e^\{\-T\}for the usual independent\-coordinate corruption\.

After initialization, fix a simulator and first imagine running that simulator with the true reverse rates, and only afterwards replace the true ratio score by the learned ratio score:

p0⟶p0𝗇𝗎𝗆,∗⟶p^0𝗇𝗎𝗆\.p\_\{0\}\\quad\\longrightarrow\\quad p\_\{0\}^\{\\mathsf\{num\},\*\}\\quad\\longrightarrow\\quad\\widehat\{p\}\_\{0\}^\{\\mathsf\{num\}\}\.Herep0𝗇𝗎𝗆,∗p\_\{0\}^\{\\mathsf\{num\},\*\}denotes the output law of the chosen numerical simulator when it is run with the true reverse rates\. The first comparison isolates numerical error, while the second comparison isolates score\-estimation error\. Strictly speaking, KL has no triangle inequality, so the rigorous proof usually performs this split on path space, or directly in the log\-intensity integrand, rather than by adding two marginal KL divergences\. Nevertheless, this bookkeeping is useful for organizing the error analysis\.

The score\-estimation term has a path\-space form analogous to Girsanov’s theorem for diffusions\. We write it in the same epsilon\-squared convention as the continuous score error:

\(7\.12\)ε𝖲𝖤2:=∫0T𝔼Xt∼pt∑y:Rt→\(y,Xt\)\>0Rt→\(y,Xt\)\[𝗌t\(Xt,y\)−𝗌t⋆\(Xt,y\)\+𝗌t⋆\(Xt,y\)log⁡𝗌t⋆\(Xt,y\)𝗌t\(Xt,y\)\]dt\.\\displaystyle\\varepsilon\_\{\\mathsf\{SE\}\}^\{2\}:=\\int\_\{0\}^\{T\}\\mathbb\{E\}\_\{X\_\{t\}\\sim p\_\{t\}\}\\sum\_\{y:\\,R\_\{t\}^\{\\to\}\(y,X\_\{t\}\)\>0\}R\_\{t\}^\{\\to\}\(y,X\_\{t\}\)\\left\[\\mathsf\{s\}\_\{t\}\(X\_\{t\},y\)\-\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\},y\)\+\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\},y\)\\log\\frac\{\\mathsf\{s\}^\{\\star\}\_\{t\}\(X\_\{t\},y\)\}\{\\mathsf\{s\}\_\{t\}\(X\_\{t\},y\)\}\\right\]\\,\\mathrm\{d\}t\.This is the same score\-entropy Bregman loss that appears in the learning objective \([7\.11](https://arxiv.org/html/2607.01693#S7.E11)\), which explains why it is a natural training loss\.

###### Proof of \([7\.12](https://arxiv.org/html/2607.01693#S7.E12)\)\.

Compare the exact reverse CTMC with ratesRt←\(x,y\)=Rt→\(y,x\)𝗌t⋆\(x,y\)R\_\{t\}^\{\\leftarrow\}\(x,y\)=R\_\{t\}^\{\\to\}\(y,x\)\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)to the learned reverse CTMC with ratesR^t←\(x,y\)=Rt→\(y,x\)𝗌t\(x,y\)\\widehat\{R\}\_\{t\}^\{\\leftarrow\}\(x,y\)=R\_\{t\}^\{\\to\}\(y,x\)\\mathsf\{s\}\_\{t\}\(x,y\), and assume the learned rates are positive wherever the exact rates are positive\. Over a small interval of lengthdt\\,\\mathrm\{d\}t, condition on the current statexx\. Using the diagonal for the stay\-put probability, the one\-step KL is

∑y≠xRt←\(x,y\)dtlog⁡Rt←\(x,y\)dtR^t←\(x,y\)dt\+\(1\+Rt←\(x,x\)dt\)log⁡1\+Rt←\(x,x\)dt1\+R^t←\(x,x\)dt\+o\(dt\)\\displaystyle\\sum\_\{y\\neq x\}R\_\{t\}^\{\\leftarrow\}\(x,y\)\\,\\mathrm\{d\}t\\log\\frac\{R\_\{t\}^\{\\leftarrow\}\(x,y\)\\,\\mathrm\{d\}t\}\{\\widehat\{R\}\_\{t\}^\{\\leftarrow\}\(x,y\)\\,\\mathrm\{d\}t\}\+\\bigl\(1\+R\_\{t\}^\{\\leftarrow\}\(x,x\)\\,\\mathrm\{d\}t\\bigr\)\\log\\frac\{1\+R\_\{t\}^\{\\leftarrow\}\(x,x\)\\,\\mathrm\{d\}t\}\{1\+\\widehat\{R\}\_\{t\}^\{\\leftarrow\}\(x,x\)\\,\\mathrm\{d\}t\}\+o\(\\,\\mathrm\{d\}t\)=dt∑y≠x\[R^t←\(x,y\)−Rt←\(x,y\)\+Rt←\(x,y\)log⁡Rt←\(x,y\)R^t←\(x,y\)\]\+o\(dt\)\.\\displaystyle\\qquad=\\,\\mathrm\{d\}t\\sum\_\{y\\neq x\}\\left\[\\widehat\{R\}\_\{t\}^\{\\leftarrow\}\(x,y\)\-R\_\{t\}^\{\\leftarrow\}\(x,y\)\+R\_\{t\}^\{\\leftarrow\}\(x,y\)\\log\\frac\{R\_\{t\}^\{\\leftarrow\}\(x,y\)\}\{\\widehat\{R\}\_\{t\}^\{\\leftarrow\}\(x,y\)\}\\right\]\+o\(\\,\\mathrm\{d\}t\)\.Substituting the two reverse rates cancels the common factorRt→\(y,x\)R\_\{t\}^\{\\to\}\(y,x\)inside the logarithm and gives

∑y:Rt→\(y,x\)\>0Rt→\(y,x\)\[𝗌t\(x,y\)−𝗌t⋆\(x,y\)\+𝗌t⋆\(x,y\)log⁡𝗌t⋆\(x,y\)𝗌t\(x,y\)\]dt\+o\(dt\)\.\\sum\_\{y:\\,R\_\{t\}^\{\\to\}\(y,x\)\>0\}R\_\{t\}^\{\\to\}\(y,x\)\\left\[\\mathsf\{s\}\_\{t\}\(x,y\)\-\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\+\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\\log\\frac\{\\mathsf\{s\}^\{\\star\}\_\{t\}\(x,y\)\}\{\\mathsf\{s\}\_\{t\}\(x,y\)\}\\right\]\\,\\mathrm\{d\}t\+o\(\\,\\mathrm\{d\}t\)\.The exact reverse process has marginalptp\_\{t\}at noise timett\. Averaging overXt∼ptX\_\{t\}\\sim p\_\{t\}and integrating overt∈\[0,T\]t\\in\[0,T\]gives \([7\.12](https://arxiv.org/html/2607.01693#S7.E12)\); if the two reverse chains start from different laws, the additional contribution is the initialization KL error\. ∎

Forτ\\tau\-leaping, the exact\-score numerical law is not the true reverse CTMC law\. On each interval\[tk,tk\+1\]\[t\_\{k\},t\_\{k\+1\}\]it replaces the time\-dependent true reverse rates by the frozen ratesRtk\+1←R\_\{t\_\{k\+1\}\}^\{\\leftarrow\}\. Replacing these frozen true rates byR^tk\+1←\\widehat\{R\}\_\{t\_\{k\+1\}\}^\{\\leftarrow\}then adds the score\-estimation part\. Leth¯\\bar\{h\}denote the mesh parameter controlling the reverse step sizes, and letR¯\\bar\{R\}be an upper bound on the total outgoing reverse rate,∑z≠xRs←\(x,z\)≤R¯\\sum\_\{z\\neq x\}R\_\{s\}^\{\\leftarrow\}\(x,z\)\\leq\\bar\{R\}on\[0,T\]\[0,T\]\. A stochastic\-integral analysis of the combined comparison gives, under bounded\-rate and regularity assumptions, a bound of the schematic form

\(7\.13\)D𝖪𝖫⁡\(p0∥p^0τ\)≲ε𝗂𝗇𝗂𝗍2\+ε𝖲𝖤2\+R¯2h¯T\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(p\_\{0\}\\middle\\\|\\widehat\{p\}\_\{0\}^\{\\tau\}\\right\)\\lesssim\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\+\\varepsilon\_\{\\mathsf\{SE\}\}^\{2\}\+\\bar\{R\}^\{\\,2\}\\bar\{h\}T\.The three terms are the initialization error, the score\-estimation error, and the exact\-score tau\-leaping discretization error\.

For product token spaces𝒳=𝒱L\\mathcal\{X\}=\\mathcal\{V\}^\{L\}, more recent work gives a sharper version of the same message with explicit dependence on the vocabulary size\. If the learned ratios are clipped to\[B−1,B\]\[B^\{\-1\},B\]and the reverse grid is controlled by mesh parameterh¯\\bar\{h\}, then the standard tau\-leaping sampler obeys, suppressing logarithmic factors and endpoint regularity constants,

\(7\.14\)D𝖪𝖫⁡\(p0∥p^0τ\)≲ε𝗂𝗇𝗂𝗍2\+ε𝖲𝖤2\+h¯L2\|𝒱\|T,\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(p\_\{0\}\\middle\\\|\\widehat\{p\}\_\{0\}^\{\\tau\}\\right\)\\lesssim\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\+\\varepsilon\_\{\\mathsf\{SE\}\}^\{2\}\+\\bar\{h\}\\,L^\{2\}\|\\mathcal\{V\}\|\\,T,as in the analysis of Liang et al\.\[[33](https://arxiv.org/html/2607.01693#bib.bib33)\]\. For the usual independent\-coordinate corruption one may substituteε𝗂𝗇𝗂𝗍2≲L\(log⁡\|𝒱\|\)e−T\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\\lesssim L\(\\log\|\\mathcal\{V\}\|\)e^\{\-T\}\. Thus making the tau\-leaping discretization error of orderε\\varepsilonrequiresh¯\\bar\{h\}of orderO~\(ε/\(L2\|𝒱\|T\)\)\\widetilde\{O\}\(\\varepsilon/\(L^\{2\}\|\\mathcal\{V\}\|T\)\), and hence a deterministic grid with roughlyO~\(L2\|𝒱\|/ε\)\\widetilde\{O\}\(L^\{2\}\|\\mathcal\{V\}\|/\\varepsilon\)reverse steps in this analysis\.

We can achieve high\-accuracy sampling using uniformization\. With the true ratio score, it simulates the true reverse CTMC exactly, sop0𝗎𝗇𝗂,∗=p0p\_\{0\}^\{\\mathsf\{uni\},\*\}=p\_\{0\}if the reverse process is initialized frompTp\_\{T\}\. Thus, for the learned score, the error estimate of uniformization reduces to

\(7\.15\)D𝖪𝖫⁡\(p0∥p^0uni\)≤ε𝗂𝗇𝗂𝗍2\+ε𝖲𝖤2\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(p\_\{0\}\\middle\\\|\\widehat\{p\}\_\{0\}^\{\\rm uni\}\\right\)\\leq\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\+\\varepsilon\_\{\\mathsf\{SE\}\}^\{2\}\.On product token spaces𝒳=𝒱L\\mathcal\{X\}=\\mathcal\{V\}^\{L\}with the usual independent\-coordinate corruption, one may substituteε𝗂𝗇𝗂𝗍2≲L\(log⁡\|𝒱\|\)e−T\\varepsilon\_\{\\mathsf\{init\}\}^\{2\}\\lesssim L\(\\log\|\\mathcal\{V\}\|\)e^\{\-T\}in \([7\.15](https://arxiv.org/html/2607.01693#S7.E15)\), givingD𝖪𝖫⁡\(p0∥p^0uni\)≲L\(log⁡\|𝒱\|\)e−T\+ε𝖲𝖤2\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{0\}\\\|\\widehat\{p\}\_\{0\}^\{\\rm uni\}\)\\lesssim L\(\\log\|\\mathcal\{V\}\|\)e^\{\-T\}\+\\varepsilon\_\{\\mathsf\{SE\}\}^\{2\}\. ChoosingT≍log⁡\(Llog⁡\|𝒱\|/ε2\)T\\asymp\\log\(L\\log\|\\mathcal\{V\}\|/\\varepsilon^\{2\}\)and learning the ratio score so thatε𝖲𝖤2≲ε2\\varepsilon\_\{\\mathsf\{SE\}\}^\{2\}\\lesssim\\varepsilon^\{2\}givesD𝖪𝖫⁡\(p0∥p^0uni\)≲ε2\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(p\_\{0\}\\\|\\widehat\{p\}\_\{0\}^\{\\rm uni\}\)\\lesssim\\varepsilon^\{2\}\. Chen and Ying\[[13](https://arxiv.org/html/2607.01693#bib.bib13)\]further show that, with adaptive dominating rates, the expected number of Poisson events is nearly linear in the hypercube dimension\. Their theorem is stated for\{0,1\}d\\\{0,1\\\}^\{d\}; for an alphabet of size\|𝒱\|\|\\mathcal\{V\}\|, a fixed binary encoding usesq=⌈log2⁡\|𝒱\|⌉q=\\lceil\\log\_\{2\}\|\\mathcal\{V\}\|\\rceilbits per token, sod=Lqd=Lq\. In token notation this gives an expected event countO~\(Llog⁡\|𝒱\|\)\\widetilde\{O\}\\\!\\left\(L\\log\|\\mathcal\{V\}\|\\right\)forε2\\varepsilon^\{2\}KL error\.

## 8\.Guidance, Reward Tilting, and Inference\-Time RL

So far we have built diffusion samplers, continuous and discrete, that aim to reproduce the data distribution, or a slightly smoothed version of it\. In many applications one instead wants a controlled modification of that base sampler: generate samples satisfying a condition, prefer high\-reward outputs, or adapt a model at inference time\. Guidance, reward tilting, and inference\-time RL are all mechanisms for biasing the base model toward preferred outputs while staying close to the pretrained distribution, and their common mathematical language is the KL\-regularized change of measure developed in this section\.

### 8\.1\.Reward\-tilted targets and KL\-regularized optimization

Letp0p\_\{0\}denote the clean\-output distribution produced by the pretrained sampler; we keep the diffusion convention that clean samples are written asx0x\_\{0\}\. At this point we specify only the desired law on clean outputs; the corresponding reverse\-time dynamics will be derived below\. Given a rewardr:ℝd→ℝr:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}and inverse temperatureβ≥0\\beta\\geq 0, define the exponentially tilted target

\(8\.1\)p0β\(x0\)=1Zβp0\(x0\)eβr\(x0\)\.p\_\{0\}^\{\\beta\}\(x\_\{0\}\)=\\frac\{1\}\{Z\_\{\\beta\}\}p\_\{0\}\(x\_\{0\}\)e^\{\\beta r\(x\_\{0\}\)\}\.Here

Zβ=∫p0\(x0\)eβr\(x0\)dx0=𝔼X0∼p0eβr\(X0\)Z\_\{\\beta\}=\\int p\_\{0\}\(x\_\{0\}\)e^\{\\beta r\(x\_\{0\}\)\}\\,\\mathrm\{d\}x\_\{0\}=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}e^\{\\beta r\(X\_\{0\}\)\}is the normalizing constant, assumed finite\. Equivalently, for any test functionff,

𝔼p0βf\(X0\)=𝔼p0\[f\(X0\)eβr\(X0\)\]𝔼p0eβr\(X0\)\.\\mathbb\{E\}\_\{p\_\{0\}^\{\\beta\}\}f\(X\_\{0\}\)=\\frac\{\\mathbb\{E\}\_\{p\_\{0\}\}\\\!\\left\[f\(X\_\{0\}\)e^\{\\beta r\(X\_\{0\}\)\}\\right\]\}\{\\mathbb\{E\}\_\{p\_\{0\}\}e^\{\\beta r\(X\_\{0\}\)\}\}\.Thus the target is obtained from the base model by reweighting samples according to their clean\-output reward\. The caseβ=0\\beta=0recoversp0p\_\{0\}, while largerβ\\betaplaces more mass on high\-reward regions and therefore trades diversity under the base model for reward improvement\.

Conditional generation fits the same form\. Ifyyis an observation, class label, or prompt andp\(y∣x0\)p\(y\\mid x\_\{0\}\)is the corresponding likelihood or compatibility model, then Bayes’ rule gives

p0\(x0∣y\)∝p0\(x0\)p\(y∣x0\)\.p\_\{0\}\(x\_\{0\}\\mid y\)\\propto p\_\{0\}\(x\_\{0\}\)p\(y\\mid x\_\{0\}\)\.This is an exponential tilt with reward functionx0↦log⁡p\(y∣x0\)x\_\{0\}\\mapsto\\log p\(y\\mid x\_\{0\}\)andβ=1\\beta=1; a classifier score or learned preference model plays the same role when an explicit likelihood is unavailable\. The inference\-time problem is to modify the reverse sampler so that its clean\-output law approximates such a tilted or conditional target while still reusing the pretrained base dynamics\.

The same tilted law has a useful variational meaning\. Among all distributionsqqover final samples, it is the optimizer of

\(8\.2\)supq\{β𝔼qr−D𝖪𝖫⁡\(q∥p0\)\}\.\\sup\_\{q\}\\left\\\{\\beta\\mathbb\{E\}\_\{q\}r\-\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|p\_\{0\}\)\\right\\\}\.Equivalently, ifβ\>0\\beta\>0,

p0β=argmaxq⁡\{𝔼qr−1βD𝖪𝖫⁡\(q∥p0\)\}\.p\_\{0\}^\{\\beta\}=\\operatorname\{arg\\,max\}\_\{q\}\\left\\\{\\mathbb\{E\}\_\{q\}r\-\\frac\{1\}\{\\beta\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|p\_\{0\}\)\\right\\\}\.Thusβ\\betacontrols the tradeoff between reward improvement and staying close to the base model\. Largeβ\\betapushes hard toward high reward and risks mode collapse or reward hacking; smallβ\\betapreserves the base distribution but gives weaker alignment\.

###### Proof\.

Letp0β\(x\)=Zβ−1p0\(x\)eβr\(x\)p\_\{0\}^\{\\beta\}\(x\)=Z\_\{\\beta\}^\{\-1\}p\_\{0\}\(x\)e^\{\\beta r\(x\)\}\. For anyqq,

D𝖪𝖫⁡\(q∥p0β\)=∫q\(x\)log⁡q\(x\)p0\(x\)eβr\(x\)/Zβdx=D𝖪𝖫⁡\(q∥p0\)−β𝔼qr\+log⁡Zβ\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|p\_\{0\}^\{\\beta\}\)=\\int q\(x\)\\log\\frac\{q\(x\)\}\{p\_\{0\}\(x\)e^\{\\beta r\(x\)\}/Z\_\{\\beta\}\}\\,\\mathrm\{d\}x=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|p\_\{0\}\)\-\\beta\\mathbb\{E\}\_\{q\}r\+\\log Z\_\{\\beta\}\.Rearranging,

β𝔼qr−D𝖪𝖫⁡\(q∥p0\)=log⁡Zβ−D𝖪𝖫⁡\(q∥p0β\)≤log⁡Zβ,\\beta\\mathbb\{E\}\_\{q\}r\-\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|p\_\{0\}\)=\\log Z\_\{\\beta\}\-\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(q\\\|p\_\{0\}^\{\\beta\}\)\\leq\\log Z\_\{\\beta\},with equality iffq=p0βq=p\_\{0\}^\{\\beta\}\. ∎

### 8\.2\.Guidance as score tilting

To run a diffusion sampler for the tilted targetp0βp\_\{0\}^\{\\beta\}we need the score of its noised marginals, so the question is how that score differs from the base score we have already learned\. Letptp\_\{t\}andptβp\_\{t\}^\{\\beta\}be the noised marginals ofp0p\_\{0\}andp0βp\_\{0\}^\{\\beta\}\. WriteP0,t→\(x0→x\)P\_\{0,t\}^\{\\to\}\(x\_\{0\}\\to x\)for the density of the clean\-to\-noisy marginal forward kernelLaw⁡\(Xt∣X0=x0\)\\operatorname\{Law\}\(X\_\{t\}\\mid X\_\{0\}=x\_\{0\}\)\. Because the reward acts only on the clean samplex0x\_\{0\}, noising the tilted law gives

ptβ\(x\)=1Zβ∫P0,t→\(x0→x\)eβr\(x0\)p0\(x0\)dx0=pt\(x\)Zβhtβ\(x\),htβ\(x\):=𝔼\[eβr\(X0\)∣Xt=x\]\.p\_\{t\}^\{\\beta\}\(x\)=\\frac\{1\}\{Z\_\{\\beta\}\}\\int P\_\{0,t\}^\{\\to\}\(x\_\{0\}\\to x\)\\,e^\{\\beta r\(x\_\{0\}\)\}p\_\{0\}\(x\_\{0\}\)\\,\\mathrm\{d\}x\_\{0\}=\\frac\{p\_\{t\}\(x\)\}\{Z\_\{\\beta\}\}\\,h\_\{t\}^\{\\beta\}\(x\),\\qquad h\_\{t\}^\{\\beta\}\(x\):=\\mathbb\{E\}\\\!\\left\[e^\{\\beta r\(X\_\{0\}\)\}\\mid X\_\{t\}=x\\right\]\.The noised tilted law is thus the base noised law reweighted by the*posterior tilt factor*htβh\_\{t\}^\{\\beta\}, which reads off, from the current noisy state, the expected exponential reward of the clean sample it will denoise to\. Because this marginal is a product, its log\-gradient splits into the base score plus a correction,

∇log⁡ptβ\(x\)=∇log⁡pt\(x\)\+∇log⁡htβ\(x\),\\nabla\\log p\_\{t\}^\{\\beta\}\(x\)=\\nabla\\log p\_\{t\}\(x\)\+\\nabla\\log h\_\{t\}^\{\\beta\}\(x\),so exact guidance merely adds the tilt gradient∇log⁡htβ\\nabla\\log h\_\{t\}^\{\\beta\}to the base score, and practical methods differ only in how they approximate it\.

For conditional generation, the reward function isx0↦log⁡p\(y∣x0\)x\_\{0\}\\mapsto\\log p\(y\\mid x\_\{0\}\)\. The same posterior\-tilt notation gives the noisy likelihood

ht\(x∣y\):=𝔼\[p\(y∣X0\)∣Xt=x\]\.h\_\{t\}\(x\\mid y\):=\\mathbb\{E\}\[p\(y\\mid X\_\{0\}\)\\mid X\_\{t\}=x\]\.This is the quantity a noisy classifier estimates\. It is not the noised conditional density itself; rather, it tilts the base noisy density\. If the condition is fixed, the noised conditional density is

pt\(x∣y\)∝pt\(x\)ht\(x∣y\),p\_\{t\}\(x\\mid y\)\\propto p\_\{t\}\(x\)h\_\{t\}\(x\\mid y\),where the missing normalizing constant is independent ofxx\. Taking a log\-gradient at noise levelttgives

\(8\.3\)∇log⁡pt\(x∣y\)=∇log⁡pt\(x\)\+∇log⁡ht\(x∣y\)\.\\nabla\\log p\_\{t\}\(x\\mid y\)=\\nabla\\log p\_\{t\}\(x\)\+\\nabla\\log h\_\{t\}\(x\\mid y\)\.This is classifier guidance\[[20](https://arxiv.org/html/2607.01693#bib.bib20)\]: train or use a classifier on noisy inputs, then add its gradient to the unconditional score\. Classifier\-free guidance\[[27](https://arxiv.org/html/2607.01693#bib.bib27)\]estimates the same increment without a separate classifier\. We use the already established score notation in the form

𝗌t\(x\)≈∇log⁡pt\(x\),𝗌t\(x∣y\)≈∇log⁡pt\(x∣y\)\.\\mathsf\{s\}\_\{t\}\(x\)\\approx\\nabla\\log p\_\{t\}\(x\),\\qquad\\mathsf\{s\}\_\{t\}\(x\\mid y\)\\approx\\nabla\\log p\_\{t\}\(x\\mid y\)\.Thus \([8\.3](https://arxiv.org/html/2607.01693#S8.E3)\) suggests

𝗌t\(x∣y\)−𝗌t\(x\)≈∇log⁡ht\(x∣y\)\.\\mathsf\{s\}\_\{t\}\(x\\mid y\)\-\\mathsf\{s\}\_\{t\}\(x\)\\approx\\nabla\\log h\_\{t\}\(x\\mid y\)\.With guidance strengthβ≥0\\beta\\geq 0, classifier\-free guidance replaces the unconditional score by

\(8\.4\)𝗌t\(x\)⟼𝗌t\(x\)\+β\(𝗌t\(x∣y\)−𝗌t\(x\)\)\.\\mathsf\{s\}\_\{t\}\(x\)\\quad\\longmapsto\\quad\\mathsf\{s\}\_\{t\}\(x\)\+\\beta\\bigl\(\\mathsf\{s\}\_\{t\}\(x\\mid y\)\-\\mathsf\{s\}\_\{t\}\(x\)\\bigr\)\.The choiceβ=1\\beta=1gives the conditional score estimate𝗌t\(x∣y\)\\mathsf\{s\}\_\{t\}\(x\\mid y\), whileβ\>1\\beta\>1extrapolates the conditional\-score increment; empirically this often improves condition satisfaction, at the cost of moving farther from the base distribution\. If the two learned scores were exact andβ=1\\beta=1, the approximation \([8\.4](https://arxiv.org/html/2607.01693#S8.E4)\) would recover the conditional score in \([8\.3](https://arxiv.org/html/2607.01693#S8.E3)\)\. Forβ≠1\\beta\\neq 1, it instead gives the score of the noise\-level power tiltpt\(x\)ht\(x∣y\)βp\_\{t\}\(x\)h\_\{t\}\(x\\mid y\)^\{\\beta\}, whereas the clean reward tilt \([8\.1](https://arxiv.org/html/2607.01693#S8.E1)\) would involve the posterior factorhtβ\(x\)=𝔼\[p\(y∣X0\)β∣Xt=x\]h\_\{t\}^\{\\beta\}\(x\)=\\mathbb\{E\}\[p\(y\\mid X\_\{0\}\)^\{\\beta\}\\mid X\_\{t\}=x\]\. Powering the noisy likelihood is not the same as tilting by the noisy expectation of the powered clean likelihood, so the two coincide only atβ=1\\beta=1\.

For small tilts, the posterior tilt factor has a linear approximation\. Sincehtβ\(x\)=1\+βVt\(x\)\+O\(β2\)h\_\{t\}^\{\\beta\}\(x\)=1\+\\beta V\_\{t\}\(x\)\+O\(\\beta^\{2\}\)withVt\(x\)=𝔼\[r\(X0\)∣Xt=x\]V\_\{t\}\(x\)=\\mathbb\{E\}\[r\(X\_\{0\}\)\\mid X\_\{t\}=x\], we have

\(8\.5\)∇log⁡ptβ\(x\)≈∇log⁡pt\(x\)\+β∇Vt\(x\),Vt\(x\)=𝔼\[r\(X0\)∣Xt=x\]\.\\nabla\\log p\_\{t\}^\{\\beta\}\(x\)\\approx\\nabla\\log p\_\{t\}\(x\)\+\\beta\\nabla V\_\{t\}\(x\),\\qquad V\_\{t\}\(x\)=\\mathbb\{E\}\[r\(X\_\{0\}\)\\mid X\_\{t\}=x\]\.In the small\-tilt regime, then, guidance perturbs the score by the gradient of the posterior expected rewardVtV\_\{t\}, so a reward model need only capture this first\-order landscape\. The simplification is thatVtV\_\{t\}averages the rewardrritself, whereas the exact factorhtβh\_\{t\}^\{\\beta\}averageseβre^\{\\beta r\}; the two agree only to first order inβ\\beta\. Exact guidance uses the fullhtβh\_\{t\}^\{\\beta\}, and the small\-tilt form trades it for a cheaper reward\-gradient correction\.

### 8\.3\.Reward tilting as a Polchinski flow

Stepping back from the practical methods, the tilt factorhtβh\_\{t\}^\{\\beta\}ties guidance to the renormalization viewpoint of Section[4](https://arxiv.org/html/2607.01693#S4)\. Recall the variance\-exploding normalizationXt=X0\+tZX\_\{t\}=X\_\{0\}\+\\sqrt\{t\}\\,Zused there, in whichpt=p0∗𝒩\(0,tI\)p\_\{t\}=p\_\{0\}\*\\mathcal\{N\}\\\!\\left\(0,tI\\right\)and the effective potentialUt=−log⁡ptU\_\{t\}=\-\\log p\_\{t\}solves the finite\-dimensional Polchinski equation \([4\.12](https://arxiv.org/html/2607.01693#S4.E12)\)\. Reading the integral that defineshtβh\_\{t\}^\{\\beta\}the other way, the*unnormalized*tilted marginalhtβpt=\(eβrp0\)∗𝒩\(0,tI\)h\_\{t\}^\{\\beta\}\\,p\_\{t\}=\(e^\{\\beta r\}p\_\{0\}\)\*\\mathcal\{N\}\\\!\\left\(0,tI\\right\)is the same forward channel applied to the reward\-reweighted data measureeβrp0e^\{\\beta r\}p\_\{0\}—noising the tilted data and tilting the noised data byhtβh\_\{t\}^\{\\beta\}coincide—so it is a heat flow and solves \([4\.11](https://arxiv.org/html/2607.01693#S4.E11)\) just asptp\_\{t\}does\. Dividing outptp\_\{t\}leaves a forward Kolmogorov equation for the tilt factor,

\(8\.6\)∂thtβ=12Δhtβ\+𝗌t⋆⋅∇htβ,\\partial\_\{t\}h\_\{t\}^\{\\beta\}=\\tfrac\{1\}\{2\}\\Delta h\_\{t\}^\{\\beta\}\+\\mathsf\{s\}^\{\\star\}\_\{t\}\\cdot\\nabla h\_\{t\}^\{\\beta\},sohtβh\_\{t\}^\{\\beta\}is transported by the score\-driven generator12Δ\+𝗌t⋆⋅∇\\tfrac\{1\}\{2\}\\Delta\+\\mathsf\{s\}^\{\\star\}\_\{t\}\\cdot\\nabla\. Equivalently, the tilted effective potentialUtβ:=−log⁡\(htβpt\)U\_\{t\}^\{\\beta\}:=\-\\log\(h\_\{t\}^\{\\beta\}\\,p\_\{t\}\)obeys the*same*Polchinski equation,

\(8\.7\)∂tUtβ=12ΔUtβ−12‖∇Utβ‖2,\\partial\_\{t\}U\_\{t\}^\{\\beta\}=\\tfrac\{1\}\{2\}\\Delta U\_\{t\}^\{\\beta\}\-\\tfrac\{1\}\{2\}\\left\\lVert\\nabla U\_\{t\}^\{\\beta\}\\right\\rVert^\{2\},only with the bare potential shifted fromU0=−log⁡p0U\_\{0\}=\-\\log p\_\{0\}toU0β=−log⁡p0−βrU\_\{0\}^\{\\beta\}=\-\\log p\_\{0\}\-\\beta r\. Reward tilting therefore runs the renormalization flow from a reward\-shifted initial potential, and the guided score−∇Utβ=𝗌t⋆\+∇log⁡htβ\-\\nabla U\_\{t\}^\{\\beta\}=\\mathsf\{s\}^\{\\star\}\_\{t\}\+\\nabla\\log h\_\{t\}^\{\\beta\}is exactly the additive guidance correction of the previous subsection\.

It should be emphasized that this is a reformulation, not a recipe\. Solving the tilt equation \([8\.6](https://arxiv.org/html/2607.01693#S8.E6)\) or the Polchinski equation \([8\.7](https://arxiv.org/html/2607.01693#S8.E7)\) is no easier than evaluating the posterior expectationhtβ\(x\)=𝔼\[eβr\(X0\)∣Xt=x\]h\_\{t\}^\{\\beta\}\(x\)=\\mathbb\{E\}\[e^\{\\beta r\(X\_\{0\}\)\}\\mid X\_\{t\}=x\]that defines it: the PDE carries the same computational difficulty as the conditional expectation\.

### 8\.4\.Path\-space control and the Doob transform

The score\-tilting, classifier\-guidance, and Polchinski\-flow descriptions of the previous subsections are all exact, but they all rest on the same intractable object, the posterior tilthtβh\_\{t\}^\{\\beta\}\. To see how it is approximated in practice it helps to pass from marginals to whole reverse trajectories: on path space the tilt becomes a stochastic control problem, and its exact solution—the Doob transform of the base reverse chain—exposes precisely which conditional expectations must be estimated\.

Let us write a continuous reverse sampler in generation time as

dYs←=bs\(Ys←\)ds\+σsdBs←,0≤s≤T,\\,\\mathrm\{d\}Y^\{\\leftarrow\}\_\{s\}=b\_\{s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\,\\mathrm\{d\}s\+\\sigma\_\{s\}\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{s\},\\qquad 0\\leq s\\leq T,and letℙ0\\mathbb\{P\}^\{0\}be its path law on full reverse trajectories\(Ys←\)0≤s≤T\(Y^\{\\leftarrow\}\_\{s\}\)\_\{0\\leq s\\leq T\}, with clean outputYT←Y^\{\\leftarrow\}\_\{T\}\.

Reward\-tilted sampling trades terminal reward against the cost of departing from this base law\. At the level of path laws it is the KL\-regularized optimization

\(8\.8\)supℚ\{β𝔼ℚr\(YT←\)−D𝖪𝖫⁡\(ℚ∥ℙ0\)\}=log⁡𝔼ℙ0eβr\(YT←\),\\sup\_\{\\mathbb\{Q\}\}\\left\\\{\\beta\\mathbb\{E\}\_\{\\mathbb\{Q\}\}r\(Y^\{\\leftarrow\}\_\{T\}\)\-\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{Q\}\\\|\\mathbb\{P\}^\{0\}\)\\right\\\}=\\log\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}e^\{\\beta r\(Y^\{\\leftarrow\}\_\{T\}\)\},the path\-space version of the Gibbs variational principle \([8\.2](https://arxiv.org/html/2607.01693#S8.E2)\)\. Its optimizer is the Gibbs path law

\(8\.9\)dℚ⋆dℙ0\(\(Ys←\)0≤s≤T\)=eβr\(YT←\)𝔼ℙ0eβr\(YT←\)\.\\frac\{\\,\\mathrm\{d\}\\mathbb\{Q\}^\{\\star\}\}\{\\,\\mathrm\{d\}\\mathbb\{P\}^\{0\}\}\\\!\\left\(\(Y^\{\\leftarrow\}\_\{s\}\)\_\{0\\leq s\\leq T\}\\right\)=\\frac\{e^\{\\beta r\(Y^\{\\leftarrow\}\_\{T\}\)\}\}\{\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}e^\{\\beta r\(Y^\{\\leftarrow\}\_\{T\}\)\}\}\.
###### Proof\.

The proof is identical to the finite\-dimensional reward\-tilting proof\. Definedℚ⋆∝eβr\(YT←\)dℙ0\\,\\mathrm\{d\}\\mathbb\{Q\}^\{\\star\}\\propto e^\{\\beta r\(Y^\{\\leftarrow\}\_\{T\}\)\}\\,\\mathrm\{d\}\\mathbb\{P\}^\{0\}\. Then

D𝖪𝖫⁡\(ℚ∥ℚ⋆\)=D𝖪𝖫⁡\(ℚ∥ℙ0\)−β𝔼ℚr\(YT←\)\+log⁡𝔼ℙ0eβr\(YT←\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{Q\}\\\|\\mathbb\{Q\}^\{\\star\}\)=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{Q\}\\\|\\mathbb\{P\}^\{0\}\)\-\\beta\\mathbb\{E\}\_\{\\mathbb\{Q\}\}r\(Y^\{\\leftarrow\}\_\{T\}\)\+\\log\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}e^\{\\beta r\(Y^\{\\leftarrow\}\_\{T\}\)\}\.Rearranging and using nonnegativity of KL proves \([8\.8](https://arxiv.org/html/2607.01693#S8.E8)\), with the supremum attained atℚ⋆\\mathbb\{Q\}^\{\\star\}\. ∎

To turn this into a sampler we express it as a control problem on the drift\. A controlled sampler changes only the drift,

dYs←=\(bs\(Ys←\)\+σsus\(Ys←\)\)ds\+σsdBs←,\\,\\mathrm\{d\}Y^\{\\leftarrow\}\_\{s\}=\\bigl\(b\_\{s\}\(Y^\{\\leftarrow\}\_\{s\}\)\+\\sigma\_\{s\}u\_\{s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\bigr\)\\,\\mathrm\{d\}s\+\\sigma\_\{s\}\\,\\mathrm\{d\}B^\{\\leftarrow\}\_\{s\},with path lawℙu\\mathbb\{P\}^\{u\}; using the same diffusion coefficient means the two path laws differ only through the drift\. Under the usual absolute\-continuity and integrability assumptions, the Girsanov KL formula in Appendix[A](https://arxiv.org/html/2607.01693#A1)gives

\(8\.10\)D𝖪𝖫⁡\(ℙu∥ℙ0\)=12𝔼ℙu∫0T‖us\(Ys←\)‖2ds,\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{P\}^\{u\}\\\|\\mathbb\{P\}^\{0\}\)=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathbb\{P\}^\{u\}\}\\int\_\{0\}^\{T\}\\left\\lVert u\_\{s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\right\\rVert^\{2\}\\,\\mathrm\{d\}s,withus\(Ys←\)u\_\{s\}\(Y^\{\\leftarrow\}\_\{s\}\)a progressively measurable control, so the quadratic control energy is exactly the KL cost of steering the sampler away from its base law\. Restrictingℚ\\mathbb\{Q\}to controlled diffusionsℙu\\mathbb\{P\}^\{u\}and substituting this cost into \([8\.8](https://arxiv.org/html/2607.01693#S8.E8)\) makes the Gibbs problem the equivalent stochastic control problem over drifts

\(8\.11\)supu𝔼ℙu\[βr\(YT←\)−12∫0T‖us\(Ys←\)‖2ds\]\.\\sup\_\{u\}\\mathbb\{E\}\_\{\\mathbb\{P\}^\{u\}\}\\left\[\\beta r\(Y^\{\\leftarrow\}\_\{T\}\)\-\\frac\{1\}\{2\}\\int\_\{0\}^\{T\}\\left\\lVert u\_\{s\}\(Y^\{\\leftarrow\}\_\{s\}\)\\right\\rVert^\{2\}\\,\\mathrm\{d\}s\\right\]\.The two are the same optimization stated at the level of measures and of drifts\.

Reading the optimizer \([8\.9](https://arxiv.org/html/2607.01693#S8.E9)\) back at the level of drifts recovers the slogan “guidance is control”: the optimal control’s drift increment is exactly the value\-gradient term of the ideal guided reverse dynamics of Subsection[8\.2](https://arxiv.org/html/2607.01693#S8.SS2), now in this generation\-time parametrization\. The KL cost \([8\.10](https://arxiv.org/html/2607.01693#S8.E10)\), which measured discretization error in Section[6](https://arxiv.org/html/2607.01693#S6), is here a deliberately chosen guidance budget, spent where it raises expected terminal reward most efficiently\.

From here to the end of the section we adopt the discrete\-time point of view, working with the time\-discretized reverse chain\(XK,…,X0\)\(X\_\{K\},\\ldots,X\_\{0\}\)of Section[5](https://arxiv.org/html/2607.01693#S5)rather than the continuous SDE\. The reason is that the objects we now build—the Doob transform below, the sequential Monte Carlo weights of Subsection[8\.5](https://arxiv.org/html/2607.01693#S8.SS5), and the policies of Subsection[8\.6](https://arxiv.org/html/2607.01693#S8.SS6)—all act step by step, and a finite chain of reverse kernels states them without stochastic\-calculus bookkeeping; the continuous formulas above are recovered in the small\-step limit\. In this notation the clean output is indexed by0, so a terminal reward is written asr\(X0\)r\(X\_\{0\}\)\. The path\-space optimizer has an explicit Markov\-kernel form: each base reverse proposal is reweighted by the tilt factor of its continuations and normalized by the tilt factor at the current state\. This kernel reweighting is the Doob transform of the base reverse chain\.

###### Theorem 8\.1\(Optimal path tilt and Doob transform\)\.

Letℙ0\\mathbb\{P\}^\{0\}be a base reverse Markov chain on\(XK,XK−1,…,X0\)\(X\_\{K\},X\_\{K\-1\},\\ldots,X\_\{0\}\)with initial lawpK0p\_\{K\}^\{0\}and reverse kernelsPk0,←\(xk\+1→dxk\)P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\. For a terminal rewardr\(X0\)r\(X\_\{0\}\), define

hk\(xk\)=𝔼ℙ0\[eβr\(X0\)∣Xk=xk\]\.h\_\{k\}\(x\_\{k\}\)=\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}\\\!\\left\[e^\{\\beta r\(X\_\{0\}\)\}\\mid X\_\{k\}=x\_\{k\}\\right\]\.Then the optimizer of the path\-space Gibbs problem \([8\.8](https://arxiv.org/html/2607.01693#S8.E8)\) is the tilted path law

dℚ⋆dℙ0=eβr\(X0\)𝔼ℙ0eβr\(X0\)\.\\frac\{\\,\\mathrm\{d\}\\mathbb\{Q\}^\{\\star\}\}\{\\,\\mathrm\{d\}\\mathbb\{P\}^\{0\}\}=\\frac\{e^\{\\beta r\(X\_\{0\}\)\}\}\{\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}e^\{\\beta r\(X\_\{0\}\)\}\}\.Moreover,ℚ⋆\\mathbb\{Q\}^\{\\star\}is Markov, and its reverse kernels are

\(8\.12\)Pk⋆,←\(xk\+1→dxk\)=Pk0,←\(xk\+1→dxk\)hk\(xk\)hk\+1\(xk\+1\),hk\+1\(xk\+1\)=∫hk\(y\)Pk0,←\(xk\+1→dy\)\.P\_\{k\}^\{\\star,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)=P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\frac\{h\_\{k\}\(x\_\{k\}\)\}\{h\_\{k\+1\}\(x\_\{k\+1\}\)\},\\qquad h\_\{k\+1\}\(x\_\{k\+1\}\)=\\int h\_\{k\}\(y\)P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}y\)\.

###### Proof\.

The variational optimality of the tilted path law follows from nonnegativity of KL\. It remains to compute its kernels\. Condition on the current reverse\-time stateXk\+1=xk\+1X\_\{k\+1\}=x\_\{k\+1\}\. Under the tilted law, the conditional distribution of the next stateXkX\_\{k\}is proportional to the base conditional distribution multiplied by the expected future weight:

Pk⋆,←\(xk\+1→dxk\)∝Pk0,←\(xk\+1→dxk\)𝔼ℙ0\[eβr\(X0\)∣Xk=xk\]\.P\_\{k\}^\{\\star,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\propto P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}\\\!\\left\[e^\{\\beta r\(X\_\{0\}\)\}\\mid X\_\{k\}=x\_\{k\}\\right\]\.By definition, the expectation ishk\(xk\)h\_\{k\}\(x\_\{k\}\)\. The normalizing constant is

∫hk\(y\)Pk0,←\(xk\+1→dy\)=𝔼ℙ0\[eβr\(X0\)∣Xk\+1=xk\+1\]=hk\+1\(xk\+1\),\\int h\_\{k\}\(y\)P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}y\)=\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}\\\!\\left\[e^\{\\beta r\(X\_\{0\}\)\}\\mid X\_\{k\+1\}=x\_\{k\+1\}\\right\]=h\_\{k\+1\}\(x\_\{k\+1\}\),where the middle equality uses the Markov property\. This gives the displayed Doob\-transform kernel and shows that the tilted path law is again Markov\. ∎

The value functionhkh\_\{k\}is intractable, and the two ways of coping with this organize the rest of the section\. One route uses an approximate value only as a proposal and removes its bias by reweighting, giving an*exact*sampler in the limit \(Subsection[8\.5](https://arxiv.org/html/2607.01693#S8.SS5)\); the other approximates the value and accepts the resulting bias, the inference\-time\-RL view \(Subsection[8\.6](https://arxiv.org/html/2607.01693#S8.SS6)\)\.

### 8\.5\.Feynman–Kac correction and sequential Monte Carlo

Theorem[8\.1](https://arxiv.org/html/2607.01693#S8.Thmtheorem1)gives the exact guided sampler, but its kernels require the value functionshk\(xk\)=𝔼ℙ0\[eβr\(X0\)∣Xk=xk\]h\_\{k\}\(x\_\{k\}\)=\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}\[e^\{\\beta r\(X\_\{0\}\)\}\\mid X\_\{k\}=x\_\{k\}\], which are exactly as unavailable as the posterior reward\-to\-go in the score picture\. This subsection takes the first route of the fork above: use an approximate value only as a*proposal*and remove its bias by reweighting\. That is the Feynman–Kac / sequential Monte Carlo route, and it is the mechanism behind inference\-time scaling of guided samplers\.

The name comes from the value function\. Along the base reverse process,hhis a conditional expectation of a terminal functional, hence a martingale,

hk\+1\(xk\+1\)=∫hk\(y\)Pk0,←\(xk\+1→dy\),h\_\{k\+1\}\(x\_\{k\+1\}\)=\\int h\_\{k\}\(y\)\\,P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}y\),the same identity that appeared in Theorem[8\.1](https://arxiv.org/html/2607.01693#S8.Thmtheorem1)\. In continuous time it solves the backward equation∂shs\+ℒshs=0\\partial\_\{s\}h\_\{s\}\+\\mathcal\{L\}\_\{s\}h\_\{s\}=0with terminal datah=eβrh=e^\{\\beta r\}, whereℒs\\mathcal\{L\}\_\{s\}is the generator of the base reverse diffusion; this is the reverse\-time companion of the noising\-time equation \([8\.6](https://arxiv.org/html/2607.01693#S8.E6)\) for the same tilt factor in Subsection[8\.3](https://arxiv.org/html/2607.01693#S8.SS3)\. At the clean endk=0k=0the state isX0X\_\{0\}itself, so the terminal value is exactlyh0\(x0\)=eβr\(x0\)h\_\{0\}\(x\_\{0\}\)=e^\{\\beta r\(x\_\{0\}\)\}\.

The key fact is that the entire telescoping of the Doob transform collapses to a single terminal weight\.

###### Proposition 8\.2\(Untwisted reweighting\)\.

Letℙ0\\mathbb\{P\}^\{0\}be the base reverse chain of Theorem[8\.1](https://arxiv.org/html/2607.01693#S8.Thmtheorem1)andℚ⋆\\mathbb\{Q\}^\{\\star\}the tilted path law withdℚ⋆/dℙ0=eβr\(X0\)/Zβ\\,\\mathrm\{d\}\\mathbb\{Q\}^\{\\star\}/\\,\\mathrm\{d\}\\mathbb\{P\}^\{0\}=e^\{\\beta r\(X\_\{0\}\)\}/Z\_\{\\beta\}\. Then on trajectories,

dℚ⋆dℙ0\(xK,…,x0\)=eβr\(x0\)Zβ,Zβ=𝔼ℙ0eβr\(X0\)\.\\frac\{\\,\\mathrm\{d\}\\mathbb\{Q\}^\{\\star\}\}\{\\,\\mathrm\{d\}\\mathbb\{P\}^\{0\}\}\(x\_\{K\},\\ldots,x\_\{0\}\)=\\frac\{e^\{\\beta r\(x\_\{0\}\)\}\}\{Z\_\{\\beta\}\},\\qquad Z\_\{\\beta\}=\\mathbb\{E\}\_\{\\mathbb\{P\}^\{0\}\}e^\{\\beta r\(X\_\{0\}\)\}\.

###### Proof\.

Write the tilted path law from its initial law and Doob kernels\. Its initial law ispK⋆\(dxK\)=pK0\(dxK\)hK\(xK\)/Zβp\_\{K\}^\{\\star\}\(\\,\\mathrm\{d\}x\_\{K\}\)=p\_\{K\}^\{0\}\(\\,\\mathrm\{d\}x\_\{K\}\)\\,h\_\{K\}\(x\_\{K\}\)/Z\_\{\\beta\}, and its kernels arePk⋆,←=Pk0,←hk\(xk\)/hk\+1\(xk\+1\)P\_\{k\}^\{\\star,\\leftarrow\}=P\_\{k\}^\{0,\\leftarrow\}\\,h\_\{k\}\(x\_\{k\}\)/h\_\{k\+1\}\(x\_\{k\+1\}\)\. Multiplying,

dℚ⋆dℙ0=hK\(xK\)Zβ∏k=0K−1hk\(xk\)hk\+1\(xk\+1\)=hK\(xK\)Zβ⋅h0\(x0\)hK\(xK\)=h0\(x0\)Zβ,\\frac\{\\,\\mathrm\{d\}\\mathbb\{Q\}^\{\\star\}\}\{\\,\\mathrm\{d\}\\mathbb\{P\}^\{0\}\}=\\frac\{h\_\{K\}\(x\_\{K\}\)\}\{Z\_\{\\beta\}\}\\prod\_\{k=0\}^\{K\-1\}\\frac\{h\_\{k\}\(x\_\{k\}\)\}\{h\_\{k\+1\}\(x\_\{k\+1\}\)\}=\\frac\{h\_\{K\}\(x\_\{K\}\)\}\{Z\_\{\\beta\}\}\\cdot\\frac\{h\_\{0\}\(x\_\{0\}\)\}\{h\_\{K\}\(x\_\{K\}\)\}=\\frac\{h\_\{0\}\(x\_\{0\}\)\}\{Z\_\{\\beta\}\},andh0\(x0\)=eβr\(x0\)h\_\{0\}\(x\_\{0\}\)=e^\{\\beta r\(x\_\{0\}\)\}\. ∎

This suggests the simplest exact algorithm: drawNNtrajectories from the base model, weight each byeβr\(x0\)e^\{\\beta r\(x\_\{0\}\)\}, and resample\. AsN→∞N\\to\\inftythe weighted empirical law converges toℚ⋆\\mathbb\{Q\}^\{\\star\}, so the clean marginal converges to the tilted targetp0βp\_\{0\}^\{\\beta\}\. The catch is variance: if the base model rarely produces high\-reward outputs, almost all weight lands on a few trajectories and the effective sample size collapses\.

Sequential Monte Carlo fixes this by*twisting*the proposal with the approximate value and resampling along the way\. Replace the base kernel by a proposal that follows an approximate Doob transform,

\(8\.13\)P^k←\(xk\+1→dxk\)=Pk0,←\(xk\+1→dxk\)h^k\(xk\)g^k\+1\(xk\+1\),g^k\+1\(xk\+1\)=∫h^k\(y\)Pk0,←\(xk\+1→dy\),\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)=\\frac\{P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\,\\widehat\{h\}\_\{k\}\(x\_\{k\}\)\}\{\\widehat\{g\}\_\{k\+1\}\(x\_\{k\+1\}\)\},\\qquad\\widehat\{g\}\_\{k\+1\}\(x\_\{k\+1\}\)=\\int\\widehat\{h\}\_\{k\}\(y\)\\,P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}y\),whereh^k≈hk\\widehat\{h\}\_\{k\}\\approx h\_\{k\}is any tractable value estimate \(a classifier, a reward model, a learned value, or the linearized small\-tilt value of \([8\.5](https://arxiv.org/html/2607.01693#S8.E5)\)\) with the exact terminal valueh^0=eβr\\widehat\{h\}\_\{0\}=e^\{\\beta r\}\. If the particles are initialized from the base noisy lawpK0p\_\{K\}^\{0\}, they also carry the initial weighth^K\(XK\)\\widehat\{h\}\_\{K\}\(X\_\{K\}\); if one can instead sample from the twisted noisy law proportional topK0h^Kp\_\{K\}^\{0\}\\widehat\{h\}\_\{K\}, this initial weight is constant\. RunningNNparticles throughP^k←\\widehat\{P\}\_\{k\}^\{\\leftarrow\}and carrying an incremental importance weight at each reverse step, then resampling when the effective sample size drops, gives a consistent estimator ofℚ⋆\\mathbb\{Q\}^\{\\star\}\.

At each reverse step the particle picks up an incremental importance weight: the ratio of the base transition weighted by the current valueh^k\(xk\)\\widehat\{h\}\_\{k\}\(x\_\{k\}\)to the twisted proposal weighted by the next valueh^k\+1\(xk\+1\)\\widehat\{h\}\_\{k\+1\}\(x\_\{k\+1\}\)\. Substituting the proposal \([8\.13](https://arxiv.org/html/2607.01693#S8.E13)\), this collapses to

Pk0,←\(xk\+1→dxk\)h^k\(xk\)P^k←\(xk\+1→dxk\)h^k\+1\(xk\+1\)=g^k\+1\(xk\+1\)h^k\+1\(xk\+1\)\.\\frac\{P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\,\\widehat\{h\}\_\{k\}\(x\_\{k\}\)\}\{\\widehat\{P\}\_\{k\}^\{\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\,\\widehat\{h\}\_\{k\+1\}\(x\_\{k\+1\}\)\}=\\frac\{\\widehat\{g\}\_\{k\+1\}\(x\_\{k\+1\}\)\}\{\\widehat\{h\}\_\{k\+1\}\(x\_\{k\+1\}\)\}\.Thus the incremental weight is

\(8\.14\)wk\(xk\+1\)=g^k\+1\(xk\+1\)h^k\+1\(xk\+1\)\.w\_\{k\}\(x\_\{k\+1\}\)=\\frac\{\\widehat\{g\}\_\{k\+1\}\(x\_\{k\+1\}\)\}\{\\widehat\{h\}\_\{k\+1\}\(x\_\{k\+1\}\)\}\.It measures the one\-step mismatch between the valueh^k\+1\\widehat\{h\}\_\{k\+1\}assumed at the current state and the valueg^k\+1\\widehat\{g\}\_\{k\+1\}obtained by propagatingh^k\\widehat\{h\}\_\{k\}one base step\. If the twist is exact,h^=h\\widehat\{h\}=h, theng^k\+1=hk\+1\\widehat\{g\}\_\{k\+1\}=h\_\{k\+1\}by the martingale identity and all incremental weights are one; the only nonconstant weight left is the initial twist weight if the sampler starts frompK0p\_\{K\}^\{0\}\. All of the approximation error is thus pushed into nonuniform weights, which resampling removes in the large\-NNlimit; a goodh^\\widehat\{h\}keeps the weights near uniform and the variance low\.

For conditional generation, take the terminal reward to belog⁡p\(y∣x0\)\\log p\(y\\mid x\_\{0\}\)\. Thenhk\(xk\)=ℙ0\(y∣Xk=xk\)h\_\{k\}\(x\_\{k\}\)=\\mathbb\{P\}^\{0\}\(y\\mid X\_\{k\}=x\_\{k\}\), and a noisy classifier or conditional likelihood model gives a tractableh^k\\widehat\{h\}\_\{k\}\. When this twist is realized as a guidance gradient∇log⁡h^k\\nabla\\log\\widehat\{h\}\_\{k\}that only moves the particles, the proposal is exactly a classifier\-guided sampler, and the weights above are the SMC correction that restores the conditional target in the large\-particle limit\.

### 8\.6\.Inference\-time reinforcement learning

The previous subsections describe the ideal sampler for a reward tilt: if the posterior value functionshkh\_\{k\}were known exactly, the reverse kernels would be the Doob transforms in Theorem[8\.1](https://arxiv.org/html/2607.01693#S8.Thmtheorem1)\. The setting here is genuinely inference\-time: the pretrained reverse sampler stays fixed as the reference law, and a policy changes only how that sampler is run at test time\. The RL problem is therefore over sampler choices during generation, not over retraining the base score model\.

In a finite\-horizon formulation, the reverse sampler takes actions during sampling\. Letπ=\(πk\)\\pi=\(\\pi\_\{k\}\)be a possibly randomized policy: after observingXk\+1X\_\{k\+1\}at stepkk, it chooses an actionak∼πk\(⋅∣Xk\+1\)a\_\{k\}\\sim\\pi\_\{k\}\(\\cdot\\mid X\_\{k\+1\}\), and the reverse step uses the corresponding kernel,

ak∼πk\(⋅∣Xk\+1\),Xk∼Pkak,←\(Xk\+1→⋅\),ak∈𝒜\.a\_\{k\}\\sim\\pi\_\{k\}\(\\cdot\\mid X\_\{k\+1\}\),\\qquad X\_\{k\}\\sim P\_\{k\}^\{a\_\{k\},\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\),\\qquad a\_\{k\}\\in\\mathcal\{A\}\.The action may be a guidance scale, a timestep choice, an injected\-noise level, a rejection or resampling decision, or an additive drift correction\. In a masked language diffusion sampler, it may choose how many tokens to unmask, which positions to update, or how sharply to sample from a token posterior\. The terminal reward is measured only after the final objectX0X\_\{0\}is produced\.

The objective is the same KL\-regularized change of measure, restricted to the family of sampler modifications available at inference time:

\(8\.15\)J\(π\)=𝔼π\[r\(X0\)−1β∑kD𝖪𝖫⁡\(Pkak,←\(Xk\+1→⋅\)∥Pk0,←\(Xk\+1→⋅\)\)\]\.J\(\\pi\)=\\mathbb\{E\}\_\{\\pi\}\\left\[r\(X\_\{0\}\)\-\\frac\{1\}\{\\beta\}\\sum\_\{k\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{k\}^\{a\_\{k\},\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\middle\\\|P\_\{k\}^\{0,\\leftarrow\}\(X\_\{k\+1\}\\to\\cdot\)\\right\)\\right\]\.Hereβ\>0\\beta\>0is the same inverse temperature as in the variational principle \([8\.2](https://arxiv.org/html/2607.01693#S8.E2)\); largerβ\\betameans weaker KL regularization and stronger reward seeking\. The KL term keeps the policy\-induced sampler close to the pretrained sampler at each reverse step\. This mirrors the variational identity: reward improvement is meaningful only relative to a reference distribution\. Without this reference cost, policy optimization can exploit flaws in the reward model or collapse to a narrow set of high\-scoring samples\.

If the action class is rich enough to choose the whole next\-step kernel, this optimization does not produce a new sampler: it recovers the Doob\-transformed kernel of Theorem[8\.1](https://arxiv.org/html/2607.01693#S8.Thmtheorem1), now reached by dynamic programming rather than by tilting the path law\. To see the same object appear, define the regularized value\-to\-go fromxkx\_\{k\}by

vk\(xk\)=supπ0,…,πk−1𝔼π\[r\(X0\)−1β∑j=0k−1D𝖪𝖫\(Pjaj,←\(Xj\+1→⋅\)∥Pj0,←\(Xj\+1→⋅\)\)\|Xk=xk\],v\_\{k\}\(x\_\{k\}\)=\\sup\_\{\\pi\_\{0\},\\ldots,\\pi\_\{k\-1\}\}\\mathbb\{E\}\_\{\\pi\}\\left\[r\(X\_\{0\}\)\-\\frac\{1\}\{\\beta\}\\sum\_\{j=0\}^\{k\-1\}\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(P\_\{j\}^\{a\_\{j\},\\leftarrow\}\(X\_\{j\+1\}\\to\\cdot\)\\middle\\\|P\_\{j\}^\{0,\\leftarrow\}\(X\_\{j\+1\}\\to\\cdot\)\\right\)\\,\\middle\|\\,X\_\{k\}=x\_\{k\}\\right\],with the empty\-sum conventionv0\(x0\)=r\(x0\)v\_\{0\}\(x\_\{0\}\)=r\(x\_\{0\}\)\. The local Gibbs variational principle gives

Pk⋆,←\(xk\+1→dxk\)∝Pk0,←\(xk\+1→dxk\)exp⁡\(βvk\(xk\)\)\.P\_\{k\}^\{\\star,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\propto P\_\{k\}^\{0,\\leftarrow\}\(x\_\{k\+1\}\\to\\,\\mathrm\{d\}x\_\{k\}\)\\exp\\left\(\\beta v\_\{k\}\(x\_\{k\}\)\\right\)\.Writinghk\(xk\)=exp⁡\(βvk\(xk\)\)h\_\{k\}\(x\_\{k\}\)=\\exp\(\\beta v\_\{k\}\(x\_\{k\}\)\)makes this exactly the normalization in \([8\.12](https://arxiv.org/html/2607.01693#S8.E12)\): the same reweighting we already have, only now the value is presented as a Bellman value\-to\-go rather than a posterior tilt factor\. The two letters are the two coordinates of one object: the multiplicative tilt factorhkh\_\{k\}that enters the Doob kernel and the SMC weights, and its log\-domain*soft value*vk=β−1log⁡hkv\_\{k\}=\\beta^\{\-1\}\\log h\_\{k\}, so thatβ\\betatimes its gradient \(in the continuous picture\) is added to the score, matching∇log⁡htβ=β∇vt\\nabla\\log h\_\{t\}^\{\\beta\}=\\beta\\nabla v\_\{t\}\. The posterior expected rewardVtV\_\{t\}of \([8\.5](https://arxiv.org/html/2607.01693#S8.E5)\) is the small\-β\\betalinearization of this soft value\. Soft dynamic programming is therefore the finite\-state computational form of the path\-space tilt; what is genuinely new here is not the optimal kernel but how its value is obtained, since guidance methods approximate this value information while RL methods estimate it from sampled rewards\.

For a parameterized policyπθ\\pi\_\{\\theta\}with reverse\-path densitypθ\(xK,…,x0\)p\_\{\\theta\}\(x\_\{K\},\\ldots,x\_\{0\}\), sampled rewards can be turned into updates by the likelihood\-ratio identity

∇θ𝔼\(XK,…,X0\)∼pθ\[r\(X0\)\]=𝔼\(XK,…,X0\)∼pθ\[r\(X0\)∇θlog⁡pθ\(XK,…,X0\)\],\\nabla\_\{\\theta\}\\mathbb\{E\}\_\{\(X\_\{K\},\\ldots,X\_\{0\}\)\\sim p\_\{\\theta\}\}\[r\(X\_\{0\}\)\]=\\mathbb\{E\}\_\{\(X\_\{K\},\\ldots,X\_\{0\}\)\\sim p\_\{\\theta\}\}\\left\[r\(X\_\{0\}\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(X\_\{K\},\\ldots,X\_\{0\}\)\\right\],In implementations one often subtracts a baseline from the reward to reduce variance, but that is not part of the change\-of\-measure identity\. Along a reverse diffusion chain,

log⁡pθ\(xK,…,x0\)=log⁡pθ,K\(xK\)\+∑klog⁡pθ\(xk∣xk\+1\),\\log p\_\{\\theta\}\(x\_\{K\},\\ldots,x\_\{0\}\)=\\log p\_\{\\theta,K\}\(x\_\{K\}\)\+\\sum\_\{k\}\\log p\_\{\\theta\}\(x\_\{k\}\\mid x\_\{k\+1\}\),with the initial\-law term disappearing from the gradient when the noisy initialization is fixed\. Thus the update decomposes across the guidance decisions made along the trajectory\. For the KL\-regularized objective \([8\.15](https://arxiv.org/html/2607.01693#S8.E15)\), the return is replaced by the regularized return inside the brackets, with any explicit derivative of the KL penalty added when the transition family is differentiable\.

This perspective is also how to read current inference\-time RL methods\. One can combine the base sampler with a reward\-aligned proposal, spend particles on resampling and trajectory correction, or estimate a drift/Doob correction more directly\. These are different approximations to the same ideal controlled path law\. The useful message for these notes is not that one particular recent algorithm has replaced the others; it is that reward tilting, KL\-regularized control, and Doob transforms provide the common language for comparing ways of changing the sampler at inference time\.

### 8\.7\.Guidance and reward tilting in the discrete case

Everything in this section carries over almost unchanged to the discrete diffusion models of Section[7](https://arxiv.org/html/2607.01693#S7), because the target and objective never used continuous structure\. The reward\-tilted target \([8\.1](https://arxiv.org/html/2607.01693#S8.E1)\) and the KL\-regularized variational principle \([8\.2](https://arxiv.org/html/2607.01693#S8.E2)\) are defined on any space; the Feynman–Kac/twisted\-SMC correction of Subsection[8\.5](https://arxiv.org/html/2607.01693#S8.SS5)reweights particles by the same potentials; and inference\-time RL \([8\.15](https://arxiv.org/html/2607.01693#S8.E15)\) differentiates the trajectory likelihood—for a discrete chain a sum∑klog⁡Pk,θ←\(xk\+1→xk\)\\sum\_\{k\}\\log P\_\{k,\\theta\}^\{\\leftarrow\}\(x\_\{k\+1\}\\to x\_\{k\}\)of log transition probabilities—with respect toθ\\thetarather than the state, so it transfers verbatim\.

The one step that changes form is guidance\. The optimal tilt is again the Doobhh\-transform of Theorem[8\.1](https://arxiv.org/html/2607.01693#S8.Thmtheorem1), now of a jump process: with the soft valuevt\(x\)=β−1log⁡𝔼\[exp⁡\(βr\(X0\)\)∣Xt=x\]v\_\{t\}\(x\)=\\beta^\{\-1\}\\log\\mathbb\{E\}\[\\exp\(\\beta r\(X\_\{0\}\)\)\\mid X\_\{t\}=x\], the guided reverse rates are reweighted by value differences,

Rt←,guided\(y→x\)=Rt←\(y→x\)exp⁡\(β\(vt\(x\)−vt\(y\)\)\),R\_\{t\}^\{\\leftarrow,\\mathrm\{guided\}\}\(y\\to x\)=R\_\{t\}^\{\\leftarrow\}\(y\\to x\)\\,\\exp\\\!\\big\(\\beta\(v\_\{t\}\(x\)\-v\_\{t\}\(y\)\)\\big\),the exact analogue of addingβ∇vt\\beta\\nabla v\_\{t\}to the reverse drift, with the gradient replaced by a finite difference ofvtv\_\{t\}across admissible moves—the same score\-to\-ratio substitution as in \([7\.3](https://arxiv.org/html/2607.01693#S7.E3)\)\. As beforevtv\_\{t\}is intractable and is approximated, by a learned predictor of the reward, a predictor\-free interpolation, or a Taylor expansion around the predicted clean token\[[39](https://arxiv.org/html/2607.01693#bib.bib39)\]\.

## Appendix AItô Calculus and Girsanov Theorem

This appendix collects the stochastic\-calculus facts used in the notes\. The statements below are written in the smooth, non\-explosive setting in which the formal calculations in the text are valid\. For rigorous hypotheses and proofs, see the texts of E, Li and Vanden\-Eijnden\[[21](https://arxiv.org/html/2607.01693#bib.bib21)\], Karatzas and Shreve\[[30](https://arxiv.org/html/2607.01693#bib.bib30)\]or Øksendal\[[40](https://arxiv.org/html/2607.01693#bib.bib40)\]\.

### Itô formula, generator, and adjoint

LetXtX\_\{t\}solve

dXt=bt\(Xt\)dt\+σt\(Xt\)dBt,\\,\\mathrm\{d\}X\_\{t\}=b\_\{t\}\(X\_\{t\}\)\\,\\mathrm\{d\}t\+\\sigma\_\{t\}\(X\_\{t\}\)\\,\\mathrm\{d\}B\_\{t\},whereBtB\_\{t\}is anmm\-dimensional Brownian motion andσt\(x\)∈ℝd×m\\sigma\_\{t\}\(x\)\\in\\mathbb\{R\}^\{d\\times m\}\. Writeat\(x\)=σt\(x\)σt\(x\)⊤a\_\{t\}\(x\)=\\sigma\_\{t\}\(x\)\\sigma\_\{t\}\(x\)^\{\\top\}\. For a smooth test functionφ\\varphi, Itô’s formula gives

dφ\(Xt\)=⟨∇φ\(Xt\),bt\(Xt\)⟩dt\+12Tr⁡\(at\(Xt\)∇2φ\(Xt\)\)dt\+⟨∇φ\(Xt\),σt\(Xt\)dBt⟩\.\\,\\mathrm\{d\}\\varphi\(X\_\{t\}\)=\\left\\langle\\nabla\\varphi\(X\_\{t\}\),b\_\{t\}\(X\_\{t\}\)\\right\\rangle\\,\\mathrm\{d\}t\+\\frac\{1\}\{2\}\\operatorname\{Tr\}\\\!\\left\(a\_\{t\}\(X\_\{t\}\)\\nabla^\{2\}\\varphi\(X\_\{t\}\)\\right\)\\,\\mathrm\{d\}t\+\\left\\langle\\nabla\\varphi\(X\_\{t\}\),\\sigma\_\{t\}\(X\_\{t\}\)\\,\\mathrm\{d\}B\_\{t\}\\right\\rangle\.The last term is a martingale increment and has mean zero under the usual integrability assumptions\. Thus

ddt𝔼\[φ\(Xt\)\]=𝔼\[\(ℒtφ\)\(Xt\)\],\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}t\}\\mathbb\{E\}\[\\varphi\(X\_\{t\}\)\]=\\mathbb\{E\}\[\(\\mathcal\{L\}\_\{t\}\\varphi\)\(X\_\{t\}\)\],where the infinitesimal generator is

ℒtφ=⟨bt,∇φ⟩\+12Tr⁡\(at∇2φ\)\.\\mathcal\{L\}\_\{t\}\\varphi=\\left\\langle b\_\{t\},\\nabla\\varphi\\right\\rangle\+\\frac\{1\}\{2\}\\operatorname\{Tr\}\(a\_\{t\}\\nabla^\{2\}\\varphi\)\.IfXtX\_\{t\}has densityqtq\_\{t\}, then the Fokker–Planck equation is the adjoint equation

∂tqt=ℒt∗qt,\\partial\_\{t\}q\_\{t\}=\\mathcal\{L\}\_\{t\}^\{\\ast\}q\_\{t\},where, in coordinates,

ℒt∗q=−∇⋅\(btq\)\+12∑i,j=1d∂i∂j\(\(at\)ijq\)\.\\mathcal\{L\}\_\{t\}^\{\\ast\}q=\-\\nabla\\cdot\(b\_\{t\}q\)\+\\frac\{1\}\{2\}\\sum\_\{i,j=1\}^\{d\}\\partial\_\{i\}\\partial\_\{j\}\\\!\\left\(\(a\_\{t\}\)\_\{ij\}q\\right\)\.For overdamped Langevin,b=−∇Ub=\-\\nabla Uanda=2Ia=2I, so this reduces to

∂tqt=∇⋅\(qt∇U\)\+Δqt\.\\partial\_\{t\}q\_\{t\}=\\nabla\\cdot\(q\_\{t\}\\nabla U\)\+\\Delta q\_\{t\}\.

### Girsanov’s theorem

Girsanov’s theorem compares two diffusions that share the same initial law and the same diffusion coefficient but may carry different drifts\. We fix the coefficient2I\\sqrt\{2\}\\,I, the Langevin convention used in the notes\. Letℙ\\mathbb\{P\}andℚ\\mathbb\{Q\}be the laws on path spaceC\(\[0,T\];ℝd\)C\(\[0,T\];\\mathbb\{R\}^\{d\}\)of

dXt=btℙ\(Xt\)dt\+2dBtanddXt=btℚ\(Xt\)dt\+2dBt\.\\,\\mathrm\{d\}X\_\{t\}=b^\{\\mathbb\{P\}\}\_\{t\}\(X\_\{t\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{t\}\\qquad\\text\{and\}\\qquad\\,\\mathrm\{d\}X\_\{t\}=b^\{\\mathbb\{Q\}\}\_\{t\}\(X\_\{t\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{t\}\.Under the usual absolute\-continuity and integrability assumptions, the two laws are equivalent onℱT\\mathcal\{F\}\_\{T\}, the terminalσ\\sigma\-algebra of the Brownian filtration, with Radon–Nikodym density

\(A\.1\)dℙdℚ\|ℱT=exp⁡\(12∫0T⟨btℙ−btℚ,dXt−btℚdt⟩−14∫0T‖btℙ−btℚ‖2dt\),\\frac\{\\,\\mathrm\{d\}\\mathbb\{P\}\}\{\\,\\mathrm\{d\}\\mathbb\{Q\}\}\\bigg\|\_\{\\mathcal\{F\}\_\{T\}\}=\\exp\\\!\\left\(\\frac\{1\}\{2\}\\int\_\{0\}^\{T\}\\left\\langle b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\},\\,\\mathrm\{d\}X\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\\,\\,\\mathrm\{d\}t\\right\\rangle\-\\frac\{1\}\{4\}\\int\_\{0\}^\{T\}\\left\\lVert b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\\right\\rVert^\{2\}\\,\\,\\mathrm\{d\}t\\right\),where all integrands are evaluated atXtX\_\{t\}\.

To obtain the KL divergence, take logarithms in \([A\.1](https://arxiv.org/html/2607.01693#A1.E1)\) and average underℙ\\mathbb\{P\}:

D𝖪𝖫⁡\(ℙ∥ℚ\)=𝔼ℙ\[log⁡dℙdℚ\]=12𝔼ℙ∫0T⟨btℙ−btℚ,dXt−btℚdt⟩−14𝔼ℙ∫0T‖btℙ−btℚ‖2dt\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{P\}\\\|\\mathbb\{Q\}\)=\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\\!\\left\[\\log\\frac\{\\,\\mathrm\{d\}\\mathbb\{P\}\}\{\\,\\mathrm\{d\}\\mathbb\{Q\}\}\\right\]=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\int\_\{0\}^\{T\}\\left\\langle b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\},\\,\\mathrm\{d\}X\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\\,\\,\\mathrm\{d\}t\\right\\rangle\-\\frac\{1\}\{4\}\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\int\_\{0\}^\{T\}\\left\\lVert b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\\right\\rVert^\{2\}\\,\\mathrm\{d\}t\.The second term is already a plain expectation of a path integral, so it is left as is\. For the first term, recall that underℙ\\mathbb\{P\}, we havedXt−btℚdt=\(btℙ−btℚ\)dt\+2dBt\\,\\mathrm\{d\}X\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\\,\\,\\mathrm\{d\}t=\(b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2\}\\,\\,\\mathrm\{d\}B\_\{t\}, and

12𝔼ℙ∫0T⟨btℙ−btℚ,dXt−btℚdt⟩=12𝔼ℙ∫0T‖btℙ−btℚ‖2dt\+12𝔼ℙ∫0T⟨btℙ−btℚ,dBt⟩\.\\frac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\int\_\{0\}^\{T\}\\left\\langle b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\},\\,\\mathrm\{d\}X\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\\,\\,\\mathrm\{d\}t\\right\\rangle=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\int\_\{0\}^\{T\}\\left\\lVert b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\}\\right\\rVert^\{2\}\\,\\mathrm\{d\}t\+\\frac\{1\}\{\\sqrt\{2\}\}\\,\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\int\_\{0\}^\{T\}\\left\\langle b^\{\\mathbb\{P\}\}\_\{t\}\-b^\{\\mathbb\{Q\}\}\_\{t\},\\,\\mathrm\{d\}B\_\{t\}\\right\\rangle\.The expectation of the final integral vanishes, and we arrive at

\(A\.2\)D𝖪𝖫⁡\(ℙ∥ℚ\)=14𝔼ℙ∫0T‖btℙ\(Xt\)−btℚ\(Xt\)‖2dt\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{P\}\\\|\\mathbb\{Q\}\)=\\frac\{1\}\{4\}\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\int\_\{0\}^\{T\}\\left\\lVert b^\{\\mathbb\{P\}\}\_\{t\}\(X\_\{t\}\)\-b^\{\\mathbb\{Q\}\}\_\{t\}\(X\_\{t\}\)\\right\\rVert^\{2\}\\,\\mathrm\{d\}t\.So far the two processes share the same initial law\. If instead they start from different initial laws, the density \([A\.1](https://arxiv.org/html/2607.01693#A1.E1)\) acquires the extra factordLawℙ⁡\(X0\)/dLawℚ⁡\(X0\)\\,\\mathrm\{d\}\\operatorname\{Law\}\_\{\\mathbb\{P\}\}\(X\_\{0\}\)/\\,\\mathrm\{d\}\\operatorname\{Law\}\_\{\\mathbb\{Q\}\}\(X\_\{0\}\)att=0t=0, and the KL gains the corresponding initial term:

D𝖪𝖫⁡\(ℙ∥ℚ\)=D𝖪𝖫⁡\(Lawℙ⁡\(X0\)∥Lawℚ⁡\(X0\)\)\+14𝔼ℙ∫0T‖btℙ\(Xt\)−btℚ\(Xt\)‖2dt\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{P\}\\\|\\mathbb\{Q\}\)=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\bigl\(\\operatorname\{Law\}\_\{\\mathbb\{P\}\}\(X\_\{0\}\)\\,\\big\\\|\\,\\operatorname\{Law\}\_\{\\mathbb\{Q\}\}\(X\_\{0\}\)\\bigr\)\+\\frac\{1\}\{4\}\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\int\_\{0\}^\{T\}\\left\\lVert b^\{\\mathbb\{P\}\}\_\{t\}\(X\_\{t\}\)\-b^\{\\mathbb\{Q\}\}\_\{t\}\(X\_\{t\}\)\\right\\rVert^\{2\}\\,\\mathrm\{d\}t\.
###### Lemma A\.1\(Data processing\)\.

Letμ\\muandν\\nube probability laws on a measurable space, and letTTbe a measurable map into another measurable space\. Then

D𝖪𝖫⁡\(T\#μ∥T\#ν\)≤D𝖪𝖫⁡\(μ∥ν\)andD𝖳𝖵⁡\(T\#μ,T\#ν\)≤D𝖳𝖵⁡\(μ,ν\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(T\_\{\\\#\}\\mu\\\|T\_\{\\\#\}\\nu\)\\leq\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\\qquad\\text\{and\}\\qquad\\operatorname\{D\_\{\\mathsf\{TV\}\}\}\(T\_\{\\\#\}\\mu,T\_\{\\\#\}\\nu\)\\leq\\operatorname\{D\_\{\\mathsf\{TV\}\}\}\(\\mu,\\nu\)\.

###### Proof\.

Ifμ\\muis not absolutely continuous with respect toν\\nu, the KL bound is trivial\. Otherwise letZ=dμ/dνZ=\\,\\mathrm\{d\}\\mu/\\,\\mathrm\{d\}\\nu\. UnderT\#νT\_\{\\\#\}\\nu, the Radon–Nikodym derivative ofT\#μT\_\{\\\#\}\\muwith respect toT\#νT\_\{\\\#\}\\nuis

𝔼ν\[Z∣T\]\.\\mathbb\{E\}\_\{\\nu\}\[Z\\mid T\]\.Therefore Jensen’s inequality gives

D𝖪𝖫⁡\(T\#μ∥T\#ν\)=𝔼ν\[𝔼ν\[Z∣T\]log⁡𝔼ν\[Z∣T\]\]≤𝔼ν\[Zlog⁡Z\]=D𝖪𝖫⁡\(μ∥ν\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(T\_\{\\\#\}\\mu\\\|T\_\{\\\#\}\\nu\)=\\mathbb\{E\}\_\{\\nu\}\\\!\\left\[\\mathbb\{E\}\_\{\\nu\}\[Z\\mid T\]\\log\\mathbb\{E\}\_\{\\nu\}\[Z\\mid T\]\\right\]\\leq\\mathbb\{E\}\_\{\\nu\}\[Z\\log Z\]=\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mu\\\|\\nu\)\.For total variation, take the supremum over eventsBBin the target space:

\|T\#μ\(B\)−T\#ν\(B\)\|=\|μ\(T−1B\)−ν\(T−1B\)\|≤D𝖳𝖵⁡\(μ,ν\)\.∎\\left\\lvert T\_\{\\\#\}\\mu\(B\)\-T\_\{\\\#\}\\nu\(B\)\\right\\rvert=\\left\\lvert\\mu\(T^\{\-1\}B\)\-\\nu\(T^\{\-1\}B\)\\right\\rvert\\leq\\operatorname\{D\_\{\\mathsf\{TV\}\}\}\(\\mu,\\nu\)\.\\qed

Thus, for any measurable mapFFon path space,

D𝖪𝖫⁡\(F\#ℙu∥F\#ℙ0\)≤D𝖪𝖫⁡\(ℙu∥ℙ0\)\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(F\_\{\\\#\}\\mathbb\{P\}^\{u\}\\\|F\_\{\\\#\}\\mathbb\{P\}^\{0\}\)\\leq\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\(\\mathbb\{P\}^\{u\}\\\|\\mathbb\{P\}^\{0\}\)\.TakingF\(ω\)=ωTF\(\\omega\)=\\omega\_\{T\}gives the endpoint KL bound used in the one\-step ULA calculation\.

## Appendix BGaussian Toolbox

This short toolbox collects facts used throughout the notes\. These are good warm\-up exercises for students who have not recently worked with Gaussian densities\.

###### Lemma B\.1\(Affine Gaussian maps and sums\)\.

LetX∼𝒩\(m,Σ\)X\\sim\\mathcal\{N\}\\\!\\left\(m,\\Sigma\\right\)inℝd\\mathbb\{R\}^\{d\}\. For a matrixAAand vectorbb,

AX\+b∼𝒩\(Am\+b,AΣA⊤\)\.AX\+b\\sim\\mathcal\{N\}\\\!\\left\(Am\+b,A\\Sigma A^\{\\top\}\\right\)\.IfXXandYYare independent Gaussians, thenX\+YX\+Yis Gaussian and

Cov⁡\(X\+Y\)=Cov⁡\(X\)\+Cov⁡\(Y\)\.\\operatorname\{Cov\}\(X\+Y\)=\\operatorname\{Cov\}\(X\)\+\\operatorname\{Cov\}\(Y\)\.In particular, in one dimension, ifXXandYYare independent and centered, then

Var⁡\(aX\+bY\)=a2Var⁡\(X\)\+b2Var⁡\(Y\)\.\\operatorname\{Var\}\(aX\+bY\)=a^\{2\}\\operatorname\{Var\}\(X\)\+b^\{2\}\\operatorname\{Var\}\(Y\)\.

###### Proof\.

The affine statement follows from the Gaussian characteristic function:

𝔼ei⟨t,AX\+b⟩=exp⁡\(i⟨t,Am\+b⟩−12t⊤AΣA⊤t\)\.\\mathbb\{E\}e^\{i\\left\\langle t,AX\+b\\right\\rangle\}=\\exp\\\!\\left\(i\\left\\langle t,Am\+b\\right\\rangle\-\\frac\{1\}\{2\}t^\{\\top\}A\\Sigma A^\{\\top\}t\\right\)\.For independent Gaussians, characteristic functions multiply, so the means and covariances add\. The one\-dimensional variance formula is the corresponding special case\. ∎

###### Lemma B\.2\(KL between Gaussians with equal covariance\)\.

Form,m^∈ℝdm,\\widehat\{m\}\\in\\mathbb\{R\}^\{d\}and positive definiteΣ\\Sigma,

D𝖪𝖫⁡\(𝒩\(m,Σ\)∥𝒩\(m^,Σ\)\)=12‖m−m^‖Σ−12,‖v‖Σ−12=v⊤Σ−1v\.\\operatorname\{D\_\{\\mathsf\{KL\}\}\}\\\!\\left\(\\mathcal\{N\}\\\!\\left\(m,\\Sigma\\right\)\\middle\\\|\\mathcal\{N\}\\\!\\left\(\\widehat\{m\},\\Sigma\\right\)\\right\)=\\frac\{1\}\{2\}\\left\\lVert m\-\\widehat\{m\}\\right\\rVert\_\{\\Sigma^\{\-1\}\}^\{2\},\\qquad\\left\\lVert v\\right\\rVert\_\{\\Sigma^\{\-1\}\}^\{2\}=v^\{\\top\}\\Sigma^\{\-1\}v\.In particular, ifΣ=ηI\\Sigma=\\eta I, the KL is‖m−m^‖2/\(2η\)\\left\\lVert m\-\\widehat\{m\}\\right\\rVert^\{2\}/\(2\\eta\)\.

###### Proof\.

The log density ratio is

−12‖x−m‖Σ−12\+12‖x−m^‖Σ−12\.\-\\frac\{1\}\{2\}\\left\\lVert x\-m\\right\\rVert\_\{\\Sigma^\{\-1\}\}^\{2\}\+\\frac\{1\}\{2\}\\left\\lVert x\-\\widehat\{m\}\\right\\rVert\_\{\\Sigma^\{\-1\}\}^\{2\}\.Taking expectation underx∼𝒩\(m,Σ\)x\\sim\\mathcal\{N\}\\\!\\left\(m,\\Sigma\\right\)cancels the covariance terms and leaves12‖m−m^‖Σ−12\\frac\{1\}\{2\}\\left\\lVert m\-\\widehat\{m\}\\right\\rVert\_\{\\Sigma^\{\-1\}\}^\{2\}\. ∎

###### Lemma B\.3\(Gaussian convolution score\)\.

LetXt=X0\+tZX\_\{t\}=X\_\{0\}\+\\sqrt\{t\}Z, whereZ∼𝒩\(0,I\)Z\\sim\\mathcal\{N\}\\\!\\left\(0,I\\right\)is independent ofX0X\_\{0\}\. Ifptp\_\{t\}is the density ofXtX\_\{t\}, then

∇log⁡pt\(x\)=1t\(𝔼\[X0∣Xt=x\]−x\)\.\\nabla\\log p\_\{t\}\(x\)=\\frac\{1\}\{t\}\\left\(\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{t\}=x\]\-x\\right\)\.

###### Proof\.

This is the special caseat=1a\_\{t\}=1,σt2=t\\sigma\_\{t\}^\{2\}=tof the continuous\-time Tweedie identity \([3\.4](https://arxiv.org/html/2607.01693#S3.E4)\), obtained there by differentiating the Gaussian mixture forptp\_\{t\}and dividing bypt\(x\)p\_\{t\}\(x\)\. ∎

## References

- \[1\]M\. S\. Albergo, N\. M\. Boffi, and E\. Vanden\-Eijnden\.Stochastic interpolants: a unifying framework for flows and diffusions, 2023\.[arXiv:2303\.08797](https://arxiv.org/abs/2303.08797)\.
- \[2\]J\. M\. Altschuler and S\. Chewi\.Faster high\-accuracy log\-concave sampling via algorithmic warm starts\.InIEEE Symposium on Foundations of Computer Science \(FOCS\), pages 2169–2176, 2023\.[arXiv:2302\.10249](https://arxiv.org/abs/2302.10249)\.
- \[3\]L\. Ambrosio, N\. Gigli, and G\. Savaré\.Gradient Flows in Metric Spaces and in the Space of Probability Measures\.Lectures in Mathematics ETH Zürich\. Birkhäuser, second edition, 2008\.
- \[4\]N\. Anari, C\. Baronio, CJ Chen, A\. Haqi, F\. Koehler, A\. Li, and T\.\-D\. Vuong\.Parallel sampling via autospeculation, 2025\.[arXiv:2511\.07869](https://arxiv.org/abs/2511.07869)\.
- \[5\]N\. Anari, R\. Gao, and A\. Rubinstein\.Parallel sampling via counting, 2024\.[arXiv:2408\.09442](https://arxiv.org/abs/2408.09442)\.
- \[6\]D\. Bakry, I\. Gentil, and M\. Ledoux\.Analysis and Geometry of Markov Diffusion Operators\.Grundlehren der mathematischen Wissenschaften 348\. Springer, 2014\.
- \[7\]R\. Bauerschmidt, T\. Bodineau, and B\. Dagallier\.Stochastic dynamics and the Polchinski equation: an introduction\.Probability Surveys, 21:200–290, 2024\.[arXiv:2307\.07619](https://arxiv.org/abs/2307.07619)\.
- \[8\]C\. H\. L\. Beentjes and R\. E\. Baker\.Uniformisation techniques for stochastic simulation of chemical reaction networks\.The Journal of Chemical Physics, 150:154107, 2019\.[arXiv:1811\.00948](https://arxiv.org/abs/1811.00948)\.
- \[9\]J\. Benton, V\. De Bortoli, A\. Doucet, and G\. Deligiannidis\.Nearlydd\-linear convergence bounds for diffusion models via stochastic localization\.InInternational Conference on Learning Representations \(ICLR\), 2024\.[arXiv:2308\.03686](https://arxiv.org/abs/2308.03686)\.
- \[10\]A\. Campbell, J\. Benton, V\. De Bortoli, T\. Rainforth, G\. Deligiannidis, and A\. Doucet\.A continuous time framework for discrete denoising models\.InAdvances in Neural Information Processing Systems 35, 2022\.[arXiv:2205\.14987](https://arxiv.org/abs/2205.14987)\.
- \[11\]F\. Chen, S\. Chewi, C\. Daskalakis, and A\. Rakhlin\.High\-accuracy sampling for diffusion models and log\-concave distributions, 2026\.[arXiv:2602\.01338](https://arxiv.org/abs/2602.01338)\.
- \[12\]H\. Chen, H\. Lee, and J\. Lu\.Improved analysis of score\-based generative modeling: user\-friendly bounds under minimal smoothness assumptions, 2023\.[arXiv:2211\.01916](https://arxiv.org/abs/2211.01916)\.
- \[13\]H\. Chen and L\. Ying\.Convergence analysis of discrete diffusion model: exact implementation through uniformization, 2024\.[arXiv:2402\.08095](https://arxiv.org/abs/2402.08095)\.
- \[14\]Y\. Chen\.An almost constant lower bound of the isoperimetric coefficient in the KLS conjecture\.Geometric and Functional Analysis, 31:34–61, 2021\.[arXiv:2011\.13661](https://arxiv.org/abs/2011.13661)\.
- \[15\]Y\. Chen\.Computational and statistical aspects of diffusion models\.Lecture notes, course 401\-4634\-24L, ETH Zürich, Spring 2026, 2026\.[https://metaphor\.ethz\.ch/x/2026/fs/401\-4634\-24L/](https://metaphor.ethz.ch/x/2026/fs/401-4634-24L/)\.
- \[16\]Y\. Chen and R\. Eldan\.Localization schemes: a framework for proving mixing bounds for Markov chains, 2022\.[arXiv:2203\.04163](https://arxiv.org/abs/2203.04163)\.
- \[17\]Y\. Chen and K\. Gatmiry\.A simple proof of the mixing of Metropolis\-adjusted Langevin algorithm under smoothness and isoperimetry, 2023\.[arXiv:2304\.04095](https://arxiv.org/abs/2304.04095)\.
- \[18\]S\. Chewi\.Log\-concave sampling\.Book draft, 2026\.[https://chewisinho\.github\.io/](https://chewisinho.github.io/)\.
- \[19\]G\. Conforti, A\. Durmus, and M\. Gentiloni Silveri\.KL convergence guarantees for score diffusion models under minimal data assumptions, 2024\.[arXiv:2308\.12240](https://arxiv.org/abs/2308.12240)\.
- \[20\]P\. Dhariwal and A\. Nichol\.Diffusion models beat GANs on image synthesis\.InAdvances in Neural Information Processing Systems 34, 2021\.[arXiv:2105\.05233](https://arxiv.org/abs/2105.05233)\.
- \[21\]W\. E, T\. Li, and E\. Vanden\-Eijnden\.Applied Stochastic Analysis\.Graduate Studies in Mathematics 199\. American Mathematical Society, 2019\.
- \[22\]R\. Eldan\.Thin shell implies spectral gap up to polylog via a stochastic localization scheme\.Geometric and Functional Analysis, 23:532–569, 2013\.[arXiv:1203\.0893](https://arxiv.org/abs/1203.0893)\.
- \[23\]Z\. Geng, M\. Deng, X\. Bai, J\. Z\. Kolter, and K\. He\.Mean flows for one\-step generative modeling, 2025\.[arXiv:2505\.13447](https://arxiv.org/abs/2505.13447)\.
- \[24\]D\. T\. Gillespie\.Approximate accelerated stochastic simulation of chemically reacting systems\.The Journal of Chemical Physics, 115:1716–1733, 2001\.
- \[25\]W\. Grassmann\.Transient solutions in Markovian queues\.European Journal of Operational Research, 1\(6\):396–402, 1977\.
- \[26\]J\. Ho, A\. N\. Jain, and P\. Abbeel\.Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems 33, pages 6840–6851, 2020\.[arXiv:2006\.11239](https://arxiv.org/abs/2006.11239)\.
- \[27\]J\. Ho and T\. Salimans\.Classifier\-free diffusion guidance\.NeurIPS 2021 Workshop on Deep Generative Models, 2022\.[arXiv:2207\.12598](https://arxiv.org/abs/2207.12598)\.
- \[28\]E\. Hoogeboom, A\. A\. Gritsenko, J\. Bastings, B\. Poole, R\. van den Berg, and T\. Salimans\.Autoregressive diffusion models\.InInternational Conference on Learning Representations \(ICLR\), 2022\.[arXiv:2110\.02037](https://arxiv.org/abs/2110.02037)\.
- \[29\]L\. P\. Kadanoff\.Scaling laws for Ising models nearTcT\_\{c\}\.Physics Physique Fizika, 2:263–272, 1966\.
- \[30\]I\. Karatzas and S\. E\. Shreve\.Brownian Motion and Stochastic Calculus\.Graduate Texts in Mathematics 113\. Springer, second edition, 1991\.
- \[31\]H\. Lavenant and G\. Zanella\.Error bounds and optimal schedules for masked diffusions with factorized approximations, 2025\.[arXiv:2510\.25544](https://arxiv.org/abs/2510.25544)\.
- \[32\]Y\. T\. Lee and S\. S\. Vempala\.Eldan’s stochastic localization and the KLS conjecture: isoperimetry, concentration and mixing, 2016\.[arXiv:1612\.01507](https://arxiv.org/abs/1612.01507)\.
- \[33\]Y\. Liang, Y\. Liang, L\. Lai, and N\. Shroff\.Discrete diffusion models: novel analysis and new sampler guarantees, 2025\.[arXiv:2509\.16756](https://arxiv.org/abs/2509.16756)\.
- \[34\]Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le\.Flow matching for generative modeling\.InInternational Conference on Learning Representations \(ICLR\), 2023\.[arXiv:2210\.02747](https://arxiv.org/abs/2210.02747)\.
- \[35\]X\. Liu, C\. Gong, and Q\. Liu\.Flow straight and fast: learning to generate and transfer data with rectified flow, 2022\.[arXiv:2209\.03003](https://arxiv.org/abs/2209.03003)\.
- \[36\]A\. Lou, C\. Meng, and S\. Ermon\.Discrete diffusion modeling by estimating the ratios of the data distribution\.InInternational Conference on Machine Learning \(ICML\), PMLR 235, 2024\.[arXiv:2310\.16834](https://arxiv.org/abs/2310.16834)\.
- \[37\]S\. P\. Meyn and R\. L\. Tweedie\.Markov Chains and Stochastic Stability\.Cambridge University Press, second edition, 2009\.
- \[38\]A\. Montanari\.Sampling, diffusions, and stochastic localization, 2023\.[arXiv:2305\.10690](https://arxiv.org/abs/2305.10690)\.
- \[39\]H\. Nisonoff, J\. Xiong, S\. Allenspach, and J\. Listgarten\.Unlocking guidance for discrete state\-space diffusion and flow models\.InInternational Conference on Learning Representations \(ICLR\), 2025\.[arXiv:2406\.01572](https://arxiv.org/abs/2406.01572)\.
- \[40\]B\. Øksendal\.Stochastic Differential Equations: An Introduction with Applications\.Springer, sixth edition, 2003\.
- \[41\]J\. Polchinski\.Renormalization and effective lagrangians\.Nuclear Physics B, 231:269–295, 1984\.
- \[42\]J\. Song, C\. Meng, and S\. Ermon\.Denoising diffusion implicit models\.InInternational Conference on Learning Representations \(ICLR\), 2021\.[arXiv:2010\.02502](https://arxiv.org/abs/2010.02502)\.
- \[43\]Y\. Song, P\. Dhariwal, M\. Chen, and I\. Sutskever\.Consistency models\.InInternational Conference on Machine Learning \(ICML\), PMLR 202, 2023\.[arXiv:2303\.01469](https://arxiv.org/abs/2303.01469)\.
- \[44\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\.Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations \(ICLR\), 2021\.[arXiv:2011\.13456](https://arxiv.org/abs/2011.13456)\.
- \[45\]B\. Uria, I\. Murray, and H\. Larochelle\.A deep and tractable density estimator\.InInternational Conference on Machine Learning \(ICML\), 2014\.[arXiv:1310\.1757](https://arxiv.org/abs/1310.1757)\.
- \[46\]S\. S\. Vempala and A\. Wibisono\.Rapid convergence of the unadjusted Langevin algorithm: isoperimetry suffices\.InAdvances in Neural Information Processing Systems 32, 2019\.[arXiv:1903\.08568](https://arxiv.org/abs/1903.08568)\.
- \[47\]C\. Villani\.Optimal transport: old and new\.Springer, 2009\.
- \[48\]K\. G\. Wilson\.Renormalization group and critical phenomena\. I\. Renormalization group and the Kadanoff scaling picture\.Physical Review B, 4:3174–3183, 1971\.
- \[49\]K\. G\. Wilson\.Renormalization group and critical phenomena\. II\. Phase\-space cell analysis of critical behavior\.Physical Review B, 4:3184–3205, 1971\.
- \[50\]K\. G\. Wilson and J\. Kogut\.The renormalization group and theϵ\\epsilonexpansion\.Physics Reports, 12:75–199, 1974\.
- \[51\]K\. Wu, S\. Schmidler, and Y\. Chen\.Minimax mixing time of the Metropolis\-adjusted Langevin algorithm for log\-concave sampling\.Journal of Machine Learning Research, 23\(270\):1–63, 2022\.[arXiv:2109\.13055](https://arxiv.org/abs/2109.13055)\.
- \[52\]L\. Wu, B\. L\. Trippe, C\. A\. Naesseth, D\. M\. Blei, and J\. P\. Cunningham\.Practical and asymptotically exact conditional sampling in diffusion models\.InAdvances in Neural Information Processing Systems 36, 2023\.[arXiv:2306\.17775](https://arxiv.org/abs/2306.17775)\.
A Mathematical Introduction to Diffusion Models

Similar Articles

Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees

A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models

Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

Spectral Guidance for Flexible and Efficient Control of Diffusion Models

@docmilanfar: I really enjoyed the explainer for our recent paper on "Geometry of Noise" arXiv:2602.18428

Submit Feedback

Similar Articles

Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models
Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine
Spectral Guidance for Flexible and Efficient Control of Diffusion Models
@docmilanfar: I really enjoyed the explainer for our recent paper on "Geometry of Noise" arXiv:2602.18428