@docmilanfar: I really enjoyed the explainer for our recent paper on "Geometry of Noise" arXiv:2602.18428

X AI KOLs Timeline 06/17/26, 04:50 AM Papers

diffusion-models noise-conditioning theory geometry autonomous-models stability machine-learning

Summary

This paper provides a theoretical explanation for why diffusion models can generate clean samples without explicit noise-level conditioning, attributing it to high-dimensional geometry and analyzing why some model parameterizations succeed while others collapse.

I really enjoyed the explainer for our recent paper on "Geometry of Noise" arXiv:2602.18428 https://t.co/ghcwURvgv9

Original Article

View Cached Full Text

Cached at: 06/17/26, 04:02 PM

I really enjoyed the explainer for our recent paper on “Geometry of Noise” arXiv:2602.18428

https://t.co/ghcwURvgv9

Why diffusion models do not need noise conditioning

Source: https://intuitivepapers.ai/geometry-of-noise/ Diffusion · Theory

The Geometry of Noise: Why Diffusion Models Don’t Need Noise Conditioning

A diffusion model doesn’t need to be told how noisy its input is.

Standard diffusion models are handed the noise level at every step. Drop that input and the best ones keep working, because the geometry of high dimensions already encodes it. Which kind of model you drop it from decides whether generation survives.

Explaining the paperThe Geometry of Noise: Why Diffusion Models Don’t Need Noise ConditioningSahraee-Ardakan, Delbracio, Milanfar·Google · 2026· arXiv:2602.18428↗Every diffusion model is built around a noise-level input. For the right kind of model, it is redundant.

Every diffusion or flow model is given the noise level as an input. You show the model a noisy version of the data, and you also tell it how much noise is on that input: a single number, usually writtenttorσ\sigma(you can readttas how far the noising process has run). The model uses that number to decide how hard to clean. The architecture injects the noise level into every layer, on the assumption that the model needs it. What the model should do at high noise (guess the rough shape of an image) is nothing like what it should do at low noise (sharpen the last few details). How could one network do both without being told how noisy its input is?

Recent work showed you can remove that input entirely. Sun and collaborators (Is noise conditioning necessary?) trained a single network that sees only the noisy image, never the noise level, and it still generates clean samples. Equilibrium Matching trains the field to keep the data as its fixed points, with no time index at all, so it is noise-blind by design. The field calls this theautonomous(or noise-blind) setting: one static vector fieldf(u)f(\mathbf{u}), the same function of the input no matter what noise produced it. A single static field somehow has to serve both pure noise and nearly-clean data, and it has to stay stable right at the data, where the gradient diverges.

This paper, from Sahraee-Ardakan, Delbracio, and Milanfar at Google, is the theory of why blind models work. It answers two questions. How does a model know the noise level it was never told? And why do some blind models generate beautifully while others collapse into static? Geometry resolves the first; a stability race resolves the second. The argument runs in a few steps: what the conditioned model is, what happens when you remove the noise input, how high-dimensional geometry supplies the level, what landscape the blind model descends, and why one parameterization survives it while another shatters.

Conditioning on the noise level

The forward process is the easy half of any diffusion model: it takes a clean datapointx\mathbf{x}and a noise level indexed bytt, and mixes them into a noisy observation.

ut=a(t)x+b(t)ϵ,ϵ∼N(0,I)\mathbf{u}_t = a(t)\,\mathbf{x} + b(t)\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

(2)

The two schedule functions do the bookkeeping:a(t)a(t)scales the signal andb(t)b(t)scales the noise. Neart→0t \to 0you sit on clean data; crankttup and the noise term swamps everything until the observation is featureless. Different model families pick differenta,ba,b.DDPM(the original denoising-diffusion model) shrinks the signal as it adds noise so the total variance stays pinned (a2+b2=1a^2+b^2=1); EDM (the design from Karras et al.) keeps the signal at full scale while the noise term grows;flow matching(a straight line from data to noise) slides linearly witha=1−ta=1-t,b=tb=t. The ratio that summarizes how much signal survives is the signal-to-noise ratio (SNR):

SNR(t)=a2(t)b2(t)\text{SNR}(t) = \frac{a^2(t)}{b^2(t)}

(3)

Training is a plain regression. You show the model a noisyut\mathbf{u}_tand ask it to predict a linear targetr=c(t)x+d(t)ϵr = c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}, scored by squared error:

L(f)=Ex,ϵ,t[∥f(ut)−(c(t)x+d(t)ϵ)∥2]\mathcal{L}(f) = \mathbb{E}_{\mathbf{x},\boldsymbol{\epsilon},t}\big[\,\lVert f(\mathbf{u}_t) - (c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}) \rVert^2\,\big]

(4)

Squared error has one minimizer: the average of the thing you are predicting, conditioned on everything you can see. Here that is the conditional mean of the target, given the observation and the noise level.

ft∗(u)=Ex,ϵ∣u,t[c(t)x+d(t)ϵ]f^*_t(\mathbf{u}) = \mathbb{E}_{\mathbf{x},\boldsymbol{\epsilon}\mid\mathbf{u},t}\big[\,c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}\,\big]

(5)

This is a time-dependent vector field: a different arrow at every (position, noise level) pair. The four coefficients(a,b,c,d)(a,b,c,d)are the only difference between the famous models. DDPM predicts the noiseϵ\boldsymbol{\epsilon}(c=0,d=1c{=}0, d{=}1), EDM predicts the clean signalx\mathbf{x}(c=1,d=0c{=}1, d{=}0), flow matching predicts a velocity, the straight-line direction from data toward noise (c=−1,d=1c{=}{-}1, d{=}1). Thettinput is the switch that lets one network be a different function at each noise level.

A switch seems necessary. At high noise the field should pull any point toward the blurry average of all the data, since that is the best guess when you can barely see. At low noise it should pin a point to the nearest sharp datapoint. Those are opposite fields, andttis what tells the network which one to be. Dropttand you seem to lose the switch. You do not.

Removing the noise input

With the noise level removed, the network sees onlyu\mathbf{u}and must output a single vectorf(u)f(\mathbf{u}), the same function no matter what noise produced the input. The optimal such field is not mysterious. It is still the least-squares answer, but you can no longer condition onttbecause you do not have it, so the best you can do is average over your uncertainty about it:

f∗(u)=Et∣u[ft∗(u)]f^*(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\big[\,f^*_t(\mathbf{u})\,\big]

(6)

The weight in that average is theposterior over the noise level,p(t∣u)p(t\mid\mathbf{u}): your belief about whichttproduced this observation, by Bayes’ rule from the likelihood of seeingu\mathbf{u}at each level. So the optimal blind field is a posterior-weighted blend of all the conditioned fields. This is the law of iterated expectations: your best blind guess is the average of your informed guesses, each weighted by how likely its scenario is.

That blend is enough to generate. Below is the optimal blind field for a small five-point dataset, computed in closed form. The field never changes, and nothing tells it the noise level. Press play and a ring of pure-noise particles rides this one frozen field inward and settles exactly onto the data, where the field vanishes, then stops.

Figure 1 · one static field generates

step 0

A single time-invariant fieldf*(u), with no noise level fed in anywhere. Press play or scrub: noise particles follow this one unchanging field onto thedata pointsand stop. The data are stable equilibria, where the field is zero. Autonomous generation: one frozen field carries noise to data.How can an average over noise levels be the right field anywhere? At a givenu\mathbf{u}, if the posteriorp(t∣u)p(t\mid\mathbf{u})is spread across many levels, then (6) blends a high-noise field (points to the center) with a low-noise field (points to a specific mode), and the average of two contradictory arrows should be meaningless. But the posterior is usually not spread at all.

Drag the probe below and watchp(t∣u)p(t\mid\mathbf{u})directly. Out in the void the curve is broad, placing most of its weight on large noise levels. Slide the probe onto a datapoint and the curve collapses to a spike att→0t\to 0, because an observation sitting on the data is indistinguishable from clean data.

Figure 2 · the posterior over the noise level

drag the probe; watch p(t | u) sharpen near the data

The posteriorp(t | u)over the noise level, for a draggable probe. Far from thedatait is broad and sits at large t; on a data point it collapses to a spike at t → 0. The most likely level t̂ is marked. The observation’s position encodes the noise level, even when the model is never told it.When the posterior is a spike at somet^\hat{t}, the average in (6) reduces to the single conditioned field att^\hat{t}, so the blind field equals the model that knew the noise level all along. To make that concrete it helps to rewrite (6) in terms of thedenoiserDt∗(u)=E[x∣u,t]D^*_t(\mathbf{u}) = \mathbb{E}[\mathbf{x}\mid\mathbf{u},t], the best guess of the clean data at leveltt:

f∗(u)=Et∣u⁣[d(t)b(t)u+(c(t)−d(t)a(t)b(t))Dt∗(u)]f^*(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\!\left[\,\frac{d(t)}{b(t)}\,\mathbf{u} + \Big(c(t) - \frac{d(t)a(t)}{b(t)}\Big) D^*_t(\mathbf{u})\,\right]

(7)

Every parameterization is some affine mix of “where you are” (u\mathbf{u}) and “where the clean data probably is” (Dt∗D^*_t). So the blind model is Bayes-optimal averaging, and it matches the informed model exactly when the posterior over the noise level is sharp. The real question is no longer whether the average is reasonable; it is when the average is sharp. That is a question about geometry.

Reading the noise level off the geometry

In high dimensions, noise has a very predictable size, and a blind model relies on exactly that. Add Gaussian noise of levelσ\sigmato a point inDDdimensions and the noise vector’s length is almost exactlyσD\sigma\sqrt{D}, with vanishing relative wiggle. (This is concentration of measure: a high-dimensional Gaussian puts essentially all its mass on a thin spherical shell, not near its center.)

So suppose the data does not fill the space but sits on a thindd-dimensional manifold (a curved lower-dimensional sheet) inside a largeDD-dimensional ambient space. The part of a noisy observation that sticks out off the manifold is pure noise in the remainingD−dD-ddirections, so its length is aboutσD−d\sigma\sqrt{D-d}. That length is essentially the noise level. Measuring the distancerrfrom the observation to the manifold gives an estimate

σ^=rD−d,spread shrinking like1D−d.\hat{\sigma} = \frac{r}{\sqrt{D-d}}, \qquad \text{spread shrinking like } \frac{1}{\sqrt{D-d}}.

The bigger the gap between the ambient dimension and the manifold’s intrinsic dimension, the sharper the estimate. Two different noise levels live on two shells of different radius; in low dimensions the shells are fat and overlap, so the level is ambiguous, and in high dimensions they are razor-thin and disjoint, so the level is read straight off the radius. Drag the dimension below and watch the two shells separate.

Figure 3 · the blessing of dimensionality

dimension DD =8

The estimated noise level σ̂ = r/√(D−d) for alowand a high true level, as the ambient dimension D grows. The shaded overlap is the ambiguity. At D = 8 the two levels already separate; by D = 128 they are disjoint, so the geometry pins the noise level and the posterior p(t | u) collapses. High dimensions make the blind average sharp.The 1D plot above superimposes the two bells so you can watch their overlap shrink asDDgrows. The shells themselves are a 3D-shaped picture: two concentric spheres of points around the single data point, fatter for high noise, thinner for low. Figure 4 draws the drawable case,D=3D=3with a single data point at the origin (d=0d=0), and lets you rotate in any direction to see both spheres at once. The real model lives in hundreds of dimensions where the same spheres are paper-thin and far apart, exactly the regime the 1D bells reach as you crankDDup.

Figure 4 · the two shells, in the drawable case

A rotatableD=3D=3picture of a single data point with two shells of noisy observations around it. Theteal inner shellsits at radiusσloD−d≈0.69\sigma_{\text{lo}}\sqrt{D-d}\approx 0.69, theamber outer shellatσhiD−d≈1.73\sigma_{\text{hi}}\sqrt{D-d}\approx 1.73. Drag in any direction to rotate; even atD=3D=3the two shells are visibly distinct spheres. In real high dimensions the shells thin out and pull apart at exactly the rate the bells in Figure 3 separate. For a higher-dimensional data manifold the same shells become tubes (d=1d=1, a line) or thicker sheets (d=2d=2); the math is identical, only the rendering changes.Two caveats. The geometry only makes the noise levelrecoverablefrom the observation; whether a trained network actually exploits the distancerrto recover it is an empirical fact, and the answer (from Sun et al.) is that it does. And there is a second, stronger reason the posterior is sharp, one that needs no high dimension at all. As an observation approaches the data manifold it becomes indistinguishable from clean data, so the smallest noise levels dominate andp(t∣u)p(t\mid\mathbf{u})concentrates att→0t\to 0by sheer proximity. That near-manifold collapse keeps generation stable at the end, regardless of dimension.

So the noise level is not lost when you stop feeding it in. It is encoded in the observation’s geometry, and the model recovers it from there. Wherever the posterior is sharp (which is almost everywhere) the blind field equals the informed field. But “equals a good field” is a statement about the target. Whether you can safely follow that field all the way down to the data is a separate question, and it has a sharp geometric obstruction.

An infinitely deep well

The paper’s reframing is that a blind model is not chasing a moving target at all. It is descending a single fixed landscape: themarginal energy. Define it as the negative log-likelihood of a noisy datapoint at some unknown noise level,

Emarg(u)=−log⁡p(u),p(u)=∫p(u∣t)p(t)dtE_{\text{marg}}(\mathbf{u}) = -\log p(\mathbf{u}), \qquad p(\mathbf{u}) = \int p(\mathbf{u}\mid t)\,p(t)\,dt

(1)

where the integral averages the noisy-data density over a prior on the noise level. Generation is then rolling downhill on this one static potential toward where noisy data is most plausible, which is the clean data itself. A static field can generate because it is the gradient field of a static potential, so it needs no noise-level input, only a slope to follow downhill.

The slope is the posterior-averaged score. To write it in terms of the denoiser we use Tweedie’s formula, the empirical-Bayes identity (Robbins 1956, Efron 2011) that the conditional score points from your noisy point toward your best guess of the clean one:

∇ulog⁡p(u∣t)=a(t)Dt∗(u)−ub(t)2\nabla_{\mathbf{u}} \log p(\mathbf{u}\mid t) = \frac{a(t)\,D^*_t(\mathbf{u}) - \mathbf{u}}{b(t)^2}

(10)

Averaging that over the posterior onttgives the gradient of the marginal energy:

∇uEmarg(u)=Et∣u⁣[u−a(t)Dt∗(u)b(t)2]\nabla_{\mathbf{u}} E_{\text{marg}}(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\!\left[\,\frac{\mathbf{u} - a(t)\,D^*_t(\mathbf{u})}{b(t)^2}\,\right]

(11)

That gradient is where the problem sits. It carries a1/b(t)21/b(t)^2, andb(t)→0b(t)\to 0as you near the data, so the marginal energy has an infinitely deep, infinitely steep well at every datapoint:

lim⁡u→xk∥∇uEmarg(u)∥=∞\lim_{\mathbf{u}\to\mathbf{x}_k} \lVert \nabla_{\mathbf{u}} E_{\text{marg}}(\mathbf{u}) \rVert = \infty

(12)

A neural network outputs finite vectors. A finite field cannot equal an infinite gradient. So either a blind model cannot represent the landscape near the data, exactly where generation finishes and matters most, or something is rescuing it. The 3D view makes this concrete: lift the data plane and the marginal energy becomes a surface with pits at every data point. Toggle between the energy itself, the gradient magnitude, and the preconditioned field, and rotate to see how each surface treats the data.

Figure 5 · the landscape, in three views

The same threedata pointslaid out on a 2D plane, with the field drawn as a rotatable landscape. Theenergyview shows pits diving to negative infinity at every point (the floor is clipped). Thegradientview shows tall spikes diverging upward at every point: the singularity the paper formalizes in equation (12). Thepreconditionedview shows the smooth bounded bowl the gain leaves behind, with the data sitting at the floor as a stable equilibrium. Drag to rotate in any direction. The qualitative shape is the paper’s claim; the constants are chosen for legibility.The same math in one dimension makes the cancellation exact. Three data points sit on a line, and the three quantities involved (energy, raw gradient, and the field the model actually follows) are plotted as three curves. The raw gradient runs off the top of the chart while the bold teal product, the field f*, stays finite and crosses zero at every datum.

Figure 6 · the singularity, and the gain that cancels it

gradient ‖∇E‖ =10.95gain λ̄ =0.285field ‖f*‖ =0.093(bounded)

A one-dimensional toy with threedata pointssymmetric around zero. The faint amber curve is the marginal energy, plunging into a deep well at each datum. Theredcurve is the raw gradient∥∇E∥\|\nabla E\|, which diverges at the data (the spikes are clipped; tagged→∞). The boldtealcurve is the magnitude∥f∗∥\|f^*\|of the field the model follows. It stays bounded everywhere andvanishes at every data point, marking the three stable equilibria (teal dots at the baseline). The curve also dips to zero at the midpoints between adjacent data points; those are unstable saddles where the conditional mean of the data equalsuuby symmetry, harmless for the dynamics (the model would slide off either way). Drag the probe and watch the gain vanish at the same rate the gradient blows up.The singularity cancels. The blind field decomposes into three pieces: a scaled copy of the energy gradient (the natural-gradient term, a gradient rescaled by a local metric), a transport correction, and a linear drift.

f∗(u)=λ‾(u)∇Emarg(u)⏟natural gradient+Et∣u[(λ(t)−λ‾)(∇Et−∇Emarg)]⏟transport correction+c‾scale(u)u⏟driftf^*(\mathbf{u}) = \underbrace{\overline{\lambda}(\mathbf{u})\,\nabla E_{\text{marg}}(\mathbf{u})}_{\text{natural gradient}} + \underbrace{\mathbb{E}_{t\mid\mathbf{u}}\big[(\lambda(t)-\overline{\lambda})(\nabla E_t - \nabla E_{\text{marg}})\big]}_{\text{transport correction}} + \underbrace{\overline{c}_{\text{scale}}(\mathbf{u})\,\mathbf{u}}_{\text{drift}}

(14)

Take the three terms in order. The first is the energy gradient scaled by an effective gainλ‾(u)=Et∣u[λ(t)]\overline{\lambda}(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}[\lambda(t)], whereλ(t)=ba(da−cb)\lambda(t) = \tfrac{b}{a}(da - cb). The second, the transport correction, measures how far the per-noise-level gradients sit from their own average; it is nonzero only when the posterior is spread across several levels, and it dies once the posterior concentrates on one. The third is a linear drift the noise schedule contributes. The gain cancels the singularity. Near the datab→0b\to 0, so the1/b21/b^2in the gradient sends it to infinity, whileλ\lambdacarries factors ofbbthat vanish at the matching rate, leaving the productλ‾∇Emarg\overline{\lambda}\,\nabla E_{\text{marg}}finite. It is like descending an infinitely steep slope while shortening your stride toward zero at the same rate, so every step stays finite. The paper calls this a Riemannian gradient flow: the gain acts as a local metric, a position-dependent rescaling of distance, that turns the singular landscape into a smooth, finite-speed descent with a stable resting point at the data.1Because the transport correction dies wherever the posterior concentrates (high dimension, or near the manifold), what is left is a clean preconditioned gradient flow.

The same decomposition holds at every point on the plane. The figure samples a grid of locations and, at each, draws the three terms of equation (14) head to tail: a teal natural-gradient arrow, then an amber transport-correction arrow, then a small drift arrow, with a dashed line showing the total. The data points sit inside amber rings; the chains shrink to zero as you approach them. Slide the posterior sharpness and watch the amber transport correction die away as the posterior concentrates; what is left is the teal natural-gradient flow plus the drift.

Figure 7 · the field, decomposed

posterior sharpness45%

Equation (14) on a 2D plane. At every grid point the three components of f*(u) are drawn head to tail:natural gradient(bounded, vanishes at data),transport correction(perpendicular, vanishes with posterior concentration and near data), and a smalldrift(linear in u). The dashed line is their sum, the field f* the model actually follows. Toggle to isolate one term or watch the sum directly. Slide the posterior sharpness from broad to concentrated and watch the transport correction die away. The qualitative shape of each term is what the paper proves; the toy lets you see them combine.So the infinite well never appears in the field the model learns. That field is bounded, with the data as a stable attractor, which is what you saw the particles settle into in Figure 1. Whether a real numerical sampler can follow that bounded field down to the data is a separate question.

Which parameterization survives

Having a bounded field is not the same as being able to follow it; that is the question of the samplingdynamics. You integrate an ODE driven by the field, and a bounded field divided by a vanishing noise scale can still make that ODE stiff (one that forces a numerical solver into vanishingly small steps) and explosive. The sampler integrates

dudt=μ(t)u+ν(t)f∗(u)\frac{d\mathbf{u}}{dt} = \mu(t)\,\mathbf{u} + \nu(t)\,f^*(\mathbf{u})

(19)

whereμ(t)\mu(t)is the schedule’s drift andν(t)\nu(t)is the effective gain of the parameterization. One subtlety the paper is careful about: even though the networkf∗f^*is never told the noise level, the sampler still usestt, becauseμ\muandν\nudepend on it. The architecture is autonomous; the integration schedule around it is not.

To judge stability, compare the blind sampler against an oracle that knows the truettat every step. Subtracting the two cancels the shared drift term and leaves the error introduced purely by being blind:

Δv(u,t)=∣ν(t)∣⏟gain⋅∥f∗(u)−ft∗(u)∥⏟estimation error\Delta\mathbf{v}(\mathbf{u},t) = \underbrace{|\nu(t)|}_{\text{gain}} \cdot \underbrace{\lVert f^*(\mathbf{u}) - f^*_t(\mathbf{u}) \rVert}_{\text{estimation error}}

(22)

Stability is a race ast→0t\to 0. The estimation error falls (the posterior over the noise level concentrates), but the gainν(t)\nu(t)may diverge. Which of the two wins decides the outcome, and the three standard targets land in three different places.

Noise prediction(DDPM, and DDIM, its deterministic-sampler variant) has gainν∼1/b\nu\sim 1/b, which diverges. Worse, its estimation error does not fall to zero. It floors at a positive “Jensen gap”: because1/b1/bis convex, the posterior-average of1/b1/bexceeds1/(averageb)1/(\text{average } b), and that gap stays nonzero unless the posterior is a perfect spike. A diverging gain times a floored error goes to infinity. Blind noise prediction blows up.

Signal prediction(EDM) has an even worse gain,ν∼1/b2\nu\sim 1/b^2. But near a discrete data manifold the denoising error decaysexponentially, likee−C/b2e^{-C/b^2}. Near a data point the denoiser is a softmax over the data points weighted by a Gaussian in distance, so once you sit close to one point every other point is suppressed by a Gaussian factor in its squared distance, and the error to the nearest point dies that fast. Exponential decay beats any polynomial blow-up, so the product goes to zero: a stronger gain singularity than noise prediction, yet still stable, because the error crashes faster than the gain climbs.

Velocity prediction(flow matching) has gainν=1\nu = 1, flat. There is nothing to amplify, so the error stays bounded and the dynamics are inherently stable. Toggle the target below and watch the drift errorΔv\Delta\mathbf{v}as the noise vanishes: only noise prediction runs to infinity.

Figure 8 · the stability race

The drift errorΔv = gain × estimation erroras t → 0, for each target. Faint curves show the gain rising and the error falling; the bold curve is their product.Noiseprediction diverges (unstable).Signal and velocitystay bounded. The same blindness, opposite outcomes, set by the single coefficient ν(t).This makes the title’s claim precise. You can drop the noise-level input, but only if your parameterization has a bounded gain (velocity) or a self-correcting one (signal). Velocity-based models are inherently safe; noise-prediction models are structurally broken when blind. The difference is one coefficient,ν(t)\nu(t): bounded for velocity, divergent for noise.

What collapses, what survives

On the CIFAR-10 numbers Sun et al. report, run blind, a noise-prediction DDIM collapses to FID40.9040.90, while blind flow matching and uEDM (both velocity) reach2.612.61and2.232.23. FID is a distance between the real and generated image distributions where lower is better, and on CIFAR-10 anything around2−32{-}3is near the best published, while4040means the samples are visibly broken. That roughly eighteen-fold gap is not a tuning artifact. It is the1/b1/bgain singularity turning ordinary estimation noise into garbage. The paper’s own runs on CIFAR-10, SVHN, and Fashion-MNIST show the same pattern by eye: blind noise-prediction images are dominated by high-frequency artifacts, while blind velocity images are sharp and match the conditioned models.

The dimensionality experiment makes the geometry visible. Take a 2D ring of data, embed it in aDD-dimensional space, and sweepDD. AtD=2D=2both blind models struggle: the shells overlap, the posterior is ambiguous, and the samples are diffuse. AtD=8D=8and3232velocity is already clean while noise prediction is scattered. AtD=128D=128the concentration is so sharp that even the unstable noise-prediction model converges, because its estimation error finally crashes faster than its gain diverges. The blessing of dimensionality, start to finish.

So do diffusion models need noise conditioning? Not always: a velocity- or signal-based model can drop the explicit noise input, because the geometry supplies the level and the dynamics stay bounded, while a noise-prediction model effectively still needs it, because run blind it is structurally unstable. The clean argument leans on data living on a low-dimensional manifold inside a high-dimensional space, and the exponential-stability case for signal prediction assumes a discrete manifold, so the result is sharpest exactly where real image data tends to live.

Wherever the geometry fixes the noise level, the blind model matches the informed one, and whether the sampler reaches the data comes down to the gain coefficient. For the stable parameterizations, then, the noise-level input that every diffusion model carries was information the observation already held. The model was never told the level; it recovers it from the geometry of its own input.

ProvenanceVerified against primary literature

Tweedie / Robbins / EfronA denoiser is the conditional score (Robbins 1956, Efron 2011, Vincent 2011); the signal-scaling a(t) factor is essential.

Sun et al. (2025)The empirical result that blind models generate; the CIFAR-10 FID figures are their benchmark, which this paper explains.

EDM / Flow Matching / EqMThe unified affine schedule and the (a,b,c,d) targets for each model (Karras 2022; Lipman 2022; Wang & Du).

Concentration of measureσ̂ = r/√(D−d) is the unbiased noise-level estimate; its spread shrinks like 1/√(D−d) (Vershynin; Amari for natural gradient).

correctionThe title overstates the result. Noise conditioning is dispensable only for stable (velocity- or signal-based) parameterizations; a noise-prediction model trained blind is structurally unstable. The clean argument also assumes data on a low-dimensional manifold in a high-dimensional space, and the exponential-stability case for signal prediction assumes a discrete manifold.

Questions you might still have

If the network never sees the noise level, how does it know how much to denoise? In high dimensions the noise level is written into the geometry of the observation: the distance from the point to the data manifold fixes it (σ̂ = r/√(D−d)), so the posterior over the noise level collapses to a near-certain estimate. The model recovers the noise level from the geometry instead of being told it.

Then why does a blind DDPM fail while blind flow matching works? Stability is a race as the noise vanishes. Noise prediction has a gain that grows like 1/b and amplifies the residual uncertainty (a nonzero “Jensen gap”) into a blow-up; velocity prediction has a gain of exactly 1, so nothing amplifies the error. Same blindness, opposite fate.

Does this mean noise conditioning is useless? No. The claim is narrower than that: a velocity- or signal-based model can drop the explicit noise input because the geometry supplies it and the dynamics stay bounded. A noise-prediction model cannot. And the clean theory leans on data sitting on a low-dimensional manifold inside a high-dimensional space.

What is the marginal energy, exactly? It is −log p(u), where p(u) averages the noisy-data densities over every noise level. It is the single static landscape a blind model descends, and it has an infinitely deep well at every data point.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.

@docmilanfar: I really enjoyed the explainer for our recent paper on "Geometry of Noise" arXiv:2602.18428

Why diffusion models do not need noise conditioning

The Geometry of Noise: Why Diffusion Models Don’t Need Noise Conditioning

Conditioning on the noise level

Removing the noise input

Reading the noise level off the geometry

An infinitely deep well

Which parameterization survives

What collapses, what survives

Questions you might still have

Similar Articles

@teropa: I which @sedielem beautifully illustrates why diffusion models work so well with images Our visual world is spatially c…

@k_solidified_: https://arxiv.org/abs/2106.10165 All of humanity should read this

Class-frequency Guided Noise Schedule for Diffusion Models

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

Submit Feedback

Similar Articles

@teropa: I which @sedielem beautifully illustrates why diffusion models work so well with images Our visual world is spatially c…

@k_solidified_: https://arxiv.org/abs/2106.10165 All of humanity should read this

Class-frequency Guided Noise Schedule for Diffusion Models

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training