Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion

arXiv cs.LG 06/15/26, 04:00 AM Papers
Summary
Introduces Decoupled Latent Optimization (DLO) for full waveform inversion, which relaxes latent optimization into a quadratic-penalty objective, outperforming classical and diffusion-based methods on benchmarks while preserving smoothed-velocity initialization.
arXiv:2606.14139v1 Announce Type: new Abstract: Full waveform inversion (FWI) recovers subsurface velocity from seismic recordings by solving a severely ill-posed, nonconvex PDE-constrained optimization. Classical regularizers stabilize the inversion but fail to reproduce realistic geological structures; recent diffusion-prior methods improve realism at the cost of a fragile trade-off between data fidelity and prior consistency. We propose Decoupled Latent Optimization (DLO), which relaxes the standard latent-optimization formulation into a quadratic-penalty objective over an auxiliary physical variable and a latent variable. The data-fidelity gradient acts in physical space, the diffusion sampler contributes only through a decoded prior sample, and the standard smoothed-velocity initialization of classical FWI is preserved. On the OpenFWI benchmark, DLO outperforms classical regularizers and existing diffusion-based methods under clean, noisy, and missing-trace acquisitions. The prior, trained on 70*70 OpenFWI models, transfers directly to the Marmousi and Overthrust benchmarks, where DLO recovers intricate fault structures and remains robust to initialization smoothing and measurement noise.
Original Article
View Cached Full Text
Cached at: 06/15/26, 09:10 AM
# Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion
Source: [https://arxiv.org/html/2606.14139](https://arxiv.org/html/2606.14139)
Zheng MaCorresponding author\. E\-mail: zhengma@sjtu\.edu\.cnSchool of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, ChinaCMA\-Shanghai, Shanghai Jiao Tong University, Shanghai, China

###### Abstract

Full waveform inversion \(FWI\) recovers subsurface velocity from seismic recordings by solving a severely ill\-posed, nonconvex PDE\-constrained optimization\. Classical regularizers stabilize the inversion but fail to reproduce realistic geological structures; recent diffusion\-prior methods improve realism at the cost of a fragile trade\-off between data fidelity and prior consistency\. We propose Decoupled Latent Optimization \(DLO\), which relaxes the standard latent\-optimization formulation into a quadratic\-penalty objective over an auxiliary physical variable and a latent variable\. The data\-fidelity gradient acts in physical space, the diffusion sampler contributes only through a decoded prior sample, and the standard smoothed\-velocity initialization of classical FWI is preserved\. On the OpenFWI benchmark, DLO outperforms classical regularizers and existing diffusion\-based methods under clean, noisy, and missing\-trace acquisitions\. The prior, trained on70×7070\\times 70OpenFWI models, transfers directly to the Marmousi and Overthrust benchmarks, where DLO recovers intricate fault structures and remains robust to initialization smoothing and measurement noise\.

Keywords:Full Waveform Inversion; Diffusion Model; PDE\-governed Inverse Problem; Seismic Imaging

## 1Introduction

Inverse problems governed by partial differential equations \(PDEs\) play a central role across science and engineering, encompassing medical imaging, fluid dynamics, and geophysical exploration\[[14](https://arxiv.org/html/2606.14139#bib.bib27),[47](https://arxiv.org/html/2606.14139#bib.bib31)\]\. A prototypical and particularly demanding instance is seismic inversion, which is widely used to reconstruct quantitative subsurface models from wavefield recordings collected at the surface and supports applications ranging from resource exploration to seismic\-hazard assessment and environmental monitoring\[[65](https://arxiv.org/html/2606.14139#bib.bib39),[39](https://arxiv.org/html/2606.14139#bib.bib33)\]\. Over the past decades, a number of methods have been developed to address this problem, including velocity analysis on stacked traces\[[6](https://arxiv.org/html/2606.14139#bib.bib42)\], traveltime tomography and migration\-based methods\[[64](https://arxiv.org/html/2606.14139#bib.bib43),[13](https://arxiv.org/html/2606.14139#bib.bib44)\], and Born\-approximation linearized inversion\[[23](https://arxiv.org/html/2606.14139#bib.bib45),[35](https://arxiv.org/html/2606.14139#bib.bib46)\]\. Among these methods, full\-waveform inversion \(FWI\)\[[46](https://arxiv.org/html/2606.14139#bib.bib47),[57](https://arxiv.org/html/2606.14139#bib.bib48),[24](https://arxiv.org/html/2606.14139#bib.bib49),[51](https://arxiv.org/html/2606.14139#bib.bib18)\]is set apart by its ability to recover high\-resolution subsurface structure, which it achieves by casting the inversion as a PDE\-constrained optimization that exploits the full phase and amplitude content of the wavefield to align the simulated and observed seismograms under the wave equation\[[51](https://arxiv.org/html/2606.14139#bib.bib18),[19](https://arxiv.org/html/2606.14139#bib.bib50)\]\.

The optimization problem underlying FWI is widely regarded as a challenging nonconvex problem, owing to the nonlinearity of its forward modeling and the ill\-posedness of the inverse problem itself\. The associated misfit landscape is strongly nonconvex and exhibits numerous local minima arising from phase ambiguity and limited subsurface illumination, leaving the inversion highly sensitive to inaccuracies in the initial velocity model, measurement noise, and incomplete acquisitions\[[51](https://arxiv.org/html/2606.14139#bib.bib18),[38](https://arxiv.org/html/2606.14139#bib.bib58),[62](https://arxiv.org/html/2606.14139#bib.bib59),[66](https://arxiv.org/html/2606.14139#bib.bib23),[63](https://arxiv.org/html/2606.14139#bib.bib34),[25](https://arxiv.org/html/2606.14139#bib.bib36)\]\. A particularly characteristic difficulty is cycle skipping, where the gradient steers the simulated wavefield toward a match that is off by a whole cycle whenever the predicted and observed traces are misaligned by more than half a period, and the optimization becomes trapped in an incorrect basin of attraction, converging to inaccurate saddle points\[[20](https://arxiv.org/html/2606.14139#bib.bib57),[37](https://arxiv.org/html/2606.14139#bib.bib6),[34](https://arxiv.org/html/2606.14139#bib.bib56)\]\.

To mitigate these difficulties and improve robustness against imperfect observations, a variety of remedies have been explored\. Regularization\-based approaches retain the standard quadratic misfit and add an explicit penalty that constrains the velocity model, with Tikhonov regularization enforcing smoothness on the recovered field\[[4](https://arxiv.org/html/2606.14139#bib.bib5)\]and total\-variation regularization promoting piecewise\-constant structures that preserve sharp interfaces\[[18](https://arxiv.org/html/2606.14139#bib.bib2),[29](https://arxiv.org/html/2606.14139#bib.bib3),[1](https://arxiv.org/html/2606.14139#bib.bib4)\]\. By penalizing non\-physical features in the recovered velocity, these regularizers stabilize the optimization and improve convergence accuracy, and have accordingly become standard baselines in FWI\. A complementary line of work replaces the pointwise quadratic residual by misfit functionals that are less sensitive to small time and phase shifts, including multi\-scale frequency continuation\[[9](https://arxiv.org/html/2606.14139#bib.bib51)\], phase\-based and envelope\-based misfits\[[8](https://arxiv.org/html/2606.14139#bib.bib52),[10](https://arxiv.org/html/2606.14139#bib.bib53)\], adaptive waveform inversion\[[56](https://arxiv.org/html/2606.14139#bib.bib54)\], and geometry\-aware metrics based on optimal transport\[[61](https://arxiv.org/html/2606.14139#bib.bib55),[34](https://arxiv.org/html/2606.14139#bib.bib56)\]\.

Alongside these classical strategies, a substantial body of work has explored machine\-learning approaches to FWI\. Direct supervised mappings learn the inverse operator from synthetic pairs of seismograms and velocity models\[[58](https://arxiv.org/html/2606.14139#bib.bib7),[68](https://arxiv.org/html/2606.14139#bib.bib9),[67](https://arxiv.org/html/2606.14139#bib.bib8),[28](https://arxiv.org/html/2606.14139#bib.bib40)\]\. These networks achieve much higher computational efficiency than iterative optimization, and the training data provide implicit regularization for the ill\-posed inverse problem, which yield strong empirical accuracy on FWI benchmarks\. Physics\-aware variants embed the wave equation directly into the learning process\. Physics\-informed neural networks penalize PDE residuals on collocation points\[[26](https://arxiv.org/html/2606.14139#bib.bib24),[32](https://arxiv.org/html/2606.14139#bib.bib26),[31](https://arxiv.org/html/2606.14139#bib.bib25)\], and related schemes reparameterize the velocity field as a neural network whose weights are updated by the FWI gradient\[[60](https://arxiv.org/html/2606.14139#bib.bib60)\]\. A more recent and particularly fruitful paradigm in PDE solving is operator learning, in which neural networks approximate function\-to\-function mappings between infinite\-dimensional spaces\[[30](https://arxiv.org/html/2606.14139#bib.bib10),[69](https://arxiv.org/html/2606.14139#bib.bib11)\]\. Analogous operator architectures have also been investigated for FWI\[[22](https://arxiv.org/html/2606.14139#bib.bib41)\]\. Despite these advances, generalization remains the central limitation of supervised approaches\. Their accuracy degrades on velocity structures unseen during training and on data contaminated by measurement noise, and the network typically must be retrained whenever the acquisition geometry changes\.

The rapid development of diffusion models in recent years has raised generative modeling to a new level and opened a new paradigm for solving inverse problems\. By learning the score field∇𝐱log⁡pt\(𝐱\)\\nabla\_\{\\mathbf\{x\}\}\\log p\_\{t\}\(\\mathbf\{x\}\)of a family of Gaussian\-corrupted marginals\{pt\}t∈\[0,T\]\\\{p\_\{t\}\\\}\_\{t\\in\[0,T\]\}, diffusion models generate samples by solving the associated reverse\-time stochastic differential equation that transports a standard Gaussian back to the data distribution\[[21](https://arxiv.org/html/2606.14139#bib.bib12),[45](https://arxiv.org/html/2606.14139#bib.bib14),[43](https://arxiv.org/html/2606.14139#bib.bib13)\]\. From an inverse\-problem perspective, a pretrained diffusion model embeds a data\-driven prior into the inversion that provides stronger regularization than handcrafted alternatives and yields improved robustness to measurement noise and out\-of\-distribution observations\. A wide range of diffusion\-based algorithms have been witnessed in image\-domain inverse problems\[[27](https://arxiv.org/html/2606.14139#bib.bib28),[33](https://arxiv.org/html/2606.14139#bib.bib17),[16](https://arxiv.org/html/2606.14139#bib.bib30),[44](https://arxiv.org/html/2606.14139#bib.bib29),[12](https://arxiv.org/html/2606.14139#bib.bib62),[11](https://arxiv.org/html/2606.14139#bib.bib38)\], whereas their application to physical, PDE\-governed inverse problems remains under\-explored\[[42](https://arxiv.org/html/2606.14139#bib.bib16),[40](https://arxiv.org/html/2606.14139#bib.bib15),[54](https://arxiv.org/html/2606.14139#bib.bib35)\]\. For FWI in particular, one line of work inserts FWI physical\-constraint steps into the diffusion sampling trajectory, including DiffusionFWI and its DiffusionILVR extension\[[52](https://arxiv.org/html/2606.14139#bib.bib1),[49](https://arxiv.org/html/2606.14139#bib.bib37)\]as well as DPS variants applied to the wave\-equation operator\[[48](https://arxiv.org/html/2606.14139#bib.bib64),[36](https://arxiv.org/html/2606.14139#bib.bib65)\]\. However, the imposed physical\-constraint steps are generally inconsistent with the Gaussian\-noised marginals visited by the sampling flow, and hyperparameter selection that balances data\-prior effectiveness against physical\-constraint enforcement becomes a central difficulty\. A second line of work embeds diffusion\-model\-based regularization terms directly into the FWI optimization objective\[[41](https://arxiv.org/html/2606.14139#bib.bib61),[59](https://arxiv.org/html/2606.14139#bib.bib63)\]\. These works collectively demonstrate that diffusion priors can deliver meaningful structural information for FWI; the form in which the prior signal is extracted from the pretrained model, and the way it is injected into the inversion, remain open design choices that motivate the present work\.

In this paper, we develop a general framework,*Decoupled Latent Optimization*\(DLO\), that embeds a pretrained diffusion model into PDE\-governed inverse problems as a regularization mechanism\. DLO introduces an auxiliary optimizable latent variable whose decoded sample approximates the prior projection of the physical variable, providing a diffusion\-prior regularization signal for the physical inversion\. The framework inherits the strong generalization to out\-of\-distribution structures characteristic of diffusion\-prior approaches\.

We validate DLO on large\-scale FWI benchmarks and compare it against classical regularizers and existing diffusion\-based methods\. On the OpenFWI benchmark\[[15](https://arxiv.org/html/2606.14139#bib.bib19)\], DLO consistently outperforms these baselines in reconstruction accuracy and remains robust under measurement noise and incomplete acquisitions\. We further apply the framework to the Marmousi\[[50](https://arxiv.org/html/2606.14139#bib.bib20)\]and Overthrust\[[2](https://arxiv.org/html/2606.14139#bib.bib21)\]benchmarks that lie clearly outside the training distribution, and the successful reconstructions on these models imply that DLO scales to larger domains and adapts well to the complex geological structures encountered in real field data\.

The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2606.14139#S2)formulates FWI, reviews diffusion generative models, and surveys existing diffusion\-based approaches for inverse problems\. Section[3](https://arxiv.org/html/2606.14139#S3)presents Decoupled Latent Optimization and analyzes its convergence properties\. Section[4](https://arxiv.org/html/2606.14139#S4)reports numerical experiments on the OpenFWI, Marmousi, and Overthrust benchmarks\. Section[5](https://arxiv.org/html/2606.14139#S5)concludes with a discussion of limitations and future directions\. Implementation details, hyperparameters, and additional experimental results are deferred to the appendices\.

## 2Background and Preliminaries

### 2\.1Full Waveform Inversion

Full waveform inversion \(FWI\) recovers the subsurface velocity field from seismic wavefield measurements recorded at the surface\[[51](https://arxiv.org/html/2606.14139#bib.bib18),[47](https://arxiv.org/html/2606.14139#bib.bib31)\]\. In the acoustic approximation, wave propagation in the subsurface domainΩ\\Omegais governed by

\{1v\(𝐫\)2∂2p\(𝐫,t\)∂t2=∇2p\(𝐫,t\)\+s\(𝐫,t,ξ\),p\(𝐫,0\)=0,pt\(𝐫,0\)=0\.\\left\\\{\\begin\{aligned\} &\\frac\{1\}\{v\(\\mathbf\{r\}\)^\{2\}\}\\frac\{\\partial^\{2\}p\(\\mathbf\{r\},t\)\}\{\\partial t^\{2\}\}=\\nabla^\{2\}p\(\\mathbf\{r\},t\)\+s\(\\mathbf\{r\},t,\\xi\),\\\\ &p\(\\mathbf\{r\},0\)=0,\\\\ &p\_\{t\}\(\\mathbf\{r\},0\)=0\.\\end\{aligned\}\\right\.\(1\)where𝐫∈Ω\\mathbf\{r\}\\in\\Omegais the spatial coordinate,ttis time,∇2\\nabla^\{2\}is the spatial Laplacian,v\(𝐫\)v\(\\mathbf\{r\}\)is the spatially varying subsurface velocity,p\(𝐫,t\)p\(\\mathbf\{r\},t\)is the pressure wavefield\. The source terms\(𝐫,t,ξ\)s\(\\mathbf\{r\},t,\\xi\)is typically a Ricker wavelet emitted from a point source at the surface, represented as follows:

s\(𝐫,t,ξ\)=s0\(t\)δ\(𝐫−ξ\),s\(\\mathbf\{r\},t,\\xi\)=s\_\{0\}\(t\)\\,\\delta\(\\mathbf\{r\}\-\\xi\),\(2\)withs0\(t\)=\(1−2π2f2t2\)e−π2f2t2s\_\{0\}\(t\)=\(1\-2\\pi^\{2\}f^\{2\}t^\{2\}\)e^\{\-\\pi^\{2\}f^\{2\}t^\{2\}\}being the amplitude of the Ricker wavelet with frequencyff,δ\\deltabeing the Dirac delta function,ξ\\xibeing the location of the source\. Following the OpenFWI framework, we use absorbing boundary layers to simulate wave propagation in an unbounded medium, suppressing artificial reflections from the domain boundaries\.

In a typical surface acquisition the medium is probed by a set of sources\{ξi\}i=1ns\\\{\\xi\_\{i\}\\\}\_\{i=1\}^\{n\_\{s\}\}\. The wavefield is observed through a measurement operatorℳ\\mathcal\{M\}that measures the wavefield at the receiver points\{𝐫j\}j=1nr\\\{\\mathbf\{r\}\_\{j\}\\\}\_\{j=1\}^\{n\_\{r\}\}placed along the surface at a shallow depth,ℳp\(𝐫,⋅,t;v,ξi\)=\{p\(𝐫j,t;v,ξi\)\}j=1nr\\mathcal\{M\}\\,p\(\\mathbf\{r\},\\cdot,t;v,\\xi\_\{i\}\)=\\\{p\(\\mathbf\{r\}\_\{j\},t;v,\\xi\_\{i\}\)\\\}\_\{j=1\}^\{n\_\{r\}\}\. For a sourceξi\\xi\_\{i\}, solving the wave equation \([1](https://arxiv.org/html/2606.14139#S2.E1)\) with velocityvvyields the wavefieldp\(𝐫,t;v,ξi\)p\(\\mathbf\{r\},t;v,\\xi\_\{i\}\), whose measurement is the simulated record\. Collecting these records over all sources defines the forward operator

ℱ\(v\)=\{p\(𝐫j,t;v,ξi\)\}j=1,…,nr,i=1,…,ns,t∈\[0,T\],\\mathcal\{F\}\(v\)=\\bigl\\\{\\,p\(\\mathbf\{r\}\_\{j\},t;v,\\xi\_\{i\}\)\\,\\bigr\\\}\_\{j=1,\\dots,n\_\{r\},\\;i=1,\\dots,n\_\{s\}\},\\qquad t\\in\[0,T\],\(3\)which maps a velocity model to the corresponding wavefield sampled at every source–receiver pair\. Generally, the observed data𝐝obs\\mathbf\{d\}\_\{\\mathrm\{obs\}\}are the wavefield generated by the true velocity modelv∗v^\{\\ast\}, measured at the same receivers and corrupted by additive measurement noise,

𝐝obs=\{p\(𝐫j,t;v∗,ξi\)\+ηi\(𝐫j,t\)\}j=1,…,nr,i=1,…,ns,ηi∼𝒩\(0,σ2I\),\\mathbf\{d\}\_\{\\mathrm\{obs\}\}=\\bigl\\\{\\,p\(\\mathbf\{r\}\_\{j\},t;v^\{\\ast\},\\xi\_\{i\}\)\+\\eta\_\{i\}\(\\mathbf\{r\}\_\{j\},t\)\\,\\bigr\\\}\_\{j=1,\\dots,n\_\{r\},\\;i=1,\\dots,n\_\{s\}\},\\qquad\\eta\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\),\(4\)whereηi\\eta\_\{i\}denotes the measurement noise at each receiver\. Fig\.[1](https://arxiv.org/html/2606.14139#S2.F1)illustrates this forward\-modeling and inversion setup\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/x1.png)Figure 1:Forward modeling and full waveform inversion\.A velocity model \(left\) is excited by five surface sources A–E; the wave equation \([1](https://arxiv.org/html/2606.14139#S2.E1)\) propagates each source through the medium, and the resulting wavefield is recorded through time at surface receivers \(right; rows are sources A–E, columns show the shot gathers under clean, noisy, and missing\-trace acquisition\)\. FWI inverts this mapping to recovervvfrom the observed wavefield\.Traditional FWI recovers the subsurface velocityvvby solving an optimization problem initialized from a rough velocity model \(typically a smoothed approximation of the true geology\)\.

minv⁡‖ℱ\(v\)−𝐝obs‖2\+λR\(v\),\\min\_\{v\}\\;\\bigl\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\}\+\\lambda\\,R\(v\),\(5\)where the first term measures the misfit between the observed data𝐝obs\\mathbf\{d\}\_\{\\mathrm\{obs\}\}and the simulated dataℱ\(v\)\\mathcal\{F\}\(v\), and the second termR\(v\)R\(v\)is a regularization term weighted byλ\\lambdathat constrains the solution to be physically plausible and improves the convergence of the inversion\. The gradient of the data misfit is usually computed via the adjoint\-state method\[[47](https://arxiv.org/html/2606.14139#bib.bib31)\]\.

Due to the nonlinearity of the wave equation and the limited coverage of surface observations, FWI is a severely ill\-posed problem whose optimization landscape contains numerous local minima \(e\.g\., cycle\-skipping\[[37](https://arxiv.org/html/2606.14139#bib.bib6)\]\)\. Although regularization penalizes non\-physical solutions and improves convergence, performance remains sensitive to the choice of initial model, the strength and form of the regularization term, and measurement noise in the observations, making accurate recovery challenging in practice\.

While classical regularizers such as Tikhonov\[[4](https://arxiv.org/html/2606.14139#bib.bib5)\]and Total Variation \(TV\)\[[18](https://arxiv.org/html/2606.14139#bib.bib2),[29](https://arxiv.org/html/2606.14139#bib.bib3),[1](https://arxiv.org/html/2606.14139#bib.bib4)\]improve inversion stability, they often fail to produce geologically realistic models: Tikhonov tends to over\-smooth fine\-scale features, while TV introduces staircase artifacts at smooth transitions\. In recent years, deep\-learning\-based methods have emerged as a promising direction, leveraging large\-scale data to introduce data\-driven regularization that captures complex geological structures beyond the reach of handcrafted priors\.

### 2\.2Diffusion Generative Models

A diffusion model defines a continuous family of Gaussian\-corrupted marginals\{pt\}t∈\[0,T\]\\\{p\_\{t\}\\\}\_\{t\\in\[0,T\]\}by progressively adding noise to clean samplesx0∼pdatax\_\{0\}\\sim p\_\{\\mathrm\{data\}\}\. Under the variance\-preserving \(VP\) corruption\[[21](https://arxiv.org/html/2606.14139#bib.bib12),[45](https://arxiv.org/html/2606.14139#bib.bib14)\],

xt=α¯tx0\+1−α¯tϵ,ϵ∼𝒩\(0,I\),x\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(6\)whereα¯t∈\(0,1\]\\bar\{\\alpha\}\_\{t\}\\in\(0,1\]decreases monotonically fromα¯0=1\\bar\{\\alpha\}\_\{0\}\\\!=\\\!1toα¯T≈0\\bar\{\\alpha\}\_\{T\}\\\!\\approx\\\!0\.

Eq\. \([6](https://arxiv.org/html/2606.14139#S2.E6)\) yields a samplext∼ptx\_\{t\}\\sim p\_\{t\}at each noise level, and the family\{pt\}t∈\[0,T\]\\\{p\_\{t\}\\\}\_\{t\\in\[0,T\]\}can be regarded as the marginal distributions of a continuous\-time forward diffusion process described by the Itô stochastic differential equation\[[45](https://arxiv.org/html/2606.14139#bib.bib14)\]

dx=f\(x,t\)dt\+g\(t\)dw,\\mathrm\{d\}x=f\(x,t\)\\,\\mathrm\{d\}t\+g\(t\)\\,\\mathrm\{d\}w,\(7\)wheref\(⋅,t\)f\(\\cdot,t\)is the drift coefficient,g\(t\)g\(t\)the diffusion coefficient, andwwa standard Wiener process\. The VP corruption \([6](https://arxiv.org/html/2606.14139#S2.E6)\) corresponds to the choicef\(x,t\)=−12β\(t\)xf\(x,t\)=\-\\dfrac\{1\}\{2\}\\beta\(t\)\\,xandg\(t\)=β\(t\)g\(t\)=\\sqrt\{\\beta\(t\)\}\. Hereβ\(t\)\\beta\(t\)is the continuous\-time noise\-rate schedule, related to the corruption coefficientα¯t\\bar\{\\alpha\}\_\{t\}in Eq\. \([6](https://arxiv.org/html/2606.14139#S2.E6)\) byβ\(t\)=−α¯˙tα¯t\\beta\(t\)=\-\\dfrac\{\\dot\{\\bar\{\\alpha\}\}\_\{t\}\}\{\\bar\{\\alpha\}\_\{t\}\}, equivalentlyα¯t=exp⁡\(−∫0tβ\(s\)ds\)\\displaystyle\\bar\{\\alpha\}\_\{t\}=\\exp\\left\(\-\\int\_\{0\}^\{t\}\\beta\(s\)\\,\\mathrm\{d\}s\\right\), so that the two schedules encode the same noising process\. Ast→Tt\\to Tthe signal\-to\-noise ratio of the marginalα¯t1−α¯t\\dfrac\{\\bar\{\\alpha\}\_\{t\}\}\{1\-\\bar\{\\alpha\}\_\{t\}\}decreases monotonically to zero, so thatpT≈𝒩\(0,I\)p\_\{T\}\\approx\\mathcal\{N\}\(0,I\)is a tractable Gaussian prior\.

A diffusion model generates samples through the reverse form of this SDE, gradually denoising a Gaussian samplexT∼𝒩\(0,I\)x\_\{T\}\\sim\\mathcal\{N\}\(0,I\)back to the data distribution by integrating the reverse\-time SDE\[[3](https://arxiv.org/html/2606.14139#bib.bib69)\]

dx=\[f\(x,t\)−g\(t\)2∇xlog⁡pt\(x\)\]dt\+g\(t\)dw¯,\\mathrm\{d\}x=\\bigl\[\\,f\(x,t\)\-g\(t\)^\{2\}\\,\\nabla\_\{x\}\\log p\_\{t\}\(x\)\\,\\bigr\]\\mathrm\{d\}t\+g\(t\)\\,\\mathrm\{d\}\\bar\{w\},\(8\)wherew¯\\bar\{w\}is a standard Wiener process running backwards in time anddt\\mathrm\{d\}tis an infinitesimal negative timestep; the only unknown is the score∇xlog⁡pt\(x\)\\nabla\_\{x\}\\log p\_\{t\}\(x\)of each marginal\. We fit this score by score matching, realized by training a neural networkDθ\(xt,t\)D\_\{\\theta\}\(x\_\{t\},t\)to predict the clean samplex0x\_\{0\}from the noised observationxtx\_\{t\}through the VP denoising objective

minθ⁡𝔼x0∼pdata,t∼𝒰\(0,T\),ϵ∼𝒩\(0,I\)\[ω\(t\)‖Dθ\(xt,t\)−x0‖2\],xt=α¯tx0\+1−α¯tϵ,\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\,x\_\{0\}\\sim p\_\{\\mathrm\{data\}\},\\;t\\sim\\mathcal\{U\}\(0,T\),\\;\\epsilon\\sim\\mathcal\{N\}\(0,I\)\}\\bigl\[\\,\\omega\(t\)\\,\\\|D\_\{\\theta\}\(x\_\{t\},t\)\-x\_\{0\}\\\|^\{2\}\\,\\bigr\],\\quad x\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon,\(9\)withω\(t\)\>0\\omega\(t\)\>0a weighting function; its optimum recovers𝔼\[x0∣xt\]\\mathbb\{E\}\[x\_\{0\}\\\!\\mid\\\!x\_\{t\}\]and hence the score∇xtlog⁡pt\(xt\)\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\)through Tweedie’s formula\.

The sampling process is realized either by the reverse\-time SDE \([8](https://arxiv.org/html/2606.14139#S2.E8)\) itself or by its equivalent probability\-flow ODE derived from the Fokker–Planck equation\[[45](https://arxiv.org/html/2606.14139#bib.bib14)\], thereby transporting a standard Gaussian sample to the clean data distribution\. Here we consider the standard DDIM sampling process\[[43](https://arxiv.org/html/2606.14139#bib.bib13)\], a discretization of the probability\-flow ODE\. We introduce the notationRθ:z↦x0R\_\{\\theta\}\\\!:\\\!z\\\!\\mapsto\\\!x\_\{0\}to represent this deterministic mapping from the initial Gaussian latentxT=z∼𝒩\(0,I\)x\_\{T\}=z\\sim\\mathcal\{N\}\(0,I\)to the clean samplex0x\_\{0\}, defined by iterating

xt−1=α¯t−1Dθ\(xt,t\)\+1−α¯t−1xt−α¯tDθ\(xt,t\)1−α¯t\.x\_\{t\-1\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\-1\}\}\\,D\_\{\\theta\}\(x\_\{t\},t\)\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\-1\}\}\\,\\frac\{x\_\{t\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,D\_\{\\theta\}\(x\_\{t\},t\)\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\.\(10\)We treatRθR\_\{\\theta\}as a learned generator that transports the standard Gaussian to \(an approximation of\) the data distributionpdatap\_\{\\mathrm\{data\}\}; every outputRθ\(z\)R\_\{\\theta\}\(z\)is by construction a sample of the learned prior\. For the specific theory and detailed derivations of diffusion models, we refer to Appendix[A](https://arxiv.org/html/2606.14139#A1)\.

### 2\.3Diffusion Priors for Inverse Problems

Diffusion models provide an efficient and high\-quality mechanism for sampling from the data prior\. Embedding this pretrained prior into the solution of inverse problems, however, requires carefully designed algorithms that specify how the prior signal is extracted from the pretrained model and how it interacts with the physics\-based data fidelity\. Existing approaches inject the diffusion prior into inversion in several broadly different ways\.

##### Physical constraint embedded in the reverse sampling process\.

A classical realization in image\-domain inverse problems is DPS\[[12](https://arxiv.org/html/2606.14139#bib.bib62)\]/Π\\PiGDM\[[44](https://arxiv.org/html/2606.14139#bib.bib29)\], which approximate the conditional score of the noisy marginal by Bayes’ rule,

∇xtlog⁡pt\(xt∣𝐝obs\)=∇xtlog⁡pt\(xt\)\+∇xtlog⁡pt\(𝐝obs∣xt\),\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\\bigl\(x\_\{t\}\\\!\\mid\\\!\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\)=\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\)\+\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\\bigl\(\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\\!\\mid\\\!x\_\{t\}\\bigr\),\(11\)where the intractable likelihood term∇xtlog⁡pt\(𝐝obs∣xt\)\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\\bigl\(\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\\!\\mid\\\!x\_\{t\}\\bigr\)is approximated by evaluating the forward operator at the denoised estimate𝔼\[x0∣xt\]\\mathbb\{E\}\[x\_\{0\}\\\!\\mid\\\!x\_\{t\}\]\. A separate line of work, represented by DiffusionFWI\[[52](https://arxiv.org/html/2606.14139#bib.bib1)\]and DiffusionILVR\[[49](https://arxiv.org/html/2606.14139#bib.bib37)\], bypasses the conditional score: instead, an FWI gradient step \(or a low\-pass replacement\) is interleaved directly between successive denoising steps of the unconditional sampler\. To respect both the prior and the observations, these methods must balance the physical constraint against the diffusion prior at every sampling step\. However, this balance is fragile: the likelihood approximation is inaccurate, and the physics\-based constraint is inherently inconsistent with the noise\-perturbed marginals\{pt\}\\\{p\_\{t\}\\\}encountered during sampling\. As a result, the schemes are sensitive to the choice of guidance strategy and hyperparameters, particularly for nonlinear PDE\-governed inverse problems such as FWI\.

##### Pretrained Denoiser as a Learned Regularizer

A second class of methods retains the physical\-space FWI objective and introduces the diffusion prior as a regularizer evaluated through the pretrained denoiser:

minv⁡‖ℱ\(v\)−𝐝obs‖2\+λℒ\(v,v~θ\(v\)\),\\min\_\{v\}\\;\\bigl\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\}\+\\lambda\\,\\mathcal\{L\}\\bigl\(v,\\,\\tilde\{v\}\_\{\\theta\}\(v\)\\bigr\),\(12\)wherev~θ\(v\)\\tilde\{v\}\_\{\\theta\}\(v\)is a prior velocity field associated with the current iteratevv, computed fromvvthrough one or more evaluations of the pretrained denoiserDθD\_\{\\theta\}; it can be read as a denoiser\-based prior estimate ofvv, andℒ\\mathcal\{L\}penalizes deviation ofvvfromv~θ\\tilde\{v\}\_\{\\theta\}, supplying a gradient that points toward the learned manifold\. Different instantiations of this framework differ in howv~θ\(v\)\\tilde\{v\}\_\{\\theta\}\(v\)is constructed fromvvandDθD\_\{\\theta\}and in the choice of penaltyℒ\\mathcal\{L\}\. For example, RED\-diff\[[33](https://arxiv.org/html/2606.14139#bib.bib17)\]and its wave\-equation counterpart RED\-DiffEq\[[41](https://arxiv.org/html/2606.14139#bib.bib61)\]corrupt the current iteratevvto a randomly drawn noise levelttwith realizationϵ\\epsilonand read off a single denoiser output,

v~θ\(v\):=Dθ\(α¯tv\+1−α¯tϵ,t\),ϵ∼𝒩\(0,I\),\\tilde\{v\}\_\{\\theta\}\(v\):=\\;D\_\{\\theta\}\\bigl\(\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,v\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon,\\;t\\bigr\),\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(13\)withℒ\\mathcal\{L\}taken from a variational lower bound on the diffusion log\-likelihood\. A common characteristic of both approaches is that the optimized solution is not drawn from the diffusion prior\. As a result, the regularizing effect of the learned prior is indirect, and reconstruction quality can depend on the choice of regularization form, guidance schedule, and hyperparameters\.

##### Latent Optimization through the Sampler

Another line of work, applicable to inverse problems with a differentiable forward operator, replaces the unknown field by the latent input to the diffusion sampler and minimizes the data misfit through the composed mapℱ∘Rθ\\mathcal\{F\}\\circ R\_\{\\theta\}:

minz⁡‖ℱ\(Rθ\(z\)\)−𝐝obs‖2,\\min\_\{z\}\\;\\bigl\\\|\\mathcal\{F\}\(R\_\{\\theta\}\(z\)\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\},\(14\)in which the samplerRθR\_\{\\theta\}acts as a preconditioner that constrains every iterateRθ\(z\)R\_\{\\theta\}\(z\)to lie on the learned prior manifoldℳ=\{Rθ\(z\)∣z∼𝒩\(0,I\)\}\\mathcal\{M\}=\\bigl\\\{\\,R\_\{\\theta\}\(z\)\\mid z\\sim\\mathcal\{N\}\(0,I\)\\,\\bigr\\\}\. This paradigm, pioneered by D\-Flow\[[5](https://arxiv.org/html/2606.14139#bib.bib66)\]and DMPlug\[[53](https://arxiv.org/html/2606.14139#bib.bib67)\]for image\-domain inverse problems, achieves strict prior consistency without requiring a hand\-designed regularizer or a guidance schedule\.

However, for physical inverse problems governed by partial differential equations such as FWI, directly applying the latent\-optimization formulation of Eq\. \([14](https://arxiv.org/html/2606.14139#S2.E14)\) faces two fundamental obstacles\. First, the forward operatorℱ\\mathcal\{F\}involves solving a strongly nonlinear PDE; composing it with the samplerRθR\_\{\\theta\}chains two nonlinear maps, and the resulting loss landscape combines the cycle\-skipping minima of classical FWI with the nonconvexity introduced by backpropagation through the PF\-ODE\. This introduces additional saddle points and destabilizes gradients\. Second, classical FWI relies on a physically informed initialization—typically a smoothed version of the true velocity—to land in a favorable basin of attraction\. The strict latent formulation, however, constrains every iterate to the learned prior manifoldℳ\\mathcal\{M\}and precludes such physical\-space initialization: the natural choicez∼𝒩\(0,I\)z\\sim\\mathcal\{N\}\(0,I\)produces only random prior samples, yielding optimization that is sensitive to the initial seed\.

These challenges motivate the decoupled formulation we introduce in the next section\.

## 3Decoupled Latent Optimization for FWI

To solve highly nonlinear and ill\-posed inverse problems such as FWI, and to address these two obstacles, we develop*Decoupled Latent Optimization*\(DLO\)\.

We begin by observing that the latent\-optimization problem \([14](https://arxiv.org/html/2606.14139#S2.E14)\) is equivalent to the equality\-constrained formulation

minv,z⁡‖ℱ\(v\)−𝐝obs‖2subject tov=Rθ\(z\),\\min\_\{v,\\,z\}\\;\\bigl\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\}\\quad\\text\{subject to\}\\quad v=R\_\{\\theta\}\(z\),\(15\)since the constraint eliminates the auxiliary variablevvand recovers Eq\. \([14](https://arxiv.org/html/2606.14139#S2.E14)\) exactly\. We then relax the hard constraintv=Rθ\(z\)v=R\_\{\\theta\}\(z\)by introducing a quadratic penalty term, yielding the decoupled objective

minv,z⁡‖ℱ\(v\)−𝐝obs‖2\+λ‖v−Rθ\(z\)‖2,\\min\_\{v,\\,z\}\\;\\bigl\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\}\+\\lambda\\,\\bigl\\\|v\-R\_\{\\theta\}\(z\)\\bigr\\\|^\{2\},\(16\)whereλ\>0\\lambda\>0is a fixed penalty parameter\. Eq\. \([16](https://arxiv.org/html/2606.14139#S3.E16)\) is the standard quadratic penalty method\[[7](https://arxiv.org/html/2606.14139#bib.bib68)\]applied to the constrained problem \([15](https://arxiv.org/html/2606.14139#S3.E15)\); a rigorous analysis of the relationship between the two formulations is given in Section[3\.2](https://arxiv.org/html/2606.14139#S3.SS2)\.

This decoupling avoids gradient backpropagation through the composed PDE\-solver–sampler chain and permits flexible physical\-space initialization\. Thezz\-update searches for the projection ofvvonto the prior manifoldℳ\\mathcal\{M\}, while the physics\-based optimization ofvvis regularized by the prior velocity fieldRθ\(z\)R\_\{\\theta\}\(z\)\.

In practice, we initializevvwith a Gaussian\-smoothed version of the true velocity \(the standard starting point in classical FWI\) and drawz\(0\)∼𝒩\(0,I\)z^\{\(0\)\}\\sim\\mathcal\{N\}\(0,I\)\. The penalty parameterλ\\lambdais held fixed, avoiding the progressive ill\-conditioning that arises whenλ\\lambdais driven to infinity in classical penalty schemes\.

### 3\.1Alternating Optimization

In practice we minimize Eq\. \([16](https://arxiv.org/html/2606.14139#S3.E16)\) by alternating gradient descent onvvandzz\. At iterationkk, the physical variablevvis updated by

ℒv\(v;z\(k\)\)=‖ℱ\(v\)−𝐝obs‖2\+λ‖v−Rθ\(z\(k\)\)‖2,\\mathcal\{L\}\_\{v\}\\bigl\(v;\\,z^\{\(k\)\}\\bigr\)=\\bigl\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\}\+\\lambda\\,\\bigl\\\|v\-R\_\{\\theta\}\(z^\{\(k\)\}\)\\bigr\\\|^\{2\},\(17\)whose gradient with respect tovvis

∇vℒv=∇v‖ℱ\(v\)−𝐝obs‖2\+2λ\(v−Rθ\(z\(k\)\)\),\\nabla\_\{\\\!v\}\\mathcal\{L\}\_\{v\}=\\nabla\_\{\\\!v\}\\bigl\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\}\+2\\lambda\\bigl\(v\-R\_\{\\theta\}\(z^\{\(k\)\}\)\\bigr\),\(18\)where the data\-misfit gradient is computed efficiently by the adjoint\-state method\[[47](https://arxiv.org/html/2606.14139#bib.bib31)\]\. Thevv\-update is then

v\(k\+1\)=v\(k\)−ηv∇vℒv\(v\(k\);z\(k\)\)\.v^\{\(k\+1\)\}=v^\{\(k\)\}\-\\eta\_\{v\}\\,\\nabla\_\{\\\!v\}\\mathcal\{L\}\_\{v\}\\bigl\(v^\{\(k\)\};z^\{\(k\)\}\\bigr\)\.\(19\)Given the updated physical variablev\(k\+1\)v^\{\(k\+1\)\}, the latent variablezzis updated by minimizing the coupling loss

ℒz\(z;v\(k\+1\)\)=‖Rθ\(z\)−v\(k\+1\)‖2\.\\mathcal\{L\}\_\{z\}\\bigl\(z;\\,v^\{\(k\+1\)\}\\bigr\)=\\bigl\\\|R\_\{\\theta\}\(z\)\-v^\{\(k\+1\)\}\\bigr\\\|^\{2\}\.\(20\)
To compute the gradient of this objective, we backpropagate through the deterministic DDIM chain\. Let the sampler be unrolled over a decreasing sequence of timestepsT=t0\>t1\>⋯\>tn=0T=t\_\{0\}\>t\_\{1\}\>\\cdots\>t\_\{n\}=0\. Differentiating the DDIM update \([10](https://arxiv.org/html/2606.14139#S2.E10)\) with respect toxtix\_\{t\_\{i\}\}gives the per\-step Jacobian

∂xti−1∂xti=\(α¯ti−1−α¯ti1−α¯ti−11−α¯ti\)∂Dθ\(xti,ti\)∂xti\+1−α¯ti−11−α¯tiI\.\\frac\{\\partial x\_\{t\_\{i\-1\}\}\}\{\\partial x\_\{t\_\{i\}\}\}=\\Bigl\(\\sqrt\{\\bar\{\\alpha\}\_\{t\_\{i\-1\}\}\}\-\\frac\{\\sqrt\{\\bar\{\\alpha\}\_\{t\_\{i\}\}\}\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\_\{i\-1\}\}\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\_\{i\}\}\}\}\\Bigr\)\\frac\{\\partial D\_\{\\theta\}\(x\_\{t\_\{i\}\},t\_\{i\}\)\}\{\\partial x\_\{t\_\{i\}\}\}\+\\frac\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\_\{i\-1\}\}\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\_\{i\}\}\}\}\\,I\.\(21\)The Jacobian of the full samplerRθ:xT↦x0R\_\{\\theta\}\\\!:\\\!x\_\{T\}\\\!\\mapsto\\\!x\_\{0\}is then the product of per\-step Jacobians,

∂Rθ\(z\)∂z=∏i=1n∂xti−1∂xti,\\frac\{\\partial R\_\{\\theta\}\(z\)\}\{\\partial z\}=\\prod\_\{i=1\}^\{n\}\\frac\{\\partial x\_\{t\_\{i\-1\}\}\}\{\\partial x\_\{t\_\{i\}\}\},\(22\)which is evaluated by automatic differentiation through the unrolled computation graph\. With a small number of steps \(n=3n=3\), computing this gradient to optimize the initial noisezzeffectively searches for the projection of the physical velocityvvonto the prior manifoldℳ\\mathcal\{M\}\. Thezz\-update then reads

z\(k\+1\)=z\(k\)−ηz∇z‖Rθ\(z\(k\)\)−v\(k\+1\)‖2\.z^\{\(k\+1\)\}=z^\{\(k\)\}\-\\eta\_\{z\}\\,\\nabla\_\{\\\!z\}\\bigl\\\|R\_\{\\theta\}\(z^\{\(k\)\}\)\-v^\{\(k\+1\)\}\\bigr\\\|^\{2\}\.\(23\)The full procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.14139#alg1)\.

Algorithm 1Decoupled Latent Optimization \(DLO\) for FWI0:observed data

𝐝obs\\mathbf\{d\}\_\{\\mathrm\{obs\}\}; forward operator

ℱ\\mathcal\{F\}; pretrained DDIM sampler

RθR\_\{\\theta\}with schedule

\{ti\}i=0n\\\{t\_\{i\}\\\}\_\{i=0\}^\{n\}; regularization weight

λ\\lambda; learning rates

ηv,ηz\\eta\_\{v\},\\eta\_\{z\}; smoothed initial velocity

v\(0\)v^\{\(0\)\}; number of outer iterations

KK\.

1:

z\(0\)∼𝒩\(𝟎,I\)z^\{\(0\)\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\{latent initialization \(sole source of randomness\)\}

2:for

k=0,…,K−1k=0,\\dots,K\-1do

3:

v\(k\+1\)←v\(k\)−ηv∇vℒv\(v\(k\);z\(k\)\)v^\{\(k\+1\)\}\\leftarrow v^\{\(k\)\}\-\\eta\_\{v\}\\,\\nabla\_\{v\}\\mathcal\{L\}\_\{v\}\\bigl\(v^\{\(k\)\};z^\{\(k\)\}\\bigr\)\{vv\-step on Eq\. \([17](https://arxiv.org/html/2606.14139#S3.E17)\)\}

4:

z\(k\+1\)←z\(k\)−ηz∇zℒz\(z\(k\);v\(k\+1\)\)z^\{\(k\+1\)\}\\leftarrow z^\{\(k\)\}\-\\eta\_\{z\}\\,\\nabla\_\{z\}\\mathcal\{L\}\_\{z\}\\bigl\(z^\{\(k\)\};v^\{\(k\+1\)\}\\bigr\)\{zz\-step on Eq\. \([20](https://arxiv.org/html/2606.14139#S3.E20)\)\}

5:endfor

6:return

v^=v\(K\)\\hat\{v\}=v^\{\(K\)\}

Since DDIM is the deterministic discretization of the probability\-flow ODE, the reconstructionv^\\hat\{v\}is deterministic with respect to the initial latent vectorz\(0\)∼𝒩\(0,I\)z^\{\(0\)\}\\sim\\mathcal\{N\}\(0,I\)\. Drawingz\(0\)z^\{\(0\)\}independently from𝒩\(0,I\)\\mathcal\{N\}\(0,I\)therefore induces a distribution over data\-consistent velocity models, which is the basis of the uncertainty quantification in Section[4\.3](https://arxiv.org/html/2606.14139#S4.SS3)\.

A schematic overview of the DLO procedure is provided in Fig\.[2](https://arxiv.org/html/2606.14139#S3.F2); implementation details, hyperparameter settings, and computational cost are given in Appendix[B](https://arxiv.org/html/2606.14139#A2)\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/DLOimage3.drawio.png)Figure 2:An overview of Decoupled Latent Optimization for full waveform inversion\.Top:the evolution of the physical velocityvvand the prior velocityvgen=Rθ\(z\)v\_\{\\mathrm\{gen\}\}=R\_\{\\theta\}\(z\)over the course of the inversion\.Bottom:a detailed view of a single iteration, showing the alternatingvv\-update andzz\-update\.
### 3\.2Convergence of the Penalty Relaxation

The DLO formulation \([16](https://arxiv.org/html/2606.14139#S3.E16)\) replaces the strictly constrained latent optimization \([14](https://arxiv.org/html/2606.14139#S2.E14)\) with the penalty objective

Φλ\(v,z\)=‖ℱ\(v\)−𝐝obs‖2\+λ‖v−Rθ\(z\)‖2,\\Phi\_\{\\lambda\}\(v,z\)=\\bigl\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\bigr\\\|^\{2\}\+\\lambda\\,\\bigl\\\|v\-R\_\{\\theta\}\(z\)\\bigr\\\|^\{2\},\(24\)which penalizes deviation of the physical velocityvvfrom the decoded prior sampleRθ\(z\)R\_\{\\theta\}\(z\)\. This section provides a rigorous justification for the penalty relaxation by establishing that, asλ→∞\\lambda\\to\\infty, the minimizers ofΦλ\\Phi\_\{\\lambda\}converge to minimizers of the original constrained problemminv⁡‖ℱ\(v\)−𝐝obs‖2\\min\_\{v\}\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\\|^\{2\}subject tov=Rθ\(z\)v=R\_\{\\theta\}\(z\)\. The argument follows the standard quadratic penalty method framework\[[7](https://arxiv.org/html/2606.14139#bib.bib68)\]\.

We write the data\-misfit functional as𝒥\(v\)=‖ℱ\(v\)−𝐝obs‖2\\mathcal\{J\}\(v\)=\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\\|^\{2\}, which is continuous whenever the forward PDE operatorℱ\\mathcal\{F\}is well\-posed\. The DDIM samplerRθ:ℝd→ℝnR\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{n\}is a composition of affine transforms and pretrained neural\-network evaluations and is therefore continuous\. The constrained problem and its optimal value are

𝒥∗=infv,z\{𝒥\(v\)∣v=Rθ\(z\)\}=infz𝒥\(Rθ\(z\)\)\.\\mathcal\{J\}^\{\*\}=\\inf\_\{v,\\,z\}\\;\\\{\\,\\mathcal\{J\}\(v\)\\mid v=R\_\{\\theta\}\(z\)\\,\\\}=\\inf\_\{z\}\\,\\mathcal\{J\}\(R\_\{\\theta\}\(z\)\)\.\(25\)We assume the feasible set is nonempty and𝒥∗\>−∞\\mathcal\{J\}^\{\*\}\>\-\\infty\.

###### Proposition 3\.1\(Exact minimization\)\.

Let\{λk\}k=0∞\\\{\\lambda\_\{k\}\\\}\_\{k=0\}^\{\\infty\}be a positive sequence withλk→∞\\lambda\_\{k\}\\to\\infty, and for eachkklet

\(vk,zk\)∈argminv,z⁡𝒥\(v\)\+λk‖v−Rθ\(z\)‖2\.\(v^\{k\},\\,z^\{k\}\)\\in\\operatorname\*\{arg\\,min\}\_\{v,\\,z\}\\;\\mathcal\{J\}\(v\)\+\\lambda\_\{k\}\\,\\\|v\-R\_\{\\theta\}\(z\)\\\|^\{2\}\.\(26\)Assume that a global minimizer exists for eachkk\. Then any limit point of\{\(vk,zk\)\}\\\{\(v^\{k\},z^\{k\}\)\\\}is a global minimizer of the constrained problem \([25](https://arxiv.org/html/2606.14139#S3.E25)\), i\.e\. it satisfiesv¯=Rθ\(z¯\)\\bar\{v\}=R\_\{\\theta\}\(\\bar\{z\}\)and𝒥\(v¯\)=𝒥∗\\mathcal\{J\}\(\\bar\{v\}\)=\\mathcal\{J\}^\{\*\}\.

###### Proof\.

Let\(v¯,z¯\)\(\\bar\{v\},\\bar\{z\}\)be a limit point of\{\(vk,zk\)\}\\\{\(v^\{k\},z^\{k\}\)\\\}and pass to a convergent subsequence\(vk,zk\)→\(v¯,z¯\)\(v^\{k\},z^\{k\}\)\\to\(\\bar\{v\},\\bar\{z\}\)\. Since\(vk,zk\)\(v^\{k\},z^\{k\}\)is a global minimizer ofΦλk\\Phi\_\{\\lambda\_\{k\}\},

𝒥\(vk\)\+λk‖vk−Rθ\(zk\)‖2≤𝒥\(v\)\+λk‖v−Rθ\(z\)‖2∀\(v,z\)\.\\mathcal\{J\}\(v^\{k\}\)\+\\lambda\_\{k\}\\,\\\|v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\\\|^\{2\}\\;\\leq\\;\\mathcal\{J\}\(v\)\+\\lambda\_\{k\}\\,\\\|v\-R\_\{\\theta\}\(z\)\\\|^\{2\}\\qquad\\forall\\,\(v,z\)\.\(27\)Restricting the right\-hand side to feasible pairs, for whichv=Rθ\(z\)v=R\_\{\\theta\}\(z\)and the penalty term vanishes, yields𝒥\(vk\)\+λk‖vk−Rθ\(zk\)‖2≤𝒥\(v\)\\mathcal\{J\}\(v^\{k\}\)\+\\lambda\_\{k\}\\\|v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\\\|^\{2\}\\leq\\mathcal\{J\}\(v\)for every feasible\(v,z\)\(v,z\), and taking the infimum over all such pairs gives

𝒥\(vk\)\+λk‖vk−Rθ\(zk\)‖2≤𝒥∗\.\\mathcal\{J\}\(v^\{k\}\)\+\\lambda\_\{k\}\\,\\\|v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\\\|^\{2\}\\;\\leq\\;\\mathcal\{J\}^\{\*\}\.\(28\)In particular, the nonnegativity of the penalty term implies𝒥\(vk\)≤𝒥∗\\mathcal\{J\}\(v^\{k\}\)\\leq\\mathcal\{J\}^\{\*\}for allkk\.

Rearranging \([28](https://arxiv.org/html/2606.14139#S3.E28)\) yields the bound

0≤‖vk−Rθ\(zk\)‖2≤𝒥∗−𝒥\(vk\)λk\.0\\;\\leq\\;\\\|v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\\\|^\{2\}\\;\\leq\\;\\frac\{\\mathcal\{J\}^\{\*\}\-\\mathcal\{J\}\(v^\{k\}\)\}\{\\lambda\_\{k\}\}\.\(29\)Since𝒥∗\\mathcal\{J\}^\{\*\}is finite and𝒥\(vk\)\\mathcal\{J\}\(v^\{k\}\)is bounded above by \([28](https://arxiv.org/html/2606.14139#S3.E28)\), the right\-hand side tends to zero asλk→∞\\lambda\_\{k\}\\to\\infty; hence‖vk−Rθ\(zk\)‖→0\\\|v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\\\|\\to 0\. By continuity ofRθR\_\{\\theta\}, it follows that‖v¯−Rθ\(z¯\)‖=0\\\|\\bar\{v\}\-R\_\{\\theta\}\(\\bar\{z\}\)\\\|=0, i\.e\.v¯=Rθ\(z¯\)\\bar\{v\}=R\_\{\\theta\}\(\\bar\{z\}\); thus\(v¯,z¯\)\(\\bar\{v\},\\bar\{z\}\)is feasible for the constrained problem \([25](https://arxiv.org/html/2606.14139#S3.E25)\)\.

Finally, taking limits in𝒥\(vk\)≤𝒥∗\\mathcal\{J\}\(v^\{k\}\)\\leq\\mathcal\{J\}^\{\*\}and using the continuity of𝒥\\mathcal\{J\}yields𝒥\(v¯\)≤𝒥∗\\mathcal\{J\}\(\\bar\{v\}\)\\leq\\mathcal\{J\}^\{\*\}, while feasibility gives𝒥\(v¯\)≥𝒥∗\\mathcal\{J\}\(\\bar\{v\}\)\\geq\\mathcal\{J\}^\{\*\}by definition of the infimum\. Hence𝒥\(v¯\)=𝒥∗\\mathcal\{J\}\(\\bar\{v\}\)=\\mathcal\{J\}^\{\*\}, and\(v¯,z¯\)\(\\bar\{v\},\\bar\{z\}\)is a global minimizer of \([25](https://arxiv.org/html/2606.14139#S3.E25)\)\. ∎

###### Proposition 3\.2\(Inexact minimization\)\.

Assume that𝒥\\mathcal\{J\}andRθR\_\{\\theta\}are continuously differentiable\. Let\{λk\}\\\{\\lambda\_\{k\}\\\}be a sequence withλk→∞\\lambda\_\{k\}\\to\\infty, and let\{\(vk,zk\)\}\\\{\(v^\{k\},z^\{k\}\)\\\}be a sequence of approximate stationary points satisfying

‖∇Φλk\(vk,zk\)‖≤εk,εk→0\.\\\|\\nabla\\Phi\_\{\\lambda\_\{k\}\}\(v^\{k\},z^\{k\}\)\\\|\\leq\\varepsilon\_\{k\},\\qquad\\varepsilon\_\{k\}\\to 0\.\(30\)If\(vk,zk\)→\(v¯,z¯\)\(v^\{k\},z^\{k\}\)\\to\(\\bar\{v\},\\bar\{z\}\), then

1. \(i\)v¯=Rθ\(z¯\)\\bar\{v\}=R\_\{\\theta\}\(\\bar\{z\}\);
2. \(ii\)the vectorsμk≔2λk\(vk−Rθ\(zk\)\)\\mu^\{k\}\\coloneqq 2\\lambda\_\{k\}\(v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\)converge toμ¯∈ℝn\\bar\{\\mu\}\\in\\mathbb\{R\}^\{n\}, and∇𝒥\(v¯\)\+μ¯=0\\nabla\\mathcal\{J\}\(\\bar\{v\}\)\+\\bar\{\\mu\}=0\.

###### Proof\.

The joint gradient ofΦλk\\Phi\_\{\\lambda\_\{k\}\}with respect to\(v,z\)\(v,z\)is

∇Φλk\(v,z\)=\(∇𝒥\(v\)\+2λk\(v−Rθ\(z\)\)−2λk∇Rθ\(z\)𝖳\(v−Rθ\(z\)\)\)\.\\nabla\\Phi\_\{\\lambda\_\{k\}\}\(v,z\)=\\begin\{pmatrix\}\\nabla\\mathcal\{J\}\(v\)\+2\\lambda\_\{k\}\(v\-R\_\{\\theta\}\(z\)\)\\\\\[4\.0pt\] \-2\\lambda\_\{k\}\\,\\nabla R\_\{\\theta\}\(z\)^\{\\mathsf\{T\}\}\(v\-R\_\{\\theta\}\(z\)\)\\end\{pmatrix\}\.Defineμk≔2λk\(vk−Rθ\(zk\)\)\\mu^\{k\}\\coloneqq 2\\lambda\_\{k\}\(v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\)\. Then condition \([30](https://arxiv.org/html/2606.14139#S3.E30)\) implies

∇𝒥\(vk\)\+μk\\displaystyle\\nabla\\mathcal\{J\}\(v^\{k\}\)\+\\mu^\{k\}→0,\\displaystyle\\to 0,\(31\)∇Rθ\(zk\)𝖳μk\\displaystyle\\nabla R\_\{\\theta\}\(z^\{k\}\)^\{\\mathsf\{T\}\}\\mu^\{k\}→0\.\\displaystyle\\to 0\.\(32\)
From \([31](https://arxiv.org/html/2606.14139#S3.E31)\) we haveμk=−∇𝒥\(vk\)\+o\(1\)\\mu^\{k\}=\-\\nabla\\mathcal\{J\}\(v^\{k\}\)\+o\(1\)\. Sincevk→v¯v^\{k\}\\to\\bar\{v\}and∇𝒥\\nabla\\mathcal\{J\}is continuous,∇𝒥\(vk\)→∇𝒥\(v¯\)\\nabla\\mathcal\{J\}\(v^\{k\}\)\\to\\nabla\\mathcal\{J\}\(\\bar\{v\}\), and therefore

μk→μ¯≔−∇𝒥\(v¯\)\.\\mu^\{k\}\\to\\bar\{\\mu\}\\coloneqq\-\\nabla\\mathcal\{J\}\(\\bar\{v\}\)\.\(33\)In particular,\{μk\}\\\{\\mu^\{k\}\\\}is bounded\.

From the definitionμk=2λk\(vk−Rθ\(zk\)\)\\mu^\{k\}=2\\lambda\_\{k\}\(v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\),

‖vk−Rθ\(zk\)‖=‖μk‖2λk→0,\\\|v^\{k\}\-R\_\{\\theta\}\(z^\{k\}\)\\\|=\\frac\{\\\|\\mu^\{k\}\\\|\}\{2\\lambda\_\{k\}\}\\;\\to\\;0,\(34\)so by continuity ofRθR\_\{\\theta\}we obtainv¯=Rθ\(z¯\)\\bar\{v\}=R\_\{\\theta\}\(\\bar\{z\}\), establishing \(i\)\.

Finally, sinceRθR\_\{\\theta\}isC1C^\{1\},∇Rθ\\nabla R\_\{\\theta\}is continuous; together withzk→z¯z^\{k\}\\to\\bar\{z\}andμk→μ¯\\mu^\{k\}\\to\\bar\{\\mu\}, we may take the limit in \([32](https://arxiv.org/html/2606.14139#S3.E32)\) to obtain∇Rθ\(z¯\)𝖳μ¯=0\\nabla R\_\{\\theta\}\(\\bar\{z\}\)^\{\\mathsf\{T\}\}\\bar\{\\mu\}=0, while \([31](https://arxiv.org/html/2606.14139#S3.E31)\) gives∇𝒥\(v¯\)\+μ¯=0\\nabla\\mathcal\{J\}\(\\bar\{v\}\)\+\\bar\{\\mu\}=0\. These are precisely the first\-order necessary conditions for the constrained problem \([25](https://arxiv.org/html/2606.14139#S3.E25)\)\. ∎

Remark\.Proposition[3\.1](https://arxiv.org/html/2606.14139#S3.Thmproposition1)shows that the DLO penalty relaxation \([16](https://arxiv.org/html/2606.14139#S3.E16)\) is asymptotically exact: drivingλ→∞\\lambda\\to\\inftyrecovers solutions of the original constrained latent\-optimization problem \([14](https://arxiv.org/html/2606.14139#S2.E14)\)\. In the DLO algorithm,λ\\lambdais held fixed rather than taken to infinity, so the penalty term acts as a soft constraint whose strength is calibrated by a single hyperparameter\. The constant\-λ\\lambdastrategy avoids the progressive ill\-conditioning that plagues classical penalty schemes when the penalty parameter grows without bound\[[7](https://arxiv.org/html/2606.14139#bib.bib68)\]and in practice a moderate value ofλ\\lambdais sufficient to steer the physical inversion toward the learned velocity manifold while preserving the numerical stability of the alternating gradient updates\. Proposition[3\.2](https://arxiv.org/html/2606.14139#S3.Thmproposition2)further confirms that the convergence behaviour is robust to the inexact subproblem solves inherent in the finite\-step alternating optimization used by DLO\.

## 4Numerical Experiments

We validate DLO on the OpenFWI benchmark\[[15](https://arxiv.org/html/2606.14139#bib.bib19)\]\(Section[4\.2](https://arxiv.org/html/2606.14139#S4.SS2)\), examine its uncertainty quantification behaviour induced by the stochasticity of the latent sampler \(Section[4\.3](https://arxiv.org/html/2606.14139#S4.SS3)\), and finally test its out\-of\-distribution generalization on the Marmousi\[[50](https://arxiv.org/html/2606.14139#bib.bib20)\]and Overthrust\[[2](https://arxiv.org/html/2606.14139#bib.bib21)\]models \(Section[4\.4](https://arxiv.org/html/2606.14139#S4.SS4)\)\. Implementation details, including pretraining of the diffusion prior \(Appendix[B\.1](https://arxiv.org/html/2606.14139#A2.SS1)\), forward solver of wave equation \(Appendix[B\.2](https://arxiv.org/html/2606.14139#A2.SS2)\), the hyperparameter settings \(Appendix[B\.3](https://arxiv.org/html/2606.14139#A2.SS3)\), the protocols used to replicate every baseline \(Appendix[B\.4](https://arxiv.org/html/2606.14139#A2.SS4)\), and the computational cost analysis \(Appendix[B\.5](https://arxiv.org/html/2606.14139#A2.SS5)\) are deferred to the appendix\.

### 4\.1Evaluation Metrics

We assess reconstruction quality with three standard metrics computed in the normalized\[−1,1\]\[\-1,1\]velocity domain: the root mean square error \(RMSE\), the mean absolute error \(MAE\), and the structural similarity index measure \(SSIM\)\. Let\{xi\}i=1N\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}and\{x^i\}i=1N\\\{\\hat\{x\}\_\{i\}\\\}\_\{i=1\}^\{N\}denote the ground\-truth and recovered velocity models over theNN\-pixel spatial domain\. RMSE and MAE measure point\-wise discrepancies and are defined by

RMSE=1N∑i=1N\(xi−x^i\)2,MAE=1N∑i=1N\|xi−x^i\|,\\mathrm\{RMSE\}=\\sqrt\{\\frac\{1\}\{N\}\\textstyle\\sum\_\{i=1\}^\{N\}\(x\_\{i\}\-\\hat\{x\}\_\{i\}\)^\{2\}\},\\qquad\\mathrm\{MAE\}=\\frac\{1\}\{N\}\\textstyle\\sum\_\{i=1\}^\{N\}\\bigl\|x\_\{i\}\-\\hat\{x\}\_\{i\}\\bigr\|,\(35\)with RMSE emphasizing larger errors and MAE capturing average deviations\. SSIM\[[55](https://arxiv.org/html/2606.14139#bib.bib32)\]compares structural and perceptual similarity betweenxxandx^\\hat\{x\}, taking values in\[0,1\]\[0,1\]where higher is better,

SSIM\(x,x^\)=\(2μxμx^\+C1\)\(2σxx^\+C2\)\(μx2\+μx^2\+C1\)\(σx2\+σx^2\+C2\),\\mathrm\{SSIM\}\(x,\\hat\{x\}\)=\\frac\{\(2\\mu\_\{x\}\\mu\_\{\\hat\{x\}\}\+C\_\{1\}\)\\,\(2\\sigma\_\{x\\hat\{x\}\}\+C\_\{2\}\)\}\{\(\\mu\_\{x\}^\{2\}\+\\mu\_\{\\hat\{x\}\}^\{2\}\+C\_\{1\}\)\\,\(\\sigma\_\{x\}^\{2\}\+\\sigma\_\{\\hat\{x\}\}^\{2\}\+C\_\{2\}\)\},\(36\)whereμ\\muandσ2\\sigma^\{2\}denote local means and variances,σxx^\\sigma\_\{x\\hat\{x\}\}the local covariance, andC1,C2C\_\{1\},C\_\{2\}small numerical\-stability constants\. Local statistics are evaluated on an11×1111\\\!\\times\\\!11sliding window and averaged over the image to give the final score\. RMSE and MAE quantify overall intensity errors, whereas SSIM captures perceptual and structural consistency, jointly providing complementary views of reconstruction quality\.

### 4\.2Validation on OpenFWI Datasets

OpenFWI\[[15](https://arxiv.org/html/2606.14139#bib.bib19)\]is a collection of large\-scale, multi\-structural benchmark datasets that provide paired velocity models and simulated seismic recordings across a broad range of synthetic subsurface geologies, and it has become the standard testbed for diffusion\-based FWI\. We select four representative families from the benchmark whose velocity fields cover complementary geological regimes: FlatVel\-B \(FV\-B\), whose velocity fields consist of horizontally layered structures with sharp velocity contrasts across flat interfaces; FlatFault\-B \(FF\-B\), whose velocity fields contain horizontally layered backgrounds disrupted by planar faults that introduce piecewise discontinuities; CurveVel\-B \(CV\-B\), whose velocity fields exhibit curved, dipping layers with smooth lateral velocity variations; and CurveFault\-B \(CF\-B\), whose velocity fields combine curved layered structures with complex curved fault networks\. We compare DLO against the following baselines: standard FWI without regularization, FWI with Tikhonov regularization\[[4](https://arxiv.org/html/2606.14139#bib.bib5)\], FWI with total variation \(TV\)\[[18](https://arxiv.org/html/2606.14139#bib.bib2)\], DiffusionFWI\[[52](https://arxiv.org/html/2606.14139#bib.bib1)\], and RED\-DiffEq\[[41](https://arxiv.org/html/2606.14139#bib.bib61)\], the state\-of\-the\-art among diffusion\-based FWI methods\.

For each family we independently train a variance\-preserving diffusion model on its training set following DDPM\[[21](https://arxiv.org/html/2606.14139#bib.bib12)\], and reuse the trained model for every diffusion\-based method evaluated on that family\. The network architecture, noise schedule, and training hyperparameters are detailed in Section[B\.1](https://arxiv.org/html/2606.14139#A2.SS1)\. Fig\.[3](https://arxiv.org/html/2606.14139#S4.F3)shows ground\-truth velocity models from each family alongside samples generated by the corresponding diffusion prior, confirming that the learned generators reproduce the geological characteristics of the four families\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/datasets_vs_generated.png)Figure 3:Comparison between velocity models from training set \(left\) and unconditionally generated samples from each pretrained diffusion model \(right\)#### 4\.2\.1Clean Seismic Data

We evaluate every method on100100velocity–waveform pairs drawn from the test set of each family; all reported metrics \(defined in Section[4\.1](https://arxiv.org/html/2606.14139#S4.SS1)\) are averages over these100100samples within the corresponding subdataset\. All optimization\-based methods are initialized from a Gaussian\-smoothed ground\-truth velocity with standard deviationσ=10\\sigma=10and optimized for300300iterations for a fair comparison\. For DiffusionFWI\[[52](https://arxiv.org/html/2606.14139#bib.bib1)\], we follow its official repository settings: the reverse process starts from the intermediate noise levelt=100t=100and performs1010FWI optimization steps between consecutive diffusion steps\. For RED\-DiffEq\[[41](https://arxiv.org/html/2606.14139#bib.bib61)\], we adopt its original hyperparameter settings and optimize for300300iterations\. We provide detailed settings of the numerical scheme of the acoustic wave\-equation forward solver in Appendix[B\.2](https://arxiv.org/html/2606.14139#A2.SS2), and the hyperparameter settings of all these experiments are specifically discussed in Appendix[B\.3](https://arxiv.org/html/2606.14139#A2.SS3)\.

Fig\.[4](https://arxiv.org/html/2606.14139#S4.F4)shows representative reconstructions across the four families, while Fig\.[5](https://arxiv.org/html/2606.14139#S4.F5)provides a quantitative comparison of the metrics\. Unregularized FWI produces large errors with non\-geological discontinuous artifacts that violate the layered structure of the true models\. TV and Tikhonov regularization partially alleviate these instabilities, but Tikhonov over\-smooths the resulting reconstructions while TV preserves sharp boundaries and performs quantitatively well on datasets whose gradients are naturally sparse, such as FV\-B, yet introduces visible artifacts elsewhere, and neither of these classical regularizers are capable of recovering the fine\-scale detail of complex velocity models\.

For diffusion\-based methods, DiffusionFWI embeds the prior effectively on simple layered structures, but lacks the capacity to capture complex faults and fine\-scale details; on the curved\-boundary families it exhibits visible artifacts, and its inversion quality is sensitive to the smoothing hyperparameters in the sampling trajectory\. RED\-DiffEq produces inversion results that are closer to the ground truth with fewer artifacts, particularly on the CV\-B and CF\-B families\. DLO further exhibits a more stable prior\-preserving capability, similarly captures discontinuous features such as faults, and more accurately characterizes deep structures far from the surface; it delivers the best metrics on CV\-B, CF\-B, and FF\-B, and on FV\-B it ranks second only to DiffusionFWI\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/velocity_compare.png)Figure 4:OpenFWI clean\-data qualitative comparison\.Representative reconstructions across CF\-B, FV\-B, FF\-B, and CV\-B for the ground truth, the smoothed initial model, and every baseline alongside DLO\.![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/clean_combined.png)Figure 5:OpenFWI clean\-data quantitative comparison\.Normalized MAE, RMSE, and SSIM averaged over100100test samples per family\. Within each metric block, the2×22\{\\times\}2grid arranges CV\-B \(top\-left\), CF\-B \(top\-right\), FF\-B \(bottom\-left\), and FV\-B \(bottom\-right\)\. Error bars indicate one standard deviation\. Arrows next to each metric name indicate the preferred direction \(↓\\downarrowlower is better,↑\\uparrowhigher is better\)\.
#### 4\.2\.2Noisy Seismic Data

We next assess robustness to additive measurement noise\. Following\[[41](https://arxiv.org/html/2606.14139#bib.bib61)\], we perturb the seismic data with independent Gaussian noise at three standard deviations,σ∈\{0\.1,0\.3,0\.5\}\\sigma\\\!\\in\\\!\\\{0\.1,\\,0\.3,\\,0\.5\\\}, corresponding to average per\-sample signal\-to\-noise ratios of24\.224\.2dB,14\.614\.6dB, and10\.210\.2dB across the400400test samples\. All methods share the same hyperparameter settings and initialization as in the clean\-data scenario\.

The comparison results of the Gaussian\-noise corrupted scenario are summarized in Fig\.[6](https://arxiv.org/html/2606.14139#S4.F6)\. Unregularized FWI and Tikhonov\-regularized FWI degrade sharply with increasingσ\\sigma, developing pronounced non\-physical artifacts in regions of low data sensitivity\. TV regularization substantially suppresses these artifacts but introduces a clearly visible staircase pattern, particularly on the smooth\-gradient families, so the trade\-off between artifact suppression and structural fidelity is unfavorable\.

Among the diffusion\-based methods, RED\-DiffEq and DLO are markedly robust to noise: the learned prior suppresses noise\-induced fluctuations that lie outside the geological manifold, and their metrics degrade only mildly even atσ=0\.5\\sigma\\\!=\\\!0\.5\. DiffusionFWI, in contrast, performs competitively under clean data and low noise, but as the noise level increases its reconstructions develop large\-area artifacts and non\-prior features that deviate from the geological structures learned by the diffusion model\. DLO consistently achieves the best inversion metrics across the vast majority of experimental scenarios and noise levels\. Under noise\-corrupted conditions, DLO maintains its data\-prior\-driven inversion paradigm, demonstrating faithful recovery of fine\-scale velocity details and accurate reconstruction of layered and complex boundary structures\.

\(a\)

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/gaussiannoise_panel.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/noise_sweep.png)

Figure 6:OpenFWI noisy\-data results \(Gaussian noise\)\.\(a\) Representative single\-sample reconstructions on CF\-B at three noise levels \(σ=0\.10,0\.30,0\.50\\sigma\\\!=\\\!0\.10,\\,0\.30,\\,0\.50\); the leftmost column shows the noisy shot gather, the second column the ground\-truth velocity, and the remaining columns the reconstructions from each method\. \(b\) Normalized MAE, RMSE, and SSIM as a function of the Gaussian noise standard deviation, averaged over100100test samples per family\.
#### 4\.2\.3Seismic Data with Missing Traces

We finally evaluate the methods under incomplete acquisitions in which a fixed subset of receivers is removed from every shot gather\. We consider three settings corresponding to*slight*,*half*, and*most\-missing*acquisitions:1010,3535, and6060of the7070traces are removed, which discards14\.3%14\.3\\%,50\.0%50\.0\\%, and85\.7%85\.7\\%of the available recordings, respectively; the missing receiver indices are kept identical across shots to mimic a realistic sensor\-failure scenario\.

Performance under missing traces is reported in Fig\.[7](https://arxiv.org/html/2606.14139#S4.F7)\. Almost every method degrades monotonically as the fraction of missing traces grows, but the diffusion\-based methods are noticeably more robust than the non\-prior baselines\. RED\-DiffEq and DLO retain a coherent prior\-consistent structure in the slight\- and half\-missing regimes, and even in the most\-missing setting they avoid the conspicuous artifacts seen in the other methods while preserving the dominant geological features\. Standard FWI and the classical\-regularization baselines, in contrast, exhibit pronounced artifacts and a global loss of fine\-scale detail across the velocity field as soon as the acquisition is meaningfully sparsified\. Taken together with the noisy\-data experiments of Section[4\.2\.2](https://arxiv.org/html/2606.14139#S4.SS2.SSS2), these results demonstrate that DLO remains effective on the more realistic acquisitions encountered in practice, including both heavily sparsified and noise\-contaminated recordings\.

\(a\)

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/missingtrace_panel.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/missing_sweep.png)

Figure 7:OpenFWI missing\-trace results\.\(a\) Representative single\-sample reconstructions on CF\-B with1010,3535, and6060of the7070receivers removed; the leftmost column shows the masked shot gather, the second column the ground\-truth velocity, and the remaining columns the reconstructions from each method\. \(b\) Normalized MAE, RMSE, and SSIM as a function of the number of missing traces, averaged over100100test samples per family\.

### 4\.3Sensitivity to Latent Initialization

DLO is a deterministic method whose optimization result depends only on the choice of hyperparameters and the initialization of the physical velocityvvand the latent variablezz\. To evaluate the sensitivity of DLO to the random initialization of the latent variable, we conduct an experiment on the OpenFWI test set\. For each sample, we generate2020random initializations ofz\(0\)∼𝒩\(0,I\)z^\{\(0\)\}\\sim\\mathcal\{N\}\(0,I\); we take the ensemble mean over the2020runs as the final prediction and report the standard deviation across runs as a measure of initialization\-induced uncertainty\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/x2.png)Figure 8:Sensitivity of DLO to the latent initializationz\(0\)z^\{\(0\)\}\.Each row shows one representative sample per OpenFWI dataset\. Columns \(left to right\): ground truth; ensemble mean over2020independentz\(0\)z^\{\(0\)\}draws; per\-pixel ensemble standard deviation; absolute error between ensemble mean and ground truth\. Uncertainty is localized at geological boundaries, while the bulk velocity structure is robustly recovered across all initializations\.Fig\.[8](https://arxiv.org/html/2606.14139#S4.F8)indicates that the per\-pixel ensemble standard deviation is primarily concentrated along fault interfaces and geological discontinuities, as well as in regions where the absolute error is larger, such as deeper layers with limited seismic illumination\. The spatial correspondence between the standard deviation and the absolute error confirms that the initialization\-induced uncertainty is positively correlated with the inversion error\. To quantitatively assess the correspondence between initialization\-induced uncertainty and inversion error, we compute the per\-pixel Spearman and Pearson correlation coefficients between the ensemble standard deviation and the absolute error of the ensemble mean, separately for each of the4040test samples\. Across all samples, the mean Spearman rank correlation is0\.64540\.6454and the mean Pearson correlation is0\.62820\.6282, both significantly positive \(p<10−26p<10^\{\-26\}, one\-samplett\-test against zero\)\. All4040samples yield positive correlations, confirming that the initialization\-induced uncertainty and the reconstruction error are robustly co\-oriented: pixels where the ensemble varies more across independent runs are also pixels where the prediction tends to deviate more from the ground truth\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/x3.png)Figure 9:Pixel\-level ensemble standard deviation versus absolute error, aggregated over all test samples per dataset\.The high\-density region exhibits a clear positive trend across all four geological families, further corroborating the per\-sample correlation results\.Fig\.[9](https://arxiv.org/html/2606.14139#S4.F9)shows the pixel\-level joint distribution of ensemble standard deviation and absolute error for each dataset\. The high\-density region consistently follows a positive trend, independently confirming that larger initialization\-induced uncertainty reliably indicates larger reconstruction error at the pixel level\. Across all4040test samples, the mean within\-sample standard deviation of RMSE, MAE, and SSIM are0\.02180\.0218,0\.01460\.0146, and0\.02850\.0285, respectively\. These results confirm that DLO yields consistent reconstructions across different initializations ofz\(0\)z^\{\(0\)\}, while the positively correlated ensemble uncertainty provides a reliable indicator of local reconstruction quality that requires no additional computation beyond the ensemble itself\.

### 4\.4Validation on the Marmousi and Overthrust Benchmarks

To probe generalization beyond the OpenFWI training distribution, we extend the experiments to the Marmousi\[[50](https://arxiv.org/html/2606.14139#bib.bib20)\]and Overthrust\[[2](https://arxiv.org/html/2606.14139#bib.bib21)\]velocity models\. Marmousi is a 2D synthetic acoustic model built after the Cuanza basin \(Angola\), with strongly dipping curved layers cut by a dense fault network; the SEG/EAGE Overthrust model represents a thrust\-belt setting with overthrust sheets and high\-velocity carbonate units beneath an erosional surface\. Both targets differ markedly from the OpenFWI prior families, so these experiments test whether our method extends to substantially more complex geology and whether out\-of\-distribution inversion is feasible with a diffusion prior reused as\-is\.

Representative reconstructions atσinit=20\\sigma\_\{\\mathrm\{init\}\}\\\!=\\\!20on clean data are shown in Figs\.[10](https://arxiv.org/html/2606.14139#S4.F10)and[11](https://arxiv.org/html/2606.14139#S4.F11), and the corresponding metrics are the clean column of Figs\.[12](https://arxiv.org/html/2606.14139#S4.F12)and[13](https://arxiv.org/html/2606.14139#S4.F13)\. DLO faithfully recovers the geological character of both targets, restoring fault structures, dipping interfaces, and the discontinuous stratigraphy that the classical regularizers blur away, and it attains the best MAE, RMSE, and SSIM among all methods on both benchmarks\. RED\-DiffEq substantially outperforms the classical baselines, confirming that the patched diffusion regularizer transfers meaningfully out of distribution, while DiffusionFWI lies between RED\-DiffEq and the classical methods\. We further test the dependence on the smoothed initial model by sweepingσinit∈\{24,28,32\}\\sigma\_\{\\mathrm\{init\}\}\\\!\\in\\\!\\\{24,28,32\\\}on noise\-free data, and DLO exhibits clear insensitivity to the initialization, retaining the best metrics across everyσinit\\sigma\_\{\\mathrm\{init\}\}on both Marmousi and Overthrust\.

We next assess robustness to additive Gaussian measurement noise by perturbing the seismic data withσ∈\{0\.1,0\.3,0\.5\}\\sigma\\\!\\in\\\!\\\{0\.1,0\.3,0\.5\\\}atσinit=20\\sigma\_\{\\mathrm\{init\}\}\\\!=\\\!20without per\-condition retuning \(Figs\.[12](https://arxiv.org/html/2606.14139#S4.F12)and[13](https://arxiv.org/html/2606.14139#S4.F13)\)\. The classical baselines degrade markedly with increasing noise, developing large\-amplitude artifacts in regions of low data sensitivity\. DiffusionFWI lies between the classical baselines and the prior\-regularized methods on both benchmarks\. DLO and RED\-DiffEq both display strong noise robustness, with metric curves that stay nearly flat across the full noise range; DLO essentially maintains the same MAE, RMSE, and SSIM level as RED\-DiffEq asσ\\sigmagrows, while retaining its advantage at the lower noise levels\. Taken together, these experiments demonstrate that DLO generalizes well across diverse geological scenarios and preserves high\-fidelity reconstructions under the kinds of data corruption encountered in practice, underscoring its potential for real\-world seismic imaging applications\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/marmousi_clean_panel.png)Figure 10:Marmousi qualitative comparison \(σinit=20\\sigma\_\{\\mathrm\{init\}\}\\\!=\\\!20, noise\-free\)\.Ground truth, smoothed initial model, and reconstructions of every method on the70×19070\\times 190Marmousi benchmark\. All diffusion\-based methods reuse the CurveFault\-B prior trained on OpenFWI; DLO and RED\-DiffEq operate on three overlapping70×7070\\times 70patches and DiffusionFWI uses a sliding\-window adaptation, while the classical baselines act on the full rectangular field\.![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/overthrust_clean_panel.png)Figure 11:Overthrust qualitative comparison \(σinit=20\\sigma\_\{\\mathrm\{init\}\}\\\!=\\\!20, noise\-free\)\.Ground truth, smoothed initial model, and reconstructions of every method on the70×19070\\times 190Overthrust benchmark, under the same setup as Fig\.[10](https://arxiv.org/html/2606.14139#S4.F10)\.![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/marmousi_sigma_sweep.png)

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/marmousi_noise_sweep.png)

Figure 12:Marmousi robustness sweep\.Normalized MAE, RMSE, and SSIM as a function of the initialization\-smoothing kernelσinit∈\{20,24,28,32\}\\sigma\_\{\\mathrm\{init\}\}\\\!\\in\\\!\\\{20,24,28,32\\\}on noise\-free data \(top\) and as a function of the additive Gaussian\-noise standard deviationσ∈\{0,0\.1,0\.3,0\.5\}\\sigma\\\!\\in\\\!\\\{0,0\.1,0\.3,0\.5\\\}atσinit=20\\sigma\_\{\\mathrm\{init\}\}\\\!=\\\!20\(bottom\)\.![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/overthrust_sigma_sweep.png)

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/overthrust_noise_sweep.png)

Figure 13:Overthrust robustness sweep\.Same sweep as Fig\.[12](https://arxiv.org/html/2606.14139#S4.F12), for the Overthrust benchmark\.

## 5Conclusion

We proposed Decoupled Latent Optimization \(DLO\), a framework for embedding pretrained diffusion priors into PDE\-governed inverse problems\. DLO relaxes the standard latent\-optimization formulation into a quadratic\-penalty objective that decouples the physical inversion from the diffusion sampler\. This decoupling isolates the data\-fidelity gradient in physical space, removes the need to backpropagate through the composed PDE\-solver–sampler chain, and permits conventional physical\-space initialization\.

On the OpenFWI benchmark, DLO consistently outperforms classical regularizers and existing diffusion\-based methods under clean, noisy, and missing\-trace acquisitions\. The prior, trained solely on the70×7070\\times 70OpenFWI velocity models, transfers without modification to the substantially larger Marmousi and Overthrust benchmarks, where DLO recovers intricate faults and dipping layers and remains robust to initialization smoothing and measurement noise\.

The decoupled latent\-optimization principle applies broadly to PDE\-governed inverse problems for which a pretrained generative prior of the unknown field is available\. Extending the framework to three\-dimensional geometries and to priors that capture richer geological complexity are natural directions for future work\.

## Acknowledgments

Zheng Ma is supported by NSFC Grant No\. 12531016 and Beijing Institute of Applied Physics and Computational Mathematics funding HX02023\-6\. Additionally, we also thank Shanghai Institute for Mathematics and Interdisciplinary Sciences \(SIMIS\) for their financial support\. This research was funded by SIMIS under grant number SIMIS\-ID\-2025\-ST\. The authors are grateful for the resources and facilities provided by SIMIS, which were essential for the completion of this work\.

## References

- \[1\]H\. Aghamiry, A\. Gholami, and S\. Operto\(2018\)Hybrid tikhonov\+ total\-variation regularization for imaging large\-contrast media by full\-waveform inversion\.InSEG International Exposition and Annual Meeting,pp\. SEG–2018\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p5.1)\.
- \[2\]F\. Aminzadeh, J\. Brac, and T\. Kunz\(1997\)3\-d salt and overthrust models, seg/eage 3\-d model\. ser\. 1\.Society of Exploration Geophysicists, Tulsa, OK\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p7.1),[§4\.4](https://arxiv.org/html/2606.14139#S4.SS4.p1.1),[§4](https://arxiv.org/html/2606.14139#S4.p1.1)\.
- \[3\]B\. D\.O\. Anderson\(1982\)Reverse\-time diffusion equation models\.Stochastic Processes and their Applications12\(3\),pp\. 313–326\.External Links:ISSN 0304\-4149,[Document](https://dx.doi.org/10.1016/0304-4149%2882%2990051-5),[Link](https://www.sciencedirect.com/science/article/pii/0304414982900515)Cited by:[Appendix A](https://arxiv.org/html/2606.14139#A1.p2.19),[§2\.2](https://arxiv.org/html/2606.14139#S2.SS2.p3.1)\.
- \[4\]A\. Asnaashari, R\. Brossier, S\. Garambois, F\. Audebert, P\. Thore, and J\. Virieux\(2013\)Regularized seismic full waveform inversion with prior model information\.Geophysics78\(2\),pp\. R25–R36\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p5.1),[§4\.2](https://arxiv.org/html/2606.14139#S4.SS2.p1.1)\.
- \[5\]H\. Ben\-Hamu, O\. Puny, I\. Gat, B\. Karrer, U\. Singer, and Y\. Lipman\(2024\)D\-flow: differentiating through flows for controlled generation\.arXiv preprint arXiv:2402\.14017\.Cited by:[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px3.p1.4)\.
- \[6\]A\. Berkhout\(1997\)Pushing the limits of seismic imaging, part ii: integration of prestack migration, velocity estimation, and avo analysis\.Geophysics62\(3\),pp\. 954–969\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[7\]D\. P\. Bertsekas\(1999\)Nonlinear programming\.2nd edition,Athena Scientific,Belmont, MA\.Cited by:[§3\.2](https://arxiv.org/html/2606.14139#S3.SS2.p1.6),[§3\.2](https://arxiv.org/html/2606.14139#S3.SS2.p3.4),[§3](https://arxiv.org/html/2606.14139#S3.p2.3)\.
- \[8\]E\. Bozdağ, J\. Trampert, and J\. Tromp\(2011\)Misfit functions for full waveform inversion based on instantaneous phase and envelope measurements\.Geophysical Journal International185\(2\),pp\. 845–870\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1)\.
- \[9\]C\. Bunks, F\. M\. Saleck, S\. Zaleski, and G\. Chavent\(1995\)Multiscale seismic waveform inversion\.Geophysics60\(5\),pp\. 1457–1473\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1)\.
- \[10\]B\. Chi, L\. Dong, and Y\. Liu\(2014\)Full waveform inversion method using envelope objective function without low frequency data\.Journal of Applied Geophysics109,pp\. 36–46\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1)\.
- \[11\]J\. Choi, S\. Kim, Y\. Jeong, Y\. Gwon, and S\. Yoon\(2021\)Ilvr: conditioning method for denoising diffusion probabilistic models\.arXiv preprint arXiv:2108\.02938\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[12\]H\. Chung, J\. Kim, M\. T\. Mccann, M\. L\. Klasky, and J\. C\. Ye\(2022\)Diffusion posterior sampling for general noisy inverse problems\.arXiv preprint arXiv:2209\.14687\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px1.p1.1)\.
- \[13\]F\. Clement, G\. Chavent, and S\. Gómez\(2001\)Migration\-based traveltime waveform inversion of 2\-d simple structures: a synthetic example\.Geophysics66\(3\),pp\. 845–860\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[14\]M\. Dashti and A\. M\. Stuart\(2013\)The bayesian approach to inverse problems\.arXiv preprint arXiv:1302\.6989\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[15\]C\. Deng, S\. Feng, H\. Wang, X\. Zhang, P\. Jin, Y\. Feng, Q\. Zeng, Y\. Chen, and Y\. Lin\(2022\)OpenFWI: large\-scale multi\-structural benchmark datasets for full waveform inversion\.Advances in Neural Information Processing Systems35,pp\. 6007–6020\.Cited by:[§B\.2](https://arxiv.org/html/2606.14139#A2.SS2.p1.2),[§1](https://arxiv.org/html/2606.14139#S1.p7.1),[§4\.2](https://arxiv.org/html/2606.14139#S4.SS2.p1.1),[§4](https://arxiv.org/html/2606.14139#S4.p1.1)\.
- \[16\]Z\. Dou and Y\. Song\(2024\)Diffusion posterior sampling for linear inverse problem solving: a filtering perspective\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[17\]B\. Efron\(2011\)Tweedie’s formula and selection bias\.Journal of the American Statistical Association106\(496\),pp\. 1602–1614\.Cited by:[Appendix A](https://arxiv.org/html/2606.14139#A1.p2.23)\.
- \[18\]E\. Esser, L\. Guasch, T\. van Leeuwen, A\. Y\. Aravkin, and F\. J\. Herrmann\(2018\)Total variation regularization strategies in full\-waveform inversion\.SIAM Journal on Imaging Sciences11\(1\),pp\. 376–406\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p5.1),[§4\.2](https://arxiv.org/html/2606.14139#S4.SS2.p1.1)\.
- \[19\]A\. Fichtner\(2010\)Full seismic waveform modelling and inversion\.Springer Science & Business Media\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[20\]O\. Gauthier, J\. Virieux, and A\. Tarantola\(1986\)Two\-dimensional nonlinear inversion of seismic waveforms; numerical results\.geophysics51\(7\),pp\. 1387–1403\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1)\.
- \[21\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[Appendix A](https://arxiv.org/html/2606.14139#A1.p2.2),[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.2](https://arxiv.org/html/2606.14139#S2.SS2.p1.2),[§4\.2](https://arxiv.org/html/2606.14139#S4.SS2.p2.1)\.
- \[22\]X\. Huang, F\. Wang, and T\. Alkhalifah\(2025\)Physics\-informed waveform inversion using pretrained wavefield neural operators\.IEEE Transactions on Geoscience and Remote Sensing\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[23\]J\. Hudson and J\. Heritage\(1981\)The use of the born approximation in seismic scattering problems\.Geophysical Journal International66\(1\),pp\. 221–240\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[24\]M\. Jakobsen and B\. Ursin\(2015\)Full waveform inversion in the frequency domain using direct iterative t\-matrix methods\.Journal of Geophysics and Engineering12\(3\),pp\. 400–418\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[25\]A\. Jiao, H\. He, R\. Ranade, J\. Pathak, and L\. Lu\(2021\)One\-shot learning for solution operators of partial differential equations\.arXiv preprint arXiv:2104\.05512\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1)\.
- \[26\]G\. E\. Karniadakis, I\. G\. Kevrekidis, L\. Lu, P\. Perdikaris, S\. Wang, and L\. Yang\(2021\)Physics\-informed machine learning\.Nature Reviews Physics3\(6\),pp\. 422–440\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[27\]B\. Kawar, M\. Elad, S\. Ermon, and J\. Song\(2022\)Denoising diffusion restoration models\.Advances in neural information processing systems35,pp\. 23593–23606\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[28\]V\. Kazei, O\. Ovcharenko, P\. Plotnitskii, D\. Peter, X\. Zhang, and T\. Alkhalifah\(2021\)Mapping full seismic waveforms to vertical velocity profiles by deep learning\.Geophysics86\(5\),pp\. R711–R721\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[29\]Y\. Lin and L\. Huang\(2014\)Acoustic\-and elastic\-waveform inversion using a modified total\-variation regularization scheme\.Geophysical Journal International200\(1\),pp\. 489–502\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p5.1)\.
- \[30\]L\. Lu, P\. Jin, G\. Pang, Z\. Zhang, and G\. E\. Karniadakis\(2021\)Learning nonlinear operators via deeponet based on the universal approximation theorem of operators\.Nature machine intelligence3\(3\),pp\. 218–229\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[31\]L\. Lu, X\. Meng, Z\. Mao, and G\. E\. Karniadakis\(2021\)DeepXDE: a deep learning library for solving differential equations\.SIAM review63\(1\),pp\. 208–228\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[32\]L\. Lu, R\. Pestourie, W\. Yao, Z\. Wang, F\. Verdugo, and S\. G\. Johnson\(2021\)Physics\-informed neural networks with hard constraints for inverse design\.SIAM Journal on Scientific Computing43\(6\),pp\. B1105–B1132\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[33\]M\. Mardani, J\. Song, J\. Kautz, and A\. Vahdat\(2024\)A variational perspective on solving inverse problems with diffusion models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 28027–28053\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px2.p1.15)\.
- \[34\]L\. Métivier, R\. Brossier, Q\. Mérigot, E\. Oudet, and J\. Virieux\(2016\)Measuring the misfit between seismograms using an optimal transport distance: application to full waveform inversion\.Geophysical Supplements to the Monthly Notices of the Royal Astronomical Society205\(1\),pp\. 345–377\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1),[§1](https://arxiv.org/html/2606.14139#S1.p3.1)\.
- \[35\]K\. Muhumuza, M\. Jakobsen, T\. Luostari, and T\. Lähivaara\(2018\)Seismic monitoring of co2 injection using a distorted born t\-matrix approach in acoustic approximation\.J\. Seism\. Explor27,pp\. 403–431\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[36\]J\. Peng, E\. Jiang, Z\. Ma, and X\. Yan\(2026\)Robust physics\-guided diffusion for full\-waveform inversion\.arXiv preprint arXiv:2603\.16393\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[37\]A\. Pladys, R\. Brossier, Y\. Li, and L\. Métivier\(2021\)On cycle\-skipping and misfit function modification for full\-wave inversion: comparison of five recent approaches\.Geophysics86\(4\),pp\. R563–R587\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p4.1)\.
- \[38\]R\. G\. Pratt\(1999\)Seismic waveform inversion in the frequency domain; part 1, theory and verification in a physical scale model\.Geophysics64\(3\),pp\. 888–901\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1)\.
- \[39\]M\. Sambridge and K\. Mosegaard\(2002\)Monte carlo methods in geophysical inverse problems\.Reviews of Geophysics40\(3\),pp\. 3–1\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[40\]S\. Shan, P\. Wang, S\. Chen, J\. Liu, C\. Xu, and S\. Cai\(2026\)Pird: physics\-informed residual diffusion for flow field reconstruction\.Acta Mechanica Sinica42\(7\),pp\. 725259\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[41\]S\. Shan, M\. Zhu, Y\. Lin, and L\. Lu\(2026\)Regularization by denoising diffusion models for solving inverse pde problems with application to full waveform inversion\.Communications Physics\.Cited by:[§B\.3](https://arxiv.org/html/2606.14139#A2.SS3.p1.13),[§B\.4](https://arxiv.org/html/2606.14139#A2.SS4.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px2.p1.15),[§4\.2\.1](https://arxiv.org/html/2606.14139#S4.SS2.SSS1.p1.7),[§4\.2\.2](https://arxiv.org/html/2606.14139#S4.SS2.SSS2.p1.5),[§4\.2](https://arxiv.org/html/2606.14139#S4.SS2.p1.1)\.
- \[42\]D\. Shu, Z\. Li, and A\. B\. Farimani\(2023\)A physics\-informed diffusion model for high\-fidelity flow field reconstruction\.Journal of Computational Physics478,pp\. 111972\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[43\]J\. Song, C\. Meng, and S\. Ermon\(2020\)Denoising diffusion implicit models\.arXiv preprint arXiv:2010\.02502\.Cited by:[Appendix A](https://arxiv.org/html/2606.14139#A1.p2.31),[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.2](https://arxiv.org/html/2606.14139#S2.SS2.p4.3)\.
- \[44\]J\. Song, A\. Vahdat, M\. Mardani, and J\. Kautz\(2023\)Pseudoinverse\-guided diffusion models for inverse problems\.InInternational conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px1.p1.1)\.
- \[45\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2020\)Score\-based generative modeling through stochastic differential equations\.arXiv preprint arXiv:2011\.13456\.Cited by:[Appendix A](https://arxiv.org/html/2606.14139#A1.p2.2),[Appendix A](https://arxiv.org/html/2606.14139#A1.p2.22),[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.2](https://arxiv.org/html/2606.14139#S2.SS2.p1.2),[§2\.2](https://arxiv.org/html/2606.14139#S2.SS2.p2.2),[§2\.2](https://arxiv.org/html/2606.14139#S2.SS2.p4.3)\.
- \[46\]A\. Tarantola\(1984\)Inversion of seismic reflection data in the acoustic approximation\.Geophysics49\(8\),pp\. 1259–1266\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[47\]A\. Tarantola\(2005\)Inverse problem theory and methods for model parameter estimation\.SIAM\.Cited by:[§B\.2](https://arxiv.org/html/2606.14139#A2.SS2.p1.2),[§1](https://arxiv.org/html/2606.14139#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p3.5),[§3\.1](https://arxiv.org/html/2606.14139#S3.SS1.p1.6)\.
- \[48\]M\. H\. Taufik and T\. Alkhalifah\(2025\)Diffusion model\-based posterior sampling in full waveform inversion\.arXiv preprint arXiv:2512\.12797\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[49\]M\. H\. Taufik and T\. Alkhalifah\(2025\)Wavenumber\-aware diffusion sampling to regularize multiparameter elastic full waveform inversion\.Geophysical Journal International240\(2\),pp\. 1215–1233\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px1.p1.4)\.
- \[50\]R\. Versteeg\(1994\)The marmousi experience: velocity model determination on a synthetic complex data set\.The Leading Edge13\(9\),pp\. 927–936\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p7.1),[§4\.4](https://arxiv.org/html/2606.14139#S4.SS4.p1.1),[§4](https://arxiv.org/html/2606.14139#S4.p1.1)\.
- \[51\]J\. Virieux and S\. Operto\(2009\)An overview of full\-waveform inversion in exploration geophysics\.GEOPHYSICS74\(6\)\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1),[§1](https://arxiv.org/html/2606.14139#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.14139#S2.SS1.p1.1)\.
- \[52\]F\. Wang, X\. Huang, and T\. A\. Alkhalifah\(2023\)A prior regularized full waveform inversion using generative diffusion models\.IEEE transactions on geoscience and remote sensing61,pp\. 1–11\.Cited by:[§B\.3](https://arxiv.org/html/2606.14139#A2.SS3.p2.4),[§B\.4](https://arxiv.org/html/2606.14139#A2.SS4.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.14139#S1.p5.2),[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px1.p1.4),[§4\.2\.1](https://arxiv.org/html/2606.14139#S4.SS2.SSS1.p1.7),[§4\.2](https://arxiv.org/html/2606.14139#S4.SS2.p1.1)\.
- \[53\]H\. Wang, X\. Zhang, T\. Li, Y\. Wan, T\. Chen, and J\. Sun\(2024\)Dmplug: a plug\-in method for solving inverse problems with diffusion models\.Advances in Neural Information Processing Systems37,pp\. 117881–117916\.Cited by:[§2\.3](https://arxiv.org/html/2606.14139#S2.SS3.SSS0.Px3.p1.4)\.
- \[54\]S\. Wang, Z\. Dou, S\. Shan, T\. Liu, and L\. Lu\(2026\)Fundiff: diffusion models over function spaces for physics\-informed generative modeling\.Nature Communications\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[55\]Z\. Wang, A\. C\. Bovik, H\. R\. Sheikh, and E\. P\. Simoncelli\(2004\)Image quality assessment: from error visibility to structural similarity\.IEEE transactions on image processing13\(4\),pp\. 600–612\.Cited by:[§4\.1](https://arxiv.org/html/2606.14139#S4.SS1.p1.7)\.
- \[56\]M\. Warner and L\. Guasch\(2016\)Adaptive waveform inversion: theory\.Geophysics81\(6\),pp\. R429–R445\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1)\.
- \[57\]M\. Warner, A\. Ratcliffe, T\. Nangoo, J\. Morgan, A\. Umpleby, N\. Shah, V\. Vinje, I\. Štekl, L\. Guasch, C\. Win,et al\.\(2013\)Anisotropic 3d full\-waveform inversion\.Geophysics78\(2\),pp\. R59–R80\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[58\]Y\. Wu and Y\. Lin\(2019\)InversionNet: an efficient and accurate data\-driven full waveform inversion\.IEEE Transactions on Computational Imaging6,pp\. 419–433\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[59\]Y\. Xie, H\. Chauris, and N\. Desassis\(2025\)Diffusion prior as a direct regularization term for fwi\.arXiv preprint arXiv:2506\.10141\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p5.2)\.
- \[60\]X\. Yan, K\. Wu, Z\. J\. Xu, and Z\. Ma\(2023\)An unsupervised deep learning approach for the wave equation inverse problem\.arXiv preprint arXiv:2311\.04531\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[61\]Y\. Yang, B\. Engquist, J\. Sun, and B\. F\. Hamfeldt\(2018\)Application of optimal transport and the quadratic wasserstein metric to full\-waveform inversion\.Geophysics83\(1\),pp\. R43–R62\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p3.1)\.
- \[62\]G\. Yao, N\. V\. da Silva, M\. Warner, D\. Wu, and C\. Yang\(2019\)Tackling cycle skipping in full\-waveform inversion with intermediate data\.Geophysics84\(3\),pp\. R411–R427\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1)\.
- \[63\]A\. Yazdani, L\. Lu, M\. Raissi, and G\. E\. Karniadakis\(2020\)Systems biology informed deep learning for inferring parameters and hidden dynamics\.PLoS computational biology16\(11\),pp\. e1007575\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1)\.
- \[64\]C\. A\. Zelt and R\. Smith\(1992\)Seismic traveltime inversion for 2\-d crustal velocity structure\.Geophysical journal international108\(1\),pp\. 16–34\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[65\]Z\. Zhang, E\. Saygin, L\. He, and T\. Alkhalifah\(2021\)Rayleigh wave dispersion spectrum inversion across scales\.Surveys in Geophysics42\(6\),pp\. 1281–1303\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p1.1)\.
- \[66\]Z\. Zhang, Z\. Wu, Z\. Wei, J\. Mei, R\. Huang, and P\. Wang\(2020\)FWI imaging: full\-wavefield imaging through full\-waveform inversion\.InSEG International Exposition and Annual Meeting,pp\. D031S027R004\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p2.1)\.
- \[67\]Z\. Zhang and Y\. Lin\(2020\)Data\-driven seismic waveform inversion: a study on the robustness and generalization\.IEEE Transactions on Geoscience and Remote sensing58\(10\),pp\. 6900–6913\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[68\]Z\. Zhang, Y\. Wu, Z\. Zhou, and Y\. Lin\(2019\)VelocityGAN: subsurface velocity image estimation using conditional adversarial networks\.In2019 IEEE Winter Conference on Applications of Computer Vision \(WACV\),pp\. 705–714\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.
- \[69\]M\. Zhu, S\. Feng, Y\. Lin, and L\. Lu\(2023\)Fourier\-deeponet: fourier\-enhanced deep operator networks for full waveform inversion with improved accuracy, generalizability, and robustness\.Computer Methods in Applied Mechanics and Engineering416,pp\. 116300\.Cited by:[§1](https://arxiv.org/html/2606.14139#S1.p4.1)\.

## Appendix ADiffusion Model Preliminaries

The essential definitions of the variance\-preserving marginal, the denoiserDθD\_\{\\theta\}, and the deterministic DDIM samplerRθR\_\{\\theta\}are summarized in Section[2\.2](https://arxiv.org/html/2606.14139#S2.SS2)of the main text\. This appendix provides the complete mathematical derivations—the stochastic differential equation, the probability\-flow ODE, Tweedie’s identity, and the DDPM training objective—for reference\.

A diffusion model defines a continuous family of Gaussian\-corrupted marginals\{pt\}t∈\[0,T\]\\\{p\_\{t\}\\\}\_\{t\\in\[0,T\]\}obtained by progressively adding noise to clean samplesx0∼pdatax\_\{0\}\\sim p\_\{\\mathrm\{data\}\}\. Under the variance\-preserving \(VP\) corruption\[[21](https://arxiv.org/html/2606.14139#bib.bib12),[45](https://arxiv.org/html/2606.14139#bib.bib14)\],

xt=α¯tx0\+1−α¯tϵ,ϵ∼𝒩\(0,I\),x\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon,\\qquad\\epsilon\\sim\\mathcal\{N\}\(0,I\),\(37\)whereα¯t∈\(0,1\]\\bar\{\\alpha\}\_\{t\}\\in\(0,1\]decreases monotonically fromα¯0=1\\bar\{\\alpha\}\_\{0\}\\\!=\\\!1toα¯T≈0\\bar\{\\alpha\}\_\{T\}\\\!\\approx\\\!0, sopT≈𝒩\(0,I\)p\_\{T\}\\approx\\mathcal\{N\}\(0,I\)\. Conditional onx0x\_\{0\},xtx\_\{t\}is Gaussian, so the marginal density ofxtx\_\{t\}is the convolution ofpdatap\_\{\\mathrm\{data\}\}with a rescaled Gaussian kernel,

pt\(x\)=∫pdata\(x0\)𝒩\(x;α¯tx0,\(1−α¯t\)I\)dx0\.p\_\{t\}\(x\)=\\int p\_\{\\mathrm\{data\}\}\(x\_\{0\}\)\\,\\mathcal\{N\}\\\!\\bigl\(x;\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\},\\,\(1\-\\bar\{\\alpha\}\_\{t\}\)\\,I\\bigr\)\\,\\mathrm\{d\}x\_\{0\}\.\(38\)Viewingα¯t\\bar\{\\alpha\}\_\{t\}as a smooth function oftt, the familypt\(x\)p\_\{t\}\(x\)satisfies a drift–diffusion equation inxxandtt,

∂tpt\(x\)=12β\(t\)\[∇⋅\(xpt\(x\)\)\+∇2pt\(x\)\],β\(t\)=−α¯˙tα¯t\.\\partial\_\{t\}p\_\{t\}\(x\)=\\tfrac\{1\}\{2\}\\beta\(t\)\\bigl\[\\nabla\\\!\\cdot\\\!\\bigl\(x\\,p\_\{t\}\(x\)\\bigr\)\+\\nabla^\{2\}p\_\{t\}\(x\)\\bigr\],\\qquad\\beta\(t\)=\-\\frac\{\\dot\{\\bar\{\\alpha\}\}\_\{t\}\}\{\\bar\{\\alpha\}\_\{t\}\}\.\(39\)The forward diffusion that induces these marginals is the Itô SDE with drift−12β\(t\)x\-\\tfrac\{1\}\{2\}\\beta\(t\)\\,xand diffusion coefficientβ\(t\)\\sqrt\{\\beta\(t\)\}; sampling reverses this process, and by Anderson’s theorem\[[3](https://arxiv.org/html/2606.14139#bib.bib69)\]the reverse\-time dynamics that transport the standard GaussianpTp\_\{T\}back to the data distributionpdatap\_\{\\mathrm\{data\}\}are governed by the sampling SDE

dx=\[−12β\(t\)x−β\(t\)∇xlog⁡pt\(x\)\]dt\+β\(t\)dw¯,\\mathrm\{d\}x=\\Bigl\[\-\\tfrac\{1\}\{2\}\\beta\(t\)\\,x\-\\beta\(t\)\\,\\nabla\_\{\\\!x\}\\log p\_\{t\}\(x\)\\Bigr\]\\mathrm\{d\}t\+\\sqrt\{\\beta\(t\)\}\\,\\mathrm\{d\}\\bar\{w\},\(40\)integrated backwards fromt=Tt=Ttot=0t=0, withw¯\\bar\{w\}a reverse\-time Wiener process\. The Fokker–Planck equation of this sampling SDE yields exactly the same form as Eq\. \([39](https://arxiv.org/html/2606.14139#A1.E39)\), so it traverses the marginals \([38](https://arxiv.org/html/2606.14139#A1.E38)\) in reverse\. The same marginals are reproduced by the equivalent deterministic probability\-flow ordinary differential equation\[[45](https://arxiv.org/html/2606.14139#bib.bib14)\]

dx=\[−12β\(t\)x−12β\(t\)∇xlog⁡pt\(x\)\]dt,\\mathrm\{d\}x=\\Bigl\[\-\\tfrac\{1\}\{2\}\\beta\(t\)\\,x\-\\tfrac\{1\}\{2\}\\beta\(t\)\\,\\nabla\_\{\\\!x\}\\log p\_\{t\}\(x\)\\Bigr\]\\mathrm\{d\}t,\(41\)which provides a deterministic sampler sharing the same marginals\. Both the SDE and the ODE require the score∇xlog⁡pt\(x\)\\nabla\_\{\\\!x\}\\log p\_\{t\}\(x\)of every noisy marginal, which by Tweedie’s identity\[[17](https://arxiv.org/html/2606.14139#bib.bib22)\]admits the conditional\-expectation form

∇xtlog⁡pt\(xt\)=α¯t𝔼\[x0∣xt\]−xt1−α¯t\.\\nabla\_\{\\\!x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\)=\\frac\{\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,\\mathbb\{E\}\[x\_\{0\}\\\!\\mid\\\!x\_\{t\}\]\-x\_\{t\}\}\{1\-\\bar\{\\alpha\}\_\{t\}\}\.\(42\)The conditional expectation𝔼\[x0∣xt\]\\mathbb\{E\}\[x\_\{0\}\\\!\\mid\\\!x\_\{t\}\]is approximated by a neural denoiserDθ\(xt,t\)D\_\{\\theta\}\(x\_\{t\},t\)trained by the standard denoising objective

minθ⁡𝔼x0∼pdata,t∼𝒰\(0,T\),ϵ∼𝒩\(0,I\)\[ω\(t\)‖Dθ\(xt,t\)−x0‖2\],xt=α¯tx0\+1−α¯tϵ,\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\,x\_\{0\}\\sim p\_\{\\mathrm\{data\}\},\\;t\\sim\\mathcal\{U\}\\\!\(0,T\),\\;\\epsilon\\sim\\mathcal\{N\}\(0,I\)\}\\bigl\[\\,\\omega\(t\)\\,\\\|D\_\{\\theta\}\(x\_\{t\},t\)\-x\_\{0\}\\\|^\{2\}\\,\\bigr\],\\qquad x\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon,\(43\)which, substituted into Eq\. \([42](https://arxiv.org/html/2606.14139#A1.E42)\), supplies a learned score for the SDE \([40](https://arxiv.org/html/2606.14139#A1.E40)\) and the ODE \([41](https://arxiv.org/html/2606.14139#A1.E41)\)\. We adopt the deterministic DDIM update\[[43](https://arxiv.org/html/2606.14139#bib.bib13)\], a discretization of the probability\-flow ODE \([41](https://arxiv.org/html/2606.14139#A1.E41)\),

xt−1=α¯t−1Dθ\(xt,t\)\+1−α¯t−1xt−α¯tDθ\(xt,t\)1−α¯t,x\_\{t\-1\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\-1\}\}\\,D\_\{\\theta\}\(x\_\{t\},t\)\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\-1\}\}\\,\\frac\{x\_\{t\}\-\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,D\_\{\\theta\}\(x\_\{t\},t\)\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\},\(44\)which, iterated fromxT=z∼𝒩\(0,I\)x\_\{T\}=z\\sim\\mathcal\{N\}\(0,I\)down tox0x\_\{0\}, defines a deterministic samplerRθ:z↦x0R\_\{\\theta\}\\\!:\\\!z\\\!\\mapsto\\\!x\_\{0\}that transports the standard Gaussian to \(an approximation of\)pdatap\_\{\\mathrm\{data\}\}\. We treatRθR\_\{\\theta\}as the learned generator throughout the paper\.

## Appendix BImplementation Details

### B\.1Diffusion Model Pretraining

For each of the four OpenFWI families \(FV\-B, FF\-B, CV\-B, CF\-B\) we independently pretrain a velocity\-domain diffusion model on its training split\. All four families share the same network architecture, noise schedule, and training protocol; the only difference between runs is the training data\. Velocity fields are normalized byy=\(vm/s−3000\)/1500y=\(v\_\{\\,\\mathrm\{m/s\}\}\-3000\)/1500, so the physical range\[1500,4500\]\[1500,4500\]m/s maps to\[−1,1\]\[\-1,1\]\. The native velocity resolution is70×7070\\\!\\times\\\!70; inside the network we apply a small amount of reflection padding to72×7272\\\!\\times\\\!72at the input and slice the output back to70×7070\\\!\\times\\\!70, so the denoiser operates on a resolution that is cleanly divisible at every downsampling stage while the externally visible velocity field remains70×7070\\\!\\times\\\!70throughout pretraining and inversion\.

##### Network architecture\.

The denoiserϵθ\\epsilon\_\{\\theta\}is a22D U\-Net \(the Hugging Face DiffusersUNet2DModel\) with one input and one output channel\. It comprises four resolution stages with output channel widths\{128,256,256,512\}\\\{128,256,256,512\\\}and three downsampling steps that take the padded input from72×7272\\\!\\times\\\!72through36×3636\\\!\\times\\\!36and18×1818\\\!\\times\\\!18down to a9×99\\\!\\times\\\!9bottleneck, mirrored by three upsampling steps in the decoder\. A self\-attention block is inserted at the18×1818\\\!\\times\\\!18stage in both the encoder and the decoder, with the remaining stages purely convolutional, and timestep conditioning is supplied to every residual block via a sinusoidal embedding\.

##### Noise schedule\.

We use a discrete\-time DDPM scheduler withT=1000T\\\!=\\\!1000training timesteps and a linearβ\\betaschedule fromβ1=10−4\\beta\_\{1\}\\\!=\\\!10^\{\-4\}toβT=2×10−2\\beta\_\{T\}\\\!=\\\!2\\\!\\times\\\!10^\{\-2\}, yielding the VP marginals defined in Section[2\.2](https://arxiv.org/html/2606.14139#S2.SS2)\(see also Appendix[A](https://arxiv.org/html/2606.14139#A1)for the complete derivation\)\. The full procedure for constructing the discrete schedule\{βt,αt,α¯t\}t=1T\\\{\\beta\_\{t\},\\alpha\_\{t\},\\bar\{\\alpha\}\_\{t\}\\\}\_\{t=1\}^\{T\}from the two endpoint values is summarized in Algorithm[2](https://arxiv.org/html/2606.14139#alg2)\. The network is trained to predict the injected noiseϵ\\epsilonby the standard DDPM noise\-prediction objective\. The same schedule is reused at inference time by the deterministic DDIM samplerRθR\_\{\\theta\}\(Section[B\.5](https://arxiv.org/html/2606.14139#A2.SS5)\), with the number of deterministic steps reduced fromT=1000T\\\!=\\\!1000ton=3n\\\!=\\\!3inside the DLO loop\.

Algorithm 2Linearβ\\betaschedule \(DDPM, VP\)\.0:Total steps

T=1000T=1000, start

β1=10−4\\beta\_\{1\}=10^\{\-4\}, end

βT=2×10−2\\beta\_\{T\}=2\\\!\\times\\\!10^\{\-2\}
1:for

t=1t=1to

TTdo

2:

βt←β1\+t−1T−1\(βT−β1\)\\beta\_\{t\}\\leftarrow\\beta\_\{1\}\+\\dfrac\{t\-1\}\{T\-1\}\\,\(\\beta\_\{T\}\-\\beta\_\{1\}\)
3:

αt←1−βt\\alpha\_\{t\}\\leftarrow 1\-\\beta\_\{t\}
4:

α¯t←∏s=1tαs\\bar\{\\alpha\}\_\{t\}\\leftarrow\\prod\_\{s=1\}^\{t\}\\alpha\_\{s\}
5:endfor

6:return

\{βt,αt,α¯t\}t=1T\\\{\\beta\_\{t\},\\alpha\_\{t\},\\bar\{\\alpha\}\_\{t\}\\\}\_\{t=1\}^\{T\}

##### Training hyperparameters\.

We optimize the noise\-prediction loss with Adam at peak learning rate10−410^\{\-4\}following a cosine schedule with a500500\-step linear warm\-up\. We train for400400epochs on each family at batch size3232, which gives roughly1,5001\{,\}500optimizer steps per epoch and a total of∼6×105\\sim\\\!6\\\!\\times\\\!10^\{5\}update steps over the48,00048\{,\}000\-velocity training set per family\. Training and validation loss curves for the CF\-B run are shown in Fig\.[14](https://arxiv.org/html/2606.14139#A2.F14); the other three families exhibit qualitatively identical behaviour\. After roughly3×1053\\\!\\times\\\!10^\{5\}optimizer steps the loss begins to oscillate around a slowly decreasing plateau, a common signature of the stochastic noise\-prediction objective in which the per\-step targetϵ\\epsilonis resampled and the loss therefore retains an irreducible variance even onceϵθ\\epsilon\_\{\\theta\}has effectively converged\.

![Refer to caption](https://arxiv.org/html/2606.14139v1/figures/ddpm_loss_curve.png)Figure 14:Diffusion pretraining loss curve \(CF\-B\)\.Training and validation noise\-prediction loss versus optimizer step for the CurveFault\-B diffusion model\. The remaining three families produce qualitatively identical curves\.

### B\.2Acoustic Wave\-Equation Forward Solver

The forward operatorℱ\\mathcal\{F\}in Eq\. \([5](https://arxiv.org/html/2606.14139#S2.E5)\) numerically integrates the 2D acoustic wave equation \([1](https://arxiv.org/html/2606.14139#S2.E1)\) with a Ricker wavelet source by a44th\-order finite\-difference scheme in space and a leapfrog update in time, with a sponge absorbing layer of quadratic damping along the four exterior boundaries\. Sources and receivers are uniformly distributed along the surface at fixed depths\. The configuration follows the protocol used to generate the OpenFWI datasets\[[15](https://arxiv.org/html/2606.14139#bib.bib19)\], and the same parameters are reused for the Marmousi and Overthrust experiments, with the lateral extent enlarged to accommodate the wider models\. Gradients of the data\-misfit term with respect to the velocity model are computed by the adjoint\-state method\[[47](https://arxiv.org/html/2606.14139#bib.bib31)\], which back\-propagates the data residual through a single reverse\-time solve of the same wave equation and contracts it with the stored forward wavefield\. All parameters are summarized in Table[1](https://arxiv.org/html/2606.14139#A2.T1)\.

Table 1:Forward\-solver configuration\. The same numerical scheme, source signature, and sponge boundary are used for both benchmark families; only the horizontal extent \(nxn\_\{x\},ngn\_\{g\}\) changes between OpenFWI and the larger Marmousi/Overthrust models\.ParameterSymbolOpenFWIMarmousi/OverthrustUnitHorizontal grid pointsnxn\_\{x\}70190—Vertical grid pointsnzn\_\{z\}7070—Grid spacingΔx\\Delta x10\.010\.0mTime stepsntn\_\{t\}10001000—Time step sizeΔt\\Delta t0\.0010\.001sRecording timeTT1\.01\.0sSource frequency \(Ricker\)ff15\.015\.0HzNumber of sourcesnsn\_\{s\}55—Number of receiversngn\_\{g\}70190—Source depthzsz\_\{s\}1010mReceiver depthzgz\_\{g\}1010mBoundary condition—Sponge layer \(quadratic damping\)—ABC layer thicknessnbcn\_\{\\mathrm\{bc\}\}120120grid pointsPhysical domain size—700×700700\\\!\\times\\\!700700×1900700\\\!\\times\\\!1900m2Seismic data shape—\(5,1000,70\)\(5,1000,70\)\(5,1000,190\)\(5,1000,190\)—
### B\.3Hyperparameter Settings

To ensure a fair comparison, all methods share the same hyperparameters across all scenarios\. For the data\-misfit term, all methods minimize theℓ1\\ell\_\{1\}norm‖ℱ\(v\)−𝐝obs‖1\\\|\\mathcal\{F\}\(v\)\-\\mathbf\{d\}\_\{\\mathrm\{obs\}\}\\\|\_\{1\}rather than the squaredℓ2\\ell\_\{2\}norm, as theℓ1\\ell\_\{1\}loss provides greater robustness to outliers in seismic recordings\. All optimization methods use the Adam optimizer with learning rateη=0\.03\\eta=0\.03, run for300300outer iterations\. DLO uses the same optimizer for the velocity fieldvvas the other methods, plus an independent Adam optimizer for the latent variablezzwith a constant learning rateηz=0\.02\\eta\_\{z\}=0\.02\. The regularization coefficients of Tikhonov \(λ=0\.01\\lambda\\\!=\\\!0\.01\), TV \(λ=0\.01\\lambda\\\!=\\\!0\.01\), and RED\-DiffEq \(λ=0\.75\\lambda\\\!=\\\!0\.75\) follow the values reported in RED\-DiffEq\[[41](https://arxiv.org/html/2606.14139#bib.bib61)\]out of empirical hyperparameter tuning, and the manifold\-tracking weight of DLO is fixed atλ=0\.5\\lambda\\\!=\\\!0\.5throughout\.

DiffusionFWI\[[52](https://arxiv.org/html/2606.14139#bib.bib1)\]follows a different optimization path, in which FWI gradient steps are interleaved between successive reverse\-diffusion steps; we retain the protocol of the original code release, starting the reverse sampler att=100t\\\!=\\\!100on theT=1000T\\\!=\\\!1000schedule, inserting1010Adam steps \(learning rateη=0\.01\\eta\\\!=\\\!0\.01\) between consecutive denoising steps\. For the stabilization techniques in the repo \(gradient smoothing, gradient normalization, and velocity\-model Gaussian blur\), our cross\-validation across the four OpenFWI families confirms that velocity\-model blur combined with gradient normalization most reliably improves performance, and we enable only these two in all reported DiffusionFWI results\.

### B\.4Baseline Implementation Protocols

All baselines operate on velocity fields normalized to\[−1,1\]\[\-1,1\]byx=\(vm/s−3000\)/1500x=\(v\_\{\\,\\mathrm\{m/s\}\}\-3000\)/1500and are initialized from a Gaussian\-smoothed ground\-truth model with standard deviationσinit=10\\sigma\_\{\\mathrm\{init\}\}=10\.

##### Tikhonov\.

We apply a first\-order Tikhonov regularizer that penalizes the squared gradient of the velocity model,

RTikhonov\(x\)=1N∑i,j\[\(xi\+1,j−xi,j\)2\+\(xi,j\+1−xi,j\)2\],R\_\{\\mathrm\{Tikhonov\}\}\(x\)=\\frac\{1\}\{N\}\\sum\_\{i,j\}\\Bigl\[\(x\_\{i\+1,j\}\-x\_\{i,j\}\)^\{2\}\+\(x\_\{i,j\+1\}\-x\_\{i,j\}\)^\{2\}\\Bigr\],\(45\)whereNNis the number of grid points\. This penalty discourages abrupt variations and yields smooth solutions at the cost of fine\-scale detail\.

##### Total variation\.

We use anisotropic TV, which promotes piecewise\-constant structure and preserves sharp velocity discontinuities,

RTV\(x\)=1N∑i,j\[\|xi\+1,j−xi,j\|\+\|xi,j\+1−xi,j\|\],R\_\{\\mathrm\{TV\}\}\(x\)=\\frac\{1\}\{N\}\\sum\_\{i,j\}\\Bigl\[\\,\|x\_\{i\+1,j\}\-x\_\{i,j\}\|\+\|x\_\{i,j\+1\}\-x\_\{i,j\}\|\\,\\Bigr\],\(46\)which has a known tendency to introduce staircase artifacts in regions of smooth velocity gradient\.

##### RED\-DiffEq\.

We use the denoiser\-based regularizer of Eq\. \([13](https://arxiv.org/html/2606.14139#S2.E13)\), with the time\-dependent weighting in the original formulation replaced by a fixed coefficientλ\\lambdafollowing\[[41](https://arxiv.org/html/2606.14139#bib.bib61)\]\.

##### DiffusionFWI\.

We adapt the official implementation of\[[52](https://arxiv.org/html/2606.14139#bib.bib1)\]to our acoustic forward solver, replacing the original elastic engine while retaining the nested optimization protocol described in Section[B\.3](https://arxiv.org/html/2606.14139#A2.SS3)\.

### B\.5Computational Cost

For stable optimization and computational efficiency, we instantiate the DDIM sampler withn=3n=3deterministic steps as the prior velocity field generator\. Table[2](https://arxiv.org/html/2606.14139#A2.T2)summarizes the per\-iteration runtime breakdown of DLO\. Compared to a standard FWI iteration \(forward PDE solver plus adjoint gradient\), optimization with the additional DDIM\-related operations costs roughly220%220\\%of a standard FWI iteration under our implementation\.

Compared to classical FWI methods, DLO loads a pretrained diffusion model and therefore occupies additional GPU memory\. Peak GPU memory usage during the DLO step is approximately46004600MB, well below the capacity of modern GPUs\. The additional computational cost and memory usage at this level are acceptable, given the inversion performance achieved by diffusion\-based methods such as DLO\.

Table 2:Per\-iteration runtime breakdown of DLO\.Measured on the OpenFWI70×7070\\\!\\times\\\!70velocity grid and averaged over100100iterations;±\\pmdenotes one standard deviation\.
Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion

Similar Articles

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

Diffusion Policy Optimization without Drifting Apart

Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Submit Feedback

Similar Articles

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models
Diffusion Policy Optimization without Drifting Apart
Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models