Difference of Convex Programming in the Wasserstein Space with Applications to MMD Optimization
Summary
This paper introduces a difference-of-convex programming framework in Wasserstein space for optimizing non-convex functionals over probability measures, with explicit decompositions for Maximum Mean Discrepancy and Energy Distance, and proves convergence of the lifted convex-concave procedure.
View Cached Full Text
Cached at: 06/29/26, 05:26 AM
# Difference of Convex Programming in the Wasserstein Space with Applications to MMD Optimization
Source: [https://arxiv.org/html/2606.27767](https://arxiv.org/html/2606.27767)
Clément Bonet CMAP, CNRS, Ecole Polytechnique, IP Paris clement\.bonet\.mapp@polytechnique\.edu &Pierre\-Cyril Aubin\-Frankowski CERMICS, CNRS, ENPC, IP Paris pierre\-cyril\.aubin@enpc\.fr &Youssef Mroueh IBM Research mroueh@us\.ibm\.com
###### Abstract
Optimizing functionals over the space of probability measures is now ubiquitous in machine learning\. A widely used approach is to perform the optimization directly over the Wasserstein space, but many objective functionals of practical interest are non\-convex along Wasserstein geodesics, making the analysis of standard first\-order methods challenging\. In this work, we study a class of objectives over the Wasserstein space that admit a difference\-of\-convex \(DC\) decomposition and we lift the classical convex\-concave procedure \(CCCP\) to this setting\. Under smoothness and strong convexity assumptions on the convex components of the decomposition, we prove almost stationarity along the iterates of the resulting algorithm\. Our main focus is on the Maximum Mean Discrepancy \(MMD\) and the Energy Distance \(ED\) functionals, for which we develop explicit Wasserstein DC decompositions, and establish local convergence of the scheme under mild assumptions\. Empirically, we show that well\-chosen DC decompositions yield faster and more stable convergence than Wasserstein gradient descent on these MMD objectives\.
### 1Introduction
Optimizing over the space of probability measures is an important problem in machine learning, which has received attention to solve problems ranging from variational inference\(Bleiet al\.,[2017](https://arxiv.org/html/2606.27767#bib.bib42); Lambertet al\.,[2022](https://arxiv.org/html/2606.27767#bib.bib52); Petit\-Talamonet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib53)\)to generative modeling\(Denget al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib38); Turan and Ovsjanikov,[2026](https://arxiv.org/html/2606.27767#bib.bib39); Caoet al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib40)\), reinforcement learning\(Zhanget al\.,[2018](https://arxiv.org/html/2606.27767#bib.bib54); Pfauet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib55)\), optimization of neural networks\(Meiet al\.,[2018](https://arxiv.org/html/2606.27767#bib.bib72); Chizat and Bach,[2018](https://arxiv.org/html/2606.27767#bib.bib73)\)or for modeling dynamics of cells\(Bunneet al\.,[2022](https://arxiv.org/html/2606.27767#bib.bib75); Terpinet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib78); Persiianovet al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib77)\)\. One prominent way to optimize over the space of probability measures is to equip it with the Wasserstein distance\(Villani and others,[2009](https://arxiv.org/html/2606.27767#bib.bib79); Santambrogio,[2015](https://arxiv.org/html/2606.27767#bib.bib51)\), and to discretize the associated Wasserstein gradient flows\(Jordanet al\.,[1998](https://arxiv.org/html/2606.27767#bib.bib44); Wibisono,[2018](https://arxiv.org/html/2606.27767#bib.bib45)\)\. This allows to design many optimization algorithms as counterparts of their Euclidean version, such as gradient descent\(Wibisono,[2018](https://arxiv.org/html/2606.27767#bib.bib45)\), proximal point and gradient algorithms\(Jordanet al\.,[1998](https://arxiv.org/html/2606.27767#bib.bib44); Salimet al\.,[2020](https://arxiv.org/html/2606.27767#bib.bib12)\), coordinate descent\(Xu and Li,[2026](https://arxiv.org/html/2606.27767#bib.bib25)\)or mirror descent\(Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3); Sharrocket al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib43)\)\.
While these optimization methods have demonstrated good results in minimizing several objectives such as the Kullback\-Leibler divergence \(KL\)\(Wibisono,[2018](https://arxiv.org/html/2606.27767#bib.bib45)\),ff\-divergences\(Gaoet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib63); Ansariet al\.,[2021](https://arxiv.org/html/2606.27767#bib.bib62); Liuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib64)\), the Energy Distance\(Hertrichet al\.,[2024b](https://arxiv.org/html/2606.27767#bib.bib27)\)or the Sliced\-Wasserstein distance\(Liutkuset al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib46); Duet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib47); Bonetet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib48)\), most of them are tailored for convex functionals along \(generalized\) geodesics\. However, many popular functionals are known to be non\-convex\. This is the case for instance of the squared Wasserstein distance itself\(Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31), Chapter 9\), the Sliced\-Wasserstein distance\(Bonnotte,[2013](https://arxiv.org/html/2606.27767#bib.bib50)\), the KL with non\-log\-concave target\(Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2)\), the Energy distance\(Hertrichet al\.,[2024a](https://arxiv.org/html/2606.27767#bib.bib26)\)or the Maximum Mean Discrepancy \(MMD\)\(Arbelet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib22)\)\. An alternative to the convexity assumption is to consider Polyak\-Łojaziewicz inequalities\(Blanchet and Bolte,[2018](https://arxiv.org/html/2606.27767#bib.bib65); Liuet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib66); Zhu and Chen,[2025](https://arxiv.org/html/2606.27767#bib.bib74)\), however these are so far only known to hold in restrictive cases such as trivially for the squared Wasserstein distance, the KL*w\.r\.t\.*a measure satisfying the log\-Sobolev inequality\(Villani and others,[2009](https://arxiv.org/html/2606.27767#bib.bib79), Chapter 21\), Sliced\-Wasserstein distance over Gaussians\(Thurinet al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib49)\), the MMD with Coulomb kernels and smooth initializations\(Boufadène and Vialard,[2025](https://arxiv.org/html/2606.27767#bib.bib100), Section 2\)\.
For the MMD specifically, to address its non\-convexity in the Wasserstein space,Arbelet al\.\([2019](https://arxiv.org/html/2606.27767#bib.bib22)\)proposed injecting noise to the gradient\. However, tuning the amount of noise remains delicate\.Gladinet al\.\([2024](https://arxiv.org/html/2606.27767#bib.bib24)\)performed instead the optimization of MMD in the Wasserstein\-MMD space, where MMD is convex\. Their method improves performance, but it requires to change weights of the distribution, mixing two implicit steps in squared Wasserstein and MMD\. More recently ,Belhadjiet al\.\([2026](https://arxiv.org/html/2606.27767#bib.bib106)\)optimized it in the Wasserstein Fisher\-Rao space through a fixed\-point algorithm, also changing weights\. We propose instead a new particle\-based algorithm on the Wasserstein space to handle non\-convex functionals\.
To achieve such a goal, a candidate class of analogous algorithms overℝd\\mathbb\{R\}^\{d\}is the family of methods for objectives that can be written as a difference\-of\-convex functions \(DC\)\(Le Thi and Pham Dinh,[2018](https://arxiv.org/html/2606.27767#bib.bib70); Pham Dinh and Le Thi,[2014](https://arxiv.org/html/2606.27767#bib.bib69)\)\. Onℝd\\mathbb\{R\}^\{d\}this includes a large class of functions of interest, in particular twice continuously differentiable functions\(Yuille and Rangarajan,[2001](https://arxiv.org/html/2606.27767#bib.bib8); Hiriart\-Urruty,[1985](https://arxiv.org/html/2606.27767#bib.bib94)\)\. One key such algorithm is the Convex\-Concave Procedure\(Yuille and Rangarajan,[2001](https://arxiv.org/html/2606.27767#bib.bib8)\)\. These DC algorithms have already been used for many applications in machine learning, such as kernel selection\(Argyriouet al\.,[2006](https://arxiv.org/html/2606.27767#bib.bib56)\), clustering\(Tao and others,[2014](https://arxiv.org/html/2606.27767#bib.bib82)\), dictionary learning\(Voet al\.,[2015](https://arxiv.org/html/2606.27767#bib.bib83)\), optimal transport\(Tranet al\.,[2021](https://arxiv.org/html/2606.27767#bib.bib86)\), or neural network optimization\(Awasthiet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib84); Askarizadehet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib85)\)to name a few\. However, except for a few works that studied DC algorithms on Riemannian manifolds\(Souza and Oliveira,[2015](https://arxiv.org/html/2606.27767#bib.bib89); Weber and Sra,[2023](https://arxiv.org/html/2606.27767#bib.bib17); Bergmannet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib87); Ferreiraet al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib88)\), and recently on the Wasserstein space\(Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2); Luu and Wang,[2026](https://arxiv.org/html/2606.27767#bib.bib104)\), optimization algorithms for DC functionals have mostly been studied on Euclidean spaces\.
##### Contributions\.
In this work, we focus on developing an optimization scheme on the Wasserstein space tailored for the non\-convex objectives that can be written as a difference of convex objectives\. To do so, we introduce the*Wasserstein Convex\-Concave Procedure*\(WCCCP\), and analyze its theoretical convergence, proving almost stationary along its iterates\. The closest work to ours is\(Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2)\)which used a Wasserstein Proximal Gradient algorithm\(Salimet al\.,[2020](https://arxiv.org/html/2606.27767#bib.bib12)\)to solve some DC problems, but which restricted their applications to functionals whose concave part is a potential energy,*i\.e\.*linear\. We argue that it is too restrictive to handle objectives such as Maximum Mean Discrepancies, which can be decomposed as a sum of two non\-convex quadratic and linear terms\. Thus, we deal with the more general case, where the concave part can be any differentiable functional over the Wasserstein space\. Then, for several kernels, we show that with a well\-chosen decomposition obtained by splitting the kernel, the introduced scheme can better optimize the MMD than the Wasserstein Gradient Descent\. In particular, we provide experiments on the Energy distance and the MMD with Gaussian kernel\.
##### Notation\.
We denote by𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)the space of probability distributions with second finite moments, and by𝒫ac\(ℝd\)\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)its restriction to absolutely continuous measures with respect to the Lebesgue measure\. Givenμ,ν∈𝒫2\(ℝd\)\\mu,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\), we denote byW22\(μ,ν\)=infγ∈Π\(μ,ν\)∫‖x−y‖22dγ\(x,y\)\\mathrm\{W\}\_\{2\}^\{2\}\(\\mu,\\nu\)=\\inf\_\{\\gamma\\in\\Pi\(\\mu,\\nu\)\}\\ \\int\\\|x\-y\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\gamma\(x,y\)the squared\-Wasserstein distance, whereΠ\(μ,ν\)\\Pi\(\\mu,\\nu\)is the set of couplings betweenμ\\muandν\\nu, andΠo\(μ,ν\)\\Pi\_\{o\}\(\\mu,\\nu\)is its subset of optimal couplings\. The metric space\(𝒫2\(ℝd\),W2\)\(\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),\\mathrm\{W\}\_\{2\}\)is called the Wasserstein space\. For anyμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\), we denote byL2\(μ\)L^\{2\}\(\\mu\)the Hilbert space of functionsf:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}such that∫‖f\(x\)‖22dμ\(x\)<∞\\int\\\|f\(x\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)<\\inftyequipped with the norm‖f‖L2\(μ\)2=∫‖f\(x\)‖22dμ\(x\)\\\|f\\\|\_\{L^\{2\}\(\\mu\)\}^\{2\}=\\int\\\|f\(x\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)and with inner product⟨⋅,⋅⟩L2\(μ\)\\langle\\cdot,\\cdot\\rangle\_\{L^\{2\}\(\\mu\)\}\. GivenT:ℝd→ℝd\\mathrm\{T\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\},T\#μ∈𝒫2\(ℝd\)\\mathrm\{T\}\_\{\\\#\}\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)is the pushforward measure ofμ\\mu\.
### 2Background
We begin by recalling a few facts on optimization on the Wasserstein space\. More precisely, we recall the notion of Wasserstein gradient, of total convexity on the Wasserstein space, and some classical optimization schemes on\(𝒫2\(ℝd\),W2\)\(\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),\\mathrm\{W\}\_\{2\}\)\. For more details about Wasserstein gradient flows, we refer to*e\.g\.*\(Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31); Lanzettiet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib30)\)\. Then, we provide a brief introduction to the convex\-concave procedure on Euclidean spaces\.
##### Wasserstein gradient\.
Letℱ:𝒫2\(ℝd\)→ℝ\\mathcal\{F\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}be a functional\. It admits a Wasserstein gradient∇W2ℱ\(μ\)∈L2\(μ\)\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\)\\in L^\{2\}\(\\mu\)atμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)if for allν∈𝒫2\(ℝd\)\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),γ∈Πo\(μ,ν\)\\gamma\\in\\Pi\_\{o\}\(\\mu,\\nu\), the following first order Taylor expansion is satisfied\(Bonnet,[2019](https://arxiv.org/html/2606.27767#bib.bib32); Lanzettiet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib30)\)
ℱ\(ν\)=ℱ\(μ\)\+∫⟨∇W2ℱ\(μ\)\(x\),y−x⟩dγ\(x,y\)\+o\(W2\(μ,ν\)\)\.\\mathcal\{F\}\(\\nu\)=\\mathcal\{F\}\(\\mu\)\+\\int\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\)\(x\),y\-x\\rangle\\ \\mathrm\{d\}\\gamma\(x,y\)\+o\\big\(\\mathrm\{W\}\_\{2\}\(\\mu,\\nu\)\\big\)\.\(1\)When it exists, the Wasserstein gradient may not be unique inL2\(μ\)L^\{2\}\(\\mu\)in general\. Nonetheless, there is only one gradient living in the tangent spaceTμ𝒫2\(ℝd\)⊂L2\(μ\)T\_\{\\mu\}\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\subset L^\{2\}\(\\mu\)which is a Hilbert space\. Hence, by Hilbert’s decomposition theorem, any gradient can be decomposed as a part inTμ𝒫2\(ℝd\)T\_\{\\mu\}\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)and an orthogonal partξ\(μ\)\\xi\(\\mu\)which satisfies∫⟨ξ\(μ\),y−x⟩dγ\(x,y\)=0\\int\\langle\\xi\(\\mu\),y\-x\\rangle\\ \\mathrm\{d\}\\gamma\(x,y\)=0, see\(Lanzettiet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib30), Proposition 2\.11\)\. Thus, without loss of generality, we always work with the unique∇W2ℱ\(μ\)∈Tμ𝒫2\(ℝd\)\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\)\\in T\_\{\\mu\}\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)for a Wasserstein differentiable functional, using the shorthand W\-differentiable for such functionals\.
Classical functionals from𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)toℝ\\mathbb\{R\}include potential energies𝒱\(μ\)=∫Vdμ\\mathcal\{V\}\(\\mu\)=\\int\\mathrm\{V\}\\mathrm\{d\}\\muand interaction energies𝒲\(μ\)=12∬W\(x−y\)dμ\(x\)dμ\(y\)\\mathcal\{W\}\(\\mu\)=\\frac\{1\}\{2\}\\iint\\mathrm\{W\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)forV,W:ℝd→ℝ\\mathrm\{V\},\\mathrm\{W\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, withW\\mathrm\{W\}symmetric\. They are both differentiable providedV\\mathrm\{V\}andW\\mathrm\{W\}are differentiable and smooth enough\(Lanzettiet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib30)\), and their Wasserstein gradients read respectively as∇W2𝒱\(μ\)=∇V\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{V\}\(\\mu\)=\\nabla\\mathrm\{V\}and as the convolution∇W2𝒲\(μ\)=∇W∗μ\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{W\}\(\\mu\)=\\nabla\\mathrm\{W\}\*\\mu\. In this work, we will mostly focus on functionals obtained as a sum of interaction and potential energies, as they include in particular the Maximum Mean Discrepancy\(Arbelet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib22)\)\.
##### Convexity in the Wasserstein space\.
Letμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),T,S∈L2\(μ\)\\mathrm\{T\},\\mathrm\{S\}\\in L^\{2\}\(\\mu\)\. Givenϕ:L2\(μ\)→ℝ\\phi:L^\{2\}\(\\mu\)\\to\\mathbb\{R\}convex and Gateaux differentiable, the Bregman divergence onL2\(μ\)L^\{2\}\(\\mu\)betweenT\\mathrm\{T\},S\\mathrm\{S\}is defined as\(Frigyiket al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib33)\)
Dϕ\(T,S\)=ϕ\(T\)−ϕ\(S\)−⟨∇ϕ\(S\),T−S⟩L2\(μ\)\.\\mathrm\{D\}\_\{\\phi\}\(\\mathrm\{T\},\\mathrm\{S\}\)=\\phi\(\\mathrm\{T\}\)\-\\phi\(\\mathrm\{S\}\)\-\\langle\\nabla\\phi\(\\mathrm\{S\}\),\\mathrm\{T\}\-\\mathrm\{S\}\\rangle\_\{L^\{2\}\(\\mu\)\}\.\(2\)Givenℱ:𝒫2\(ℝd\)→ℝ\\mathcal\{F\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}W\-differentiable, we can define the Bregman divergence inL2\(μ\)L^\{2\}\(\\mu\)associated to the lifted functionalT↦ℱ\(T\#μ\)\\mathrm\{T\}\\mapsto\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)as
Dℱμ\(T,S\)=ℱ\(T\#μ\)−ℱ\(S\#μ\)−⟨∇W2ℱ\(S\#μ\)∘S,T−S⟩L2\(μ\),\\mathrm\{D\}\_\{\\mathcal\{F\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)=\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\-\\mathcal\{F\}\(\\mathrm\{S\}\_\{\\\#\}\\mu\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mathrm\{S\}\_\{\\\#\}\\mu\)\\circ\\mathrm\{S\},\\mathrm\{T\}\-\\mathrm\{S\}\\rangle\_\{L^\{2\}\(\\mu\)\},\(3\)using the chain rule for the gradient ofℱ~μ:S↦ℱ\(S\#μ\)\\tilde\{\\mathcal\{F\}\}\_\{\\mu\}:\\mathrm\{S\}\\mapsto\\mathcal\{F\}\(\\mathrm\{S\}\_\{\\\#\}\\mu\)which implies∇ℱ~μ\(S\)=∇W2ℱ\(S\#μ\)∘S\\nabla\\tilde\{\\mathcal\{F\}\}\_\{\\mu\}\(\\mathrm\{S\}\)=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mathrm\{S\}\_\{\\\#\}\\mu\)\\circ\\mathrm\{S\}, see\(Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3), Proposition 1\)\. Note that forℱ\(μ\)=∫12∥⋅∥22dμ\\mathcal\{F\}\(\\mu\)=\\int\\tfrac\{1\}\{2\}\\\|\\cdot\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu, this reduces toDℱμ\(T,S\)=12‖T−S‖L2\(μ\)2\\mathrm\{D\}\_\{\\mathcal\{F\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)=\\tfrac\{1\}\{2\}\\\|\\mathrm\{T\}\-\\mathrm\{S\}\\\|\_\{L^\{2\}\(\\mu\)\}^\{2\}\. Letα≥0\\alpha\\geq 0, we say thatℱ\\mathcal\{F\}isα\\alpha\-totally convex\(Cavagnariet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib34); Tanaka,[2023](https://arxiv.org/html/2606.27767#bib.bib21); Parker,[2024](https://arxiv.org/html/2606.27767#bib.bib35)\)if for allμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),T,S∈L2\(μ\)\\mathrm\{T\},\\mathrm\{S\}\\in L^\{2\}\(\\mu\),Dℱμ\(T,S\)≥α2‖T−S‖L2\(μ\)2\\mathrm\{D\}\_\{\\mathcal\{F\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)\\geq\\frac\{\\alpha\}\{2\}\\\|\\mathrm\{T\}\-\\mathrm\{S\}\\\|\_\{L^\{2\}\(\\mu\)\}^\{2\}\. Equivalently, it satisfies
∀t∈\[0,1\],ℱ\(\(\(1−t\)T\+tS\)\#μ\)≤\(1−t\)ℱ\(T\#μ\)\+tℱ\(S\#μ\)−αt\(1−t\)2‖T−S‖L2\(μ\)2\.\\forall t\\in\[0,1\],\\ \\mathcal\{F\}\\big\(\\big\(\(1\-t\)\\mathrm\{T\}\+t\\mathrm\{S\}\\big\)\_\{\\\#\}\\mu\\big\)\\leq\(1\-t\)\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\+t\\mathcal\{F\}\(\\mathrm\{S\}\_\{\\\#\}\\mu\)\-\\alpha\\frac\{t\(1\-t\)\}\{2\}\\\|\\mathrm\{T\}\-\\mathrm\{S\}\\\|\_\{L^\{2\}\(\\mu\)\}^\{2\}\.\(4\)If this result holds only forS=Id\\mathrm\{S\}=\\mathrm\{Id\}and forT\\mathrm\{T\}the gradient of a convex function, this corresponds to the less restrictive notion of strong convexity along geodesics\(Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31)\)\.
##### Wasserstein Gradient Descent\.
Optimization onL2\(μ\)L^\{2\}\(\\mu\)and on𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)are very much intertwined in practice, see*e\.g\.*\(Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3); Dumontet al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib36)\)\. For instance, the Wasserstein Gradient Descent \(WGD\) over a W\-differentiable functionalℱ:𝒫2\(ℝd\)→ℝ\\mathcal\{F\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}which is defined for allk≥0k\\geq 0,τ\>0\\tau\>0, asμk\+1=\(Id−τ∇W2ℱ\(μk\)\)\#μk\\mu\_\{k\+1\}=\\big\(\\mathrm\{Id\}\-\\tau\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\\big\)\_\{\\\#\}\\mu\_\{k\}, can be written at each iteration in two steps: first solving an optimization problem onL2\(μk\)L^\{2\}\(\\mu\_\{k\}\)to get a mapTk\+1∈L2\(μk\)\\mathrm\{T\}\_\{k\+1\}\\in L^\{2\}\(\\mu\_\{k\}\), then pushing forwardμk\\mu\_\{k\}byTk\+1\\mathrm\{T\}\_\{k\+1\},*i\.e\.*
\{Tk\+1=argminT∈L2\(μk\)12τ‖T−Id‖L2\(μk\)2\+⟨∇W2ℱ\(μk\),T−Id⟩L2\(μk\)μk\+1=\(Tk\+1\)\#μk\.\\begin\{cases\}\\mathrm\{T\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\tfrac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\+\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\\\ \\mu\_\{k\+1\}=\(\\mathrm\{T\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\.\\end\{cases\}\(5\)It can be shown to converge ifℱ\\mathcal\{F\}is smooth alongt↦\(\(1−t\)Id\+tTk\+1\)\#μkt\\mapsto\\big\(\(1\-t\)\\mathrm\{Id\}\+t\\mathrm\{T\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}and convex along geodesics\. Other first\-order algorithms have been lifted fromℝd\\mathbb\{R\}^\{d\}to𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. For instance, replacing the squaredL2L^\{2\}distance in \([5](https://arxiv.org/html/2606.27767#S2.E5)\) by a Bregman divergence \([2](https://arxiv.org/html/2606.27767#S2.E2)\) allows to lift the Mirror descent algorithm\(Beck and Teboulle,[2003](https://arxiv.org/html/2606.27767#bib.bib68); Luet al\.,[2018](https://arxiv.org/html/2606.27767#bib.bib67)\)to𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\(Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3)\)\.
##### The Convex\-Concave Procedure\.
A functionf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}is DC if it can be written as the difference of two convex functions,*i\.e\.*if there existsf\+,f−f^\{\+\},f^\{\-\}two convex functions such thatf=f\+−f−f=f^\{\+\}\-f^\{\-\}\. EveryC1C^\{1\}function with Lipschitz gradient is DC\(Hiriart\-Urruty,[1985](https://arxiv.org/html/2606.27767#bib.bib94), Section II\)\. To minimize such functions, a popular algorithm is the Convex\-Concave Procedure \(CCCP\)\(Yuille and Rangarajan,[2001](https://arxiv.org/html/2606.27767#bib.bib8)\)\. This algorithm amounts to linearizing the concave part around the current iteratexk∈ℝdx\_\{k\}\\in\\mathbb\{R\}^\{d\}, which by convexity gives the lower bound,
∀x∈ℝd,f−\(x\)≥f−\(xk\)\+⟨∇f−\(xk\),x−xk⟩,\\forall x\\in\\mathbb\{R\}^\{d\},\\ f^\{\-\}\(x\)\\geq f^\{\-\}\(x\_\{k\}\)\+\\langle\\nabla f^\{\-\}\(x\_\{k\}\),x\-x\_\{k\}\\rangle,\(6\)which entails an upper bound onf=f\+−f−f=f^\{\+\}\-f^\{\-\}, and the majorization\-minimization
xk\+1=argminxf\+\(x\)−f−\(xk\)−⟨∇f−\(xk\),x−xk⟩,∀k≥0\.x\_\{k\+1\}=\\operatorname\{argmin\}\_\{x\}\\ f^\{\+\}\(x\)\-f^\{\-\}\(x\_\{k\}\)\-\\langle\\nabla f^\{\-\}\(x\_\{k\}\),x\-x\_\{k\}\\rangle,\\,\\forall k\\geq 0\.\(7\)When bothf\+f^\{\+\}andf−f^\{\-\}are differentiable, the iterates satisfy∇f\+\(xk\+1\)=∇f−\(xk\)\\nabla f^\{\+\}\(x\_\{k\+1\}\)=\\nabla f^\{\-\}\(x\_\{k\}\)by the first order conditions\. This algorithm belongs to the more general family of Difference\-of\-Convex algorithms \(DCA\)\(Pham Dinh and Le Thi,[2014](https://arxiv.org/html/2606.27767#bib.bib69); Le Thi and Pham Dinh,[2018](https://arxiv.org/html/2606.27767#bib.bib70)\), and is related to different optimization algorithms including Frank\-Wolfe\(Yurtsever and Sra,[2022](https://arxiv.org/html/2606.27767#bib.bib18)\), the Mirror and Bregman proximal descent\(Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1)\)or the Proximal gradient algorithm\(Rotaruet al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib71)\)\.
The first convergence analysis of CCCP focused on obtaining asymptotic convergence, showing that it converges towards a stationary point under some assumptions\(Tao and Le Thi,[1997](https://arxiv.org/html/2606.27767#bib.bib20); Lanckriet and Sriperumbudur,[2009](https://arxiv.org/html/2606.27767#bib.bib19)\)\. Then, several works such as\(Yurtsever and Sra,[2022](https://arxiv.org/html/2606.27767#bib.bib18); Abbaszadehpeivastiet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib16); Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1)\)derived non\-asymptotic convergence rates\. In particular, the algorithm was shown to converge inO\(1/k\)O\(1/k\)in terms of the squared norm of the gradient\. More recently,\(Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9)\)provided an analysis of CCCP under a generalized convexity perspective, though in finite dimensions\. Linear rates were also derived under Polyak\-Łojaziewicz inequalities adapted to DC functions\(Abbaszadehpeivastiet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib16); Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1); Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9); Niu,[2026](https://arxiv.org/html/2606.27767#bib.bib7)\)\.
### 3Wasserstein CCCP
In this section, we first introduce the Wasserstein Convex\-Concave Procedure \(WCCCP\) to minimize difference\-of\-convex functions on the Wasserstein space as well as our assumptions\. Then we provide a theoretical analysis in several settings, including the convex and non\-convex ones\. Finally, we discuss how we can implement these schemes in practice\. All the proofs are deferred to Appendix[F](https://arxiv.org/html/2606.27767#A6)\.
#### 3\.1Convex\-Concave Procedure in the Wasserstein Space
We focus on the problem of minimizingℱ:𝒫2\(ℝd\)→ℝ\\mathcal\{F\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}whereℱ\\mathcal\{F\}can be decomposed as
ℱ\(μ\)=ℱ\+\(μ\)−ℱ−\(μ\),∀μ∈𝒫2\(ℝd\),\\mathcal\{F\}\(\\mu\)=\\mathcal\{F\}^\{\+\}\(\\mu\)\-\\mathcal\{F\}^\{\-\}\(\\mu\),\\quad\\forall\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),\(8\)withℱ\+,ℱ−:𝒫2\(ℝd\)→ℝ\\mathcal\{F\}^\{\+\},\\mathcal\{F\}^\{\-\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}both totally convex\. Moreover, we assume thatℱ−\\mathcal\{F\}^\{\-\}is W\-differentiable, and will assumeℱ\+\\mathcal\{F\}^\{\+\}W\-differentiable on a case\-by\-case basis\. In contrast to\(Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2)\), we do not restrictℱ−\\mathcal\{F\}^\{\-\}to be a potential energy\.
Letμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. Sinceℱ−\\mathcal\{F\}^\{\-\}is totally convex, for anyT∈L2\(μ\)\\mathrm\{T\}\\in L^\{2\}\(\\mu\),
Dℱ−μ\(T,Id\)≥0⇔ℱ−\(T\#μ\)≥ℱ−\(μ\)\+⟨∇W2ℱ−\(μ\),T−Id⟩L2\(μ\)\.\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\\geq 0\\iff\\mathcal\{F\}^\{\-\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\\geq\\mathcal\{F\}^\{\-\}\(\\mu\)\+\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\)\}\.\(9\)Hence, we have the following upper bound onℱ\\mathcal\{F\}:
ℱ\(T\#μ\)=ℱ\+\(T\#μ\)−ℱ−\(T\#μ\)≤ℱ\+\(T\#μ\)−ℱ−\(μ\)−⟨∇W2ℱ−\(μ\),T−Id⟩L2\(μ\)\.\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)=\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\-\\mathcal\{F\}^\{\-\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\\leq\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\-\\mathcal\{F\}^\{\-\}\(\\mu\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\)\}\.\(10\)
We define the Wasserstein Convex\-Concave Procedure \(WCCCP\) as the majorization\-minimization based on the upper bound in \([10](https://arxiv.org/html/2606.27767#S3.E10)\) at each iterationk≥0k\\geq 0,*i\.e\.*givenμ0∈𝒫2\(ℝd\)\\mu\_\{0\}\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),Tk\+1=argminT∈L2\(μk\)J\(T\)≔ℱ\+\(T\#μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\displaystyle\\mathrm\{T\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathrm\{J\}\(\\mathrm\{T\}\)\\coloneqq\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\(11\)μk\+1=\(Tk\+1\)\#μk\.\\displaystyle\\mu\_\{k\+1\}=\(\\mathrm\{T\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\.
This extends CCCP\(Yuille and Rangarajan,[2001](https://arxiv.org/html/2606.27767#bib.bib8)\)to the Wasserstein space, CCCP being recovered whenℱ\+\\mathcal\{F\}^\{\+\}andℱ−\\mathcal\{F\}^\{\-\}are potential energies\. Restricting the optimization to measures of the formT\#μk\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}forT∈L2\(μk\)\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)is without loss of generality in two important cases:*i\)*as done in our experiments, for empirical target and initial distributions with the same number of particles;*ii\)*ifμk∈𝒫ac\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\), as by Brenier’s theorem\(Brenier,[1991](https://arxiv.org/html/2606.27767#bib.bib81)\), there always exists an OT map betweenμk\\mu\_\{k\}and anyν∈𝒫2\(ℝd\)\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. While for greater generality we could instead optimize over couplings with first marginalμk\\mu\_\{k\}, the subproblems onL2\(μk\)L^\{2\}\(\\mu\_\{k\}\)are more tractable and reflect practical implementations\.
We assume existence and uniqueness in \([11](https://arxiv.org/html/2606.27767#S3.E11)\) for simplicity of exposition\. Sufficient conditions areT↦ℱ\+\(T\#μk\)\\mathrm\{T\}\\mapsto\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)strictly convex, coercive and lower semicontinuous overL2\(μk\)L^\{2\}\(\\mu\_\{k\}\), for allkk\. We now study the theoretical convergence of the WCCCP in several settings\.
#### 3\.2Theoretical Analysis in the Non\-Convex Case
We first observe that in general, similarly to the analysis of\(Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1)\)in the Euclidean case, \([11](https://arxiv.org/html/2606.27767#S3.E11)\) is equivalent to both a Mirror Descent and a Bregman Proximal Descent in the Wasserstein space on the objectiveℱ\\mathcal\{F\}, with Bregman potential respectivelyℱ\+\\mathcal\{F\}^\{\+\}andℱ−\\mathcal\{F\}^\{\-\}, and with step sizeτ=1\\tau=1\.
###### Proposition 1\.
Letμk∈𝒫2\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)for somek≥0k\\geq 0\. \([11](https://arxiv.org/html/2606.27767#S3.E11)\) is equivalent to \(Bregman Proximal Descent\)
Tk\+1=argminT∈L2\(μk\)Dℱ−μk\(T,Id\)\+ℱ\(T\#μk\),\\mathrm\{T\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\+\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\),\(12\)and, ifℱ\+\\mathcal\{F\}^\{\+\}is W\-differentiable, to \(Mirror Descent\)
Tk\+1=argminT∈L2\(μk\)Dℱ\+μk\(T,Id\)\+⟨∇W2ℱ\(μk\),T−Id⟩L2\(μk\)\.\\mathrm\{T\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\+\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\.\(13\)
These equivalences allow us to provide a convergence rate in the case whereℱ\\mathcal\{F\}satisfies convexity assumptions relative toℱ\+\\mathcal\{F\}^\{\+\}orℱ−\\mathcal\{F\}^\{\-\}, leveraging the convergence analysis of Mirror descent\(Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3)\)and of Bregman Proximal Descent\. We refer to Appendix[B](https://arxiv.org/html/2606.27767#A2)for these results and we focus only on the non\-convex case here\.
If bothℱ\+,ℱ−\\mathcal\{F\}^\{\+\},\\mathcal\{F\}^\{\-\}are W\-differentiable, then we can take the first order conditions in \([11](https://arxiv.org/html/2606.27767#S3.E11)\) which yield the equivalent update
∇W2ℱ\+\(μk\+1\)∘Tk\+1=∇W2ℱ−\(μk\)\.\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\.\(14\)Leveraging this result, we obtain the following relation betweenℱ\(μk\+1\)\\mathcal\{F\}\(\\mu\_\{k\+1\}\)andℱ\(μk\)\\mathcal\{F\}\(\\mu\_\{k\}\)involving Bregman divergences with Bregman potentialℱ\+\\mathcal\{F\}^\{\+\}andℱ−\\mathcal\{F\}^\{\-\}\.
###### Proposition 2\.
We have for allk≥0k\\geq 0
ℱ\(μk\+1\)=ℱ\(μk\)−Dℱ−μk\(Tk\+1,Id\)−𝒟ℱ\+k\\mathcal\{F\}\(\\mu\_\{k\+1\}\)=\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\-\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}\(15\)where𝒟ℱ\+k≔ℱ\(μk\)−minT∈L2\(μk\)ℱ\+\(T\#μk\)−ℱ−\(μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}\\coloneqq\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\min\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}and
𝒟ℱ\+k=ℱ\+\(μk\)−ℱ\+\(μk\+1\)−⟨∇W2ℱ−\(μk\),Id−Tk\+1⟩L2\(μk\)\.\\displaystyle\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}=\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{Id\}\-\\mathrm\{T\}\_\{k\+1\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\.\(16\)Assume thatℱ\+\\mathcal\{F\}^\{\+\}is W\-differentiable\. Then, for allk≥0k\\geq 0,𝒟ℱ\+k=Dℱ\+μk\(Id,Tk\+1\)\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}=\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{Id\},\\mathrm\{T\}\_\{k\+1\}\), hence
ℱ\(μk\+1\)=ℱ\(μk\)−Dℱ−μk\(Tk\+1,Id\)−Dℱ\+μk\(Id,Tk\+1\)\.\\mathcal\{F\}\(\\mu\_\{k\+1\}\)=\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{Id\},\\mathrm\{T\}\_\{k\+1\}\)\.\(17\)
The term𝒟ℱ\+k\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}, used also in the non\-smooth and Euclidean case in\(Abbaszadehpeivastiet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib16), eq\.\(15\)\)or\(Yurtsever and Sra,[2022](https://arxiv.org/html/2606.27767#bib.bib18)\), is merely a proxy forDℱ\+μk\(Id,Tk\+1\)\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{Id\},\\mathrm\{T\}\_\{k\+1\}\)whenℱ\+\\mathcal\{F\}^\{\+\}is possibly not W\-differentiable\. Note that𝒟ℱ\+k≥0\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}\\geq 0, as takingT=Id\\mathrm\{T\}=\\mathrm\{Id\}to upper bound the min, the difference vanishes\. Hence, \([15](https://arxiv.org/html/2606.27767#S3.E15)\) implies thatℱ\\mathcal\{F\}is non\-increasing along the WCCCP scheme\. The condition𝒟ℱ\+k=0\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}=0can thus be used as a termination criterion as it implies thatμk\\mu\_\{k\}is a critical point ofℱ\\mathcal\{F\}\.
###### Proposition 3\.
Assume thatℱ\+\\mathcal\{F\}^\{\+\}is W\-differentiable, then𝒟ℱ\+k=0\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}=0implies that∇W2ℱ\(μk\)=0\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)=0\.
##### Sublinear rates\.
We now derive sublinear rates for the non\-convex case\. All the next results build on the next immediate result \(obtained by telescoping\), which provides a bound for any arbitrary sequence and corresponds actually to the more involved\(Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1), Theorem 5\),
mink∈\{0,…,K−1\}ℱ\(μk′\)−ℱ\(μk\+1′\)≤ℱ\(μ0′\)−ℱ\(μK′\)K,∀\(μk′\)k≥0,K≥1\.\\min\_\{k\\in\\\{0,\\dots,K\-1\\\}\}\\ \\mathcal\{F\}\(\\mu^\{\\prime\}\_\{k\}\)\-\\mathcal\{F\}\(\\mu^\{\\prime\}\_\{k\+1\}\)\\leq\\frac\{\\mathcal\{F\}\(\\mu^\{\\prime\}\_\{0\}\)\-\\mathcal\{F\}\(\\mu^\{\\prime\}\_\{K\}\)\}\{K\},\\quad\\forall\(\\mu^\{\\prime\}\_\{k\}\)\_\{k\\geq 0\},\\,K\\geq 1\.\(18\)
In the next proposition, we show a sublinear convergence rate for the termination criterion𝒟ℱ\+k\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}, in accordance with results overℝd\\mathbb\{R\}^\{d\}from\(Yurtsever and Sra,[2022](https://arxiv.org/html/2606.27767#bib.bib18); Abbaszadehpeivastiet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib16); Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1)\)\.
###### Proposition 4\.
Assume thatℱ\\mathcal\{F\}is bounded from below\. Then, for allK≥1K\\geq 1,
0≤mink∈\{0,…,K−1\}𝒟ℱ\+k≤ℱ\(μ0\)−infℱK\.0\\leq\\min\_\{k\\in\\\{0,\\dots,K\-1\\\}\}\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}\\leq\\frac\{\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\inf\\mathcal\{F\}\}\{K\}\.\(19\)
Assuming that eitherℱ\+\\mathcal\{F\}^\{\+\}orℱ−\\mathcal\{F\}^\{\-\}is strongly convex along iterates, we can also obtain a sublinear convergence result for the squared distance between iterates and the norm of the W\-gradient, which guarantees that at least one iterate of WCCCP is almost a stationary point\.
###### Proposition 5\.
Letα\+,α−≥0\\alpha^\{\+\},\\alpha^\{\-\}\\geq 0such thatα\+\+α−\>0\\alpha^\{\+\}\+\\alpha^\{\-\}\>0\. Assumeℱ\+\\mathcal\{F\}^\{\+\}to be W\-differentiable and thatℱ\+\\mathcal\{F\}^\{\+\}andℱ−\\mathcal\{F\}^\{\-\}are respectivelyα\+\\alpha^\{\+\}andα−\\alpha^\{\-\}totally\-convex\. Then for allK≥1K\\geq 1,
min0≤k≤K−1W22\(μk,μk\+1\)≤min0≤k≤K−1‖Tk\+1−Id‖L2\(μk\)2≤2α\+\+α−\(ℱ\(μ0\)−ℱ\(μK\)\)K\.\\min\_\{0\\leq k\\leq K\-1\}\\ \\mathrm\{W\}\_\{2\}^\{2\}\(\\mu\_\{k\},\\mu\_\{k\+1\}\)\\leq\\min\_\{0\\leq k\\leq K\-1\}\\ \\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\\leq\\frac\{2\}\{\\alpha^\{\+\}\+\\alpha^\{\-\}\}\\frac\{\\big\(\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\mathcal\{F\}\(\\mu\_\{K\}\)\\big\)\}\{K\}\.\(20\)Furthermore, if∇W2ℱ\+\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}satisfies the Lipschitz condition:
‖∇W2ℱ\+\(μk\+1\)∘Tk\+1−∇W2ℱ\+\(μk\)‖L2\(μk\)≤L‖Tk\+1−Id‖L2\(μk\)∀k≥0,\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\leq L\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\,\\forall k\\geq 0,\(21\)then
min0≤k≤K−1‖∇W2ℱ\(μk\)‖L2\(μk\)2≤2L2α\+\+α−\(ℱ\(μ0\)−ℱ\(μK\)\)K\.\\min\_\{0\\leq k\\leq K\-1\}\\ \\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\\leq\\frac\{2L^\{2\}\}\{\\alpha^\{\+\}\+\\alpha^\{\-\}\}\\frac\{\\big\(\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\mathcal\{F\}\(\\mu\_\{K\}\)\\big\)\}\{K\}\.\(22\)
#### 3\.3Computing WCCCP
To solve the WCCCP scheme \([11](https://arxiv.org/html/2606.27767#S3.E11)\), we need to minimize at each iterationJ\\mathrm\{J\}onL2\(μk\)L^\{2\}\(\\mu\_\{k\}\)\. First, we observe that ifℱ\+\\mathcal\{F\}^\{\+\}is convex along curves of the formt↦\(\(1−t\)T\+tS\)\#μkt\\mapsto\\big\(\(1\-t\)\\mathrm\{T\}\+t\\mathrm\{S\}\\big\)\_\{\\\#\}\\mu\_\{k\}for anyS,T∈L2\(μk\)\\mathrm\{S\},\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\), thenJ\\mathrm\{J\}is convex onL2\(μk\)L^\{2\}\(\\mu\_\{k\}\)asDℱ\+μk\(T,S\)≥0\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{S\}\)\\geq 0, see[Section˜2](https://arxiv.org/html/2606.27767#S2)\.
Ifℱ\+\\mathcal\{F\}^\{\+\}is W\-differentiable, thenT↦ℱ\+\(T\#μk\)\\mathrm\{T\}\\mapsto\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)is Fréchet differentiable and taking the first order conditions in \([11](https://arxiv.org/html/2606.27767#S3.E11)\), we get the equivalent update \([14](https://arxiv.org/html/2606.27767#S3.E14)\)\. Ifℱ\+\(μ\)=∫Vdμ\\mathcal\{F\}^\{\+\}\(\\mu\)=\\int\\mathrm\{V\}\\mathrm\{d\}\\muwithV\\mathrm\{V\}strictly convex, thenTk\+1=∇V∗∘∇W2ℱ−\(μk\)\\mathrm\{T\}\_\{k\+1\}=\\nabla\\mathrm\{V\}^\{\*\}\\circ\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)withV∗\(x\)=supy⟨x,y⟩−V\(y\)\\mathrm\{V\}^\{\*\}\(x\)=\\sup\_\{y\}\\langle x,y\\rangle\-\\mathrm\{V\}\(y\)the Legendre transform ofV\\mathrm\{V\}\. For more general functionals, the problem is implicit and, to the best of our knowledge, cannot be found in closed\-form\.
Nonetheless, sinceJ\\mathrm\{J\}is convex onL2\(μk\)L^\{2\}\(\\mu\_\{k\}\), under the Lipschitz condition \([21](https://arxiv.org/html/2606.27767#S3.E21)\) \(corresponding toLL\-smoothness ofT↦ℱ\+\(T\#μk\)\\mathrm\{T\}\\mapsto\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)for allk≥0k\\geq 0\), we can perform a gradient descent onL2\(μk\)L^\{2\}\(\\mu\_\{k\}\), which is of the form, for0<τ≤1/L0<\\tau\\leq 1/LandT~k0=Id\\tilde\{\\mathrm\{T\}\}\_\{k\}^\{0\}=\\mathrm\{Id\},
∀ℓ≥0,T~kℓ\+1=T~kℓ−τ\(∇W2ℱ\+\(\(T~kℓ\)\#μk\)∘T~kℓ−∇W2ℱ−\(μk\)\)\.\\forall\\ell\\geq 0,\\ \\tilde\{\\mathrm\{T\}\}\_\{k\}^\{\\ell\+1\}=\\tilde\{\\mathrm\{T\}\}\_\{k\}^\{\\ell\}\-\\tau\\big\(\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\\big\(\(\\tilde\{\\mathrm\{T\}\}\_\{k\}^\{\\ell\}\)\_\{\\\#\}\\mu\_\{k\}\\big\)\\circ\\tilde\{\\mathrm\{T\}\}\_\{k\}^\{\\ell\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\\big\)\.\(23\)Then, we approximate the solution of \([11](https://arxiv.org/html/2606.27767#S3.E11)\) byT~k\+1=T~kL\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}=\\tilde\{\\mathrm\{T\}\}^\{L\}\_\{k\}andμk\+1=\(T~k\+1\)\#μk\\mu\_\{k\+1\}=\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\. Ifμkℓ=\(T~kℓ\)\#μk=1n∑i=1nδxik,ℓ\\mu\_\{k\}^\{\\ell\}=\(\\tilde\{\\mathrm\{T\}\}\_\{k\}^\{\\ell\}\)\_\{\\\#\}\\mu\_\{k\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\delta\_\{x\_\{i\}^\{k,\\ell\}\}, then the update translates to the particles as
∀k≥0,ℓ≥0,xik,ℓ\+1=xik,ℓ−τ\(∇W2ℱ\+\(μkℓ\)\(xik,ℓ\)−∇W2ℱ−\(μk\)\(xik,0\)\)\.\\forall k\\geq 0,\\ell\\geq 0,\\ x\_\{i\}^\{k,\\ell\+1\}=x\_\{i\}^\{k,\\ell\}\-\\tau\\big\(\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}^\{\\ell\}\)\(x\_\{i\}^\{k,\\ell\}\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\(x\_\{i\}^\{k,0\}\)\\big\)\.\(24\)In practice, one can follow this procedure for anyτ\\tausmall enough, and we run the algorithm for0<k≤K0<k\\leq Kouter steps \([11](https://arxiv.org/html/2606.27767#S3.E11)\), and, at eachkk,0<ℓ≤M0<\\ell\\leq Minner steps of \([24](https://arxiv.org/html/2606.27767#S3.E24)\) to minimizeJ\\mathrm\{J\}\. Alternatively one could rely on root\-finding algorithms instead, such as Newton’s method\(Zolteret al\.,[2020](https://arxiv.org/html/2606.27767#bib.bib10)\)\.
#### 3\.4Connection with Wasserstein Proximal Gradient Algorithm
Luuet al\.\([2024](https://arxiv.org/html/2606.27767#bib.bib2)\)proposed to solve the DC problem on𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)by using a Wasserstein Proximal Gradient Descent scheme\(Salimet al\.,[2020](https://arxiv.org/html/2606.27767#bib.bib12)\), alternating between a gradient descent step on−ℱ−\-\\mathcal\{F\}^\{\-\}and a JKO step onℱ\+\\mathcal\{F\}^\{\+\},*i\.e\.*using
\{νk\+1=\(Id\+τ∇W2ℱ−\(μk\)\)\#μkμk\+1=argminμ∈𝒫2\(ℝd\)12τW22\(μ,νk\+1\)\+ℱ\+\(μ\)\.\\begin\{cases\}\\nu\_\{k\+1\}=\\big\(\\mathrm\{Id\}\+\\tau\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\\big\)\_\{\\\#\}\\mu\_\{k\}\\\\ \\mu\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\}\\ \\frac\{1\}\{2\\tau\}\\mathrm\{W\}\_\{2\}^\{2\}\(\\mu,\\nu\_\{k\+1\}\)\+\\mathcal\{F\}^\{\+\}\(\\mu\)\.\\end\{cases\}\(25\)In[Proposition˜6](https://arxiv.org/html/2606.27767#Thmproposition6), we show that this is equivalent to minimizing an upper bound similar to that of WCCCP \([10](https://arxiv.org/html/2606.27767#S3.E10)\), with an additional quadratic cost\.
###### Proposition 6\.
Assumeμ0∈𝒫ac\(ℝd\)\\mu\_\{0\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)and thatμk∈𝒫ac\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)impliesνk\+1∈𝒫ac\(ℝd\)\\nu\_\{k\+1\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)\. Then \([25](https://arxiv.org/html/2606.27767#S3.E25)\) is equivalent to
\{T~k\+1=argminT∈L2\(μk\)ℱ\+\(T\#μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\+12τ‖T−Id‖L2\(μk\)2μk\+1=\(T~k\+1\)\#μk\.\\begin\{cases\}\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}=\\underset\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\{\\operatorname\{argmin\}\}\\ \\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\+\\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\\\\ \\mu\_\{k\+1\}=\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\.\\end\{cases\}\(26\)
Hence, this scheme can be seen as a lifting of CCCP algorithms with additional quadratic terms\(Sunet al\.,[2003](https://arxiv.org/html/2606.27767#bib.bib37)\)which have been in particular used to deal with the stochastic setting\(Nitanda and Suzuki,[2017](https://arxiv.org/html/2606.27767#bib.bib13); Xuet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib14); Chayti and Jaggi,[2025](https://arxiv.org/html/2606.27767#bib.bib15)\)\. Note that it would also be possible to use instead a Bregman divergence onL2\(μk\)L^\{2\}\(\\mu\_\{k\}\), and that it would be equivalent with a Bregman proximal gradient scheme, see*e\.g\.*\(Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3), Appendix F\)\. Furthermore the WCCCP algorithm can be seen as minimizing in \([12](https://arxiv.org/html/2606.27767#S3.E12)\) a regularizationℱ\(T\#μk\)\+Dℱ−μk\(T,Id\)\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)ofℱ\(T\#μk\)\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\), and \([26](https://arxiv.org/html/2606.27767#S3.E26)\) hence corresponds to replacingℱ\(T\#μk\)\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)byℱ\(T\#μk\)\+12τ‖T−Id‖L2\(μk\)2\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\+\\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}in our analysis\.
Luuet al\.\([2024](https://arxiv.org/html/2606.27767#bib.bib2)\)focused on functionals whose concave part is a potential energy,*i\.e\.*ℱ−\(μ\)=∫V−dμ\\mathcal\{F\}^\{\-\}\(\\mu\)=\\int\\mathrm\{V\}^\{\-\}\\mathrm\{d\}\\muforV−:ℝd→ℝ\\mathrm\{V\}^\{\-\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}a convex function\. This includes a large class of functionals, and in particular the KL divergence with a possibly non log\-concave target if the potential of the target admits a DC decompositionV=V\+−V−\\mathrm\{V\}=\\mathrm\{V\}^\{\+\}\-\\mathrm\{V\}^\{\-\}\. Our theory in[Section˜3](https://arxiv.org/html/2606.27767#S3)would also cover these objectives, up to replacing the gradient by the unique subgradient of the negative entropy in the tangent space, as discussed in\(Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31), Chapter 10\), and verifying that the measure stays regular enough at each iteration, which can be enforced through a regularization\(Xu and Li,[2025](https://arxiv.org/html/2606.27767#bib.bib101)\)to avoid failures\(Xu and Li,[2024](https://arxiv.org/html/2606.27767#bib.bib102)\)\. In the next section, we focus instead on the Maximum Mean Discrepancy, which can be decomposed as a DC function, whose concave part is a sum of an interaction and of a potential energies\.
### 4DC Decomposition for the Maximum Mean Discrepancy
We now focus on finding a DC decomposition to the Maximum Mean Discrepancy \(MMD\)\(Grettonet al\.,[2012](https://arxiv.org/html/2606.27767#bib.bib23)\)\. Givenk:ℝd×ℝd→ℝk:\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}a kernel, the squared MMD is defined as
∀μ,ν∈𝒫2\(ℝd\),MMDk2\(μ,ν\)=∬k\(x,y\)d\(μ−ν\)\(x\)d\(μ−ν\)\(y\)\.\\forall\\mu,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),\\ \\mathrm\{MMD\}\_\{k\}^\{2\}\(\\mu,\\nu\)=\\iint k\(x,y\)\\ \\mathrm\{d\}\(\\mu\-\\nu\)\(x\)\\mathrm\{d\}\(\\mu\-\\nu\)\(y\)\.\(27\)It is well known that the squared MMD distance can be decomposed as a sum of an interaction energy and a potential energy\(Arbelet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib22)\),*i\.e\.*
ℱ\(μ\)=12MMDk2\(μ,ν\)=12∬k\(x,y\)dμ\(x\)dμ\(y\)\+∫Vdμ\+c\(ν\),\\mathcal\{F\}\(\\mu\)=\\frac\{1\}\{2\}\\mathrm\{MMD\}\_\{k\}^\{2\}\(\\mu,\\nu\)=\\frac\{1\}\{2\}\\iint k\(x,y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+\\int\\mathrm\{V\}\\ \\mathrm\{d\}\\mu\+c\(\\nu\),\(28\)withV\(x\)=−∫k\(x,y\)dν\(y\)\\mathrm\{V\}\(x\)=\-\\int k\(x,y\)\\ \\mathrm\{d\}\\nu\(y\)andc\(ν\)=12∬k\(x,y\)dν\(x\)dν\(y\)c\(\\nu\)=\\frac\{1\}\{2\}\\iint k\(x,y\)\\ \\mathrm\{d\}\\nu\(x\)\\mathrm\{d\}\\nu\(y\)\. The first term is an interaction term and the second a potential\. This objective is in general not \(geodesically\) convex, but only semi\-convex\(Arbelet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib22), Proposition 5\),*i\.e\.*λ\\lambda\-totally convex withλ∈ℝ\\lambda\\in\\mathbb\{R\}\. Moreover, the performance of WGD to minimize it depends heavily on the kernel as observed in*e\.g\.*\(Arbelet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib22); Korbaet al\.,[2021](https://arxiv.org/html/2606.27767#bib.bib90); Hertrichet al\.,[2024b](https://arxiv.org/html/2606.27767#bib.bib27)\)\.
For anLL\-smooth kernelkk,*i\.e\.*satisfying‖∇k\(x,y\)−∇k\(x′,y′\)‖22≤L\(‖x−x′‖22\+‖y−y′‖22\)\\\|\\nabla k\(x,y\)\-\\nabla k\(x^\{\\prime\},y^\{\\prime\}\)\\\|\_\{2\}^\{2\}\\leq L\\big\(\\\|x\-x^\{\\prime\}\\\|\_\{2\}^\{2\}\+\\\|y\-y^\{\\prime\}\\\|\_\{2\}^\{2\}\\big\)for allx,x′,y,y′∈ℝdx,x^\{\\prime\},y,y^\{\\prime\}\\in\\mathbb\{R\}^\{d\},Luuet al\.\([2024](https://arxiv.org/html/2606.27767#bib.bib2), Appendix A\.2\)proposed to use, forα≥L\\alpha\\geq L,
ℱ\+\(μ\)=α∫‖x‖22dμ\(x\)\+12∬k\(x,y\)dμ\(x\)dμ\(y\)\+c\(ν\),ℱ−\(μ\)=∫\(α‖x‖22−V\(x\)\)dμ\(x\)\\mathcal\{F\}^\{\+\}\(\\mu\)=\\alpha\\int\\\|x\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)\+\\frac\{1\}\{2\}\\iint k\(x,y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+c\(\\nu\),\\,\\mathcal\{F\}^\{\-\}\(\\mu\)=\\int\\big\(\\alpha\\\|x\\\|\_\{2\}^\{2\}\-\\mathrm\{V\}\(x\)\\big\)\\ \\mathrm\{d\}\\mu\(x\)\(29\)as a DC decomposition of \([28](https://arxiv.org/html/2606.27767#S4.E28)\)\. Instead, we propose to obtain a Difference\-of\-Convex function for this objective by decomposing the kernel itself\. For this, we focus on translation\-invariant kernels\.
##### MMD with translation\-invariant kernel\.
A large class of useful kernels are the translation\-invariant one,*i\.e\.*those of the formk\(x,y\)=ψ\(x−y\)k\(x,y\)=\\psi\(x\-y\)for some symmetric functionψ:ℝd→ℝ\\psi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}\(Sriperumbuduret al\.,[2011](https://arxiv.org/html/2606.27767#bib.bib57); Muandetet al\.,[2017](https://arxiv.org/html/2606.27767#bib.bib58)\)\. This class includes, among others, the Riesz \(a\.k\.a negative distance\) kernel withψ\(z\)=−‖z‖2\\psi\(z\)=\-\\\|z\\\|\_\{2\}, the Gaussian kernel withψ\(z\)=e−‖z‖22/\(2h\)\\psi\(z\)=e^\{\-\\\|z\\\|\_\{2\}^\{2\}/\(2h\)\}or the inverse multiquadric kernel withψ\(z\)=\(c2\+‖z‖22\)−α\\psi\(z\)=\(c^\{2\}\+\\\|z\\\|\_\{2\}^\{2\}\)^\{\-\\alpha\}andα\>1\\alpha\>1\(Muandetet al\.,[2017](https://arxiv.org/html/2606.27767#bib.bib58)\)\. Assuming thatψ\\psiadmits a DC decompositionψ=ψ\+−ψ−\\psi=\\psi^\{\+\}\-\\psi^\{\-\}, we obtain the following DC decomposition of the MMD\.
###### Proposition 7\.
Letkkbe a translation\-invariant kernel of the formk\(x,y\)=ψ\(x−y\)k\(x,y\)=\\psi\(x\-y\)for allx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\}, with aψ\\psiadmitting a DC decompositionψ=ψ\+−ψ−\\psi=\\psi^\{\+\}\-\\psi^\{\-\},ψ\+,ψ−\\psi^\{\+\},\\psi^\{\-\}beingα\+,α−≥0\\alpha^\{\+\},\\alpha^\{\-\}\\geq 0convex\. Then \([28](https://arxiv.org/html/2606.27767#S4.E28)\) admits the DC decompositionℱ=ℱ\+−ℱ−\\mathcal\{F\}=\\mathcal\{F\}^\{\+\}\-\\mathcal\{F\}^\{\-\}where for allμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),
\{ℱ\+\(μ\)=12∬ψ\+\(x−y\)dμ\(x\)dμ\(y\)\+∫V−dμ\+c\(ν\),V−\(⋅\)=∫ψ−\(⋅−y\)dν\(y\),ℱ−\(μ\)=12∬ψ−\(x−y\)dμ\(x\)dμ\(y\)\+∫V\+dμ,V\+\(⋅\)=∫ψ\+\(⋅−y\)dν\(y\),\\left\\\{\\begin\{array\}\[\]\{ll\}\\mathcal\{F\}^\{\+\}\(\\mu\)=\\tfrac\{1\}\{2\}\\iint\\psi^\{\+\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+\\int\\mathrm\{V\}^\{\-\}\\mathrm\{d\}\\mu\+c\(\\nu\),&\\mathrm\{V\}^\{\-\}\(\\cdot\)=\\int\\psi^\{\-\}\(\\cdot\-y\)\\ \\mathrm\{d\}\\nu\(y\),\\\\ \\mathcal\{F\}^\{\-\}\(\\mu\)=\\tfrac\{1\}\{2\}\\iint\\psi^\{\-\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+\\int\\mathrm\{V\}^\{\+\}\\mathrm\{d\}\\mu,&\\mathrm\{V\}^\{\+\}\(\\cdot\)=\\int\\psi^\{\+\}\(\\cdot\-y\)\\ \\mathrm\{d\}\\nu\(y\),\\end\{array\}\\right\.\(30\)andℱ\+,ℱ−\\mathcal\{F\}^\{\+\},\\mathcal\{F\}^\{\-\}are respectivelyα−\\alpha^\{\-\}andα\+\\alpha^\{\+\}totally convex\. Note thatψ\+,ψ−\\psi^\{\+\},\\psi^\{\-\}can always be chosen symmetric as, by symmetry ofψ\\psi,2ψ\(x\)=ψ\(x\)\+ψ\(−x\)=ψ\+\(x\)\+ψ\+\(−x\)−ψ−\(x\)−ψ−\(−x\)2\\psi\(x\)=\\psi\(x\)\+\\psi\(\-x\)=\\psi^\{\+\}\(x\)\+\\psi^\{\+\}\(\-x\)\-\\psi^\{\-\}\(x\)\-\\psi^\{\-\}\(\-x\)\. They are also locally Lipschitz since they are convex and have full domain\.
We now focus on the subclass of radial kernels,*i\.e\.*those for which there existsq:ℝ\+→ℝq:\\mathbb\{R\}\_\{\+\}\\to\\mathbb\{R\}such thatψ\(z\)=q\(‖z‖22\)\\psi\(z\)=q\(\\\|z\\\|\_\{2\}^\{2\}\)\. For such kernels, DC decompositions can be obtained via a Jordan decomposition
\{q\+\(x\)=q\(0\)\+\(q′\(0\)\+A\)x\+∫0x\(x−t\)max\(0,q′′\(t\)\)dt,q−\(x\)=Ax−∫0x\(x−t\)min\(0,q′′\(t\)\)dt,\\begin\{cases\}q\_\{\+\}\(x\)=q\(0\)\+\(q^\{\\prime\}\(0\)\+A\)x\+\\int\_\{0\}^\{x\}\(x\-t\)\\max\\big\(0,q^\{\\prime\\prime\}\(t\)\\big\)\\,\\mathrm\{d\}t,\\,\\\\ q\_\{\-\}\(x\)=Ax\-\\int\_\{0\}^\{x\}\(x\-t\)\\min\\big\(0,q^\{\\prime\\prime\}\(t\)\\big\)\\,\\mathrm\{d\}t,\\end\{cases\}\(31\)whenever these integrals of the second derivative are computable, and takingA≥max\(0,−q′\(0\)\)A\\geq\\max\(0,\-q^\{\\prime\}\(0\)\)to ensure thatq\+q\_\{\+\}andq−q\_\{\-\}are nondecreasing \(henceψ\+\\psi\_\{\+\}andψ−\\psi\_\{\-\}are convex\)\. Alternatively ifqqis analytic,*i\.e\.*q\(x\)=∑i∈ℕaixiq\(x\)=\\sum\_\{i\\in\\mathbb\{N\}\}a\_\{i\}x^\{i\}, then we can setI\+≔\{i\|ai≥0\}I^\{\+\}\\coloneqq\\\{i\\,\|\\,a\_\{i\}\\geq 0\\\}andI−≔\{i\|ai≤0\}I^\{\-\}\\coloneqq\\\{i\\,\|\\,a\_\{i\}\\leq 0\\\},q\+\(x\)=∑i∈I\+aixiq\_\{\+\}\(x\)=\\sum\_\{i\\in I^\{\+\}\}a\_\{i\}x^\{i\},q−\(x\)=∑i∈I−aixiq\_\{\-\}\(x\)=\\sum\_\{i\\in I^\{\-\}\}a\_\{i\}x^\{i\}\. These two choices are explored for the Gaussian kernel \(Gauss Jordan vs cosh/sinh\) in[Section˜5](https://arxiv.org/html/2606.27767#S5), and their forms are detailed in[Table˜1](https://arxiv.org/html/2606.27767#S4.T1)\.
LetΩ=ℝd\\Omega=\\mathbb\{R\}^\{d\}or a compact convex subset andS∗=supx,y∈Ω‖x−y‖2S\_\{\*\}=\\sup\_\{x,y\\in\\Omega\}\\ \\\|x\-y\\\|\_\{2\}\. Define forψ\(z\)=q\(‖z‖22\)\\psi\(z\)=q\(\\\|z\\\|\_\{2\}^\{2\}\),
λ¯\[q\]≔inf0≤s≤S∗min\{2q′\(s\),2q′\(s\)\+4sq′′\(s\)\},Λ¯\[q\]≔sup0≤s≤S∗max\{2q′\(s\),2q′\(s\)\+4sq′′\(s\)\}\.\\underline\{\\lambda\}\[q\]\\coloneqq\\inf\_\{0\\leq s\\leq S\_\{\*\}\}\\min\\bigl\\\{2q^\{\\prime\}\(s\),\\,2q^\{\\prime\}\(s\)\+4sq^\{\\prime\\prime\}\(s\)\\bigr\\\},\\quad\\overline\{\\Lambda\}\[q\]\\coloneqq\\sup\_\{0\\leq s\\leq S\_\{\*\}\}\\ \\max\\bigl\\\{2q^\{\\prime\}\(s\),2q^\{\\prime\}\(s\)\+4sq^\{\\prime\\prime\}\(s\)\\bigr\\\}\.
Havingλ¯\[q±\]≥0\\underline\{\\lambda\}\[q\_\{\\pm\}\]\\geq 0and finiteΛ¯\[q±\]\\overline\{\\Lambda\}\[q\_\{\\pm\}\], which are related to bounds on the Hessian ofℱ±\\mathcal\{F\}^\{\\pm\}, allows for a sufficient condition on a DC decomposition ofqqand therefore of the MMD\. Owing to this decomposition we can apply[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)and all the convergence results of[Section˜3](https://arxiv.org/html/2606.27767#S3)\.
###### Proposition 8\.
Letq∈C2\(\[0,S∗\]\)q\\in C^\{2\}\(\[0,S\_\{\*\}\]\)\. Assume there existsq±∈C2\(\[0,S∗\]\)q\_\{\\pm\}\\in C^\{2\}\(\[0,S\_\{\*\}\]\)such thatq=q\+−q−q=q\_\{\+\}\-q\_\{\-\}\. Ifλ¯\[q\+\],λ¯\[q−\]≥0\\underline\{\\lambda\}\[q\_\{\+\}\],\\underline\{\\lambda\}\[q\_\{\-\}\]\\geq 0, then the translation\-invariant kernelk\(x,y\)=ψ\(x−y\)=q\(‖x−y‖22\)k\(x,y\)=\\psi\(x\-y\)=q\(\\\|x\-y\\\|\_\{2\}^\{2\}\)satisfies[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)withψ±\(z\)=q±\(‖z‖22\)\\psi\_\{\\pm\}\(z\)=q\_\{\\pm\}\(\\\|z\\\|\_\{2\}^\{2\}\),α±=λ¯\[q±\]\\alpha^\{\\pm\}=\\underline\{\\lambda\}\[q\_\{\\pm\}\]\. Moreover, ifΛ¯\[q\+\],Λ¯\[q−\]<∞\\overline\{\\Lambda\}\[q\_\{\+\}\],\\overline\{\\Lambda\}\[q\_\{\-\}\]<\\infty, for allk≥0k\\geq 0, the Lipschitz condition in \([21](https://arxiv.org/html/2606.27767#S3.E21)\) holds forL=2⋅Λ¯\[q\+\]\+Λ¯\[q−\]<∞L=\\sqrt\{2\}\\cdot\\overline\{\\Lambda\}\[q\_\{\+\}\]\+\\overline\{\\Lambda\}\[q\_\{\-\}\]<\\infty\. Hence, ifλ¯\[q\+\]\+λ¯\[q−\]\>0\\underline\{\\lambda\}\[q\_\{\+\}\]\+\\underline\{\\lambda\}\[q\_\{\-\}\]\>0,
1. 1\.forS∗=∞S\_\{\*\}=\\infty,*i\.e\.*Ω=ℝd\\Omega=\\mathbb\{R\}^\{d\}, WCCCP leads to an almost stationary measure;
2. 2\.forS∗<∞S\_\{\*\}<\\infty,*i\.e\.*Ω\\Omegacompact and convex subset ofℝd\\mathbb\{R\}^\{d\}, stationarity \([22](https://arxiv.org/html/2606.27767#S3.E22)\) of WCCCP provided the additional condition that the iterates\(μk\)k\(\\mu\_\{k\}\)\_\{k\}remain inΩ\\Omega\.
We refer to[Table˜1](https://arxiv.org/html/2606.27767#S4.T1)for decompositions of the Gaussian and \(smoothed\) Riesz kernels satisfying[Proposition˜8](https://arxiv.org/html/2606.27767#Thmproposition8)\. For more discussion about kernels satisfying[Proposition˜8](https://arxiv.org/html/2606.27767#Thmproposition8), we refer to Appendix[D](https://arxiv.org/html/2606.27767#A4)\. In Appendix[D\.3](https://arxiv.org/html/2606.27767#A4.SS3), we also provide sufficient conditions under which the WCCCP scheme converges locally towards a critical point of the MMD in the compact \(Prop[D\.14](https://arxiv.org/html/2606.27767#Thmproposition14)\) and non\-compact cases \(Prop[D\.15](https://arxiv.org/html/2606.27767#Thmproposition15)\)\.
Table 1:DC decompositions satisfying Prop\.[8](https://arxiv.org/html/2606.27767#Thmproposition8), i\.e\.,k\(x,y\)=q\+\(s\)−q−\(s\),k\(x,y\)=q\_\{\+\}\(s\)\-q\_\{\-\}\(s\),wheres=‖x−y‖22s=\\\|x\-y\\\|^\{2\}\_\{2\}\.

Figure 1:Convergence of WGD, WCCCP and FB to minimizeℱ\(μ\)=12ED\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\tfrac\{1\}\{2\}\\mathrm\{ED\}\(\\mu,\\nu\)withν\\nuas uniform distribution over the spiral and cat shapes\.



Figure 2:Convergence of WGD and WCCCP to minimizeℱ\(μ\)=12ED\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\tfrac\{1\}\{2\}\\mathrm\{ED\}\(\\mu,\\nu\)withν\\nusamples from CIFAR10\.
### 5Numerical Experiments
We now apply the WCCCP algorithm on the Energy Distance and on the MMD with Gaussian kernel, and compare its performance with the Wasserstein Gradient Descent \(WGD\) and the Wasserstein Proximal Gradient \([25](https://arxiv.org/html/2606.27767#S3.E25)\) proposed in\(Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2)\)denoted Forward\-Backward \(FB\)\. For more numerical experiments and implementation details, we refer to Appendix[E](https://arxiv.org/html/2606.27767#A5)111Code available at[https://github\.com/clbonet/Wasserstein\_Convex\_Concave\_Procedure](https://github.com/clbonet/Wasserstein_Convex_Concave_Procedure)\.
Energy distance\.The Energy distance \(ED\)\(Sejdinovicet al\.,[2013](https://arxiv.org/html/2606.27767#bib.bib28)\)corresponds to \([28](https://arxiv.org/html/2606.27767#S4.E28)\) induced byψ\(z\)=−‖z‖2\\psi\(z\)=\-\\\|z\\\|\_\{2\},*i\.e\.*k\(x,y\)=−‖x−y‖2k\(x,y\)=\-\\\|x\-y\\\|\_\{2\}and, forc\(ν\)=−12∬‖x−y‖2dν\(x\)dν\(y\)c\(\\nu\)=\-\\frac\{1\}\{2\}\\iint\\\|x\-y\\\|\_\{2\}\\ \\mathrm\{d\}\\nu\(x\)\\mathrm\{d\}\\nu\(y\),
12ED\(μ,ν\)=−12∬∥x−y∥2dμ\(x\)dμ\(y\)\+∫Vdμ\+c\(ν\),V\(⋅\)=∫∥⋅−y∥2dν\(y\)\.\\frac\{1\}\{2\}\\mathrm\{ED\}\(\\mu,\\nu\)=\-\\frac\{1\}\{2\}\\iint\\\|x\-y\\\|\_\{2\}\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+\\int\\mathrm\{V\}\\ \\mathrm\{d\}\\mu\+c\(\\nu\),\\ \\mathrm\{V\}\(\\cdot\)=\\int\\\|\\cdot\-y\\\|\_\{2\}\\ \\mathrm\{d\}\\nu\(y\)\.\(32\)While not convex, its Wasserstein gradient flow has a good behavior\(Chizatet al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib29)\)and has demonstrated good results for different machine learning applications\(Altekrügeret al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib60); Hagemannet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib61); Hertrichet al\.,[2024b](https://arxiv.org/html/2606.27767#bib.bib27),[a](https://arxiv.org/html/2606.27767#bib.bib26); Geuteret al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib59)\)\. Nonetheless, it can be naturally decomposed as a DC functional using[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)withψ\+=0\\psi^\{\+\}=0andψ−\(z\)=‖z‖2\\psi^\{\-\}\(z\)=\\\|z\\\|\_\{2\}\. Thus, we propose to apply the WCCCP algorithm to minimize it\.
On[Figures˜2](https://arxiv.org/html/2606.27767#S4.F2)and[2](https://arxiv.org/html/2606.27767#S4.F2), we minimize the Energy distance with respect toνn\\nu\_\{n\}an empirical distribution ofn=500n=500samples drawn uniformly over spiral or cat shapes, and from CIFAR10\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2606.27767#bib.bib99)\)\. In both cases, we start from an empirical distribution ofnnsamples ofμ0=𝒩\(0,Id\)\\mu\_\{0\}=\\mathcal\{N\}\(0,I\_\{d\}\), and use the stepsizeτ=1\\tau=1for WGD and FB, as WCCCP can be seen as Mirror Descent algorithm withτ=1\\tau=1\. It allows comparing the three schemes in the same setting, assuming access to an oracle for WCCCP and FB\. We add in[Figure˜E\.7](https://arxiv.org/html/2606.27767#A5.F7)a comparison with different step sizes for WGD and the same computational budget for WCCCP,*i\.e\.*with the same number of gradient evaluations\. Results were averaged over 100 different source and target samples on[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2)and 5 on[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2)\. For both FB and WCCCP, the inner optimization ofJ\\mathrm\{J\}is performed via gradient descent, possibly with momentum\. We display the number of outer iterations, focusing on the behaviour in objective values on[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2)\(FB and WCCCP havingM=50M=50extra inner iterations\)\. On[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2)instead, to be fair on the computational time \(1h30 for WGD, 1h15 for WCCCP on a Nvidia V100 GPU\) forx∈ℝdx\\in\\mathbb\{R\}^\{d\}andd≈3Kd\\approx\\text\{3K\}, in total 200K iterations were performed for WGD against 40K outer iterations for WCCCP, and we show snapshots every 20K iterations for WGD, and every 4K iterations for WCCCP\. Even with this rescaling, the convergence of WCCCP remains much faster\.
Figure 3:Optimization ofℱ\(μ\)=12MMDk2\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\frac\{1\}\{2\}\\mathrm\{MMD\}\_\{k\}^\{2\}\(\\mu,\\nu\)forν\\nua Gaussian target andkkthe Gaussian kernel\. \(Left\) Evolution of the squared MMD along the flow\. \(Right\) Trajectories of the particles over time\. The initial particles are in blue and the final particles in red\.MMD with Gaussian kernel\.Another natural translation\-invariant kernel is the Gaussian kernelk\(x,y\)=e−‖x−y‖22/\(2h\)k\(x,y\)=e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}for whichq\(s\)=e−s/\(2h\)q\(s\)=e^\{\-s/\(2h\)\}\. However, minimizing the MMD with this kernel has been proven challenging, as its convergence depends a lot on the value of its bandwidthhh\(Arbelet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib22)\)\. On the one hand, we can use the decomposition \([29](https://arxiv.org/html/2606.27767#S4.E29)\) proposed byLuuet al\.\([2024](https://arxiv.org/html/2606.27767#bib.bib2)\)with optimal choiceα=1h\\alpha=\\tfrac\{1\}\{h\}\(see[Example˜D\.1](https://arxiv.org/html/2606.27767#Thmexample1)\)\. On the other hand, we can use the DC decomposition from[Proposition˜8](https://arxiv.org/html/2606.27767#Thmproposition8)withq\+,q−q\_\{\+\},q\_\{\-\}chosen either with the Jordan decomposition \([31](https://arxiv.org/html/2606.27767#S4.E31)\) or the algebraic one for whichq\+\(2hs\)=cosh\(s\),q−\(2hs\)=sinh\(s\)q\_\{\+\}\(2hs\)=\\cosh\(s\),q\_\{\-\}\(2hs\)=\\sinh\(s\)\. Their properties are detailed in[Table˜1](https://arxiv.org/html/2606.27767#S4.T1)and derived in Appendix[D](https://arxiv.org/html/2606.27767#A4)\.
We compare on[Figure˜3](https://arxiv.org/html/2606.27767#S5.F3)the minimization of the squared MMD with Gaussian kernel and bandwidthh=10h=10between WGD and WCCCP with these three decompositions\. We usen=500n=500particles initially sampled fromμ0=𝒩\(5𝟙2,I2\)\\mu\_\{0\}=\\mathcal\{N\}\(5\\mathbb\{1\}\_\{2\},I\_\{2\}\)and set the target as an empirical distributionνn=1n∑i=1nδyi\\nu\_\{n\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\delta\_\{y\_\{i\}\}withyiy\_\{i\}sampled independently fromν=𝒩\(0,Σ\)\\nu=\\mathcal\{N\}\(0,\\Sigma\)withΣ=\(10\.50\.51\)\\Sigma=\\bigl\(\\begin\{smallmatrix\}1&0\.5\\\\ 0\.5&1\\end\{smallmatrix\}\\bigr\)\. The results are averaged over 25 runs with different source and target samples\. The inner problems of the WCCCP scheme withcosh/sinh\\cosh/\\sinhare solved using a gradient descent with momentum, withm=0\.9m=0\.9and step sizeτ=5⋅10−4\\tau=5\\cdot 10^\{\-4\}forM=250M=250iterations, while for the WCCCP scheme with Jordan decomposition, we use a gradient descent withτ=0\.1\\tau=0\.1andM=500M=500iterations\. In particular, this decomposition is smoother and less prone to numerical instabilities than thecosh/sinh\\cosh/\\sinhone\. We observe that WGD and WCCCP with the decomposition \([29](https://arxiv.org/html/2606.27767#S4.E29)\) get stuck in a local minimum where some samples drift away of the target distribution, while WCCCP withcosh/sinh\\cosh/\\sinhand Jordan decompositions converge much better\. We note that the Jordan decomposition can still sometimes be stuck in local minima, which are nonetheless better than WGD\. We hypothesize that this is due to the inexact solver used at each iteration of \([11](https://arxiv.org/html/2606.27767#S3.E11)\)\. On the other hand, all the samples of WCCCP withcosh/sinh\\cosh/\\sinhappear to eventually converge, instead of being sent away, but at a slower rate\. We also compare the results with the FB algorithm in Appendix[E](https://arxiv.org/html/2606.27767#A5), which also showcases results that depend heavily on the choice of DC decomposition\.
### 6Conclusion
In this work, we lifted the convex\-concave procedure to the Wasserstein space, and analyzed its convergence in the convex and non\-convex settings\. Then, we used it to minimize the Maximum Mean Discrepancy, and showed improved performance over the Wasserstein Gradient Descent for the negative distance and Gaussian kernel\. Nonetheless, the improved convergence of WCCCP strongly depends on the choice of the DC decomposition of the kernel\. Future work will therefore focus on better understanding the impact of different DC decompositions on the performance, with the goal of designing more effective, automatic, and adaptive decomposition strategies\(Ahmadi and Hall,[2018](https://arxiv.org/html/2606.27767#bib.bib4)\)\. On the theoretical side, taking into account the inexact solvers used for the inner optimization problems would also be a natural avenue for future work\.
### Acknowledgments and Disclosure of Funding
This work was granted access to the HPC resources of IDRIS under the allocation 2025\-AD011015891R1 made by GENCI\. CB’s work was supported by the Ecole Polytechnique Foundation as part of its campaign “Servir la Science”, and by the French National Research Agency \(ANR\) through the France 2030 program under the MacLeOD project \(ANR\-25\-PEIA\-0005\)\.
### References
- On the rate of convergence of the difference\-of\-convex algorithm \(DCA\)\.Journal of Optimization Theory and Applications202\(1\),pp\. 475–496\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p3.1),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p2.1),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.SSS0.Px1.p2.2),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.p4.9)\.
- A\. A\. Ahmadi and G\. Hall \(2018\)DC Decomposition of Nonconvex Polynomials with Algebraic Techniques\.Mathematical Programming169\(1\),pp\. 69–94\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p1.1),[§6](https://arxiv.org/html/2606.27767#S6.p1.1)\.
- F\. Altekrüger, J\. Hertrich, and G\. Steidl \(2023\)Neural Wasserstein Gradient Flows for Discrepancies with Riesz Kernels\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 664–690\.Cited by:[§5](https://arxiv.org/html/2606.27767#S5.p2.5)\.
- L\. Ambrosio, N\. Gigli, and G\. Savaré \(2008\)Gradient Flows: in Metric Spaces and in the Space of Probability Measures\.Springer\.Cited by:[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px1.p2.11),[Appendix D](https://arxiv.org/html/2606.27767#A4.p1.16),[§F\.7](https://arxiv.org/html/2606.27767#A6.SS7.p4.6),[§F\.7](https://arxiv.org/html/2606.27767#A6.SS7.p5.2),[§1](https://arxiv.org/html/2606.27767#S1.p2.1),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px2.p1.21),[§2](https://arxiv.org/html/2606.27767#S2.p1.1),[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p3.3)\.
- A\. F\. Ansari, M\. L\. Ang, and H\. Soh \(2021\)Refining Deep Generative Models via Discriminator Gradient Flow\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- M\. Arbel, A\. Korba, A\. Salim, and A\. Gretton \(2019\)Maximum Mean Discrepancy Gradient Flow\.Advances in Neural Information Processing Systems32\.Cited by:[§E\.2](https://arxiv.org/html/2606.27767#A5.SS2.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.27767#S1.p2.1),[§1](https://arxiv.org/html/2606.27767#S1.p3.1),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px1.p2.10),[§4](https://arxiv.org/html/2606.27767#S4.p1.5),[§4](https://arxiv.org/html/2606.27767#S4.p1.6),[§5](https://arxiv.org/html/2606.27767#S5.p4.6)\.
- A\. Argyriou, R\. Hauser, C\. A\. Micchelli, and M\. Pontil \(2006\)A DC\-Programming Algorithm for Kernel Selection\.InProceedings of the 23rd international conference on Machine learning,pp\. 41–48\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- M\. Askarizadeh, A\. Morsali, S\. Tofigh, and K\. K\. Nguyen \(2024\)Convex\-Concave Programming: An Effective Alternative for Optimizing Shallow Neural Networks\.IEEE Transactions on Emerging Topics in Computational Intelligence9\(4\),pp\. 2894–2907\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- P\. Awasthi, A\. Mao, M\. Mohri, and Y\. Zhong \(2024\)DC\-programming for neural network optimizations\.Journal of Global Optimization,pp\. 1–17\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- H\. H\. Bauschke, J\. Bolte, and M\. Teboulle \(2017\)A descent lemma beyond Lipschitz gradient continuity: first\-order methods revisited and applications\.Mathematics of Operations Research42\(2\),pp\. 330–348\.Cited by:[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px1.p1.12)\.
- A\. Beck and M\. Teboulle \(2003\)Mirror Descent and Nonlinear Projected Subgradient Methods for Convex Optimization\.Operations Research Letters31\(3\),pp\. 167–175\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px3.p1.16)\.
- A\. Belhadji, D\. Sharp, and Y\. Marzouk \(2026\)Weighted Quantization Using MMD: From Mean Field to Mean Shift via Gradient Flows\.InThe 29th International Conference on Artificial Intelligence and Statistics,Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p3.1)\.
- R\. Bergmann, O\. P\. Ferreira, E\. M\. Santos, and J\. C\. O\. Souza \(2024\)The Difference of Convex Algorithm on Hadamard Manifolds\.Journal of Optimization Theory and Applications201\(1\),pp\. 221–251\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- A\. Blanchet and J\. Bolte \(2018\)A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions\.Journal of Functional Analysis275\(7\),pp\. 1650–1673\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p3.1),[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- D\. M\. Blei, A\. Kucukelbir, and J\. D\. McAuliffe \(2017\)Variational Inference: A Review for Statisticians\.Journal of the American statistical Association112\(518\),pp\. 859–877\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- I\. M\. Bomze and M\. Locatelli \(2004\)Undominated d\.c\. Decompositions of Quadratic Functions and Applications to Branch\-and\-Bound Approaches\.Computational Optimization and Applications28\(2\),pp\. 227–245\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p1.1)\.
- C\. Bonet, L\. Drumetz, and N\. Courty \(2025\)Sliced\-Wasserstein Distances and Flows on Cartan\-Hadamard Manifolds\.Journal of Machine Learning Research26\(32\),pp\. 1–76\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- C\. Bonet, T\. Uscidda, A\. David, P\. Aubin\-Frankowski, and A\. Korba \(2024\)Mirror and Preconditioned Gradient Descent in Wasserstein Space\.InThirty\-eight Conference on Neural Information Processing Systems,Cited by:[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px1.p1.7),[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px2.p1.3),[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px2.p1.6),[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px3.1.p1.1),[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px3.p1.4),[Appendix B](https://arxiv.org/html/2606.27767#A2.p1.1),[§F\.9](https://arxiv.org/html/2606.27767#A6.SS9.p3.9),[§1](https://arxiv.org/html/2606.27767#S1.p1.1),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px2.p1.19),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px3.p1.10),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px3.p1.16),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.p2.3),[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p2.5)\.
- B\. Bonnet \(2019\)A Pontryagin Maximum Principle in Wasserstein Spaces for Constrained Optimal Control Problems\.ESAIM: Control, Optimisation and Calculus of Variations25,pp\. 52\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px1.p1.5)\.
- N\. Bonnotte \(2013\)Unidimensional and Evolution Methods for Optimal Transportation\.Ph\.D\. Thesis,Université Paris Sud\-Paris XI; Scuola normale superiore \(Pise, Italie\)\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- S\. Boufadène and F\. Vialard \(2025\)On the global convergence of Wasserstein gradient flow of the Coulomb discrepancy\.SIAM Journal on Mathematical Analysis57\(4\),pp\. 4556–4587\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- Y\. Brenier \(1991\)Polar Factorization and Monotone Rearrangement of Vector\-Valued Functions\.Communications on pure and applied mathematics44\(4\),pp\. 375–417\.Cited by:[§3\.1](https://arxiv.org/html/2606.27767#S3.SS1.p4.9)\.
- C\. Bunne, L\. Papaxanthos, A\. Krause, and M\. Cuturi \(2022\)Proximal Optimal Transport Modeling of Population Dynamics\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 6511–6528\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- J\. Cao, Z\. Wei, and Y\. Liu \(2026\)Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE\-Approximated Divergences\.arXiv preprint arXiv:2603\.10592\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- G\. Cavagnari, G\. Savaré, and G\. E\. Sodini \(2023\)A Lagrangian approach to totally dissipative evolutions in Wasserstein spaces\.arXiv preprint arXiv:2305\.05211\.Cited by:[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px1.p2.11),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px2.p1.19)\.
- E\. M\. Chayti and M\. Jaggi \(2025\)Stochastic Difference\-of\-Convex Optimization with Momentum\.arXiv preprint arXiv:2510\.17503\.Cited by:[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p2.5)\.
- L\. Chizat and F\. Bach \(2018\)On the Global Convergence of Gradient Descent for Over\-Parameterized Models using Optimal Transport\.Advances in neural information processing systems31\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- L\. Chizat, M\. Colombo, R\. Colombo, and X\. Fernández\-Real \(2026\)Quantitative Convergence of Wasserstein Gradient Flows of Kernel Mean Discrepancies\.arXiv preprint arXiv:2603\.01977\.Cited by:[§5](https://arxiv.org/html/2606.27767#S5.p2.5)\.
- M\. Deng, H\. Li, T\. Li, Y\. Du, and K\. He \(2026\)Generative Modeling via Drifting\.arXiv preprint arXiv:2602\.04770\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- C\. Du, T\. Li, T\. Pang, S\. Yan, and M\. Lin \(2023\)Nonparametric Generative Modeling with Conditional Sliced\-Wasserstein Flows\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 8565–8584\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- T\. Dumont, T\. Lacombe, and F\. Vialard \(2026\)Learning Monge maps by lifting and constraining Wasserstein gradient flows\.arXiv preprint arXiv:2603\.25182\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px3.p1.10)\.
- O\. Faust, H\. Fawzi, and J\. Saunderson \(2023\)A Bregman Divergence View on the Difference\-of\-Convex Algorithm\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 3427–3439\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p3.1),[Appendix C](https://arxiv.org/html/2606.27767#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.27767#A3.p4.1),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p1.9),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p2.1),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.SSS0.Px1.p2.2),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.p1.4)\.
- O\. Ferreira, D\. Gonçalves, M\. Louzeiro, S\. Németh, and J\. Zhu \(2026\)A subdifferential characterization via Busemann functions and applications to DC optimization on Hadamard manifolds\.arXiv preprint arXiv:2602\.20931\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- B\. A\. Frigyik, S\. Srivastava, and M\. R\. Gupta \(2008\)Functional Bregman Divergence\.In2008 IEEE International Symposium on Information Theory,pp\. 1681–1685\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px2.p1.6)\.
- Y\. Gao, Y\. Jiao, Y\. Wang, Y\. Wang, C\. Yang, and S\. Zhang \(2019\)Deep Generative Learning via Variational Gradient Flow\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 2093–2101\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- J\. Geuter, C\. Bonet, A\. Korba, and D\. Alvarez\-Melis \(2025\)DDEQs: Distributional Deep Equilibrium Models through Wasserstein Gradient Flows\.InThe 28th International Conference on Artificial Intelligence and Statistics,Cited by:[§5](https://arxiv.org/html/2606.27767#S5.p2.5)\.
- E\. Gladin, P\. Dvurechensky, A\. Mielke, and J\. Zhu \(2024\)Interaction\-Force Transport Gradient Flows\.Advances in Neural Information Processing Systems37,pp\. 14484–14508\.Cited by:[§E\.2](https://arxiv.org/html/2606.27767#A5.SS2.SSS0.Px2.p1.2),[§E\.2](https://arxiv.org/html/2606.27767#A5.SS2.SSS0.Px2.p2.6),[§1](https://arxiv.org/html/2606.27767#S1.p3.1)\.
- A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. Smola \(2012\)A Kernel Two\-sample Test\.The journal of machine learning research13\(1\),pp\. 723–773\.Cited by:[§4](https://arxiv.org/html/2606.27767#S4.p1.1)\.
- P\. Hagemann, J\. Hertrich, F\. Altekrüger, R\. Beinert, J\. Chemseddine, and G\. Steidl \(2024\)Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2606.27767#S5.p2.5)\.
- J\. Hertrich, M\. Gräf, R\. Beinert, and G\. Steidl \(2024a\)Wasserstein Steepest Descent Flows of Discrepancies with Riesz Kernels\.Journal of Mathematical Analysis and Applications531\(1\),pp\. 127829\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1),[§5](https://arxiv.org/html/2606.27767#S5.p2.5)\.
- J\. Hertrich, C\. Wald, F\. Altekrüger, and P\. Hagemann \(2024b\)Generative Sliced MMD Flows with Riesz Kernels\.InThe Twelfth International Conference on Learning Representations,Cited by:[§E\.1](https://arxiv.org/html/2606.27767#A5.SS1.p1.15),[§E\.2](https://arxiv.org/html/2606.27767#A5.SS2.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.27767#S1.p2.1),[§4](https://arxiv.org/html/2606.27767#S4.p1.5),[§5](https://arxiv.org/html/2606.27767#S5.p2.5)\.
- J\. Hiriart\-Urruty \(1985\)Generalized Differentiability, Duality and Optimization for Problems Dealing with Differences of Convex Functions\.InConvexity and Duality in Optimization,J\. Ponstein \(Ed\.\),Lecture Notes in Economics and Mathematical Systems, Vol\.256,pp\. 37–70\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p1.5)\.
- R\. Jordan, D\. Kinderlehrer, and F\. Otto \(1998\)The variational formulation of the Fokker–Planck equation\.SIAM journal on mathematical analysis29\(1\),pp\. 1–17\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- A\. Korba, P\. Aubin\-Frankowski, S\. Majewski, and P\. Ablin \(2021\)Kernel Stein Discrepancy Descent\.InInternational Conference on Machine Learning,pp\. 5719–5730\.Cited by:[§4](https://arxiv.org/html/2606.27767#S4.p1.5)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning Multiple Layers of Features from Tiny Images\.Cited by:[§5](https://arxiv.org/html/2606.27767#S5.p3.10)\.
- M\. Lambert, S\. Chewi, F\. Bach, S\. Bonnabel, and P\. Rigollet \(2022\)Variational Inference via Wasserstein Gradient Flows\.Advances in Neural Information Processing Systems35,pp\. 14434–14447\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- G\. Lanckriet and B\. K\. Sriperumbudur \(2009\)On the Convergence of the Concave\-Convex Procedure\.Advances in neural information processing systems22\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p2.1)\.
- N\. Lanzetti, S\. Bolognani, and F\. Dörfler \(2025\)First\-Order Conditions for Optimization in the Wasserstein Space\.SIAM Journal on Mathematics of Data Science7\(1\),pp\. 274–300\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px1.p1.11),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px1.p1.5),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px1.p2.10),[§2](https://arxiv.org/html/2606.27767#S2.p1.1)\.
- H\. A\. Le Thi and T\. Pham Dinh \(2018\)DC programming and DCA: thirty years of developments\.Mathematical Programming169\(1\),pp\. 5–68\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p1.9)\.
- L\. Liu, M\. B\. Majka, and Ł\. Szpruch \(2023\)Polyak–Łojasiewicz inequality on the space of measures and convergence of mean\-field birth\-death processes\.Applied Mathematics & Optimization87\(3\),pp\. 48\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p3.1),[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- S\. Liu, J\. Yu, J\. Simons, M\. Yi, and M\. Beaumont \(2024\)Minimizingff\-Divergences by Interpolating Velocity Fields\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 32308–32331\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- A\. Liutkus, U\. Simsekli, S\. Majewski, A\. Durmus, and F\. Stöter \(2019\)Sliced\-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and Diffusions\.InInternational Conference on machine learning,pp\. 4104–4113\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- H\. Lu, R\. M\. Freund, and Y\. Nesterov \(2018\)Relatively Smooth Convex Optimization by First\-Order Methods, and Applications\.SIAM Journal on Optimization28\(1\),pp\. 333–354\.Cited by:[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px1.p1.12),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px3.p1.16)\.
- H\. P\. H\. Luu and Z\. Wang \(2026\)DC\-LA: Difference\-of\-Convex Langevin Algorithm\.arXiv preprint arXiv:2601\.22932\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- H\. P\. H\. Luu, H\. Yu, B\. Williams, P\. Mikkola, M\. Hartmann, K\. Puolamäki, and A\. Klami \(2024\)Non\-geodesically\-convex optimization in the Wasserstein space\.Advances in Neural Information Processing Systems37,pp\. 16772–16809\.Cited by:[Appendix C](https://arxiv.org/html/2606.27767#A3.p5.1),[§D\.1](https://arxiv.org/html/2606.27767#A4.SS1.SSS0.Px1.p2.4),[§D\.1](https://arxiv.org/html/2606.27767#A4.SS1.p5.2),[§E\.1](https://arxiv.org/html/2606.27767#A5.SS1.SSS0.Px1.p1.19),[§E\.2](https://arxiv.org/html/2606.27767#A5.SS2.SSS0.Px2.p7.2),[§1](https://arxiv.org/html/2606.27767#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.27767#S1.p2.1),[§1](https://arxiv.org/html/2606.27767#S1.p4.2),[§3\.1](https://arxiv.org/html/2606.27767#S3.SS1.p1.6),[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p1.3),[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p3.3),[§4](https://arxiv.org/html/2606.27767#S4.p2.5),[§5](https://arxiv.org/html/2606.27767#S5.p1.1),[§5](https://arxiv.org/html/2606.27767#S5.p4.6)\.
- S\. Mei, A\. Montanari, and P\. Nguyen \(2018\)A Mean Field View of the Landscape of Two\-Layer Neural Networks\.Proceedings of the National Academy of Sciences115\(33\),pp\. E7665–E7671\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- K\. Muandet, K\. Fukumizu, B\. Sriperumbudur, and B\. Schölkopf \(2017\)Kernel Mean Embedding of Distributions: A Review and Beyond\.Foundations and Trends® in Machine Learning10\(1\-2\),pp\. 1–141\.Cited by:[§4](https://arxiv.org/html/2606.27767#S4.SS0.SSS0.Px1.p1.8)\.
- A\. Nitanda and T\. Suzuki \(2017\)Stochastic Difference of Convex Algorithm and its Application to Training Deep Boltzmann Machines\.InArtificial intelligence and statistics,pp\. 470–478\.Cited by:[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p2.5)\.
- Y\. Niu, H\. A\. Le Thi, and D\. T\. Pham \(2024\)On Difference\-of\-SOS and Difference\-of\-Convex\-SOS Decompositions for Polynomials\.SIAM Journal on Optimization34\(2\),pp\. 1852–1878\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p1.1)\.
- Y\. Niu \(2026\)Continuous\-Time Dynamics of the Difference\-of\-Convex Algorithm\.arXiv preprint arXiv:2604\.06926\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p3.1),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p2.1)\.
- K\. Oikonomidis, E\. Laude, and P\. Patrinos \(2025\)Forward\-backward splitting under the light of generalized convexity\.arXiv preprint arXiv:2503\.18098\.Cited by:[Appendix A](https://arxiv.org/html/2606.27767#A1.p3.1),[Appendix C](https://arxiv.org/html/2606.27767#A3.p1.2),[Appendix C](https://arxiv.org/html/2606.27767#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.27767#A3.p4.1),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p2.1),[Definition C\.1](https://arxiv.org/html/2606.27767#Thmdefinition1.p1.5.5)\.
- G\. Parker \(2024\)Some convexity criteria for differentiable functions on the 2\-Wasserstein space\.Bulletin of the London Mathematical Society56\(5\),pp\. 1839–1858\.Cited by:[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px1.p2.11),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px2.p1.19)\.
- M\. Persiianov, J\. Chen, P\. Mokrov, A\. Tyurin, E\. Burnaev, and A\. Korotin \(2026\)Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- M\. Petit\-Talamon, M\. Lambert, and A\. Korba \(2025\)Variational Inference with Mixtures of Isotropic Gaussians\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- D\. Pfau, I\. Davies, D\. L\. Borsa, J\. G\. M\. Araújo, B\. D\. Tracey, and H\. van Hasselt \(2025\)Wasserstein Policy Optimization\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- T\. Pham Dinh and H\. A\. Le Thi \(2014\)Recent advances in DC programming and DCA\.Transactions on computational intelligence XIII,pp\. 1–37\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p1.9)\.
- T\. Rotaru, P\. Patrinos, and F\. Glineur \(2025\)Tight Analysis of Difference\-of\-Convex Algorithm \(DCA\) Improves Convergence Rates for Proximal Gradient Descent\.arXiv preprint arXiv:2503\.04486\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p1.9)\.
- N\. Rux, M\. Quellmalz, and G\. Steidl \(2026\)Smoothed distance kernels for MMDs and applications in Wasserstein gradient flows\.Advances in Computational Mathematics52\(2\),pp\. 24\.Cited by:[§E\.1](https://arxiv.org/html/2606.27767#A5.SS1.p1.15)\.
- A\. Salim, A\. Korba, and G\. Luise \(2020\)The Wasserstein Proximal Gradient Algorithm\.Advances in Neural Information Processing Systems33,pp\. 12356–12366\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.27767#S1.p1.1),[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p1.3)\.
- F\. Santambrogio \(2015\)Optimal Transport for Applied Mathematicians\.Vol\.55,Springer\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- D\. Sejdinovic, B\. Sriperumbudur, A\. Gretton, and K\. Fukumizu \(2013\)Equivalence of distance\-based and RKHS\-based statistics in hypothesis testing\.The annals of statistics,pp\. 2263–2291\.Cited by:[§E\.1](https://arxiv.org/html/2606.27767#A5.SS1.p1.1),[§5](https://arxiv.org/html/2606.27767#S5.p2.3)\.
- L\. Sharrock, L\. Mackey, and C\. Nemeth \(2023\)Learning Rate Free Sampling in Constrained Domains\.Advances in Neural Information Processing Systems36,pp\. 65380–65415\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- J\. d\. O\. Souza and P\. R\. Oliveira \(2015\)A proximal point algorithm for DC fuctions on Hadamard manifolds\.Journal of Global Optimization63\(4\),pp\. 797–810\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- B\. K\. Sriperumbudur, K\. Fukumizu, and G\. R\. Lanckriet \(2011\)Universality, Characteristic Kernels and RKHS Embedding of Measures\.Journal of Machine Learning Research12\(7\)\.Cited by:[§4](https://arxiv.org/html/2606.27767#S4.SS0.SSS0.Px1.p1.8)\.
- W\. Sun, R\. J\. Sampaio, and M\. Candido \(2003\)Proximal point algorithm for minimization of DC function\.Journal of computational Mathematics,pp\. 451–462\.Cited by:[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p2.5)\.
- K\. Tanaka \(2023\)Accelerated gradient descent method for functionals of probability measures by new convexity and smoothness based on transport maps\.arXiv preprint arXiv:2305\.05127\.Cited by:[Appendix B](https://arxiv.org/html/2606.27767#A2.SS0.SSS0.Px1.p2.11),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px2.p1.19)\.
- P\. D\. Tao and H\. A\. Le Thi \(1997\)Convex analysis approach to DC programming: theory, algorithms and applications\.Acta mathematica vietnamica22\(1\),pp\. 289–355\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p2.1)\.
- P\. D\. Taoet al\.\(2014\)New and efficient DCA based algorithms for minimum sum\-of\-squares clustering\.Pattern Recognition47\(1\),pp\. 388–401\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- A\. Terpin, N\. Lanzetti, M\. Gadea, and F\. Dorfler \(2024\)Learning diffusion at lightspeed\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- G\. Thurin, C\. Boyer, and K\. Nadjahi \(2026\)Convergence Rates for Distribution Matching with Sliced Optimal Transport\.arXiv preprint arXiv:2602\.10691\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- Q\. H\. Tran, H\. Janati, I\. Redko, R\. Flamary, and N\. Courty \(2021\)Factored couplings in multi\-marginal optimal transport via difference of convex programming\.arXiv preprint arXiv:2110\.00629\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- E\. Turan and M\. Ovsjanikov \(2026\)Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective\.arXiv preprint arXiv:2603\.09936\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- C\. Villaniet al\.\(2009\)Optimal Transport: Old and New\.Vol\.338,Springer\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1),[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- X\. T\. Vo, H\. An Le Thi, T\. P\. Dinh, and T\. B\. T\. Nguyen \(2015\)DC Programming and DCA for Dictionary Learning\.InComputational Collective Intelligence: 7th International Conference, ICCCI 2015, Madrid, Spain, September 21\-23, 2015, Proceedings, Part I,pp\. 295–304\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- M\. Weber and S\. Sra \(2023\)Global optimality for Euclidean CCCP under Riemannian convexity\.InInternational Conference on Machine Learning,pp\. 36790–36803\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2)\.
- A\. Wibisono \(2018\)Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem\.InConference on learning theory,pp\. 2093–3027\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1),[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- Y\. Xu and Q\. Li \(2024\)Forward\-Euler time\-discretization for Wasserstein gradient flows can be wrong\.arXiv preprint arXiv:2406\.08209\.Cited by:[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p3.3)\.
- Y\. Xu and Q\. Li \(2025\)Forward Euler for Wasserstein Gradient Flows: Breakdown and Regularization\.arXiv preprint arXiv:2509\.13260\.Cited by:[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p3.3)\.
- Y\. Xu and Q\. Li \(2026\)Random Coordinate Descent on the Wasserstein Space of Probability Measures\.arXiv preprint arXiv:2604\.01606\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- Y\. Xu, Q\. Qi, Q\. Lin, R\. Jin, and T\. Yang \(2019\)Stochastic Optimization for DC Functions and Non\-smooth Non\-convex Regularizers with Non\-Asymptotic Convergence\.InInternational conference on machine learning,pp\. 6942–6951\.Cited by:[§3\.4](https://arxiv.org/html/2606.27767#S3.SS4.p2.5)\.
- A\. L\. Yuille and A\. Rangarajan \(2001\)The Concave\-Convex Procedure \(CCCP\)\.Advances in neural information processing systems14\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p4.2),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p1.5),[§3\.1](https://arxiv.org/html/2606.27767#S3.SS1.p4.9)\.
- A\. Yurtsever and S\. Sra \(2022\)CCCP is Frank\-Wolfe in disguise\.Advances in Neural Information Processing Systems35,pp\. 35352–35364\.Cited by:[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p1.9),[§2](https://arxiv.org/html/2606.27767#S2.SS0.SSS0.Px4.p2.1),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.SSS0.Px1.p2.2),[§3\.2](https://arxiv.org/html/2606.27767#S3.SS2.p4.9)\.
- R\. Zhang, C\. Chen, C\. Li, and L\. Carin \(2018\)Policy Optimization as Wasserstein Gradient Flows\.InInternational Conference on machine learning,pp\. 5737–5746\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p1.1)\.
- S\. Zhu and X\. Chen \(2025\)Convergence Analysis of the Wasserstein Proximal Algorithm beyond Geodesic Convexity\.arXiv preprint arXiv:2501\.14993\.Cited by:[§1](https://arxiv.org/html/2606.27767#S1.p2.1)\.
- Z\. Zolter, D\. Duvenaud, and M\. Johnson \(2020\)Deep Implicit Layers \- Neural ODEs, Deep Equilibirum Models, and Beyond\.Neurips 2020 Tutorial\.Cited by:[§3\.3](https://arxiv.org/html/2606.27767#S3.SS3.p3.16)\.
## Appendix
The appendix is organized as follows\. In Appendix[A](https://arxiv.org/html/2606.27767#A1), we discuss some limitations of the paper\. In Appendix[B](https://arxiv.org/html/2606.27767#A2), we detail the theoretical analysis of the WCCCP algorithm in the convex case, leveraging the mirror and Bregman proximal descent formulations\. In Appendix[C](https://arxiv.org/html/2606.27767#A3), we discuss the derivations of Polyak\-Łojaziewicz inequalities for DC functionals\. In Appendix[D](https://arxiv.org/html/2606.27767#A4), we provide the full theoretical analysis of DC decompositions of the MMD\. In Appendix[E](https://arxiv.org/html/2606.27767#A5), we include more details about the numerical experiments\. Finally, in Appendix[F](https://arxiv.org/html/2606.27767#A6), we state all the proofs\.
### Appendix Contents
### Appendix ALimitations
Our empirical findings show that the improved convergence of WCCCP on MMD strongly depends on the choice of the DC decomposition of the kernelkk\. Our current strategy is restricted to a few simple decomposition of the kernels, without taking into account the geometry of the optimization problem\. Hence, future work could focus on developing automatic and adaptative ways to find DC decompositions of the kernels better suited to the problem of minimizing the MMD,*e\.g\.*based on algebraic decompositions and polynomial DC decompositions minimizing a suitable objective\[Bomze and Locatelli,[2004](https://arxiv.org/html/2606.27767#bib.bib5), Ahmadi and Hall,[2018](https://arxiv.org/html/2606.27767#bib.bib4), Niuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib6)\]\.
This work also focused on DC decompositions of MMD, which can be written as a sum of simple functionals, which are potential and interaction energies\. For MMD with the ubiquitous radial kernels, we were thus able to derive DC decompositions based on decompositions of functions onℝ→ℝ\\mathbb\{R\}\\to\\mathbb\{R\}\. Future work could focus on the problem of deriving DC decomposition of more complex functionals\.
On the theoretical part, as highlighted in[Section˜1](https://arxiv.org/html/2606.27767#S1), there are versions of Polyak\-Łojaziewicz \(PL\) inequalities for both the Wasserstein space\[Blanchet and Bolte,[2018](https://arxiv.org/html/2606.27767#bib.bib65), Liuet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib66)\]and DC functions\[Abbaszadehpeivastiet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib16), Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1), Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9), Niu,[2026](https://arxiv.org/html/2606.27767#bib.bib7)\]\. So far our framework does not include them despite preliminary research\. We discuss reasons for this in[Appendix˜C](https://arxiv.org/html/2606.27767#A3)\.
We also did not take into account that we actually solve each inner scheme approximately, which leads to a gap between theory and practice\.
### Appendix BTheoretical Analysis of WCCCP in the Convex Case
We focus in this section on the case whereℱ=ℱ\+−ℱ−\\mathcal\{F\}=\\mathcal\{F\}^\{\+\}\-\\mathcal\{F\}^\{\-\}is also convex along a curve of interest which we will detail\. For this, we will leverage[Proposition˜1](https://arxiv.org/html/2606.27767#Thmproposition1), where we showed that \([11](https://arxiv.org/html/2606.27767#S3.E11)\) is equivalent to both a mirror descent and a Bregman proximal descent in the Wasserstein space\. Consequently, on the one hand, we will inherit the results from\[Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3)\]for mirror descent, and on the other hand we will derive novel convergence results for Bregman proximal descent\.
##### Relative convexity and smoothness\.
First, we need to introduce the notions of relative convexity and smoothness\. Letα,β≥0\\alpha,\\beta\\geq 0\. Following\[Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3)\], we say thatℱ\\mathcal\{F\}isα\\alpha\-convex relative to𝒢:𝒫2\(ℝd\)→ℝ\\mathcal\{G\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}alongt↦\(\(1−t\)T\+tS\)\#μt\\mapsto\\big\(\(1\-t\)\\mathrm\{T\}\+t\\mathrm\{S\}\\big\)\_\{\\\#\}\\muifDℱμ\(T,S\)≥αD𝒢μ\(T,S\)\\mathrm\{D\}\_\{\\mathcal\{F\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)\\geq\\alpha\\mathrm\{D\}\_\{\\mathcal\{G\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)\. Equivalently, we have thatℱ−α𝒢\\mathcal\{F\}\-\\alpha\\mathcal\{G\}is convex along this curve, and
∀t∈\[0,1\],ℱ\(μt\)≤\(1−t\)ℱ\(T\#μ\)\+tℱ\(S\#μ\)−αt\(1−t\)D𝒢μ\(T,S\)\.\\forall t\\in\[0,1\],\\ \\mathcal\{F\}\(\\mu\_\{t\}\)\\leq\(1\-t\)\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\+t\\mathcal\{F\}\(\\mathrm\{S\}\_\{\\\#\}\\mu\)\-\\alpha t\(1\-t\)\\mathrm\{D\}\_\{\\mathcal\{G\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)\.\(33\)Likewise,ℱ\\mathcal\{F\}isβ\\beta\-smooth relative to𝒢\\mathcal\{G\}along this curve ifDℱμ\(T,S\)≤βD𝒢μ\(T,S\)\\mathrm\{D\}\_\{\\mathcal\{F\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)\\leq\\beta\\mathrm\{D\}\_\{\\mathcal\{G\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{S\}\)\. These notions enable lifting the notion of relative convexity and smoothness\[Luet al\.,[2018](https://arxiv.org/html/2606.27767#bib.bib67), Bauschkeet al\.,[2017](https://arxiv.org/html/2606.27767#bib.bib91)\]to𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)for differentiable functionals\.
If the convexity holds for allT,S∈L2\(μ\)\\mathrm\{T\},\\mathrm\{S\}\\in L^\{2\}\(\\mu\),μ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)and𝒢\(μ\)=∫12∥⋅∥22dμ\\mathcal\{G\}\(\\mu\)=\\int\\tfrac\{1\}\{2\}\\\|\\cdot\\\|\_\{2\}^\{2\}\\mathrm\{d\}\\mu, then we say thatℱ\\mathcal\{F\}isα\\alpha\-totally convex\[Cavagnariet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib34), Tanaka,[2023](https://arxiv.org/html/2606.27767#bib.bib21), Parker,[2024](https://arxiv.org/html/2606.27767#bib.bib35)\]\. Following\[Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31)\], if convexity only holds forT=Id\\mathrm\{T\}=\\mathrm\{Id\}andS\\mathrm\{S\}gradient of convex functions, it coincides with strong convexity along geodesics\[Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31)\]\. Convexity along generalized geodesics corresponds to bothT\\mathrm\{T\}andS\\mathrm\{S\}being the gradients of convex functions\. Examples of totally convex functionals include potential and interaction energies, providedV,W\\mathrm\{V\},\\mathrm\{W\}are convex, lower semi\-continuous and have a negative part with quadratic growth\[Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31), Section 9\.3\]\. Note also that all three notions of convexity are equivalent for continuous functionals \(ford≥2d\\geq 2\)\[Cavagnariet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib34), Parker,[2024](https://arxiv.org/html/2606.27767#bib.bib35)\]\.
##### Bregman Wasserstein distance\.
Similarly to\[Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3)\], let us introduce the Bregman Optimal Transport problemWϕ\\mathrm\{W\}\_\{\\phi\}associated with a Wasserstein differentiable functionalϕ:𝒫2\(ℝd\)→ℝ\\phi:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}, defined forμ,ν∈𝒫2\(ℝd\)\\mu,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)as
Wϕ\(ν,μ\)≔infγ∈Π\(ν,μ\)ϕ\(ν\)−ϕ\(μ\)−∫⟨∇W2ϕ\(μ\)\(y\),x−y⟩dγ\(x,y\)\.\\mathrm\{W\}\_\{\\phi\}\(\\nu,\\mu\)\\coloneqq\\inf\_\{\\gamma\\in\\Pi\(\\nu,\\mu\)\}\\ \\phi\(\\nu\)\-\\phi\(\\mu\)\-\\int\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\phi\(\\mu\)\(y\),x\-y\\rangle\\ \\mathrm\{d\}\\gamma\(x,y\)\.\(34\)By\[Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3), Proposition 15\], ifμ∈𝒫ac\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\), then there existsT∈L2\(μ\)\\mathrm\{T\}\\in L^\{2\}\(\\mu\)such thatWϕ\(ν,μ\)=Dϕμ\(T,Id\)\\mathrm\{W\}\_\{\\phi\}\(\\nu,\\mu\)=\\mathrm\{D\}\_\{\\phi\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{Id\}\),*i\.e\.*this problem admits an optimal transport map\.
##### Convergence results\.
We now provide a first convergence result forℱ\\mathcal\{F\}smooth and strongly convex relative toℱ\+\\mathcal\{F\}^\{\+\}, relying on the mirror descent formulation \([13](https://arxiv.org/html/2606.27767#S3.E13)\) and the results of\[Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3)\]\. More precisely,ℱ\\mathcal\{F\}needs to be smooth along iterates and convex along what would be the analog of geodesics on the space\(𝒫2\(ℝd\),Wℱ\+\)\(\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\+\}\}\)\.
###### Proposition B\.9\.
Letℱ:𝒫2\(ℝd\)→ℝ\\mathcal\{F\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}be a functional admitting a DC decompositionℱ=ℱ\+−ℱ−\\mathcal\{F\}=\\mathcal\{F\}^\{\+\}\-\\mathcal\{F\}^\{\-\}with bothℱ\+,ℱ−\\mathcal\{F\}^\{\+\},\\mathcal\{F\}^\{\-\}Wasserstein differentiable on𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. Letν∈𝒫2\(ℝd\)\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),0≤α≤β≤10\\leq\\alpha\\leq\\beta\\leq 1and\(Tk\)k≥1\(\\mathrm\{T\}\_\{k\}\)\_\{k\\geq 1\},\(μk\)k≥0\(\\mu\_\{k\}\)\_\{k\\geq 0\}iterates of \([11](https://arxiv.org/html/2606.27767#S3.E11)\)\. Assume that for allk≥0k\\geq 0,μk∈𝒫ac\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\), and denoteTμk,ν=argminT∈L2\(μk\)Dℱ\+μk\(T,Id\)\\mathrm\{T\}^\{\\mu\_\{k\},\\nu\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\), which exists asμk∈𝒫ac\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)\. Furthermore, assume thatℱ\\mathcal\{F\}isβ\\beta\-smooth alongt↦\(\(1−t\)Id\+tTk\+1\)\#μkt\\mapsto\\big\(\(1\-t\)\\mathrm\{Id\}\+t\\mathrm\{T\}\_\{k\+1\}\\big\)\_\{\\\#\}\\mu\_\{k\}andℱ\\mathcal\{F\}isα\\alpha\-convex relative toℱ\+\\mathcal\{F\}^\{\+\}alongt↦\(\(1−t\)Id\+tTμk,ν\)\#μkt\\mapsto\\big\(\(1\-t\)\\mathrm\{Id\}\+t\\mathrm\{T\}^\{\\mu\_\{k\},\\nu\}\\big\)\_\{\\\#\}\\mu\_\{k\}\. Then, for allk≥1k\\geq 1,
ℱ\(μk\)−ℱ\(ν\)≤α\(1−α\)−k−1Wℱ\+\(ν,μ0\)≤1−αkWℱ\+\(ν,μ0\)\.\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathcal\{F\}\(\\nu\)\\leq\\frac\{\\alpha\}\{\(1\-\\alpha\)^\{\-k\}\-1\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\+\}\}\(\\nu,\\mu\_\{0\}\)\\leq\\frac\{1\-\\alpha\}\{k\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\+\}\}\(\\nu,\\mu\_\{0\}\)\.\(35\)Moreover, ifα\>0\\alpha\>0, takingν=μ∗\\nu=\\mu^\{\*\}the minimizer ofℱ\\mathcal\{F\}, we obtain a linear rate,*i\.e\.*for allk≥0k\\geq 0,
Wℱ\+\(μ∗,μk\)≤\(1−α\)kWℱ\+\(μ∗,μ0\)\.\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\+\}\}\(\\mu^\{\*\},\\mu\_\{k\}\)\\leq\(1\-\\alpha\)^\{k\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\+\}\}\(\\mu^\{\*\},\\mu\_\{0\}\)\.\(36\)
###### Proof\.
The assumptions imply that we can use\[Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3), Proposition 4\]for the mirror descent formulation \([13](https://arxiv.org/html/2606.27767#S3.E13)\)\. ∎
Assuming analog conditions as for the convergence of the JKO scheme adapted to the Bregman Proximal Gradient Descent \([12](https://arxiv.org/html/2606.27767#S3.E12)\), we obtain the following linear rate convergence result\.
###### Proposition B\.10\.
Letℱ:𝒫2\(ℝd\)→ℝ\\mathcal\{F\}:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}a functional admitting a DC decompositionℱ=ℱ\+−ℱ−\\mathcal\{F\}=\\mathcal\{F\}^\{\+\}\-\\mathcal\{F\}^\{\-\}with bothℱ\+,ℱ−\\mathcal\{F\}^\{\+\},\\mathcal\{F\}^\{\-\}Wasserstein differentiable on𝒫2\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. Letμ∗∈𝒫2\(ℝd\)\\mu^\{\*\}\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)be the minimizer ofℱ\\mathcal\{F\},α≥0\\alpha\\geq 0and\(Tk\)k≥1\(\\mathrm\{T\}\_\{k\}\)\_\{k\\geq 1\},\(μk\)k≥0\(\\mu\_\{k\}\)\_\{k\\geq 0\}iterates of \([11](https://arxiv.org/html/2606.27767#S3.E11)\)\. Assume that for allk≥0k\\geq 0,μk∈𝒫ac\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\), and denoteTk∗=argminT∈L2\(μk\)Dℱ\+μk\(T,Id\)\\mathrm\{T\}\_\{k\}^\{\*\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\), which exists asμk∈𝒫ac\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)\. Furthermore, assume thatℱ\\mathcal\{F\}isα\\alpha\-convex relative toℱ−\\mathcal\{F\}^\{\-\}alongt↦\(\(1−t\)Tk∗\+tTk\+1\)\#μkt\\mapsto\\big\(\(1\-t\)\\mathrm\{T\}\_\{k\}^\{\*\}\+t\\mathrm\{T\}\_\{k\+1\}\\big\)\_\{\\\#\}\\mu\_\{k\}\. Then, for allk≥0k\\geq 0,
Wℱ−\(μ∗,μk\)≤\(11\+α\)kWℱ−\(μ∗,μ0\),andℱ\(μk\+1\)−ℱ\(μ∗\)≤\(11\+α\)kWℱ−\(μ∗,μ0\)\.\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{k\}\)\\leq\\left\(\\frac\{1\}\{1\+\\alpha\}\\right\)^\{k\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{0\}\),\\quad\\text\{and\}\\quad\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\-\\mathcal\{F\}\(\\mu^\{\*\}\)\\leq\\left\(\\frac\{1\}\{1\+\\alpha\}\\right\)^\{k\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{0\}\)\.\(37\)
###### Proof\.
See[Section˜F\.9](https://arxiv.org/html/2606.27767#A6.SS9)\. ∎
The convergence results are given in the geometry induced by the Bregman divergence with potential given byℱ−\\mathcal\{F\}^\{\-\}\.
### Appendix CEnquiry on Polyak\-Łojaziewicz Inequality for DC Functionals
An analog of the DC PL inequality of\[Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9), Definition 6\.1\], writing it only using the Bregman divergences ofℱ\+\\mathcal\{F\}^\{\+\}andℱ−\\mathcal\{F\}^\{\-\}, reads as follow
###### Definition C\.1\.
We say thatℱ=ℱ\+−ℱ−\\mathcal\{F\}=\\mathcal\{F\}^\{\+\}\-\\mathcal\{F\}^\{\-\}satisfies the Wasserstein DC PL inequality analogue to\[Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9), Definition 6\.1\], if there existsη1≥0\\eta\_\{1\}\\geq 0,η2≥0\\eta\_\{2\}\\geq 0such thatη1\+η2\>0\\eta\_\{1\}\+\\eta\_\{2\}\>0and for allμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),
η1\(ℱ\(μ\)−infℱ\)\+η2\(ℱ\(T¯\#μ\)−infℱ\)≤Dℱ−μ\(T¯,Id\)\+Dℱ\+μ\(Id,T¯\)\\eta\_\{1\}\\big\(\\mathcal\{F\}\(\\mu\)\-\\inf\\mathcal\{F\}\\big\)\+\\eta\_\{2\}\\big\(\\mathcal\{F\}\(\\bar\{\\mathrm\{T\}\}\_\{\\\#\}\\mu\)\-\\inf\\mathcal\{F\}\\big\)\\leq\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\}\(\\bar\{\\mathrm\{T\}\},\\mathrm\{Id\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\}\(\\mathrm\{Id\},\\bar\{\\mathrm\{T\}\}\)\(38\)whereT¯\\bar\{\\mathrm\{T\}\}solves WCCCP, e\.g\. in its version \([12](https://arxiv.org/html/2606.27767#S3.E12)\), givingT¯=argminT∈L2\(μ\)Dℱ−μ\(T,Id\)\+ℱ\(T\#μ\)\\bar\{\\mathrm\{T\}\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\)\}\\ \\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\+\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\)\.
Under this ideal condition, we show that we have the following linear convergence rate, analogously to\[Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1), Lemma 1\]and\[Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9), Theorem 6\.2\]\.
###### Proposition C\.11\.
Assumeℱ\\mathcal\{F\}satisfies the Wasserstein DC\-PL inequalities \([38](https://arxiv.org/html/2606.27767#A3.E38)\)\. Then, for allk≥0k\\geq 0,
ℱ\(μk\)−infℱ≤\(1−η11\+η2\)k\(ℱ\(μ0\)−infℱ\)\.\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\inf\\mathcal\{F\}\\leq\\left\(\\frac\{1\-\\eta\_\{1\}\}\{1\+\\eta\_\{2\}\}\\right\)^\{k\}\\big\(\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\inf\\mathcal\{F\}\\big\)\.\(39\)
Ifη1≥1\\eta\_\{1\}\\geq 1, then we can setη1=1\\eta\_\{1\}=1and we have convergence in one step\.
###### Proof\.
Sinceμk\+1=\(Tk\+1\)\#μk\\mu\_\{k\+1\}=\(\\mathrm\{T\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}, \([38](https://arxiv.org/html/2606.27767#A3.E38)\) gives
η1\(ℱ\(μk\)−infℱ\)\+η2\(ℱ\(μk\+1\)−infℱ\)\\displaystyle\\eta\_\{1\}\\big\(\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\inf\\mathcal\{F\}\\big\)\+\\eta\_\{2\}\\big\(\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\-\\inf\\mathcal\{F\}\\big\)≤Dℱ−μk\(Tk\+1,Id\)\+Dℱ\+μk\(Id,Tk\+1\)\\displaystyle\\leq\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{Id\},\\mathrm\{T\}\_\{k\+1\}\)=\([17](https://arxiv.org/html/2606.27767#S3.E17)\)ℱ\(μk\)−ℱ\(μk\+1\)\.\\displaystyle\\stackrel\{\{\\scriptstyle\\eqref\{eq:iterates\_difference\_gap\}\}\}\{\{=\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\.Rearranging, we obtain
ℱ\(μk\+1\)−infℱ≤1−η11\+η2\(ℱ\(μk\)−infℱ\)≤\(1−η11\+η2\)k\+1\(ℱ\(μ0\)−infℱ\)\.\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\-\\inf\\mathcal\{F\}\\leq\\frac\{1\-\\eta\_\{1\}\}\{1\+\\eta\_\{2\}\}\\big\(\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\inf\\mathcal\{F\}\\big\)\\leq\\left\(\\frac\{1\-\\eta\_\{1\}\}\{1\+\\eta\_\{2\}\}\\right\)^\{k\+1\}\\big\(\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\inf\\mathcal\{F\}\\big\)\.\(40\)∎
Although\[Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9), Lemma 6\.3\]in the Euclidean case shows that\[Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1), Definition 1\]implies their\[Oikonomidiset al\.,[2025](https://arxiv.org/html/2606.27767#bib.bib9), Definition 6\.1\], the latter is difficult to check directly for a given functional and is more of a target inequality to establish\. We thus want to argue that\[Faustet al\.,[2023](https://arxiv.org/html/2606.27767#bib.bib1), Definition 1\]is more promising to stand as DC PL inequality and has received a more developed discussion of when it holds,*e\.g\.*under strong convexity\. However this definition rests upon Fenchel duality onℝd\\mathbb\{R\}^\{d\}, for which there is no publicly available counterpart on the Wasserstein space at the time of this submission\. Consequently it is unclear for now whether[Definition˜C\.1](https://arxiv.org/html/2606.27767#Thmdefinition1)holds for functionals with strongly convex DC decompositions\.
While\[Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2), Theorems 4 and 5\]do rest upon a Łojaziewicz inequality, the latter is unrelated to Bregman divergences\.Luuet al\.\[[2024](https://arxiv.org/html/2606.27767#bib.bib2)\]achieve instead their linear rates using the extra Wasserstein regularization appearing in \([26](https://arxiv.org/html/2606.27767#S3.E26)\)\. While we discussed in the similarity of the two approaches[Section˜3\.4](https://arxiv.org/html/2606.27767#S3.SS4), one cannot use their theory without the Wasserstein regularization as otherwise their constant become vacuous\.
### Appendix DDC Theory for MMD
Letkkbe a translation\-invariant kernel of the formk\(x,y\)=ψ\(x−y\)k\(x,y\)=\\psi\(x\-y\)for allx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\}, withψ\\psiadmitting a DC decompositionψ=ψ\+−ψ−\\psi=\\psi^\{\+\}\-\\psi^\{\-\}\. Then by[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7), the squared MMD \([28](https://arxiv.org/html/2606.27767#S4.E28)\) admits the DC decompositionℱ=ℱ\+−ℱ−\\mathcal\{F\}=\\mathcal\{F\}^\{\+\}\-\\mathcal\{F\}^\{\-\}where for allμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),
\{ℱ\+\(μ\)=12∬ψ\+\(x−y\)dμ\(x\)dμ\(y\)\+∫V−dμ\+c,V−\(⋅\)=∫ψ−\(⋅−y\)dν\(y\),ℱ−\(μ\)=12∬ψ−\(x−y\)dμ\(x\)dμ\(y\)\+∫V\+dμ,V\+\(⋅\)=∫ψ\+\(⋅−y\)dν\(y\),\\left\\\{\\begin\{array\}\[\]\{ll\}\\mathcal\{F\}^\{\+\}\(\\mu\)=\\tfrac\{1\}\{2\}\\iint\\psi^\{\+\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+\\int\\mathrm\{V\}^\{\-\}\\mathrm\{d\}\\mu\+c,&\\mathrm\{V\}^\{\-\}\(\\cdot\)=\\int\\psi^\{\-\}\(\\cdot\-y\)\\ \\mathrm\{d\}\\nu\(y\),\\\\ \\mathcal\{F\}^\{\-\}\(\\mu\)=\\tfrac\{1\}\{2\}\\iint\\psi^\{\-\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+\\int\\mathrm\{V\}^\{\+\}\\mathrm\{d\}\\mu,&\\mathrm\{V\}^\{\+\}\(\\cdot\)=\\int\\psi^\{\+\}\(\\cdot\-y\)\\ \\mathrm\{d\}\\nu\(y\),\\end\{array\}\\right\.\(41\)andℱ\+,ℱ−\\mathcal\{F\}^\{\+\},\\mathcal\{F\}^\{\-\}are both totally convex\. Moreover, isψ\+\\psi^\{\+\}andψ−\\psi^\{\-\}are respectivelyα\+≥0\\alpha^\{\+\}\\geq 0andα−≥0\\alpha^\{\-\}\\geq 0strongly convex, thenℱ\+\\mathcal\{F\}^\{\+\}isα−\\alpha^\{\-\}\-totally convex andℱ−\\mathcal\{F\}^\{\-\}isα\+\\alpha^\{\+\}\-totally convex as they inherit the strong convexity of the potential, while the interaction terms are only convex, see\[Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31), Section 9\.3\]\.
We will now focus on radial kernels\. LetΩ∈ℝd\\Omega\\in\\mathbb\{R\}^\{d\}be a nonempty compact convex set\. Consider the geodesically convex setℋ=𝒫2\(Ω\)\\mathcal\{H\}=\\mathcal\{P\}\_\{2\}\(\\Omega\)\. Let
S∗≔supx,y∈Ω‖x−y‖2\.S\_\{\*\}\\coloneqq\\sup\_\{x,y\\in\\Omega\}\\\|x\-y\\\|\_\{2\}\.\(42\)
We consider in what follows:
ψ±\(z\):=q±\(‖z‖22\),z∈Ω−Ω≔\{x−y\|x,y∈Ω\},\\psi^\{\\pm\}\(z\):=q\_\{\\pm\}\(\\\|z\\\|\_\{2\}^\{2\}\),\\quad z\\in\\Omega\-\\Omega\\coloneqq\\\{x\-y\\,\|\\,x,y\\in\\Omega\\\},\(43\)whereq±∈C2\(\[0,S∗\]\)q\_\{\\pm\}\\in C^\{2\}\(\[0,S\_\{\*\}\]\)\. The MMD functionalℱ\\mathcal\{F\}is therefore defined as a function ofq±\.q^\{\\pm\}\.
#### D\.1Strong Convexity ofℱ−\\mathcal\{F\}^\{\-\}andℱ\+\\mathcal\{F\}^\{\+\}for Radial Kernels
We first study the Hessian of radial functions, in order to derive conditions forψ±\\psi\_\{\\pm\}to be \(strongly\) convex and being able to apply[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)\.
Letq∈C2\(\[0,S∗\]\)q\\in C^\{2\}\(\[0,S\_\{\*\}\]\)and define:
λ¯\[q\]≔inf0≤s≤S∗min\{2q′\(s\),2q′\(s\)\+4sq′′\(s\)\},Λ¯\[q\]≔sup0≤s≤S∗max\{2q′\(s\),2q′\(s\)\+4sq′′\(s\)\}\.\\underline\{\\lambda\}\[q\]\\coloneqq\\inf\_\{0\\leq s\\leq S\_\{\*\}\}\\min\\bigl\\\{2q^\{\\prime\}\(s\),\\,2q^\{\\prime\}\(s\)\+4sq^\{\\prime\\prime\}\(s\)\\bigr\\\},\\quad\\overline\{\\Lambda\}\[q\]\\coloneqq\\sup\_\{0\\leq s\\leq S\_\{\*\}\}\\max\\bigl\\\{2q^\{\\prime\}\(s\),\\,2q^\{\\prime\}\(s\)\+4sq^\{\\prime\\prime\}\(s\)\\bigr\\\}\.\(44\)We use the shorthandsλ±:=λ¯\[q±\]\\lambda\_\{\\pm\}:=\\underline\{\\lambda\}\[q\_\{\\pm\}\]andΛ±:=Λ¯\[q±\]\\Lambda\_\{\\pm\}:=\\overline\{\\Lambda\}\[q\_\{\\pm\}\]\. First, we compute the Hessian ofψ\(z\)=q\(‖z‖22\)\\psi\(z\)=q\(\\\|z\\\|\_\{2\}^\{2\}\)and bound its eigenvalues usingλ¯\[q\]\\underline\{\\lambda\}\[q\]andΛ¯\[q\]\\overline\{\\Lambda\}\[q\]\.
###### Lemma D\.1\(Radial Hessian Bounds\)\.
Letψ\(z\)=q\(‖z‖22\)\\psi\(z\)=q\(\\\|z\\\|\_\{2\}^\{2\}\)onz∈Ω−Ωz\\in\\Omega\-\\Omega, we have
∇2ψ\(z\)=2q′\(‖z‖22\)Id\+4q′′\(‖z‖22\)zz⊤\.\\nabla^\{2\}\\psi\(z\)=2q^\{\\prime\}\(\\\|z\\\|\_\{2\}^\{2\}\)I\_\{d\}\+4q^\{\\prime\\prime\}\(\\\|z\\\|\_\{2\}^\{2\}\)zz^\{\\top\}\.\(45\)Hence, we have the following bounds on∇2ψ\\nabla^\{2\}\\psi:
λ¯\[q\]Id⪯∇2ψ\(z\)⪯Λ¯\[q\]Id,z∈Ω−Ω\.\\underline\{\\lambda\}\[q\]I\_\{d\}\\preceq\\nabla^\{2\}\\psi\(z\)\\preceq\\overline\{\\Lambda\}\[q\]I\_\{d\},\\quad z\\in\\Omega\-\\Omega\.\(46\)Ifλ¯\(q\)≥0,\\underline\{\\lambda\}\(q\)\\geq 0,thenψ\\psiis convex onΩ−Ω\\Omega\-\\Omega\.
###### Proof\.
See[Section˜F\.10](https://arxiv.org/html/2606.27767#A6.SS10)\. ∎
Now, building on the previous Lemma, we deduce sufficient condition under whichψ\\psiis convex\.
###### Lemma D\.2\(Sufficient Condition for Convexity\)\.
Assumeq∈C2\(\[0,S∗\]\)q\\in C^\{2\}\(\[0,S^\{\*\}\]\)such thatq′\(s\)≥0q^\{\\prime\}\(s\)\\geq 0andq′′\(s\)≥0q^\{\\prime\\prime\}\(s\)\\geq 0for alls∈\[0,S∗\]s\\in\[0,S^\{\*\}\], thenz↦ψ\(z\)=q\(‖z‖22\)z\\mapsto\\psi\(z\)=q\(\\\|z\\\|\_\{2\}^\{2\}\)is convex inΩ−Ω\\Omega\-\\Omega, andΛ¯\(q\)≥λ¯\(q\)≥0\\overline\{\\Lambda\}\(q\)\\geq\\underline\{\\lambda\}\(q\)\\geq 0\.
###### Proof of Lemma[D\.2](https://arxiv.org/html/2606.27767#Thmlemma2)\.
We always haveΛ¯\(q\)≥λ¯\(q\)\\overline\{\\Lambda\}\(q\)\\geq\\underline\{\\lambda\}\(q\)and under these conditions it is easy to see thatλ¯\(q\)≥0\\underline\{\\lambda\}\(q\)\\geq 0\. ∎
This lemma provides sufficient conditions to get a DC decomposition of the form \([41](https://arxiv.org/html/2606.27767#A4.E41)\) by[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)\.
###### Corollary D\.1\(Strong Convexity of DC decomposition in MMD\)\.
Letk:ℝd→ℝd→ℝk:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}be a radial kernel,*i\.e\.*such thatk\(x,y\)=q\(‖x−y‖22\)k\(x,y\)=q\(\\\|x\-y\\\|\_\{2\}^\{2\}\), whereqqadmits a decompositionq=q\+−q−q=q\_\{\+\}\-q\_\{\-\}, withq\+q\_\{\+\}andq−q\_\{\-\}satisfying the conditions of[Lemma˜D\.2](https://arxiv.org/html/2606.27767#Thmlemma2)\. Then, consideringℱ\+,ℱ−\\mathcal\{F\}^\{\+\},\\mathcal\{F\}^\{\-\}as defined in \([41](https://arxiv.org/html/2606.27767#A4.E41)\), we have on𝒫2\(Ω\)\\mathcal\{P\}\_\{2\}\(\\Omega\), that
1. 1\.ℱ\+\\mathcal\{F\}^\{\+\}isα\+\\alpha^\{\+\}\-strongly totally convex withα\+=λ−\\alpha^\{\+\}=\\lambda\_\{\-\};
2. 2\.ℱ−\\mathcal\{F\}^\{\-\}isα−\\alpha^\{\-\}\-strongly totally convex withα−=λ\+\\alpha^\{\-\}=\\lambda\_\{\+\}\.
###### Proof\.
We apply[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)\. ∎
Building on these two lemmas, we show that several usual kernels admit such decompositions\. We begin by discussing several DC decompositions for the Gaussian kernel that we can use for MMD\. The first one is based on a remark from\[Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2)\], and the other ones are based on observing that it is a radial kernel withq\(t\)=e−t/\(2h\)q\(t\)=e^\{\-t/\(2h\)\}\. Hence, we simply need to find a DC decomposition ofq=q\+−q−q=q\_\{\+\}\-q\_\{\-\}which satisfies the assumptions in[Lemma˜D\.2](https://arxiv.org/html/2606.27767#Thmlemma2)\.
##### Gaussian kernel\.
Recall that the Gaussian kernel is a radial kernel withq\(t\)=e−αtq\(t\)=e^\{\-\\alpha t\}withα≥0\\alpha\\geq 0\(often taken asα=1/\(2h\)\\alpha=1/\(2h\),hhbeing the bandwidth\)\.
Luuet al\.\[[2024](https://arxiv.org/html/2606.27767#bib.bib2)\]observed in their Appendix A\.2 that forkkdifferentiable andLL\-smooth,*i\.e\.*satisfying‖∇k\(x,y\)−∇k\(x′,y′\)‖22≤L\(‖x−x′‖22\+‖y−y′‖22\)\\\|\\nabla k\(x,y\)\-\\nabla k\(x^\{\\prime\},y^\{\\prime\}\)\\\|\_\{2\}^\{2\}\\leq L\\big\(\\\|x\-x^\{\\prime\}\\\|\_\{2\}^\{2\}\+\\\|y\-y^\{\\prime\}\\\|\_\{2\}^\{2\}\)for allx,x′,y,y′∈ℝdx,x^\{\\prime\},y,y^\{\\prime\}\\in\\mathbb\{R\}^\{d\}, a DC decomposition of \([28](https://arxiv.org/html/2606.27767#S4.E28)\) is given by
ℱ\+\(μ\)=α∫∥⋅∥22dμ\+12∬k\(x,y\)dμ\(x\)dμ\(y\)\+c\(ν\),ℱ−\(μ\)=∫\(α∥x∥22−V\(x\)\)dμ\(x\)\\mathcal\{F\}^\{\+\}\(\\mu\)=\\alpha\\int\\\|\\cdot\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\+\\frac\{1\}\{2\}\\iint k\(x,y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\+c\(\\nu\),\\quad\\mathcal\{F\}^\{\-\}\(\\mu\)=\\int\\big\(\\alpha\\\|x\\\|\_\{2\}^\{2\}\-\\mathrm\{V\}\(x\)\\big\)\\ \\mathrm\{d\}\\mu\(x\)\(47\)for anyα≥L\\alpha\\geq L\. Actually we can specify in the next example the best choice ofα\\alphafor the Gaussian kernel\.
###### Example D\.1\.
For the Gaussian kernelk\(x,y\)=e−‖x−y‖22/\(2h\)k\(x,y\)=e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}, let us find the best constantLL\. We have∇xk\(x,y\)=−1he−‖x−y‖22/\(2h\)\(x−y\)\\nabla\_\{x\}k\(x,y\)=\-\\frac\{1\}\{h\}e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}\(x\-y\)and∇x2k\(x,y\)=1h2e−‖x−y‖22/\(2h\)\(x−y\)\(x−y\)T−1he−‖x−y‖22/\(2h\)Id=1he−‖x−y‖22/\(2h\)\(1h\(x−y\)\(x−y\)T−Id\)\\nabla\_\{x\}^\{2\}k\(x,y\)=\\frac\{1\}\{h^\{2\}\}e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}\(x\-y\)\(x\-y\)^\{T\}\-\\frac\{1\}\{h\}e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}I\_\{d\}=\\frac\{1\}\{h\}e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}\\big\(\\frac\{1\}\{h\}\(x\-y\)\(x\-y\)^\{T\}\-I\_\{d\}\\big\)\. Its eigenvalues areλ0=−1he−‖x−y‖22/\(2h\)\\lambda\_\{0\}=\-\\frac\{1\}\{h\}e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}andλ1=\(1h‖x−y‖22−1\)1he−‖x−y‖22\\lambda\_\{1\}=\(\\frac\{1\}\{h\}\\\|x\-y\\\|\_\{2\}^\{2\}\-1\)\\frac\{1\}\{h\}e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}\}\. Lett=1h‖x−y‖22t=\\frac\{1\}\{h\}\\\|x\-y\\\|\_\{2\}^\{2\}, then the operator norm of the Hessian is‖∇x2k\(x,y\)‖op=1he−t/2max\(1,\|t−1\|\)\\\|\\nabla\_\{x\}^\{2\}k\(x,y\)\\\|\_\{\\mathrm\{op\}\}=\\frac\{1\}\{h\}e^\{\-t/2\}\\max\(1,\|t\-1\|\)\. The maximum inttis obtained fort=0t=0, thus‖∇x2k\(x,y\)‖op≤1h\\\|\\nabla\_\{x\}^\{2\}k\(x,y\)\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{1\}\{h\}\. Thus, we can useα=1h\\alpha=\\frac\{1\}\{h\}in \([47](https://arxiv.org/html/2606.27767#A4.E47)\)\.
\(a\)q\+\(t\)=cosh\(t\),q−\(t\)=sinh\(t\)q\_\{\+\}\(t\)=\\cosh\(t\),\\ q\_\{\-\}\(t\)=\\sinh\(t\)
\(b\)q\+\(t\)=e−αt\+αt,q−\(t\)=αtq\_\{\+\}\(t\)=e^\{\-\\alpha t\}\+\\alpha t,\\ q\_\{\-\}\(t\)=\\alpha t
Figure D\.4:Plot of two possible DC decompositions oft↦q\(t2\)=e−t2t\\mapsto q\(t^\{2\}\)=e^\{\-t^\{2\}\}that we use to get the DC decomposition of MMD with Gaussian kernel\. On the left, we show thecosh/sinh\\cosh/\\sinhdecomposition, and on the right, we show the decomposition based on the sign of the Hessian\.Another natural DC decomposition is based on the observation thate−t=cosh\(t\)−sinh\(t\)e^\{\-t\}=\\cosh\(t\)\-\\sinh\(t\)\. We show in the next lemma that this gives a valid decomposition for[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)\.
###### Lemma D\.3\(DC decomposition of the Gaussian Kernel based oncosh/sinh\\cosh/\\sinh\)\.
Forx,y∈Ωx,y\\in\\Omegaandα\>0\\alpha\>0, a Gaussian kernelk\(x,y\)=exp\(−α‖x−y‖22\)k\(x,y\)=\\exp\(\-\\alpha\\\|x\-y\\\|\_\{2\}^\{2\}\)admits a DC decomposition as follows:k\(x,y\)=ψ\+\(x−y\)−ψ−\(x−y\)k\(x,y\)=\\psi^\{\+\}\(x\-y\)\-\\psi^\{\-\}\(x\-y\), whereψ±\(z\)=q±\(‖z‖22\)\\psi^\{\\pm\}\(z\)=q\_\{\\pm\}\(\\\|z\\\|\_\{2\}^\{2\}\)with
q\+\(s\)=cosh\(αs\),q−\(s\)=sinh\(αs\)\.q\_\{\+\}\(s\)=\\cosh\(\\alpha s\),\\quad q\_\{\-\}\(s\)=\\sinh\(\\alpha s\)\.\(48\)Fors∈\[0,S∗\]s\\in\[0,S\_\{\*\}\]the maximum and minimum eigenvalues of the Hessian satisfy
λ\+=0,Λ\+=2αsinh\(αS∗\)\+4α2S∗cosh\(αS∗\),\\lambda\_\{\+\}=0,\\quad\\Lambda\_\{\+\}=2\\alpha\\sinh\(\\alpha S\_\{\*\}\)\+4\\alpha^\{2\}S\_\{\*\}\\cosh\(\\alpha S\_\{\*\}\),\(49\)and
λ−=2α,Λ−=2αcosh\(αS∗\)\+4α2S∗sinh\(αS∗\)\.\\lambda\_\{\-\}=2\\alpha,\\quad\\Lambda\_\{\-\}=2\\alpha\\cosh\(\\alpha S\_\{\*\}\)\+4\\alpha^\{2\}S\_\{\*\}\\sinh\(\\alpha S\_\{\*\}\)\.\(50\)
###### Proof\.
See[Section˜F\.11](https://arxiv.org/html/2606.27767#A6.SS11)\. ∎
Applying[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7), we can deduce thatℱ\+\\mathcal\{F\}^\{\+\}is2α\>02\\alpha\>0\-totally convex whileℱ−\\mathcal\{F\}^\{\-\}is only totally convex\. Hence, we can apply[Proposition˜5](https://arxiv.org/html/2606.27767#Thmproposition5)and obtain convergence of WCCCP towards a stationary point in a sublinear rate\.
Next, we also consider looking atq\(t\)=e−αtq\(t\)=e^\{\-\\alpha t\}directly\. Its derivative isq′\(t\)=−αe−αtq^\{\\prime\}\(t\)=\-\\alpha e^\{\-\\alpha t\}and second derivativeq′′\(t\)=α2e−αtq^\{\\prime\\prime\}\(t\)=\\alpha^\{2\}e^\{\-\\alpha t\}\. We note that it is convex but it does not satisfy[Lemma˜D\.2](https://arxiv.org/html/2606.27767#Thmlemma2)asq′<0q^\{\\prime\}<0\. However, adding a linear term large enough, we can obtain a DC decomposition ofqqsatisfying the assumptions of[Lemma˜D\.2](https://arxiv.org/html/2606.27767#Thmlemma2)\. For this, we apply \([31](https://arxiv.org/html/2606.27767#S4.E31)\) withA=αA=\\alpha\.
###### Lemma D\.4\.
Letq:t↦e−αtq:t\\mapsto e^\{\-\\alpha t\}forα\>0\\alpha\>0\. Then the DC decomposition obtained by \([31](https://arxiv.org/html/2606.27767#S4.E31)\) isq\+\(t\)=e−αt\+αtq\_\{\+\}\(t\)=e^\{\-\\alpha t\}\+\\alpha tandq−\(t\)=αtq\_\{\-\}\(t\)=\\alpha tfor allt∈ℝt\\in\\mathbb\{R\}\.
###### Proof\.
We have for allt∈ℝt\\in\\mathbb\{R\},q\(t\)=e−αtq\(t\)=e^\{\-\\alpha t\}, henceq′\(t\)=−αe−αtq^\{\\prime\}\(t\)=\-\\alpha e^\{\-\\alpha t\}andq′′\(t\)=α2e−αt≥0q^\{\\prime\\prime\}\(t\)=\\alpha^\{2\}e^\{\-\\alpha t\}\\geq 0\. SettingA=max\(0,−q′\(0\)\)=αA=\\max\(0,\-q^\{\\prime\}\(0\)\)=\\alpha,
q−\(t\)=αt−∫0t\(t−s\)min\(0,q′′\(s\)\)ds=αt,q\_\{\-\}\(t\)=\\alpha t\-\\int\_\{0\}^\{t\}\(t\-s\)\\min\\big\(0,q^\{\\prime\\prime\}\(s\)\\big\)\\ \\mathrm\{d\}s=\\alpha t,\(51\)and
q\+\(t\)=q\(t\)\+q−\(t\)=e−αt\+αt\.q\_\{\+\}\(t\)=q\(t\)\+q\_\{\-\}\(t\)=e^\{\-\\alpha t\}\+\\alpha t\.\(52\)∎
Note that for this decomposition the definition of the boundsλ\\lambdado not need to be restricted to a compact set\.
###### Proposition D\.12\(DC decomposition of the Gaussian Kernel based on \([31](https://arxiv.org/html/2606.27767#S4.E31)\)\)\.
Forx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\},α\>0\\alpha\>0, a Gaussian kernelk\(x,y\)=e−α‖x−y‖22k\(x,y\)=e^\{\-\\alpha\\\|x\-y\\\|\_\{2\}^\{2\}\}has a DC decompositionk\(x,y\)=q\+\(‖x−y‖22\)−q−\(‖x−y‖22\)k\(x,y\)=q\_\{\+\}\(\\\|x\-y\\\|\_\{2\}^\{2\}\)\-q\_\{\-\}\(\\\|x\-y\\\|\_\{2\}^\{2\}\)with, for alls∈\[0,\+∞\)s\\in\[0,\+\\infty\),
q\+\(s\)=e−αs\+αs,q−\(s\)=αs\.q\_\{\+\}\(s\)=e^\{\-\\alpha s\}\+\\alpha s,\\quad q\_\{\-\}\(s\)=\\alpha s\.\(53\)The maximum and minimum eigenvalues of the Hessian satisfy
λ\+=0,Λ\+=2α\(1\+2e−32\),\\lambda\_\{\+\}=0,\\quad\\Lambda\_\{\+\}=2\\alpha\(1\+2e^\{\-\\frac\{3\}\{2\}\}\),\(54\)and
λ−=2α,Λ−=2α\.\\lambda\_\{\-\}=2\\alpha,\\quad\\Lambda\_\{\-\}=2\\alpha\.\(55\)
###### Proof\.
See[Section˜F\.12](https://arxiv.org/html/2606.27767#A6.SS12)\. ∎
##### Smoothed Riesz Kernel\.
We now deal with the negative distance \(Riesz\) kernelk\(x,y\)=−‖x−y‖2k\(x,y\)=\-\\\|x\-y\\\|\_\{2\}\. Since it is not differentiable inx=yx=y, we instead focus on a smoothed version of it, defined askε\(x,y\)=−ε\+‖x−y‖22k\_\{\\varepsilon\}\(x,y\)=\-\\sqrt\{\\varepsilon\+\\\|x\-y\\\|\_\{2\}^\{2\}\}\. In this case, we haveq\(t\)=−t\+εq\(t\)=\-\\sqrt\{t\+\\varepsilon\}for allt∈ℝt\\in\\mathbb\{R\}\.
###### Lemma D\.5\(DC decomposition of the smoothed Riesz Kernel\)\.
Forε\>0\\varepsilon\>0, definekε\(x,y\)=−ε\+‖x−y‖22k\_\{\\varepsilon\}\(x,y\)=\-\\sqrt\{\\varepsilon\+\\\|x\-y\\\|\_\{2\}^\{2\}\}\. The smoothed Riesz kernel has a DC decomposition as follows:k\(x−y\)=ψ\+\(x−y\)−ψ−\(x−y\)k\(x\-y\)=\\psi^\{\+\}\(x\-y\)\-\\psi^\{\-\}\(x\-y\), whereψ±\(z\)=q±\(‖z‖22\)\\psi^\{\\pm\}\(z\)=q\_\{\\pm\}\(\\\|z\\\|\_\{2\}^\{2\}\)with
q\+\(s\)=0,q−\(s\)=ε\+s\.q\_\{\+\}\(s\)=0,\\quad q\_\{\-\}\(s\)=\\sqrt\{\\varepsilon\+s\}\.\(56\)Fors∈\[0,S∗\]s\\in\[0,S\_\{\*\}\]the maximum and minimum eigenvalues satisfy
λ\+=0,Λ\+=0,λ−=ε\(ε\+S∗\)3/2,Λ−=1ε\.\\lambda\_\{\+\}=0,\\quad\\Lambda\_\{\+\}=0,\\quad\\lambda\_\{\-\}=\\frac\{\\varepsilon\}\{\(\\varepsilon\+S\_\{\*\}\)^\{3/2\}\},\\quad\\Lambda\_\{\-\}=\\frac\{1\}\{\\sqrt\{\\varepsilon\}\}\.\(57\)
###### Proof\.
See[Section˜F\.13](https://arxiv.org/html/2606.27767#A6.SS13)\. ∎
##### Rational Quadratic Kernel\.
We now turn to the rational quadratic kernelk\(x,y\)=ψ\(x−y\),k\(x,y\)=\\psi\(x\-y\),whereψ\(z\)=1\(c2\+‖z‖22\)α,α≥1\\psi\(z\)=\\frac\{1\}\{\(c^\{2\}\+\\\|z\\\|\_\{2\}^\{2\}\)^\{\\alpha\}\},\\alpha\\geq 1\.
###### Lemma D\.6\(Rational Quadratic Kernel DC decomposition based on \([31](https://arxiv.org/html/2606.27767#S4.E31)\)\)\.
Forx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\},α≥1\\alpha\\geq 1, a rational quadratic kernel has a DC decompositionk\(x,y\)=q\+\(‖x−y‖22\)−q−\(‖x−y‖22\)k\(x,y\)=q\_\{\+\}\(\\\|x\-y\\\|\_\{2\}^\{2\}\)\-q\_\{\-\}\(\\\|x\-y\\\|\_\{2\}^\{2\}\)with, for alls∈\[0,\+∞\)s\\in\[0,\+\\infty\),
q\+\(s\)=1\(c2\+s\)α\+αc−2\(α\+1\)s,q−\(s\)=αc−2\(α\+1\)s\.q\_\{\+\}\(s\)=\\frac\{1\}\{\(c^\{2\}\+s\)^\{\\alpha\}\}\+\\alpha c^\{\-2\(\\alpha\+1\)\}s,\\quad\\quad q\_\{\-\}\(s\)=\\alpha c^\{\-2\(\\alpha\+1\)\}s\.\(58\)The maximum and minimum eigenvalues of the Hessian satisfy:
λ\+=0,Λ\+=f\(s∗\),λ−=2αc−2\(α\+1\),Λ−=2αc−2\(α\+1\),\\lambda\_\{\+\}=0,\\quad\\Lambda\_\{\+\}=f\(s^\{\*\}\),\\quad\\lambda\_\{\-\}=2\\alpha c^\{\-2\(\\alpha\+1\)\},\\quad\\Lambda\_\{\-\}=2\\alpha c^\{\-2\(\\alpha\+1\)\},\(59\)wheref\(s\)=2q\+′\(s\)\+4sq\+′′\(s\)f\(s\)=2q^\{\\prime\}\_\{\+\}\(s\)\+4sq^\{\\prime\\prime\}\_\{\+\}\(s\)ands∗=6c24\(α\+2\)−6s^\{\*\}=\\frac\{6c^\{2\}\}\{4\(\\alpha\+2\)\-6\}\.
###### Proof\.
See[Section˜F\.14](https://arxiv.org/html/2606.27767#A6.SS14)\. ∎
#### D\.2Lipschitz Continuity of∇W2ℱ\+\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}and Stationarity
We now show that for a DC decomposition based on radial kernels as in[Lemma˜D\.2](https://arxiv.org/html/2606.27767#Thmlemma2), the Wasserstein gradient of∇W2ℱ\+\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}satisfies a Lipschitz condition, which corresponds to the assumption in[Proposition˜5](https://arxiv.org/html/2606.27767#Thmproposition5)\.
###### Proposition D\.13\(Lipschitz Continuity of∇W2ℱ\+\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\)\.
Assumeλ±≥0\\lambda\_\{\\pm\}\\geq 0andΛ±<∞\\Lambda\_\{\\pm\}<\\infty\. Letσ,μ∈𝒫2\(Ω\)\\sigma,\\mu\\in\\mathcal\{P\}\_\{2\}\(\\Omega\)be two a\.c measures, considerT\\mathrm\{T\}such thatσ=T\#μ\\sigma=\\mathrm\{T\}\_\{\\\#\}\\mu, then we have
‖∇W2ℱ\+\(σ\)∘T−∇W2ℱ\+\(μ\)‖L2\(μ\)≤L‖T−Id‖L2\(μ\),\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\sigma\)\\circ\\mathrm\{T\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\)\\\|\_\{L^\{2\}\(\\mu\)\}\\leq L\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\)\},\(60\)withL=2Λ\+\+Λ−L=\\sqrt\{2\}\\Lambda\_\{\+\}\+\\Lambda\_\{\-\}\.
###### Proof\.
See[Section˜F\.15](https://arxiv.org/html/2606.27767#A6.SS15)\. ∎
Using this Proposition, we can apply[Proposition˜5](https://arxiv.org/html/2606.27767#Thmproposition5)to obtain a sublinear rate over the minimum of the Wasserstein gradient over the scheme\.
###### Theorem D\.1\(WCCCP with MMD leads to a Stationary Measure\)\.
Letν∈𝒫2\(Ω\),\\nu\\in\\mathcal\{P\}\_\{2\}\(\\Omega\),whereΩ\\Omegais a compact and convex nonempty set inℝd\\mathbb\{R\}^\{d\}\. Consider\(μk\)k≥0\(\\mu\_\{k\}\)\_\{k\\geq 0\}the WCCCP iterates \([11](https://arxiv.org/html/2606.27767#S3.E11)\) for the MMD functionalℱ\\mathcal\{F\}with a radial translation\-invariant kernel that admits a DC decomposition:k\(x,y\)=ψ\+\(x−y\)−ψ−\(x−y\)k\(x,y\)=\\psi^\{\+\}\(x\-y\)\-\\psi^\{\-\}\(x\-y\)\. Assumeλ\+,λ−≥0\\lambda\_\{\+\},\\lambda\_\{\-\}\\geq 0,λ\+\+λ−\>0\\lambda\_\{\+\}\+\\lambda\_\{\-\}\>0and0<Λ±<∞0<\\Lambda\_\{\\pm\}<\\infty, and WCCCP iterates belong to𝒫2\(Ω\)\\mathcal\{P\}\_\{2\}\(\\Omega\), with support inΩ\\Omega, then these iterates satisfy, for allK≥1K\\geq 1,
min0≤k≤K−1‖∇W2ℱ\(μk\)‖L2\(μk\)2≤2\(2Λ\+\+Λ−\)2λ\+\+λ−\(ℱ\(μ0\)−ℱ\(μK\)\)K\.\\min\_\{0\\leq k\\leq K\-1\}\\ \\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\\\|^\{2\}\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\leq\\frac\{2\(\\sqrt\{2\}\\Lambda\_\{\+\}\+\\Lambda\_\{\-\}\)^\{2\}\}\{\\lambda\_\{\+\}\+\\lambda\_\{\-\}\}\\frac\{\\big\(\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\mathcal\{F\}\(\\mu\_\{K\}\)\\big\)\}\{K\}\.\(61\)
###### Proof\.
See[Section˜F\.16](https://arxiv.org/html/2606.27767#A6.SS16)\. ∎
#### D\.3Critical points and Local Convergence
In the following we analyze the conditions under which we obtain local convergence of WCCCP for the MMD functional with base spaceΩ\\Omega\. All the statements are in the weak topology and assume a continuous kernelkk\.
First,we consider a condition on the initialization*w\.r\.t\.*the critical value gap as follows:
###### Assumption 1\.
The setΩ\\Omegais convex and compact and the following holds true:
1. 1\.*Strong Convexity & Lipschitz Continuity\.*ℱ\+\\mathcal\{F\}^\{\+\}andℱ−\\mathcal\{F\}^\{\-\}are totally strongly convex,∇W2ℱ\+\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}is Lipschitz continuous, andμ↦‖∇W2ℱ\(μ\)‖L2\(μ\)\\mu\\mapsto\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\)\\\|\_\{L^\{2\}\(\\mu\)\}is lower semicontinuous;
2. 2\.*Critical\-value gap*\.Setting c∗≔inf\{ℱ\(μ\):‖∇W2ℱ\(μ\)‖L2\(μ\)=0,μ≠ν\},c\_\{\*\}\\coloneqq\\inf\\\{\\mathcal\{F\}\(\\mu\):\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\)\\\|\_\{L^\{2\}\(\\mu\)\}=0,\\mu\\neq\\nu\\\},the initializationμ0\\mu\_\{0\}satisfies: ℱ\(μ0\)<c∗\.\\mathcal\{F\}\(\\mu\_\{0\}\)<c\_\{\*\}\.
Under Assumption[1](https://arxiv.org/html/2606.27767#Thmassumption1), we have the following local convergence result, interpreted as: if one starts with a near optimal initialization, by descent, there must be convergence\. The conditionℱ\(μ0\)<c∗\\mathcal\{F\}\(\\mu\_\{0\}\)<c\_\{\*\}entails thatν\\nuis an isolated critical point\.
###### Proposition D\.14\(Local Convergence of CCP for MMD under Critical Value Gap Assumption\)\.
Under Assumption[1](https://arxiv.org/html/2606.27767#Thmassumption1), the sequence produced by WCCCP converges toν\\nu, i\.e\.,μk→ν\\mu\_\{k\}\\to\\nu\.
###### Proof\.
Let\(μk\)k\(\\mu\_\{k\}\)\_\{k\}be the sequence produced by WCCCP\. The compactness ofΩ\\Omegaimplies the compactness of𝒫2\(Ω\)\\mathcal\{P\}\_\{2\}\(\\Omega\), hence we can extract a subsequenceμkj→μ∗\\mu\_\{k\_\{j\}\}\\to\\mu^\{\*\}inW2\\mathrm\{W\}\_\{2\}\. Using Theorem[D\.1](https://arxiv.org/html/2606.27767#Thmtheorem1)\(satisfied under the first point in Assumption[1](https://arxiv.org/html/2606.27767#Thmassumption1)\)‖∇W2ℱ\(μkj\)‖L2\(μkk\)→0\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\_\{j\}\}\)\\\|\_\{L\_\{2\}\(\\mu\_\{k\_\{k\}\}\)\}\\to 0\. By lower semicontinuity ofμ↦‖∇W2ℱ\(μ\)‖L2\(μ\)\\mu\\mapsto\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\)\\\|\_\{L\_\{2\}\(\\mu\)\}, we have‖∇W2ℱ\(μ∗\)‖L2\(μ∗\)=0\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu^\{\*\}\)\\\|\_\{L\_\{2\}\(\\mu^\{\*\}\)\}=0andμ∗\\mu^\{\*\}is then a critical point\. Since under our assumptions WCCCP guarantees descent of values and under Assumption[1](https://arxiv.org/html/2606.27767#Thmassumption1)we haveℱ\(μ0\)<c∗\\mathcal\{F\}\(\\mu\_\{0\}\)<c\_\{\*\}, we obtain:
ℱ\(μ∗\)≤ℱ\(μ0\)<c∗\.\\mathcal\{F\}\(\\mu\_\{\*\}\)\\leq\\mathcal\{F\}\(\\mu\_\{0\}\)<c\_\{\*\}\.By the definition ofc∗c^\{\*\}, the limitμ∗\\mu^\{\*\}must be equal toν\\nu\(μ∗=ν\\mu^\{\*\}=\\nu\)\. We finally conclude that every subsequenceμkj\\mu\_\{k\_\{j\}\}converges toν\\nu, and henceμk→ν\\mu\_\{k\}\\to\\nu\. ∎
The following proposition does not require compactness but assumes local quadratic growth of the MMD and uniqueness of critical points in a local neighborhood ofν\\nu, and it guarantees local convergence:
###### Proposition D\.15\(Local Convergence of CCP under Local Quadratic Growth for the MMD\)\.
Assume there existsr\>0r\>0such that over the ball
Br\(ν\)=\{μ\|W2\(μ,ν\)≤r\},B\_\{r\}\(\\nu\)=\\Big\\\{\\mu\\,\\Big\|\\,\\mathrm\{W\}\_\{2\}\(\\mu,\\nu\)\\leq r\\Big\\\},1. 1\.ℱ\\mathcal\{F\}has local quadratic growth atν\\nu, i\.e\., there existsc\>0c\>0such that for allμ∈Br\(ν\)\\mu\\in B\_\{r\}\(\\nu\): ℱ\(μ\)≥cW22\(μ,ν\);\\mathcal\{F\}\(\\mu\)\\geq c\\mathrm\{W\}^\{2\}\_\{2\}\(\\mu,\\nu\);
2. 2\.ν\\nuis the unique critical point inBr\(ν\)B\_\{r\}\(\\nu\);
3. 3\.the initializationμ0\\mu\_\{0\}satisfies : ℱ\(μ0\)<cr2\.\\mathcal\{F\}\(\\mu\_\{0\}\)<cr^\{2\}\.
Then under the assumptions of Theorem[D\.1](https://arxiv.org/html/2606.27767#Thmtheorem1)we have local convergence of WCCCP iterates:μk→ν\\mu\_\{k\}\\to\\nu\.
###### Proof\.
Assume local quadratic growth ofℱ\\mathcal\{F\}inBr\(ν\)B\_\{r\}\(\\nu\), meaning we have forμ∈Br\(ν\)\\mu\\in B\_\{r\}\(\\nu\):
ℱ\(μ\)−ℱ\(ν\)≥cW22\(μ,ν\),\\mathcal\{F\}\(\\mu\)\-\\mathcal\{F\}\(\\nu\)\\geq c\\mathrm\{W\}^\{2\}\_\{2\}\(\\mu,\\nu\),sinceℱ\(ν\)=0\\mathcal\{F\}\(\\nu\)=0we have
ℱ\(μ\)≥cW22\(μ,ν\)\.\\mathcal\{F\}\(\\mu\)\\geq c\\mathrm\{W\}^\{2\}\_\{2\}\(\\mu,\\nu\)\.On the boundary \(W22\(μ,ν\)=r2\\mathrm\{W\}^\{2\}\_\{2\}\(\\mu,\\nu\)=r^\{2\}\) we have therefore
ℱ\(μ\)≥cr2\.\\mathcal\{F\}\(\\mu\)\\geq cr^\{2\}\.
Consider an initializationμ0\\mu\_\{0\}such thatℱ\(μ0\)<cr2,\\mathcal\{F\}\(\\mu\_\{0\}\)<cr^\{2\},since WCCCP is descending under our assumptions, we have therefore
ℱ\(μkj\)≤ℱ\(μ0\)<cr2,\\mathcal\{F\}\(\\mu\_\{k\_\{j\}\}\)\\leq\\mathcal\{F\}\(\\mu\_\{0\}\)<cr^\{2\},and hence by local quadratic growth
cW22\(μkj,ν\)≤ℱ\(μkj\)≤ℱ\(μ0\)<cr2\.c\\mathrm\{W\}^\{2\}\_\{2\}\(\\mu\_\{k\_\{j\}\},\\nu\)\\leq\\mathcal\{F\}\(\\mu\_\{k\_\{j\}\}\)\\leq\\mathcal\{F\}\(\\mu\_\{0\}\)<cr^\{2\}\.From this we conclude that
W22\(μkj,ν\)<r2,\\mathrm\{W\}^\{2\}\_\{2\}\(\\mu\_\{k\_\{j\}\},\\nu\)<r^\{2\},and all iterations remain in the ball of radiusrraroundν\\nuwithout touching the boundary\.
By the compactness of the Wasserstein ball we can extract a subsequenceμkj→μ∗\\mu\_\{k\_\{j\}\}\\to\\mu^\{\*\}\. Under the assumptions of Theorem[D\.1](https://arxiv.org/html/2606.27767#Thmtheorem1)we have‖∇W2ℱ\(μkj\)‖L2\(μkk\)→0\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\_\{j\}\}\)\\\|\_\{L\_\{2\}\(\\mu\_\{k\_\{k\}\}\)\}\\to 0\. Hence, by continuity we have‖∇W2ℱ\(μ∗\)‖L2\(μ∗\)=0\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu^\{\*\}\)\\\|\_\{L\_\{2\}\(\\mu^\{\*\}\)\}=0andμ∗\\mu^\{\*\}is a critical point\. Sinceν\\nuin the unique critical point in that neighborhood,μ∗=ν\\mu^\{\*\}=\\nu\. ∎
### Appendix ENumerical Applications on MMD




Figure E\.5:Convergence of the Wasserstein Gradient Descent \(WGD\), Forward\-Backward \(FB\) and Wasserstein Convex\-Concave Procedure \(WCCCP\) on the squared MMD with kernelk\(x,y\)=−‖x−y‖2k\(x,y\)=\-\\\|x\-y\\\|\_\{2\}\(Left\) and particles of WCCCP along the scheme \(Right\)\.

Figure E\.6:Optimization ofℱ\(μ\)=12ED2\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\frac\{1\}\{2\}\\mathrm\{ED\}^\{2\}\(\\mu,\\nu\)forν\\nua Gaussian target \(Top\) and a Gaussian mixture \(Bottom\)\. \(Left\) Evolution of the squared Energy distance along the flow\. \(Right\) Trajectories of the particles over times\. The initial particles are in blue and the final particles in red\.We now detail the experiments of[Section˜4](https://arxiv.org/html/2606.27767#S4), as well as extra experiments\. We first detail the experiments on the Energy Distance, and then focus on MMD with the Gaussian kernel\. All the numerical applications are done on a Nvidia V100 GPU\.
#### E\.1Energy Distance
We recall that the Energy distance\[Sejdinovicet al\.,[2013](https://arxiv.org/html/2606.27767#bib.bib28)\]is of the form, forμ,ν∈𝒫2\(ℝd\)\\mu,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),
ED\(μ,ν\)=−∬‖x−y‖2d\(μ−ν\)\(x\)d\(μ−ν\)\(y\),\\mathrm\{ED\}\(\\mu,\\nu\)=\-\\iint\\\|x\-y\\\|\_\{2\}\\ \\mathrm\{d\}\(\\mu\-\\nu\)\(x\)\\mathrm\{d\}\(\\mu\-\\nu\)\(y\),\(62\)which by[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7), for a fixed targetν\\nu, can be decomposed as12ED\(μ,ν\)=ℱ\+\(μ\)−ℱ−\(μ\)\\tfrac\{1\}\{2\}\\mathrm\{ED\}\(\\mu,\\nu\)=\\mathcal\{F\}^\{\+\}\(\\mu\)\-\\mathcal\{F\}^\{\-\}\(\\mu\)where
ℱ\+\(μ\)=∬‖x−y‖2dν\(y\)dμ\(x\)\+c\(ν\),ℱ−\(μ\)=12∬‖x−y‖2dμ\(x\)dμ\(y\),\\mathcal\{F\}^\{\+\}\(\\mu\)=\\iint\\\|x\-y\\\|\_\{2\}\\ \\mathrm\{d\}\\nu\(y\)\\mathrm\{d\}\\mu\(x\)\+c\(\\nu\),\\quad\\mathcal\{F\}^\{\-\}\(\\mu\)=\\frac\{1\}\{2\}\\iint\\\|x\-y\\\|\_\{2\}\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\),\(63\)withc\(ν\)=−12∬‖x−y‖2dν\(x\)dν\(y\)c\(\\nu\)=\-\\frac\{1\}\{2\}\\iint\\\|x\-y\\\|\_\{2\}\\ \\mathrm\{d\}\\nu\(x\)\\mathrm\{d\}\\nu\(y\)\. The functionalℱ\+\\mathcal\{F\}^\{\+\}is a potential energy withV\(x\)=∫‖x−y‖2dν\(y\)V\(x\)=\\int\\\|x\-y\\\|\_\{2\}\\ \\mathrm\{d\}\\nu\(y\)andℱ−\\mathcal\{F\}^\{\-\}an interaction energy, and both are totally convex\. If we use the smoothed Riesz kernelk\(x,y\)=−ε\+‖x−y‖22k\(x,y\)=\-\\sqrt\{\\varepsilon\+\\\|x\-y\\\|\_\{2\}^\{2\}\}and assume thatμ,ν∈𝒫2\(Ω\)\\mu,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\Omega\)forΩ\\Omegaa compact convex set, then by[Lemma˜D\.5](https://arxiv.org/html/2606.27767#Thmlemma5)and[Corollary˜D\.1](https://arxiv.org/html/2606.27767#Thmcorollary1),ℱ\+\\mathcal\{F\}^\{\+\}is alsoα\\alpha\-totally convex withα=ε\(ε\+S∗\)32\\alpha=\\frac\{\\varepsilon\}\{\(\\varepsilon\+S\_\{\*\}\)^\{\\frac\{3\}\{2\}\}\}withS∗=supx,y∈Ω‖x−y‖2S\_\{\*\}=\\sup\_\{x,y\\in\\Omega\}\\ \\\|x\-y\\\|\_\{2\}\. In practice, we use mostly the non\-smooth Riesz kernelk\(x,y\)=−‖x−y‖2k\(x,y\)=\-\\\|x\-y\\\|\_\{2\}as it works well in practice\[Hertrichet al\.,[2024b](https://arxiv.org/html/2606.27767#bib.bib27)\]\. Nonetheless, some smoothed versions based on convolutions have been also shown to have more favorable theoretical properties\[Ruxet al\.,[2026](https://arxiv.org/html/2606.27767#bib.bib98)\]\.
##### Experiments on shapes\.
For the experiment of[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2), we compare the Wasserstein gradient descent \([5](https://arxiv.org/html/2606.27767#S2.E5)\), the Forward\-Backward scheme \([25](https://arxiv.org/html/2606.27767#S3.E25)\) from\[Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2)\], and our scheme WCCCP \([11](https://arxiv.org/html/2606.27767#S3.E11)\)\. We use a spatial discretization as described in[Section˜3\.3](https://arxiv.org/html/2606.27767#S3.SS3),*i\.e\.*we first samplen=500n=500particles fromμ0=𝒩\(0,I2\)\\mu\_\{0\}=\\mathcal\{N\}\(0,I\_\{2\}\), andnnuniform independent particles from the target shape to obtain the target distributionνn\\nu\_\{n\}\. Then, at each iteration, we compute the mapTk\+1\\mathrm\{T\}\_\{k\+1\}and apply it to move each of the particles\. For WGD and FB, we use as step sizeτ=1\\tau=1\. For WCCCP and FB, we solve each inner optimization problem with a gradient descent with momentumm=0\.9m=0\.9forM=50M=50iterations and step size0\.10\.1\. We report on[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2)the values of the objectiveED2\(μk,νn\)\\mathrm\{ED\}^\{2\}\(\\mu\_\{k\},\\nu\_\{n\}\)for the cat and spiral shapes depending on the number of outer iterations \(hence WGD is much faster but still converges to a local minimum and does not improve further\)\. We performed the experiment 100 times, and report the average values with standard deviation to get confidence intervals\. We add on[Figure˜E\.5](https://arxiv.org/html/2606.27767#A5.F5)the results for the heart and disk shapes, as well as particles for the WCCCP at iterations 0, 10 and 500\. On[Figure˜E\.6](https://arxiv.org/html/2606.27767#A5.F6), we performed the same experiment with target samples from a Gaussianν=𝒩\(0,Σ\)\\nu=\\mathcal\{N\}\(0,\\Sigma\)forΣ=\(10\.50\.51\)\\Sigma=\\bigl\(\\begin\{smallmatrix\}1&0\.5\\\\ 0\.5&1\\end\{smallmatrix\}\\bigr\)\(top\), and with target samples from a mixture of 3 Gaussian with equal weights, meansm1=\(0,0\)m\_\{1\}=\(0,0\),m2=\(3,−1\)m\_\{2\}=\(3,\-1\)andm3=\(1,4\)m\_\{3\}=\(1,4\), and covariancesΣ1=\(10\.50\.52\)\\Sigma\_\{1\}=\\bigl\(\\begin\{smallmatrix\}1&0\.5\\\\ 0\.5&2\\end\{smallmatrix\}\\bigr\),Σ2=I2\\Sigma\_\{2\}=I\_\{2\}andΣ3=\(30\.50\.51\)\\Sigma\_\{3\}=\\bigl\(\\begin\{smallmatrix\}3&0\.5\\\\ 0\.5&1\\end\{smallmatrix\}\\bigr\)\. For both, we use the same initial distributionμ0=𝒩\(5,I2\)\\mu\_\{0\}=\\mathcal\{N\}\(5,I\_\{2\}\)and the same hyperparameters as the shape experiment\.
We also report on[Figure˜E\.7](https://arxiv.org/html/2606.27767#A5.F7)the convergence for the cat target uniform distribution, with the exact same number of iterations between WCCCP and WGD with different step sizes\. We observe that forτ=0\.1\\tau=0\.1, WGD converges better but slower than forτ=1\\tau=1\. With the same computational budget for WCCCP withM=50M=50,K=400K=400, we observe that the algorithm converges in similar compuational time\.
Figure E\.7:Convergence of WGD \(with different step size andKK=20K\) and WCCCP with the same computational budget,*i\.e\.*M=50M=50andK=400K=400\(each iteration thus corresponding to one inner step\)\.



Figure E\.8:Samples along the scheme of WGD \(Top\) and WCCCP \(Bottom\) withℱ\(μ\)=12ED2\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\frac\{1\}\{2\}\\mathrm\{ED\}^\{2\}\(\\mu,\\nu\)as objective\. On CIFAR10, we plot samples every 2K iterations for WCCCP and every 10K iterations for WGD\. On MNIST, we plot samples every 1K iterations for WCCCP and every 5K iterations for WGD\.

Figure E\.9:Evolution ofℱ\(μ\)=ED2\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\mathrm\{ED\}^\{2\}\(\\mu,\\nu\)along the WCCCP and WGD schemes forν\\nucomposed of samples of CIFAR10 \(Left\) and of MNIST \(Right\)\. The results are averaged over 5 different run with different sampels of the source and target distribution\.
##### Experiments on images\.
On[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2), we performed the same experiment with target samples from the CIFAR10 dataset, whose images are of size3×32×323\\times 32\\times 32\. More precisely, we sampled 50 points by class, and hence also worked withn=500n=500particles\. We started the flows fromμ0=𝒩\(0,Id\)\\mu\_\{0\}=\\mathcal\{N\}\(0,I\_\{d\}\)\. We compared WGD with stepsizeτ=1\\tau=1and WCCCP with a gradient descent to solve the inner optimization scheme withτ=1\\tau=1and2020iterations\. We ran WGD for 200K iterations and WCCCP for 40K iterations, with 20 iterations to solve each of the subproblems \([11](https://arxiv.org/html/2606.27767#S3.E11)\)\. We choose to use a different number of iterations to be fair in comparing the two methods, as WCCCP solves each inner problem in closed\-form, and is thus less computationally expensive\. With a Nvidia v100 GPU, WCCCP took about 1h15 while WGD took 1h30\. On[Figure˜2](https://arxiv.org/html/2606.27767#S4.F2), we plot samples along the scheme of WGD every 20K iterations, and samples along the scheme of WCCCP every 4K iterations\. Below, we plot the value of the loss across iterations, where we rescaled the abscissa for WCCCP to match the abscisse of WGD as done on the image samples\. We observe overall a much faster convergence of WCCCP compared to WGD on this high dimensional challenging dataset\. On[Figure˜E\.8](https://arxiv.org/html/2606.27767#A5.F8), we add more samples along the schemes of WGD and WCCCP, every 2K iterations for WCCCP and 10K iterations for WGD\.
We also performed the experiment on MNIST and report the results on[Figure˜E\.8](https://arxiv.org/html/2606.27767#A5.F8)\. Here, the samples are reported every 1K iterations for WCCCP and 5K iterations for WGD\. The schemes took about 18 minutes to run for both WGD and WCCCP\.
We show on[Figure˜E\.9](https://arxiv.org/html/2606.27767#A5.F9)the evolution of the Energy distance along the scheme, averaged over 5 runs with different samples of the source and of the target\. We observe that on the MNIST experiment, both schemes converge, but WCCCP converges faster \(even though the iterations are rescaled\)\. On the CIFAR10 experiment, even after 200K iterations, WGD is very far from converging, whereas WCCCP has already converged\. Hence, WCCCP can be promising to accelerate convergence in high dimensions \(hered=3×32×32=3072d=3\\times 32\\times 32=3072\)\.
#### E\.2MMD with Gaussian Kernel
We now focus on MMD with Gaussian kernelk\(x,y\)=e−‖x−y‖22/\(2h\)k\(x,y\)=e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}for a bandwidthh\>0h\>0\.


Figure E\.10:Loss for one run for MMD with Gaussian kernel, and Gaussian target \(Left\) and Gaussian mixture target \(Right\)\.Figure E\.11:Optimization ofℱ\(μ\)=12MMDk2\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\frac\{1\}\{2\}\\mathrm\{MMD\}\_\{k\}^\{2\}\(\\mu,\\nu\)forν\\nua Gaussian mixture target andkkthe Gaussian kernel\. \(Left\) Evolution of the squared MMD along the flow\. \(Right\) Trajectories of the particles over times\. The initial particles are in blue and the final particles in red\.##### Decomposition based on the radial kernel\.
By[Corollary˜D\.1](https://arxiv.org/html/2606.27767#Thmcorollary1), we can have a DC decomposition of the squared MMD if we findq\+,q−q\_\{\+\},q\_\{\-\}satisfyingq\+′,q−′,q\+′′,q−′′≥0q\_\{\+\}^\{\\prime\},q\_\{\-\}^\{\\prime\},q\_\{\+\}^\{\\prime\\prime\},q\_\{\-\}^\{\\prime\\prime\}\\geq 0andq=q\+−q−q=q\_\{\+\}\-q\_\{\-\}\. We discuss here two natural decompositions ofq:t↦e−t/\(2h\)q:t\\mapsto e^\{\-t/\(2h\)\}\.
The first one is based on the remark thate−z=cosh\(z\)−sinh\(z\)e^\{\-z\}=\\cosh\(z\)\-\\sinh\(z\)sincecosh\(z\)=\(ez\+e−z\)/2\\cosh\(z\)=\\big\(e^\{z\}\+e^\{\-z\}\)/2andsinh\(z\)=\(ez−e−z\)/2\\sinh\(z\)=\\big\(e^\{z\}\-e^\{\-z\}\)/2, and henceq\+\(t\)=cosh\(t/\(2h\)\)q\_\{\+\}\(t\)=\\cosh\\big\(t/\(2h\)\\big\),q−\(t\)=sinh\(t/\(2h\)\)q\_\{\-\}\(t\)=\\sinh\\big\(t/\(2h\)\\big\)\. This is based on the algebraic decomposition described in[Section˜4](https://arxiv.org/html/2606.27767#S4)ascosh\(t\)=∑keventkk\!\\cosh\(t\)=\\sum\_\{k\\ \\mathrm\{even\}\}\\frac\{t^\{k\}\}\{k\!\}andsinh\(t\)=∑koddtkk\!\\sinh\(t\)=\\sum\_\{k\\ \\mathrm\{odd\}\}\\frac\{t^\{k\}\}\{k\!\}\. This decomposition is valid to apply[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)as showed in[Lemma˜D\.3](https://arxiv.org/html/2606.27767#Thmlemma3)\. We show this decomposition on[Figure˜D\.4a](https://arxiv.org/html/2606.27767#A4.F4.sf1)\. We note that bothq\+q\_\{\+\}andq−q\_\{\-\}tend to take very large values, which might be prone to numerical instabilities, in particular for small bandwidths\.
The second one is based on decomposing the second derivative, and taking the convex part as the function having as second derivative the non\-negative part, and the convex part as the function having as second derivative the opposite of the non\-positive part\. This can be done using the Jordan decomposition \([31](https://arxiv.org/html/2606.27767#S4.E31)\)\. The exact formula givesq\+\(s\)=e−αs\+αsq\_\{\+\}\(s\)=e^\{\-\\alpha s\}\+\\alpha sandq−\(s\)=αsq\_\{\-\}\(s\)=\\alpha s\. This also satisfies[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)as showed in[Proposition˜D\.12](https://arxiv.org/html/2606.27767#Thmproposition12)\. We plot on[Figure˜D\.4b](https://arxiv.org/html/2606.27767#A4.F4.sf2)the resulting DC decomposition\. We observe here that this decomposition does not blow up, and should be numerically more stable\.


Figure E\.12:Evolution ofℱ\(μ\)=12MMDk2\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\frac\{1\}\{2\}\\mathrm\{MMD\}^\{2\}\_\{k\}\(\\mu,\\nu\)forν\\nua Gaussian target \(Left\) and a Gaussian mixture target \(Right\) over iterations, comparing WGD, WCCCP and FB with thecosh/sinh\\cosh/\\sinhdecomposition and the DC decomposition \([47](https://arxiv.org/html/2606.27767#A4.E47)\)\. We observe that both FB and WCCCP work well with thecosh/sinh\\cosh/\\sinhdecomposition\.
##### Experiments\.
The Wasserstein gradient descent onℱ\(μ\)=12MMDk2\(μ,ν\)\\mathcal\{F\}\(\\mu\)=\\frac\{1\}\{2\}\\mathrm\{MMD\}\_\{k\}^\{2\}\(\\mu,\\nu\)withk\(x,y\)=e−‖x−y‖22/\(2h\)k\(x,y\)=e^\{\-\\\|x\-y\\\|\_\{2\}^\{2\}/\(2h\)\}is known to heavily depend on the choice of the bandwidth\[Arbelet al\.,[2019](https://arxiv.org/html/2606.27767#bib.bib22), Hertrichet al\.,[2024b](https://arxiv.org/html/2606.27767#bib.bib27)\], and might not converge well in practice\. In particular, on simple examples such as a single Gaussian distribution,Arbelet al\.\[[2019](https://arxiv.org/html/2606.27767#bib.bib22)\], Gladinet al\.\[[2024](https://arxiv.org/html/2606.27767#bib.bib24)\]observed that some particles get stuck further from the mode, and never converge\. In our experiments, we compare several DC decomposition in the WCCCP algorithm and test for which decompositions we obtain a better convergence or not\.
For this, we choose the same setting as\[Gladinet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib24)\]\. We first taken=500n=500samples fromμ0=𝒩\(5,I2\)\\mu\_\{0\}=\\mathcal\{N\}\(5,I\_\{2\}\)to get the initial distribution\. Then, we taken=500n=500samples from the target distributionν\\nu\. We experiment with 2 choices forν\\nu\. The first one is a Gaussianν=𝒩\(0,Σ\)\\nu=\\mathcal\{N\}\(0,\\Sigma\)with
Σ=\(10\.50\.51\)\.\\Sigma=\\begin\{pmatrix\}1&0\.5\\\\ 0\.5&1\\end\{pmatrix\}\.\(64\)The second is a mixture of 3 Gaussian with uniform weights, meansm1=\(0,0\)m\_\{1\}=\(0,0\),m2=\(3,−1\)m\_\{2\}=\(3,\-1\),m3=\(1,4\)m\_\{3\}=\(1,4\)and covariances
Σ1=\(10\.50\.51\),Σ2=I2,Σ3=\(30\.50\.51\)\.\\Sigma\_\{1\}=\\begin\{pmatrix\}1&0\.5\\\\ 0\.5&1\\end\{pmatrix\},\\quad\\Sigma\_\{2\}=I\_\{2\},\\quad\\Sigma\_\{3\}=\\begin\{pmatrix\}3&0\.5\\\\ 0\.5&1\\end\{pmatrix\}\.\(65\)Moreover, the bandwidth is fixed toh=10h=10\.
In each experiment, we use for WGD and FB a stepsizeτ=1\\tau=1\. For FB and WCCCP, we optimize the inner problem with a gradient descent with momentumm=0\.9m=0\.9, step sizeτ=5⋅10−4\\tau=5\\cdot 10^\{\-4\}and for 250 iterations\.
On[Figure˜3](https://arxiv.org/html/2606.27767#S5.F3), we show the evolution of the loss over 200K iterations of the algorithms, comparing WGD with WCCCP for the DC decomposition \([47](https://arxiv.org/html/2606.27767#A4.E47)\), and the DC decomposition of the radial kernel based oncosh\\coshandsinh\\sinhas presented in[Lemma˜D\.3](https://arxiv.org/html/2606.27767#Thmlemma3)as well as the decomposition based on the Jordan decomposition presented in[Proposition˜D\.12](https://arxiv.org/html/2606.27767#Thmproposition12)\. We show the evolution of the objective over the iterations and average the results over 100 different set of initial and target samples\.
We observe that the two radial decompositions perform better than WGD and the DC decomposition \([47](https://arxiv.org/html/2606.27767#A4.E47)\)\. Thecosh/sinh\\cosh/\\sinhdecomposition seems to converges more consistently than the one based on the Jordan decomposition, even though it takes longer to converge\. Indeed, the Jordan decomposition converges much faster, but also shows a higher variance, and sometimes, particles get stuck away from the target mode, which does not seem to happen with thecosh/sinh\\cosh/\\sinhdecomposition\.
We also show the result on[Figure˜E\.11](https://arxiv.org/html/2606.27767#A5.F11)of the same experiment on the Gaussian mixture target\. On this target, the Jordan decomposition outperforms all the other\. On[Figure˜E\.10](https://arxiv.org/html/2606.27767#A5.F10), we add the loss for one run andn=50n=50\. We observe that WCCCP with Jordan decomposition can plateau during the training, and seems to have several regime of convergences for the Gaussian target\. Moreover, it is much faster in its first iterations than thecosh/sinh\\cosh/\\sinhdecomposition\.
On[Figure˜E\.12](https://arxiv.org/html/2606.27767#A5.F12), we plot the evolution of the loss for the Gaussian and Gaussian mixture targets, comparing WGD with WCCCP and the Forward\-Backward \(FB\) algorithm studied in\[Luuet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib2)\]\. We compare WCCCP and FB with two DC decompositions: the one based on the DC decomposition of the radial kernel withcosh/sinh\\cosh/\\sinhas presented in[Lemma˜D\.3](https://arxiv.org/html/2606.27767#Thmlemma3)and the one based on the DC decomposition \([47](https://arxiv.org/html/2606.27767#A4.E47)\)\. We observe that both WCCCP and FB performance heavily depend on the choice of the DC decomposition\. In particular, thecosh/sinh\\cosh/\\sinhdecomposition seems to work well for both algorithms, while the iterates of the decomposition \([47](https://arxiv.org/html/2606.27767#A4.E47)\) gets stuck in a local minimum as WGD\. We add on[Figure˜F\.13](https://arxiv.org/html/2606.27767#A6.F13)and[Figure˜F\.14](https://arxiv.org/html/2606.27767#A6.F14)the evolutions of particles over the flows fo WGD, and WCCCP and FB for both decompositions\.
### Appendix FProofs
#### F\.1Proof of[Proposition˜1](https://arxiv.org/html/2606.27767#Thmproposition1)
Letk≥0k\\geq 0,μk∈𝒫2\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. On one hand, we have for anyT∈L2\(μk\)\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\), settingJ\\mathrm\{J\}as the r\.h\.s\. of \([10](https://arxiv.org/html/2606.27767#S3.E10)\),
J\(T\)\\displaystyle\\mathrm\{J\}\(\\mathrm\{T\}\)≔ℱ\+\(T\#μk\)−ℱ−\(μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\displaystyle\\coloneqq\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\(66\)=ℱ\+\(T\#μk\)−ℱ\+\(μk\)−⟨∇W2ℱ\+\(μk\),T−Id⟩L2\(μk\)\\displaystyle=\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\+ℱ\+\(μk\)\+⟨∇W2ℱ\+\(μk\),T−Id⟩L2\(μk\)\\displaystyle\\qquad\\qquad\\qquad\+\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\+\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}−ℱ−\(μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\displaystyle\\quad\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}=Dℱ\+μk\(T,Id\)\+ℱ\+\(μk\)−ℱ−\(μk\)\+⟨∇W2ℱ\+\(μk\)−∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\displaystyle=\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\+\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\+\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}=Dℱ\+μk\(T,Id\)\+ℱ\(μk\)\+⟨∇W2ℱ\(μk\),T−Id⟩L2\(μk\)\.\\displaystyle=\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\+\\mathcal\{F\}\(\\mu\_\{k\}\)\+\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\.Hence, \([11](https://arxiv.org/html/2606.27767#S3.E11)\) is equivalent with a mirror descent onℱ\\mathcal\{F\}with geometry induced by the Bregman divergence with Bregman potentialℱ\+\\mathcal\{F\}^\{\+\}\.
On the other hand, we also have, for anyT∈L2\(μk\)\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\),
J\(T\)\\displaystyle\\mathrm\{J\}\(\\mathrm\{T\}\)=ℱ\+\(T\#μk\)−ℱ−\(μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\displaystyle=\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\(67\)=ℱ\+\(T\#μk\)−ℱ−\(T\#μk\)\+ℱ−\(T\#μk\)−ℱ−\(μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\displaystyle=\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\+\\mathcal\{F\}^\{\-\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}=ℱ\(T\#μk\)\+Dℱ−μk\(T,Id\)\.\\displaystyle=\\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\.Hence, \([11](https://arxiv.org/html/2606.27767#S3.E11)\) is also equivalent with a Bregman proximal descent onℱ\\mathcal\{F\}with geometry induced by the Bregman divergence with Bregman potentialℱ−\\mathcal\{F\}^\{\-\}\.
Figure F\.13:Evolution of the particles over the scheme for the Gaussian target\.Figure F\.14:Evolution of the particles over the scheme for the Gaussian mixture target\.
#### F\.2Proof of[Proposition˜2](https://arxiv.org/html/2606.27767#Thmproposition2)
ℱ\(μk\+1\)\\displaystyle\\mathcal\{F\}\(\\mu\_\{k\+1\}\)=ℱ\(μk\)\+ℱ\+\(μk\+1\)−ℱ\+\(μk\)−ℱ−\(μk\+1\)\+ℱ−\(μk\)\\displaystyle=\\mathcal\{F\}\(\\mu\_\{k\}\)\+\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\-\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\+1\}\)\+\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\(68\)=ℱ\(μk\)−Dℱ−μk\(Tk\+1,Id\)\+ℱ\+\(μk\+1\)−ℱ\+\(μk\)−⟨∇W2ℱ−\(μk\),Tk\+1−Id⟩L2\(μk\)\\displaystyle=\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\+\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\-\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}=ℱ\(μk\)−Dℱ−μk\(Tk\+1,Id\)−𝒟ℱ\+k\.\\displaystyle=\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\-\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}\.Ifℱ\+\\mathcal\{F\}^\{\+\}is W\-differentiable, observe that by the first order conditions in \([11](https://arxiv.org/html/2606.27767#S3.E11)\),∇W2ℱ\+\(μk\+1\)∘Tk\+1=∇W2ℱ−\(μk\)\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\), and thus𝒟ℱ\+k=Dℱ\+μk\(Id,Tk\+1\)\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}=\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{Id\},\\mathrm\{T\}\_\{k\+1\}\)\.
#### F\.3Proof of[Proposition˜3](https://arxiv.org/html/2606.27767#Thmproposition3)
If𝒟ℱ\+k=0\\mathcal\{D\}^\{k\}\_\{\\mathcal\{F\}^\{\+\}\}=0, thenId∈argminT∈L2\(μk\)ℱ\+\(T\#μk\)−ℱ−\(μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\\mathrm\{Id\}\\in\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\. Hence the first order condition gives0=∇W2ℱ\+\(Id\#μk\)∘Id−∇W2ℱ−\(μk\)=∇W2ℱ\(μk\)0=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mathrm\{Id\}\_\{\\\#\}\\mu\_\{k\}\)\\circ\\mathrm\{Id\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\.
#### F\.4Proof of[Proposition˜4](https://arxiv.org/html/2606.27767#Thmproposition4)
We just use a telescopic sum on \([15](https://arxiv.org/html/2606.27767#S3.E15)\), discardingDℱ−μk\(Tk\+1,Id\)\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\(nonnegative by convexity ofℱ−\\mathcal\{F\}^\{\-\}\), and then use the observation \([18](https://arxiv.org/html/2606.27767#S3.E18)\)\.
#### F\.5Proof of[Proposition˜5](https://arxiv.org/html/2606.27767#Thmproposition5)
Sinceℱ\+\\mathcal\{F\}^\{\+\}andℱ−\\mathcal\{F\}^\{\-\}are respectivelyα\+\\alpha^\{\+\}andα−\\alpha^\{\-\}\-convex along iterates with respect toμ↦∫12∥⋅∥22dμ\\mu\\mapsto\\int\\tfrac\{1\}\{2\}\\\|\\cdot\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu, thenDℱ\+μk\(Id,Tk\+1\)≥α\+2‖Tk\+1−Id‖L2\(μk\)2\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{Id\},\\mathrm\{T\}\_\{k\+1\}\)\\geq\\frac\{\\alpha^\{\+\}\}\{2\}\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}andDℱ−μk\(Tk\+1,Id\)≥α−2‖Tk\+1−Id‖L2\(μk\)2\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\\geq\\frac\{\\alpha^\{\-\}\}\{2\}\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\. Thus, for allk∈\{0,…,K−1\}k\\in\\\{0,\\dots,K\-1\\\},
α\+\+α−2‖Tk\+1−Id‖L2\(μk\)2\\displaystyle\\frac\{\\alpha^\{\+\}\+\\alpha^\{\-\}\}\{2\}\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}≤Dℱ\+μk\(Id,Tk\+1\)\+Dℱ−μk\(Tk\+1,Id\)\\displaystyle\\leq\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\+\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{Id\},\\mathrm\{T\}\_\{k\+1\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\(69\)=\([17](https://arxiv.org/html/2606.27767#S3.E17)\)ℱ\(μk\)−ℱ\(μk\+1\)\.\\displaystyle\\stackrel\{\{\\scriptstyle\\eqref\{eq:iterates\_difference\_gap\}\}\}\{\{=\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\.Hence, by \([18](https://arxiv.org/html/2606.27767#S3.E18)\),
mink∈\{0,…,K−1\}‖Tk\+1−Id‖L2\(μk\)2\\displaystyle\\min\_\{k\\in\\\{0,\\dots,K\-1\\\}\}\\ \\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}≤2α\+\+α−mink∈\{0,…,K−1\}ℱ\(μk\)−ℱ\(μk\+1\)\\displaystyle\\leq\\frac\{2\}\{\\alpha^\{\+\}\+\\alpha^\{\-\}\}\\min\_\{k\\in\\\{0,\\dots,K\-1\\\}\}\\ \\mathcal\{F\}\(\\mu\_\{k\}\)\-\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\(70\)≤2α\+\+α−⋅ℱ\(μ0\)−ℱ\(μK\)K\.\\displaystyle\\leq\\frac\{2\}\{\\alpha^\{\+\}\+\\alpha^\{\-\}\}\\cdot\\frac\{\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\mathcal\{F\}\(\\mu\_\{K\}\)\}\{K\}\.
Under the additional assumption that‖∇W2ℱ\+\(μk\)∘Tk\+1−∇W2ℱ\+\(μk\)‖L2\(μk\)≤L‖Tk\+1−Id‖L2\(μk\)\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\leq L\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}, then by the first order condition \([14](https://arxiv.org/html/2606.27767#S3.E14)\),
‖∇W2ℱ\(μk\)‖L2\(μk\)\\displaystyle\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}=‖∇W2ℱ\+\(μk\)−∇W2ℱ−\(μk\)‖L2\(μk\)\\displaystyle=\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\(71\)=‖∇W2ℱ\+\(μk\)−∇W2ℱ−\(μk\+1\)∘Tk\+1‖L2\(μk\)\\displaystyle=\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}≤L‖Tk\+1−Id‖L2\(μk\)\.\\displaystyle\\leq L\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\.Hence, combining this with the previous result,
mink∈\{0,…,K−1\}‖∇W2ℱ\(μk\)‖L2\(μk\)2≤2L2α\+\+α−⋅ℱ\(μ0\)−ℱ\(μK\)K\.\\min\_\{k\\in\\\{0,\\dots,K\-1\\\}\}\\ \\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\\leq\\frac\{2L^\{2\}\}\{\\alpha^\{\+\}\+\\alpha^\{\-\}\}\\cdot\\frac\{\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\mathcal\{F\}\(\\mu\_\{K\}\)\}\{K\}\.\(72\)
#### F\.6Proof of[Proposition˜6](https://arxiv.org/html/2606.27767#Thmproposition6)
LetSk\+1≔Id\+τ∇W2ℱ−\(μk\)\\mathrm\{S\}\_\{k\+1\}\\coloneqq\\mathrm\{Id\}\+\\tau\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\. First, notice that sinceνk\+1∈𝒫ac\(ℝd\)\\nu\_\{k\+1\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)by assumption, then \([25](https://arxiv.org/html/2606.27767#S3.E25)\) is equivalent to
\{Sk\+1=argminS∈L2\(μk\)12τ‖S−Id‖L2\(μk\)2−⟨∇W2ℱ−\(μk\),S−Id⟩L2\(μk\),νk\+1=\(Sk\+1\)\#μk,Tk\+1=argminT∈L2\(νk\+1\)12τ‖T−Id‖L2\(νk\+1\)2\+ℱ\+\(T\#νk\+1\),μk\+1=\(Tk\+1\)\#μk\.\\left\\\{\\begin\{array\}\[\]\{ll\}\\mathrm\{S\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{S\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\frac\{1\}\{2\\tau\}\\\|\\mathrm\{S\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{S\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\},&\\nu\_\{k\+1\}=\(\\mathrm\{S\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\},\\\\ \\mathrm\{T\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)\}\\ \\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}^\{2\}\+\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\nu\_\{k\+1\}\),&\\mu\_\{k\+1\}=\(\\mathrm\{T\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\.\\end\{array\}\\right\.\(73\)The first equation is obtained by the first order condition, and the second by observing thatTk\+1\\mathrm\{T\}\_\{k\+1\}is necessarily an OT map betweenνk\+1\\nu\_\{k\+1\}andμk\+1\\mu\_\{k\+1\}\. Indeed, asνk\+1∈𝒫ac\(ℝd\)\\nu\_\{k\+1\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\), by Brenier’s theorem, there exists a unique OT map betweenνk\+1\\nu\_\{k\+1\}andμk\+1\\mu\_\{k\+1\}, and ifTk\+1\\mathrm\{T\}\_\{k\+1\}is not such that‖Tk\+1−Id‖L2\(νk\+1\)2=W22\(μk\+1,νk\+1\)\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}^\{2\}=\\mathrm\{W\}\_\{2\}^\{2\}\(\\mu\_\{k\+1\},\\nu\_\{k\+1\}\), then we can replaceTk\+1\\mathrm\{T\}\_\{k\+1\}by the OT mapTνk\+1μk\+1\\mathrm\{T\}\_\{\\nu\_\{k\+1\}\}^\{\\mu\_\{k\+1\}\}, which won’t changeℱ\+\(\(Tνk\+1μk\+1\)\#μ\)=ℱ\+\(μk\+1\)\\mathcal\{F\}^\{\+\}\\big\(\(\\mathrm\{T\}\_\{\\nu\_\{k\+1\}\}^\{\\mu\_\{k\+1\}\}\)\_\{\\\#\}\\mu\\big\)=\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)but will give a better transport cost\.
LetJ\(T\)=ℱ\+\(T\#μk\)−⟨∇W2ℱ−\(μk\),T−Id⟩L2\(μk\)\+12τ‖T−Id‖L2\(μk\)2J\(\\mathrm\{T\}\)=\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\-\\mathrm\{Id\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\+\\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\. The gradient ofJJgives
∇J\(T\)\\displaystyle\\nabla J\(\\mathrm\{T\}\)=∇W2ℱ\+\(T\#μk\)∘T−∇W2ℱ−\(μk\)\+1τ\(T−Id\)\\displaystyle=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\\circ\\mathrm\{T\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\+\\frac\{1\}\{\\tau\}\(\\mathrm\{T\}\-\\mathrm\{Id\}\)\(74\)=∇W2ℱ\+\(T\#μk\)∘T\+1τ\(T−\(Id\+τ∇W2ℱ−\(μk\)\)\)\\displaystyle=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\\circ\\mathrm\{T\}\+\\frac\{1\}\{\\tau\}\\big\(\\mathrm\{T\}\-\(\\mathrm\{Id\}\+\\tau\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\)\\big\)=∇W2ℱ\+\(T\#μk\)∘T\+1τ\(T−Sk\+1\)\.\\displaystyle=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\\circ\\mathrm\{T\}\+\\frac\{1\}\{\\tau\}\\big\(\\mathrm\{T\}\-\\mathrm\{S\}\_\{k\+1\}\\big\)\.Taking the first order conditions,
∇J\(T~k\+1\)=0⇔T~k\+1=argminT~∈L2\(μk\)ℱ\+\(T~\#μk\)\+12τ‖T~−Sk\+1‖L2\(μk\)2\.\\nabla J\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\)=0\\iff\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\tilde\{\\mathrm\{T\}\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathcal\{F\}^\{\+\}\(\\tilde\{\\mathrm\{T\}\}\_\{\\\#\}\\mu\_\{k\}\)\+\\frac\{1\}\{2\\tau\}\\\|\\tilde\{\\mathrm\{T\}\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\.\(75\)
Now let us show thatT~k\+1=Tk\+1∘Sk\+1\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}=\\mathrm\{T\}\_\{k\+1\}\\circ\\mathrm\{S\}\_\{k\+1\},*i\.e\.*
minT~∈L2\(μk\)ℱ\+\(T~\#μk\)\+12τ‖T~−Sk\+1‖L2\(μk\)2=minT∈L2\(νk\+1\)ℱ\+\(T\#νk\+1\)\+12τ‖T−Id‖L2\(νk\+1\)2\.\\min\_\{\\tilde\{\\mathrm\{T\}\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathcal\{F\}^\{\+\}\(\\tilde\{\\mathrm\{T\}\}\_\{\\\#\}\\mu\_\{k\}\)\+\\frac\{1\}\{2\\tau\}\\\|\\tilde\{\\mathrm\{T\}\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}=\\min\_\{\\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)\}\\ \\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\nu\_\{k\+1\}\)\+\\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}^\{2\}\.\(76\)
On one hand, notice by a change of variables, sinceνk\+1=\(Sk\+1\)\#μk\\nu\_\{k\+1\}=\(\\mathrm\{S\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}, that for allT∈L2\(νk\+1\)\\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\),
12τ‖T−Id‖L2\(νk\+1\)\+ℱ\+\(T\#νk\+1\)=12τ‖T∘Sk\+1−Sk\+1‖L2\(μk\)\+ℱ\+\(\(T∘Sk\+1\)\#μk\)\.\\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}\+\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\nu\_\{k\+1\}\)=\\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\\circ\\mathrm\{S\}\_\{k\+1\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\+\\mathcal\{F\}^\{\+\}\\big\(\(\\mathrm\{T\}\\circ\\mathrm\{S\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\\big\)\.\(77\)Since\{T∘Sk\+1,T∈L2\(νk\+1\)\}⊂L2\(μk\)\\\{\\mathrm\{T\}\\circ\\mathrm\{S\}\_\{k\+1\},\\ \\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)\\\}\\subset L^\{2\}\(\\mu\_\{k\}\),
minT∈L2\(νk\+1\)12τ‖T−Id‖L2\(νk\+1\)\+ℱ\+\(T\#νk\+1\)\\displaystyle\\min\_\{\\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)\}\\ \\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}\+\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\nu\_\{k\+1\}\)\(78\)=minT∈L2\(νk\+1\)12τ‖T∘Sk\+1−Sk\+1‖L2\(μk\)\+ℱ\+\(\(T∘Sk\+1\)\#μk\)\\displaystyle=\\min\_\{\\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)\}\\ \\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\\circ\\mathrm\{S\}\_\{k\+1\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\+\\mathcal\{F\}^\{\+\}\\big\(\(\\mathrm\{T\}\\circ\\mathrm\{S\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\\big\)≥minT∈L2\(μk\)12τ‖T−Sk\+1‖L2\(μk\)2\+ℱ\+\(T\#μk\)\\displaystyle\\geq\\min\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\+\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)=\([75](https://arxiv.org/html/2606.27767#A6.E75)\)12τ‖T~k\+1−Sk\+1‖L2\(μk\)2\+ℱ\+\(\(T~k\+1\)\#μk\)\.\\displaystyle\\stackrel\{\{\\scriptstyle\\eqref\{eq:foc\_tilde\}\}\}\{\{=\}\}\\frac\{1\}\{2\\tau\}\\\|\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\+\\mathcal\{F\}^\{\+\}\\big\(\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\\big\)\.
Now, let us suppose by contradiction that it is a strict inequality,*i\.e\.*
minT∈L2\(νk\+1\)12τ‖T−Id‖L2\(νk\+1\)\+ℱ\+\(T\#νk\+1\)\>12τ‖T~k\+1−Sk\+1‖L2\(μk\)2\+ℱ\+\(\(T~k\+1\)\#μk\)\.\\min\_\{\\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)\}\\ \\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}\+\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\nu\_\{k\+1\}\)\>\\frac\{1\}\{2\\tau\}\\\|\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\+\\mathcal\{F\}^\{\+\}\\big\(\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\\big\)\.\(79\)Defineη=\(T~k\+1\)\#μk\\eta=\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\. Sinceνk\+1∈𝒫ac\(ℝd\)\\nu\_\{k\+1\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\), there exists an OT mapTη∈L2\(νk\+1\)\\mathrm\{T\}^\{\\eta\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)such thatW22\(η,νk\+1\)=‖Tη−Id‖L2\(νk\+1\)2\\mathrm\{W\}\_\{2\}^\{2\}\(\\eta,\\nu\_\{k\+1\}\)=\\\|\\mathrm\{T\}^\{\\eta\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}^\{2\}\. Notice that\(T~k\+1,Sk\+1\)\#μk∈Π\(η,νk\+1\)\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\},\\mathrm\{S\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\\in\\Pi\(\\eta,\\nu\_\{k\+1\}\), hence‖T~k\+1−Sk\+1‖L2\(μk\)2≥W22\(η,νk\+1\)\\\|\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\\geq\\mathrm\{W\}\_\{2\}^\{2\}\(\\eta,\\nu\_\{k\+1\}\)and
minT∈L2\(νk\+1\)12τ‖T−Id‖L2\(νk\+1\)\+ℱ\+\(T\#νk\+1\)\\displaystyle\\min\_\{\\mathrm\{T\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)\}\\ \\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\)\}\+\\mathcal\{F\}^\{\+\}\(\\mathrm\{T\}\_\{\\\#\}\\nu\_\{k\+1\}\)\>12τ‖T~k\+1−Sk\+1‖L2\(μk\)2\+ℱ\+\(\(T~k\+1\)\#μk\)\\displaystyle\>\\frac\{1\}\{2\\tau\}\\\|\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\-\\mathrm\{S\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}^\{2\}\+\\mathcal\{F\}^\{\+\}\\big\(\(\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\\big\)\(80\)≥12τ‖Tη−Id‖L2\(νk\+1\+ℱ\+\(\(Tη\)\#νk\+1\)\.\\displaystyle\\geq\\frac\{1\}\{2\\tau\}\\\|\\mathrm\{T\}^\{\\eta\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\nu\_\{k\+1\}\}\+\\mathcal\{F\}^\{\+\}\(\(\\mathrm\{T\}^\{\\eta\}\)\_\{\\\#\}\\nu\_\{k\+1\}\)\.However,Tη∈L2\(νk\+1\)\\mathrm\{T\}^\{\\eta\}\\in L^\{2\}\(\\nu\_\{k\+1\}\)and thus this is a contradiction\. Therefore, we necessarily have an equality in \([76](https://arxiv.org/html/2606.27767#A6.E76)\) andT~k\+1=Tk\+1∘Sk\+1\\tilde\{\\mathrm\{T\}\}\_\{k\+1\}=\\mathrm\{T\}\_\{k\+1\}\\circ\\mathrm\{S\}\_\{k\+1\}\.
#### F\.7Proof of[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)
We start with two preliminary lemmas\.
###### Lemma F\.7\.
Letν∈𝒫2\(ℝd\)\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)andα≥0\\alpha\\geq 0\. Letψ:ℝd→ℝ\\psi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}be aα\\alpha\-strongly convex function that isν\\nu\-integrable\. ThenV:x↦∫ψ\(x−y\)dν\(y\)\\mathrm\{V\}:x\\mapsto\\int\\psi\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)isα\\alpha\-strongly convex\.
###### Proof\.
Letx0,x1∈ℝdx\_\{0\},x\_\{1\}\\in\\mathbb\{R\}^\{d\},t∈\[0,1\]t\\in\[0,1\], then
V\(\(1−t\)x0\+tx1\)\\displaystyle\\mathrm\{V\}\\big\(\(1\-t\)x\_\{0\}\+tx\_\{1\}\\big\)=∫ψ\(\(1−t\)x0\+tx1−y\)dν\(y\)\\displaystyle=\\int\\psi\\big\(\(1\-t\)x\_\{0\}\+tx\_\{1\}\-y\\big\)\\ \\mathrm\{d\}\\nu\(y\)\(81\)=∫ψ\(\(1−t\)\(x0−y\)\+t\(x1−y\)\)dν\(y\)\\displaystyle=\\int\\psi\\big\(\(1\-t\)\(x\_\{0\}\-y\)\+t\(x\_\{1\}\-y\)\\big\)\\ \\mathrm\{d\}\\nu\(y\)≤\(1−t\)∫ψ\(x0−y\)dν\(y\)\+t∫ψ\(x1−y\)dν\(y\)−αt\(1−t\)2‖x0−x1‖22\\displaystyle\\leq\(1\-t\)\\int\\psi\(x\_\{0\}\-y\)\\ \\mathrm\{d\}\\nu\(y\)\+t\\int\\psi\(x\_\{1\}\-y\)\\ \\mathrm\{d\}\\nu\(y\)\-\\frac\{\\alpha t\(1\-t\)\}\{2\}\\\|x\_\{0\}\-x\_\{1\}\\\|\_\{2\}^\{2\}=\(1−t\)V\(x0\)\+tV\(x1\)−αt\(1−t\)2‖x0−x1‖22\.\\displaystyle=\(1\-t\)\\mathrm\{V\}\(x\_\{0\}\)\+t\\mathrm\{V\}\(x\_\{1\}\)\-\\frac\{\\alpha t\(1\-t\)\}\{2\}\\\|x\_\{0\}\-x\_\{1\}\\\|\_\{2\}^\{2\}\.∎
###### Lemma F\.8\.
Letν∈𝒫2\(ℝd\)\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. Letψ:ℝd→ℝ\\psi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}beν\\nu\-integrable and defineV:x↦∫ψ\(x−y\)dν\(y\)\\mathrm\{V\}:x\\mapsto\\int\\psi\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)\. Assume there existsa,b∈ℝa,b\\in\\mathbb\{R\}such thatψ\(x\)≥−a−b‖x‖22\\psi\(x\)\\geq\-a\-b\\\|x\\\|\_\{2\}^\{2\}for allx∈ℝdx\\in\\mathbb\{R\}^\{d\}\. Then, there existsa′,b′∈ℝa^\{\\prime\},b^\{\\prime\}\\in\\mathbb\{R\}such thatV\(x\)≥−a′−b′‖x‖22\\mathrm\{V\}\(x\)\\geq\-a^\{\\prime\}\-b^\{\\prime\}\\\|x\\\|\_\{2\}^\{2\}\.
###### Proof\.
Letx∈ℝdx\\in\\mathbb\{R\}^\{d\}, then
V\(x\)=∫ψ\(x−y\)dν\(y\)\\displaystyle\\mathrm\{V\}\(x\)=\\int\\psi\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)≥−a−b∫‖x−y‖22dν\(y\)\.\\displaystyle\\geq\-a\-b\\int\\\|x\-y\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\nu\(y\)\.\(82\)
Ifb≥0b\\geq 0, we can use that∫‖x−y‖22dν\(y\)≤2‖x‖22\+2∫‖y‖22dν\(y\)\\int\\\|x\-y\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\nu\(y\)\\leq 2\\\|x\\\|\_\{2\}^\{2\}\+2\\int\\\|y\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\nu\(y\), and thus we have the result fora′=a\+2b∫‖y‖22dν\(y\)a^\{\\prime\}=a\+2b\\int\\\|y\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\nu\(y\)andb′=2bb^\{\\prime\}=2b\. Ifb<0b<0, thenV\(x\)≥−a\\mathrm\{V\}\(x\)\\geq\-aand we can usea′=aa^\{\\prime\}=a,b′=0b^\{\\prime\}=0\. ∎
We now move to proving[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)\. First, let us focus on the potential term of MMD,*i\.e\.*
𝒱\(μ\)=∫Vdμ,V\(⋅\)=−∫k\(⋅,y\)dν\(y\)\.\\mathcal\{V\}\(\\mu\)=\\int\\mathrm\{V\}\\ \\mathrm\{d\}\\mu,\\quad\\mathrm\{V\}\(\\cdot\)=\-\\int k\(\\cdot,y\)\\ \\mathrm\{d\}\\nu\(y\)\.\(83\)Since for allx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\},k\(x,y\)=ψ\(x−y\)=ψ\+\(x−y\)−ψ−\(x−y\)k\(x,y\)=\\psi\(x\-y\)=\\psi^\{\+\}\(x\-y\)\-\\psi^\{\-\}\(x\-y\), we can rewriteV\\mathrm\{V\}as, for allx∈ℝdx\\in\\mathbb\{R\}^\{d\},
V\(x\)\\displaystyle\\mathrm\{V\}\(x\)=−∫ψ\(x−y\)dν\(y\)\\displaystyle=\-\\int\\psi\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)\(84\)=−∫\(ψ\+\(x−y\)−ψ−\(x−y\)\)dν\(y\)\\displaystyle=\-\\int\\big\(\\psi^\{\+\}\(x\-y\)\-\\psi^\{\-\}\(x\-y\)\\big\)\\ \\mathrm\{d\}\\nu\(y\)=∫ψ−\(x−y\)dν\(y\)−∫ψ\+\(x−y\)dν\(y\)\\displaystyle=\\int\\psi^\{\-\}\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)\-\\int\\psi^\{\+\}\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)=V−\(x\)−V\+\(x\)\.\\displaystyle=\\mathrm\{V\}^\{\-\}\(x\)\-\\mathrm\{V\}^\{\+\}\(x\)\.Moreover, asψ−\\psi^\{\-\}andψ\+\\psi^\{\+\}are respectivelyα−\\alpha^\{\-\}andα\+\\alpha^\{\+\}\-strongly convex and locally Lipschitz,V−\\mathrm\{V\}^\{\-\}andV\+\\mathrm\{V\}^\{\+\}are respectiveα−\\alpha^\{\-\}andα\+\\alpha^\{\+\}\-strongly convex by[Lemma˜F\.7](https://arxiv.org/html/2606.27767#Thmlemma7)and continuous\.
We now showψ−\\psi^\{\-\}andψ\+\\psi^\{\+\}have more than a negative quadratic growth\. Since they have full domain, there have convex subdifferentials everywhere, in particular in0\. Take for instancep∈∂ψ−\(0\)p\\in\\partial\\psi^\{\-\}\(0\), thenψ−\(x\)≥ψ−\(0\)\+⟨p,x⟩\\psi^\{\-\}\(x\)\\geq\\psi^\{\-\}\(0\)\+\\langle p,x\\rangle\. Ifp=0p=0, thenψ−\\psi^\{\-\}is lower bounded\. Assumep≠0p\\neq 0, then
ψ−\(x\)≥ψ−\(0\)\+⟨p,x⟩≥ψ−\(0\)−‖p‖2‖x‖2≥ψ−\(0\)−‖p‖2\(1\+‖x‖22\)\.\\psi^\{\-\}\(x\)\\geq\\psi^\{\-\}\(0\)\+\\langle p,x\\rangle\\geq\\psi^\{\-\}\(0\)\-\\\|p\\\|\_\{2\}\\\|x\\\|\_\{2\}\\geq\\psi^\{\-\}\(0\)\-\\\|p\\\|\_\{2\}\(1\+\\\|x\\\|\_\{2\}^\{2\}\)\.\(85\)The same reasoning applies toψ\+\\psi^\{\+\}\. Consequently there existsa\+,a−,b\+,b−∈ℝa^\{\+\},a^\{\-\},b^\{\+\},b^\{\-\}\\in\\mathbb\{R\}such thatψ\+\(⋅\)≥−a\+−b\+∥⋅∥22\\psi^\{\+\}\(\\cdot\)\\geq\-a^\{\+\}\-b^\{\+\}\\\|\\cdot\\\|\_\{2\}^\{2\}andψ−\(⋅\)≥−a−−b−∥⋅∥22\\psi^\{\-\}\(\\cdot\)\\geq\-a^\{\-\}\-b^\{\-\}\\\|\\cdot\\\|\_\{2\}^\{2\}\. Consequently,V−\\mathrm\{V\}^\{\-\}andV\+\\mathrm\{V\}^\{\+\}have a negative part with a quadratic growth using[Lemma˜F\.8](https://arxiv.org/html/2606.27767#Thmlemma8)\.
Hence, by\[Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31), Proposition 9\.3\.2\],𝒱\\mathcal\{V\}can be decomposed as a difference of two strongly totally convex potential energies𝒱=𝒱−−𝒱\+\\mathcal\{V\}=\\mathcal\{V\}^\{\-\}\-\\mathcal\{V\}^\{\+\}with𝒱−=∫V−dμ\\mathcal\{V\}^\{\-\}=\\int\\mathrm\{V\}^\{\-\}\\mathrm\{d\}\\muα−\\alpha^\{\-\}\-totally convex,𝒱\+=∫V\+dμ\\mathcal\{V\}^\{\+\}=\\int\\mathrm\{V\}^\{\+\}\\mathrm\{d\}\\muα\+\\alpha^\{\+\}\-totally convex\.
Similarly, the interaction energy term𝒲\(μ\)=12∬k\(x,y\)dμ\(x\)dμ\(y\)\\mathcal\{W\}\(\\mu\)=\\frac\{1\}\{2\}\\iint k\(x,y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)can be decomposed as a difference of totally convex interaction energies𝒲=𝒲\+−𝒲−\\mathcal\{W\}=\\mathcal\{W\}^\{\+\}\-\\mathcal\{W\}^\{\-\}\(by\[Ambrosioet al\.,[2008](https://arxiv.org/html/2606.27767#bib.bib31), Proposition 9\.3\.5\]\) with
𝒲\+\(μ\)=12∬ψ\+\(x−y\)dμ\(x\)dμ\(y\),𝒲−\(μ\)=12∬ψ−\(x−y\)dμ\(x\)dμ\(y\)\.\\mathcal\{W\}^\{\+\}\(\\mu\)=\\frac\{1\}\{2\}\\iint\\psi^\{\+\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\),\\quad\\mathcal\{W\}^\{\-\}\(\\mu\)=\\frac\{1\}\{2\}\\iint\\psi^\{\-\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\.\(86\)
Definingℱ\+\(μ\)=𝒲\+\+𝒱−\+c\\mathcal\{F\}^\{\+\}\(\\mu\)=\\mathcal\{W\}^\{\+\}\+\\mathcal\{V\}^\{\-\}\+candℱ−\(μ\)=𝒲−\+𝒱\+\\mathcal\{F\}^\{\-\}\(\\mu\)=\\mathcal\{W\}^\{\-\}\+\\mathcal\{V\}^\{\+\}, we have for allμ∈𝒫2\(ℝd\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\),
ℱ\(μ\)=ℱ\+\(μ\)−ℱ−\(μ\),\\mathcal\{F\}\(\\mu\)=\\mathcal\{F\}^\{\+\}\(\\mu\)\-\\mathcal\{F\}^\{\-\}\(\\mu\),\(87\)whereℱ\+\\mathcal\{F\}^\{\+\}isα−\\alpha^\{\-\}\-totally convex andℱ−\\mathcal\{F\}^\{\-\}isα\+\\alpha^\{\+\}\-totally convex, as the sum of a convex term \(the interactions\) and of a strongly\-convex term \(the potentials\)\.
#### F\.8Proof of[Proposition˜8](https://arxiv.org/html/2606.27767#Thmproposition8)
The assumptionλ¯\[q\+\],λ¯\[q−\]≥0\\underline\{\\lambda\}\[q\_\{\+\}\],\\underline\{\\lambda\}\[q\_\{\-\}\]\\geq 0allows to deduce thatψ\+:z↦q\+\(‖z‖22\)\\psi^\{\+\}:z\\mapsto q\_\{\+\}\(\\\|z\\\|\_\{2\}^\{2\}\)andψ−:z↦q−\(‖z‖22\)\\psi^\{\-\}:z\\mapsto q\_\{\-\}\(\\\|z\\\|\_\{2\}^\{2\}\)areλ¯\[q\+\]≥0\\underline\{\\lambda\}\[q\_\{\+\}\]\\geq 0andλ¯\[q−\]≥0\\underline\{\\lambda\}\[q\_\{\-\}\]\\geq 0strongly convex\. Hence, we can apply[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7)and obtain thatℱ\+\\mathcal\{F\}^\{\+\}isλ¯\[q−\]\\underline\{\\lambda\}\[q\_\{\-\}\]\-totally convex, andℱ−\\mathcal\{F\}^\{\-\}isλ¯\[q\+\]\\underline\{\\lambda\}\[q\_\{\+\}\]\-totally convex\.
Then applying[Proposition˜D\.13](https://arxiv.org/html/2606.27767#Thmproposition13), we obtain the second result\.
#### F\.9Proof of[Proposition˜B\.10](https://arxiv.org/html/2606.27767#Thmproposition10)
We recall that for allk≥0k\\geq 0,
Tk\+1=argminT∈L2\(μk\)ℱ\(T\#μk\)\+Dℱ−μk\(T,Id\)\.\\mathrm\{T\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\\in L^\{2\}\(\\mu\_\{k\}\)\}\\ \\mathcal\{F\}\(\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\.\(88\)Taking the first order conditions, we obtain
∇W2ℱ\(μk\+1\)∘Tk\+1\+∇W2ℱ−\(μk\+1\)∘Tk\+1−∇W2ℱ−\(μk\)=0⇔∇W2ℱ\(μk\+1\)∘Tk\+1=−\(∇W2ℱ−\(μk\+1\)∘Tk\+1−∇W2ℱ−\(μk\)\)\.\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\+\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)=0\\\\ \\iff\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}=\-\\big\(\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\\big\)\.\(89\)
Notice thatTk\+1=argminT\#μk=μk\+1Dℱ−μk\(T,Id\)\\mathrm\{T\}\_\{k\+1\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}=\\mu\_\{k\+1\}\}\\ \\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\), and defineTk∗=argminT\#μk=μ∗Dℱ−μk\(T,Id\)\\mathrm\{T\}\_\{k\}^\{\*\}=\\operatorname\{argmin\}\_\{\\mathrm\{T\}\_\{\\\#\}\\mu\_\{k\}=\\mu^\{\*\}\}\\ \\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\},\\mathrm\{Id\}\)\. Both exist as by assumption,μk∈𝒫ac\(ℝd\)\\mu\_\{k\}\\in\\mathcal\{P\}\_\{\\mathrm\{ac\}\}\(\\mathbb\{R\}^\{d\}\)for allk≥0k\\geq 0\.
By hypothesis,ℱ\\mathcal\{F\}isα\\alpha\-convex relative toℱ−\\mathcal\{F\}^\{\-\}alongt↦\(\(1−t\)Tk∗\+tTk\+1\)\#μkt\\mapsto\\big\(\(1\-t\)\\mathrm\{T\}\_\{k\}^\{\*\}\+t\\mathrm\{T\}\_\{k\+1\}\\big\)\_\{\\\#\}\\mu\_\{k\}, thus we have
Dℱμk\(Tk∗,Tk\+1\)≥αDℱ−μk\(Tk∗,Tk\+1\)⇔ℱ\(μ∗\)−ℱ\(μk\+1\)−⟨∇W2ℱ\(μk\+1\)∘Tk\+1,Tk∗−Tk\+1⟩L2\(μk\)≥αDℱ−μk\(Tk∗,Tk\+1\)\.\\mathrm\{D\}\_\{\\mathcal\{F\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\\geq\\alpha\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\\\\ \\iff\\mathcal\{F\}\(\\mu^\{\*\}\)\-\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\-\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\},\\mathrm\{T\}\_\{k\}^\{\*\}\-\\mathrm\{T\}\_\{k\+1\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\geq\\alpha\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\.\(90\)By definition ofμ∗\\mu^\{\*\},ℱ\(μ∗\)−ℱ\(μk\+1\)≤0\\mathcal\{F\}\(\\mu^\{\*\}\)\-\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\\leq 0\. Using the first order conditions, we get the inequality
⟨∇W2ℱ−\(μk\+1\)∘Tk\+1−∇W2ℱ−\(μk\),Tk∗−Tk\+1⟩L2\(μk\)≥αDℱ−μk\(Tk∗,Tk\+1\)\.\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\_\{k\}^\{\*\}\-\\mathrm\{T\}\_\{k\+1\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\geq\\alpha\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\.\(91\)By the 3\-point equality \(see\[Bonetet al\.,[2024](https://arxiv.org/html/2606.27767#bib.bib3), Lemma 28\]\) applied withT:=Tk\+1\\mathrm\{T\}:=\\mathrm\{T\}\_\{k\+1\},S:=Tk∗\\mathrm\{S\}:=\\mathrm\{T\}\_\{k\}^\{\*\}andU:=Id\\mathrm\{U\}:=\\mathrm\{Id\},
⟨∇W2ℱ−\(μk\+1\)∘Tk\+1−∇W2ℱ−\(μk\),Tk∗−Tk\+1⟩L2\(μk\)=Dℱ−μk\(Tk∗,Id\)−Dℱ−μk\(Tk∗,Tk\+1\)−Dℱ−μk\(Tk\+1,Id\)\.\\langle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\),\\mathrm\{T\}\_\{k\}^\{\*\}\-\\mathrm\{T\}\_\{k\+1\}\\rangle\_\{L^\{2\}\(\\mu\_\{k\}\)\}\\\\ =\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{Id\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\.\(92\)Plugging \([92](https://arxiv.org/html/2606.27767#A6.E92)\) into \([91](https://arxiv.org/html/2606.27767#A6.E91)\), we get
Dℱ−μk\(Tk∗,Id\)−Dℱ−μk\(Tk\+1,Id\)≥\(α\+1\)Dℱ−μk\(Tk∗,Tk\+1\)\.\\displaystyle\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{Id\}\)\-\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\\geq\(\\alpha\+1\)\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\.\(93\)Using thatℱ−\\mathcal\{F\}^\{\-\}is convex alongt↦\(\(1−t\)Id\+tTk\+1\)\#μkt\\mapsto\\big\(\(1\-t\)\\mathrm\{Id\}\+t\\mathrm\{T\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}, we get thatDℱ−μk\(Tk\+1,Id\)≥0\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\\geq 0\. Using the definition ofTk∗\\mathrm\{T\}\_\{k\}^\{\*\}, we haveDℱ−μk\(Tk∗,Id\)=Wℱ−\(μ∗,μk\)\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{Id\}\)=\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{k\}\)and sinceγ=\(Tk∗,Tk\+1\)\#μk∈Π\(μ∗,μk\+1\)\\gamma=\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\_\{\\\#\}\\mu\_\{k\}\\in\\Pi\(\\mu^\{\*\},\\mu\_\{k\+1\}\), we also haveWℱ−\(μ∗,μk\+1\)≤Dℱ−μk\(Tk∗,Tk\+1\)\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{k\+1\}\)\\leq\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\. Thus, we obtain by induction
Wℱ−\(μ∗,μk\+1\)≤Dℱ−μk\(Tk∗,Tk\+1\)≤11\+αWℱ−\(μ∗,μk\)≤\(11\+α\)k\+1Wℱ−\(μ∗,μ0\)\.\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{k\+1\}\)\\leq\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{T\}\_\{k\+1\}\)\\leq\\frac\{1\}\{1\+\\alpha\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{k\}\)\\leq\\left\(\\frac\{1\}\{1\+\\alpha\}\\right\)^\{k\+1\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{0\}\)\.\(94\)
Moreover, by definition ofTk\+1\\mathrm\{T\}\_\{k\+1\},
ℱ\(μk\+1\)\+Dℱ−μk\(Tk\+1,Id\)≤ℱ\(μ∗\)\+Dℱ−μk\(Tk∗,Id\)\.\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\+1\},\\mathrm\{Id\}\)\\leq\\mathcal\{F\}\(\\mu^\{\*\}\)\+\\mathrm\{D\}\_\{\\mathcal\{F\}^\{\-\}\}^\{\\mu\_\{k\}\}\(\\mathrm\{T\}\_\{k\}^\{\*\},\\mathrm\{Id\}\)\.\(95\)Hence,
ℱ\(μk\+1\)−ℱ\(μ∗\)≤Wℱ−\(μ∗,μk\)≤\(11\+α\)kWℱ−\(μ∗,μ0\)\.\\mathcal\{F\}\(\\mu\_\{k\+1\}\)\-\\mathcal\{F\}\(\\mu^\{\*\}\)\\leq\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{k\}\)\\leq\\left\(\\frac\{1\}\{1\+\\alpha\}\\right\)^\{k\}\\mathrm\{W\}\_\{\\mathcal\{F\}^\{\-\}\}\(\\mu^\{\*\},\\mu\_\{0\}\)\.\(96\)
#### F\.10Proof of[Lemma˜D\.1](https://arxiv.org/html/2606.27767#Thmlemma1)
Lets=‖z‖22s=\\\|z\\\|\_\{2\}^\{2\}and recall thatψ\(z\)=q\(‖z‖22\)\\psi\(z\)=q\(\\\|z\\\|\_\{2\}^\{2\}\)\. Differentiating, we have∇ψ\(z\)=2q′\(s\)z\\nabla\\psi\(z\)=2q^\{\\prime\}\(s\)z, hence differentiating a second time we obtain
∇2ψ\(z\)=2q′\(s\)Id\+4q′′\(s\)zz⊤\.\\nabla^\{2\}\\psi\(z\)=2q^\{\\prime\}\(s\)I\_\{d\}\+4q^\{\\prime\\prime\}\(s\)zz^\{\\top\}\.\(97\)Fixzzand lete⟂ze\\perp zand‖e‖2=1\\\|e\\\|\_\{2\}=1thenzz⊤e=0zz^\{\\top\}e=0, and
∇2ψ\(z\)e=2q′\(s\)e,\\nabla^\{2\}\\psi\(z\)e=2q^\{\\prime\}\(s\)e,\(98\)theneeis a tangential eigenvector, with eigenvalue2q′\(s\)2q^\{\\prime\}\(s\)\. We can findd−1d\-1vectors that are linearly independent and⟂z\\perp z, hence2q′\(s\)2q^\{\\prime\}\(s\)is an eigenvalue with multiplicityd−1d\-1\. Now consider in the radial directionv=z‖z‖2v=\\frac\{z\}\{\\\|z\\\|\_\{2\}\}\(whenz≠0z\\neq 0\), we have:
∇2ψ\(z\)v\\displaystyle\\nabla^\{2\}\\psi\(z\)v=\(2q′\(s\)Id\+4q′′\(s\)zz⊤\)v\\displaystyle=\\left\(2q^\{\\prime\}\(s\)I\_\{d\}\+4q^\{\\prime\\prime\}\(s\)zz^\{\\top\}\\right\)v\(99\)=2q′\(s\)v\+4q′′\(s\)zz⊤z‖z‖2\\displaystyle=2q^\{\\prime\}\(s\)v\+4q^\{\\prime\\prime\}\(s\)zz^\{\\top\}\\frac\{z\}\{\\\|z\\\|\_\{2\}\}=2q′\(s\)v\+4q′′\(s\)‖z‖2z\\displaystyle=2q^\{\\prime\}\(s\)v\+4q^\{\\prime\\prime\}\(s\)\\\|z\\\|\_\{2\}z=2q′\(s\)v\+4q′′\(s\)‖z‖22v\\displaystyle=2q^\{\\prime\}\(s\)v\+4q^\{\\prime\\prime\}\(s\)\\\|z\\\|\_\{2\}^\{2\}v=\(2q′\(s\)\+4q′′\(s\)s\)v\.\\displaystyle=\\big\(2q^\{\\prime\}\(s\)\+4q^\{\\prime\\prime\}\(s\)s\\big\)v\.Hencevvis an eigenvector of∇2ψ\(z\)\\nabla^\{2\}\\psi\(z\)with eigenvalue\(2q′\(s\)\+4q′′\(s\)s\)\\left\(2q^\{\\prime\}\(s\)\+4q^\{\\prime\\prime\}\(s\)s\\right\)of multiplicity 1\. Hence we have identified the full spectrum of∇2ψ\(z\)\\nabla^\{2\}\\psi\(z\), and the smallest eigenvalue is
min\{2q′\(s\),2q′\(s\)\+4sq′′\(s\)\},\\min\\bigl\\\{2q^\{\\prime\}\(s\),\\,2q^\{\\prime\}\(s\)\+4sq^\{\\prime\\prime\}\(s\)\\bigr\\\},\(100\)and the largest eigenvalue is
max\{2q′\(s\),2q′\(s\)\+4sq′′\(s\)\}\.\\max\\bigl\\\{2q^\{\\prime\}\(s\),\\,2q^\{\\prime\}\(s\)\+4sq^\{\\prime\\prime\}\(s\)\\bigr\\\}\.\(101\)
#### F\.11Proof of[Lemma˜D\.3](https://arxiv.org/html/2606.27767#Thmlemma3)
Recall thatk\(x,y\)=ψ\(x−y\)k\(x,y\)=\\psi\(x\-y\), whereψ\(x−y\)=exp\(−α‖x−y‖22\)\\psi\(x\-y\)=\\exp\(\-\\alpha\\\|x\-y\\\|\_\{2\}^\{2\}\)\. Letq\(s\)=exp\(−αs\)q\(s\)=\\exp\(\-\\alpha s\), we haveq\(s\)=q\+\(s\)−q−\(s\)q\(s\)=q\_\{\+\}\(s\)\-q\_\{\-\}\(s\)where
q\+\(s\)=cosh\(αs\),q−\(s\)=sinh\(αs\)\.q\_\{\+\}\(s\)=\\cosh\(\\alpha s\),\\quad q\_\{\-\}\(s\)=\\sinh\(\\alpha s\)\.\(102\)Taking the derivatives, we have
q\+′\(s\)=αsinh\(αs\),q\+′′\(s\)=α2cosh\(αs\),q\_\{\+\}^\{\\prime\}\(s\)=\\alpha\\sinh\(\\alpha s\),\\quad q\_\{\+\}^\{\\prime\\prime\}\(s\)=\\alpha^\{2\}\\cosh\(\\alpha s\),\(103\)and
q−′\(s\)=αcosh\(αs\),q−′′\(s\)=α2sinh\(αs\)\.q\_\{\-\}^\{\\prime\}\(s\)=\\alpha\\cosh\(\\alpha s\),\\quad q\_\{\-\}^\{\\prime\\prime\}\(s\)=\\alpha^\{2\}\\sinh\(\\alpha s\)\.\(104\)It is easy to see thatq±′\(s\)≥0q^\{\\prime\}\_\{\\pm\}\(s\)\\geq 0andq±′′\(s\)≥0q^\{\\prime\\prime\}\_\{\\pm\}\(s\)\\geq 0fors≥0s\\geq 0\. Thereforez↦ψ±\(z\)=q±\(‖z‖22\)z\\mapsto\\psi\_\{\\pm\}\(z\)=q\_\{\\pm\}\(\\\|z\\\|\_\{2\}^\{2\}\)are convex onΩ−Ω\\Omega\-\\Omegaby[Lemma˜D\.2](https://arxiv.org/html/2606.27767#Thmlemma2), and we have a DC decomposition of the Gaussian kernel\.
On\[0,S∗\]\[0,S^\{\*\}\]we have:
λ\+=infs∈\[0,S∗\]min\(2αsinh\(αs\),2αsinh\(αs\)\+4α2scosh\(αs\)\)=infs∈\[0,S∗\]2αsinh\(αs\)=0,\\lambda\_\{\+\}=\\inf\_\{s\\in\[0,S^\{\*\}\]\}\\min\\big\(2\\alpha\\sinh\(\\alpha s\),2\\alpha\\sinh\(\\alpha s\)\+4\\alpha^\{2\}s\\cosh\(\\alpha s\)\\big\)=\\inf\_\{s\\in\[0,S^\{\*\}\]\}2\\alpha\\sinh\(\\alpha s\)=0,\(105\)and
Λ\+\\displaystyle\\Lambda\_\{\+\}=sups∈\[0,S∗\]max\(2αsinh\(αs\),2αsinh\(αs\)\+4α2scosh\(αs\)\)\\displaystyle=\\sup\_\{s\\in\[0,S^\{\*\}\]\}\\max\\left\(2\\alpha\\sinh\(\\alpha s\),2\\alpha\\sinh\(\\alpha s\)\+4\\alpha^\{2\}s\\cosh\(\\alpha s\)\\right\)\(106\)=sups∈\[0,S∗\]2αsinh\(αs\)\+4α2scosh\(αs\)\\displaystyle=\\sup\_\{s\\in\[0,S^\{\*\}\]\}2\\alpha\\sinh\(\\alpha s\)\+4\\alpha^\{2\}s\\cosh\(\\alpha s\)=2αsinh\(αS∗\)\+4α2cosh\(αS∗\)\.\\displaystyle=2\\alpha\\sinh\(\\alpha S^\{\*\}\)\+4\\alpha^\{2\}\\cosh\(\\alpha S^\{\*\}\)\.Similarly we have:
λ−=2α,Λ−=2αcosh\(αS∗\)\+4α2sinh\(αS∗\)\.\\lambda\_\{\-\}=2\\alpha,\\quad\\Lambda\_\{\-\}=2\\alpha\\cosh\(\\alpha S^\{\*\}\)\+4\\alpha^\{2\}\\sinh\(\\alpha S^\{\*\}\)\.\(107\)
#### F\.12Proof of[Proposition˜D\.12](https://arxiv.org/html/2606.27767#Thmproposition12)
Recall thatk\(x,y\)=ψ\(x−y\)k\(x,y\)=\\psi\(x\-y\), whereψ\(x−y\)=e−α‖x−y‖22\\psi\(x\-y\)=e^\{\-\\alpha\\\|x\-y\\\|\_\{2\}^\{2\}\}\. Letq\(s\)=e−αsq\(s\)=e^\{\-\\alpha s\}, thenq\(s\)=q\+\(s\)−q−\(s\)q\(s\)=q\_\{\+\}\(s\)\-q\_\{\-\}\(s\)where for alls≥0s\\geq 0,
q\+\(s\)=e−αs\+αs,q−\(s\)=αs\.q\_\{\+\}\(s\)=e^\{\-\\alpha s\}\+\\alpha s,\\quad q\_\{\-\}\(s\)=\\alpha s\.\(108\)Taking the derivatives, we get
q\+′\(s\)=α\(1−e−αs\),q\+′′\(s\)=α2e−αs,q\_\{\+\}^\{\\prime\}\(s\)=\\alpha\(1\-e^\{\-\\alpha s\}\),\\quad q\_\{\+\}^\{\\prime\\prime\}\(s\)=\\alpha^\{2\}e^\{\-\\alpha s\},\(109\)and
q−′\(s\)=α,q−′′\(s\)=0\.q\_\{\-\}^\{\\prime\}\(s\)=\\alpha,\\quad q\_\{\-\}^\{\\prime\\prime\}\(s\)=0\.\(110\)Hence, fors≥0s\\geq 0,q±\(s\)≥0q\_\{\\pm\}\(s\)\\geq 0and thusz↦ψ±\(‖z‖22\)z\\mapsto\\psi\_\{\\pm\}\(\\\|z\\\|\_\{2\}^\{2\}\)are convex by[Lemma˜D\.2](https://arxiv.org/html/2606.27767#Thmlemma2)\.
Moreover, we get the following minimal and maximal eigenvalues,
λ\+=infs∈\[0,\+∞\)min\{2α\(1−e−αs\),2α\(1−e−αs\)\+4sα2e−αs\}=infs∈\[0,\+∞\)2α\(1−e−αs\)=0,\\lambda\_\{\+\}=\\inf\_\{s\\in\[0,\+\\infty\)\}\\ \\min\\big\\\{2\\alpha\(1\-e^\{\-\\alpha s\}\),2\\alpha\(1\-e^\{\-\\alpha s\}\)\+4s\\alpha^\{2\}e^\{\-\\alpha s\}\\big\\\}=\\inf\_\{s\\in\[0,\+\\infty\)\}\\ 2\\alpha\(1\-e^\{\-\\alpha s\}\)=0,\(111\)and
Λ\+\\displaystyle\\Lambda\_\{\+\}=sups∈\[0,\+∞\)max\{2α\(1−e−αs\),2α\(1−e−αs\)\+4sα2e−αs\}\\displaystyle=\\sup\_\{s\\in\[0,\+\\infty\)\}\\max\\big\\\{2\\alpha\(1\-e^\{\-\\alpha s\}\),2\\alpha\(1\-e^\{\-\\alpha s\}\)\+4s\\alpha^\{2\}e^\{\-\\alpha s\}\\big\\\}\(112\)=sups∈\[0,\+∞\)2α\(1−e−αs\)\+4sα2e−αs=2α\+2α\(2sα−1\)e−αs\.\\displaystyle=\\sup\_\{s\\in\[0,\+\\infty\)\}2\\alpha\(1\-e^\{\-\\alpha s\}\)\+4s\\alpha^\{2\}e^\{\-\\alpha s\}=2\\alpha\+2\\alpha\(2s\\alpha\-1\)e^\{\-\\alpha s\}\.Letf\(s\)=2α\+2α\(2sα−1\)e−αsf\(s\)=2\\alpha\+2\\alpha\(2s\\alpha\-1\)e^\{\-\\alpha s\}, its derivative givef′\(s\)=4α2e−αs−2α2\(2sα−1\)e−αs=2α2e−αs\(3−2sα\)=0⇔s=32αf^\{\\prime\}\(s\)=4\\alpha^\{2\}e^\{\-\\alpha s\}\-2\\alpha^\{2\}\(2s\\alpha\-1\)e^\{\-\\alpha s\}=2\\alpha^\{2\}e^\{\-\\alpha s\}\(3\-2s\\alpha\)=0\\iff s=\\frac\{3\}\{2\\alpha\}\. Moreover,f′′\(s\)=−2α3e−αs\(3−2sα\)−4α3e−sα=−2α3e−sα\(5−2sα\)≤0⇔s≤52αf^\{\\prime\\prime\}\(s\)=\-2\\alpha^\{3\}e^\{\-\\alpha s\}\(3\-2s\\alpha\)\-4\\alpha^\{3\}e^\{\-s\\alpha\}=\-2\\alpha^\{3\}e^\{\-s\\alpha\}\(5\-2s\\alpha\)\\leq 0\\iff s\\leq\\frac\{5\}\{2\\alpha\}\. Hences=32αs=\\frac\{3\}\{2\\alpha\}is the maximizer, and
Λ\+\\displaystyle\\Lambda\_\{\+\}=2α\(1\+2e−32\)\.\\displaystyle=2\\alpha\(1\+2e^\{\-\\frac\{3\}\{2\}\}\)\.\(113\)Similarly,λ−=Λ−=2α\\lambda\_\{\-\}=\\Lambda\_\{\-\}=2\\alpha\.
#### F\.13Proof of[Lemma˜D\.5](https://arxiv.org/html/2606.27767#Thmlemma5)
The derivatives ofq−q\_\{\-\}give, for alls∈ℝs\\in\\mathbb\{R\},
q−′\(s\)=12ε\+s,q−′′\(s\)=−14\(ε\+s\)3/2\.q\_\{\-\}^\{\\prime\}\(s\)=\\frac\{1\}\{2\\sqrt\{\\varepsilon\+s\}\},\\quad q\_\{\-\}^\{\\prime\\prime\}\(s\)=\-\\frac\{1\}\{4\(\\varepsilon\+s\)^\{3/2\}\}\.\(114\)For the minimum eigenvalue we have
min\{2q−′\(s\),2q−′\(s\)\+4sq−′′\(s\)\}\\displaystyle\\min\\bigl\\\{2q\_\{\-\}^\{\\prime\}\(s\),2q\_\{\-\}^\{\\prime\}\(s\)\+4sq\_\{\-\}^\{\\prime\\prime\}\(s\)\\bigr\\\}=2q−′\(s\)\+4sq−′′\(s\)\\displaystyle=2q\_\{\-\}^\{\\prime\}\(s\)\+4sq\_\{\-\}^\{\\prime\\prime\}\(s\)\(115\)=1ε\+s−s\(s\+ε\)32\\displaystyle=\\frac\{1\}\{\\sqrt\{\\varepsilon\+s\}\}\-\\frac\{s\}\{\(s\+\\varepsilon\)^\{\\frac\{3\}\{2\}\}\}=ε\(s\+ε\)3/2\.\\displaystyle=\\frac\{\\varepsilon\}\{\(s\+\\varepsilon\)^\{3/2\}\}\.Therefore fors∈\[0,S∗\]s\\in\[0,S\_\{\*\}\], we have
λ−=ε\(S∗\+ε\)3/2\.\\lambda\_\{\-\}=\\frac\{\\varepsilon\}\{\(S\_\{\*\}\+\\varepsilon\)^\{3/2\}\}\.\(116\)For the maximum eigenvalue we have
max\{2q−′\(s\),2q−′\(s\)\+4sq−′′\(s\)\}=2q−′\(s\)=1ε\+s\.\\max\\bigl\\\{2q\_\{\-\}^\{\\prime\}\(s\),\\,2q\_\{\-\}^\{\\prime\}\(s\)\+4sq\_\{\-\}^\{\\prime\\prime\}\(s\)\\bigr\\\}=2q\_\{\-\}^\{\\prime\}\(s\)=\\frac\{1\}\{\\sqrt\{\\varepsilon\+s\}\}\.\(117\)Therefore fors∈\[0,S∗\]s\\in\[0,S\_\{\*\}\], we have
Λ−=1ε\.\\Lambda\_\{\-\}=\\frac\{1\}\{\\sqrt\{\\varepsilon\}\}\.\(118\)
#### F\.14Proof of[Lemma˜D\.6](https://arxiv.org/html/2606.27767#Thmlemma6)
ψ\(z\)=1\(c2\+‖z‖22\)α,\\psi\(z\)=\\frac\{1\}\{\(c^\{2\}\+\\\|z\\\|\_\{2\}^\{2\}\)^\{\\alpha\}\},this corresponds toq\(s\)=1\(c2\+s\)α,α≥1\.q\(s\)=\\frac\{1\}\{\(c^\{2\}\+s\)^\{\\alpha\}\},\\alpha\\geq 1\.We have fors≥0s\\geq 0,
q′\(s\)=−α\(c2\+s\)−α−1,q′′\(s\)=α\(α\+1\)\(c2\+s\)−α−2\.q^\{\\prime\}\(s\)=\-\\alpha\\left\(c^\{2\}\+s\\right\)^\{\-\\alpha\-1\},\\quad q^\{\\prime\\prime\}\(s\)=\\alpha\(\\alpha\+1\)\\left\(c^\{2\}\+s\\right\)^\{\-\\alpha\-2\}\.\(119\)Hence we haveq′\(s\)<0q^\{\\prime\}\(s\)<0andq′′\(s\)\>0q^\{\\prime\\prime\}\(s\)\>0, and thusq\(s\)q\(s\)is convex ons≥0s\\geq 0, butψ\\psiis not convex sinceq′\(s\)<0q^\{\\prime\}\(s\)<0\. SetA=max\(0,−q′\(0\)\)=αc−2\(α\+1\),A=\\max\\big\(0,\-q^\{\\prime\}\(0\)\\big\)=\\alpha c^\{\-2\(\\alpha\+1\)\},hence
q−\(s\)=As−∫0s\(s−t\)min\(0,q′′\(t\)\)dt=αc−2\(α\+1\)s,q\_\{\-\}\(s\)=As\-\\int\_\{0\}^\{s\}\(s\-t\)\\min\\big\(0,q^\{\\prime\\prime\}\(t\)\\big\)\\ \\mathrm\{d\}t=\\alpha c^\{\-2\(\\alpha\+1\)\}s,\(120\)and
q\+\(s\)=q\(s\)\+q−\(s\)=1\(c2\+s\)α\+αc−2\(α\+1\)s\.q\_\{\+\}\(s\)=q\(s\)\+q\_\{\-\}\(s\)=\\frac\{1\}\{\(c^\{2\}\+s\)^\{\\alpha\}\}\+\\alpha c^\{\-2\(\\alpha\+1\)\}s\.\(121\)Now turning to minimum of the Hessian we have
λ¯\[q\+\]≔infs≥0min\{2q\+′\(s\),2q\+′\(s\)\+4sq\+′′\(s\)\},λ¯\[q−\]≔infs≥0min\{2q−′\(s\),2q−′\(s\)\+4sq−′′\(s\)\}\.\\underline\{\\lambda\}\[q\_\{\+\}\]\\coloneqq\\inf\_\{s\\geq 0\}\\min\\bigl\\\{2q\_\{\+\}^\{\\prime\}\(s\),\\,2q\_\{\+\}^\{\\prime\}\(s\)\+4sq\_\{\+\}^\{\\prime\\prime\}\(s\)\\bigr\\\},\\quad\\underline\{\\lambda\}\[q\_\{\-\}\]\\coloneqq\\inf\_\{s\\geq 0\}\\min\\bigl\\\{2q\_\{\-\}^\{\\prime\}\(s\),\\,2q\_\{\-\}^\{\\prime\}\(s\)\+4sq\_\{\-\}^\{\\prime\\prime\}\(s\)\\bigr\\\}\.\(122\)By constructionq\+′\(s\)≥0q\_\{\+\}^\{\\prime\}\(s\)\\geq 0andq\+′′\(s\)≥0q\_\{\+\}^\{\\prime\\prime\}\(s\)\\geq 0, and henceq\+′q\_\{\+\}^\{\\prime\}is non decreasing andλ¯\[q\+\]=infs≥02q\+′\(s\)=2q\+′\(0\)=−A\+A=0\.\\underline\{\\lambda\}\[q\_\{\+\}\]=\\inf\_\{s\\geq 0\}2q\_\{\+\}^\{\\prime\}\(s\)=2q\_\{\+\}^\{\\prime\}\(0\)=\-A\+A=0\.On the other handq−′\(s\)=αc−2\(α\+1\)q\_\{\-\}^\{\\prime\}\(s\)=\\alpha c^\{\-2\(\\alpha\+1\)\}andq−′′\(s\)=0q\_\{\-\}^\{\\prime\\prime\}\(s\)=0and henceλ¯\[q−\]=2αc−2\(α\+1\)\.\\underline\{\\lambda\}\[q\_\{\-\}\]=2\\alpha c^\{\-2\(\\alpha\+1\)\}\.For the maximum of the Hessian we have as well,
Λ¯\[q\+\]=sups≥0max\{2q\+′\(s\),2q\+′\(s\)\+4sq\+′′\(s\)\}=sups≥02q\+′\(s\)\+4sq\+′′\(s\)\.\\overline\{\\Lambda\}\[q\_\{\+\}\]=\\sup\_\{s\\geq 0\}\\max\\bigl\\\{2q\_\{\+\}^\{\\prime\}\(s\),\\,2q\_\{\+\}^\{\\prime\}\(s\)\+4sq\_\{\+\}^\{\\prime\\prime\}\(s\)\\bigr\\\}=\\sup\_\{s\\geq 0\}\\ 2q\_\{\+\}^\{\\prime\}\(s\)\+4sq\_\{\+\}^\{\\prime\\prime\}\(s\)\.\(123\)Let us note fors≥0s\\geq 0,
f\(s\)=2q\+′\(s\)\+4sq\+′′\(s\),f′\(s\)=6q\+′′\(s\)\+4sq\+′′′\(s\)\.f\(s\)=2q\_\{\+\}^\{\\prime\}\(s\)\+4sq\_\{\+\}^\{\\prime\\prime\}\(s\),\\quad f^\{\\prime\}\(s\)=6q\_\{\+\}^\{\\prime\\prime\}\(s\)\+4sq\_\{\+\}^\{\\prime\\prime\\prime\}\(s\)\.\(124\)We have
f′\(s\)\\displaystyle f^\{\\prime\}\(s\)=6α\(α\+1\)\(c2\+s\)−α−2−4α\(α\+1\)\(α\+2\)s\(c2\+s\)−α−3\\displaystyle=6\\alpha\(\\alpha\+1\)\(c^\{2\}\+s\)^\{\-\\alpha\-2\}\-4\\alpha\(\\alpha\+1\)\(\\alpha\+2\)s\(c^\{2\}\+s\)^\{\-\\alpha\-3\}\(125\)=α\(α\+1\)\(c2\+s\)−α−2\(6−4\(α\+2\)sc2\+s\)\.\\displaystyle=\\alpha\(\\alpha\+1\)\(c^\{2\}\+s\)^\{\-\\alpha\-2\}\\left\(6\-4\(\\alpha\+2\)\\frac\{s\}\{c^\{2\}\+s\}\\right\)\.Lets∗s^\{\*\}such that
6−4\(α\+2\)s∗c2\+s∗=0,6\-4\(\\alpha\+2\)\\frac\{s^\{\*\}\}\{c^\{2\}\+s^\{\*\}\}=0,\(126\)equivalently
s∗=6c24\(α\+2\)−6≥0\.s^\{\*\}=\\frac\{6c^\{2\}\}\{4\(\\alpha\+2\)\-6\}\\geq 0\.\(127\)We havef′\(s\)≥0f^\{\\prime\}\(s\)\\geq 0fors∈\[0,s∗\]s\\in\[0,s^\{\*\}\]andf′\(s\)≤0f^\{\\prime\}\(s\)\\leq 0for\[s∗,\+∞\[\[s^\{\*\},\+\\infty\[, hencef\(s∗\)f\(s^\{\*\}\)is the global sup\. The result forΛ¯\[q−\]\\overline\{\\Lambda\}\[q\_\{\-\}\]is immediate\.
#### F\.15Proof of[Proposition˜D\.13](https://arxiv.org/html/2606.27767#Thmproposition13)
Recall that based on the DC decomposition of[Proposition˜7](https://arxiv.org/html/2606.27767#Thmproposition7),ℱ\+=𝒲\+\+𝒱−\+c\\mathcal\{F\}^\{\+\}=\\mathcal\{W\}^\{\+\}\+\\mathcal\{V\}^\{\-\}\+cwhere
𝒲\+\(μ\)=12∬ψ\+\(x−y\)dμ\(x\)dμ\(y\),𝒱−\(μ\)=∫V−\(x\)dμ\(x\),\\mathcal\{W\}^\{\+\}\(\\mu\)=\\frac\{1\}\{2\}\\iint\\psi^\{\+\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\),\\quad\\mathcal\{V\}^\{\-\}\(\\mu\)=\\int\\mathrm\{V\}^\{\-\}\(x\)\\mathrm\{d\}\\mu\(x\),\(128\)and
V−\(x\)=∫ψ−\(x−y\)dν\(y\)\.\\mathrm\{V\}^\{\-\}\(x\)=\\int\\psi^\{\-\}\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)\.\(129\)As𝒲\+\\mathcal\{W\}^\{\+\}is an interaction energy, and𝒱−\\mathcal\{V\}^\{\-\}a potential energy, their Wasserstein gradients atμ∈𝒫2\(Ω\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\Omega\)read as, for allx∈ℝdx\\in\\mathbb\{R\}^\{d\},
∇W2𝒲\+\(μ\)\(x\)=∇\(ψ\+∗μ\)\(x\)\\displaystyle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{W\}^\{\+\}\(\\mu\)\(x\)=\\nabla\(\\psi^\{\+\}\*\\mu\)\(x\)\(130\)∇W2𝒱−\(μ\)\(x\)=∇V−\(x\)=∇\(ψ−∗ν\)\(x\)\.\\displaystyle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{V\}^\{\-\}\(\\mu\)\(x\)=\\nabla\\mathrm\{V\}^\{\-\}\(x\)=\\nabla\(\\psi^\{\-\}\*\\nu\)\(x\)\.Hence, the Wasserstein gradient ofℱ\+\\mathcal\{F\}^\{\+\}atμ\\muis, for allx∈ℝdx\\in\\mathbb\{R\}^\{d\},
∇W2ℱ\+\(μ\)\(x\)\\displaystyle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\)\(x\)=∇W2𝒲\+\(μ\)\(x\)\+∇W2𝒱−\(μ\)\(x\)\\displaystyle=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{W\}^\{\+\}\(\\mu\)\(x\)\+\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{V\}^\{\-\}\(\\mu\)\(x\)\(131\)=∇\(ψ\+∗μ\)\(x\)\+∇\(ψ−∗ν\)\(x\)\.\\displaystyle=\\nabla\(\\psi^\{\+\}\*\\mu\)\(x\)\+\\nabla\(\\psi^\{\-\}\*\\nu\)\(x\)\.
LetT∈L2\(μ\)\\mathrm\{T\}\\in L^\{2\}\(\\mu\)such thatT\#μ=σ\\mathrm\{T\}\_\{\\\#\}\\mu=\\sigma\. Definea,b:ℝd→ℝda,b:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}as
a\(x\)≔∇\(ψ\+∗μ\)\(x\)−∇\(ψ\+∗σ\)\(T\(x\)\),b\(x\)≔∇\(ψ−∗ν\)\(x\)−∇\(ψ−∗ν\)\(T\(x\)\)\.a\(x\)\\coloneqq\\nabla\(\\psi^\{\+\}\*\\mu\)\(x\)\-\\nabla\(\\psi^\{\+\}\*\\sigma\)\\big\(\\mathrm\{T\}\(x\)\\big\),\\quad b\(x\)\\coloneqq\\nabla\(\\psi^\{\-\}\*\\nu\)\(x\)\-\\nabla\(\\psi^\{\-\}\*\\nu\)\\big\(\\mathrm\{T\}\(x\)\\big\)\.\(132\)Then, we have the foloowing relation between the Wasserstein gradient ofℱ\+\\mathcal\{F\}^\{\+\}atμ\\muandσ\\sigma:
∇W2ℱ\+\(μ\)−∇W2ℱ\+\(σ\)∘T=a\(x\)\+b\(x\)\.\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\sigma\)\\circ\\mathrm\{T\}=a\(x\)\+b\(x\)\.\(133\)
Bounding the term‖a‖L2\(μ\)\\\|a\\\|\_\{L^\{2\}\(\\mu\)\}\.Looking at the termaawe have:
a\(x\)\\displaystyle a\(x\)=∫∇ψ\+\(x−y\)dμ\(y\)−∫∇ψ\+\(T\(x\)−z\)dσ\(z\)\\displaystyle=\\int\\nabla\\psi^\{\+\}\(x\-y\)\\ \\mathrm\{d\}\\mu\(y\)\-\\int\\nabla\\psi^\{\+\}\\big\(\\mathrm\{T\}\(x\)\-z\\big\)\\ \\mathrm\{d\}\\sigma\(z\)\(134\)=∫\(∇ψ\+\(x−y\)−∇ψ\+\(T\(x\)−T\(y\)\)\)dμ\(y\),\\displaystyle=\\int\\left\(\\nabla\\psi^\{\+\}\(x\-y\)\-\\nabla\\psi^\{\+\}\\big\(\\mathrm\{T\}\(x\)\-\\mathrm\{T\}\(y\)\\big\)\\right\)\\ \\mathrm\{d\}\\mu\(y\),sinceT\#μ=σ\\mathrm\{T\}\_\{\\\#\}\\mu=\\sigma\.∇ψ\+\\nabla\\psi\_\{\+\}isΛ\+\\Lambda\_\{\+\}\-Lipschitz by assumption onΩ−Ω\\Omega\-\\Omegaand hence we have:
‖a\(x\)‖2≤Λ\+∫‖\(x−y\)−\(T\(x\)−T\(y\)\)‖2dμ\(y\)\.\\\|a\(x\)\\\|\_\{2\}\\leq\\Lambda\_\{\+\}\\int\\\|\(x\-y\)\-\(\\mathrm\{T\}\(x\)\-\\mathrm\{T\}\(y\)\)\\\|\_\{2\}\\ \\mathrm\{d\}\\mu\(y\)\.\(135\)Now, Jensen inequality and rearranging terms we obtain:
‖a\(x\)‖22≤Λ\+2∫‖\(x−T\(x\)\)−\(y−T\(y\)\)‖22dμ\(y\)\\\|a\(x\)\\\|\_\{2\}^\{2\}\\leq\\Lambda^\{2\}\_\{\+\}\\int\\\|\(x\-\\mathrm\{T\}\(x\)\)\-\(y\-\\mathrm\{T\}\(y\)\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(y\)\(136\)Integrating onxx*w\.r\.t*μ\\mu,
∫‖a\(x\)‖22dμ\(x\)≤Λ\+2∫‖\(x−T\(x\)\)−\(y−T\(y\)\)‖22dμ\(x\)dμ\(y\)\.\\int\\\|a\(x\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)\\leq\\Lambda^\{2\}\_\{\+\}\\int\\\|\(x\-\\mathrm\{T\}\(x\)\)\-\(y\-\\mathrm\{T\}\(y\)\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\.\(137\)
Let us now provide an intermediary lemma before boundingaa\.
###### Lemma F\.9\.
LetH:Ω−Ω→Ω−ΩH:\\Omega\-\\Omega\\to\\Omega\-\\Omegawe have:
∫‖H\(x\)−H\(y\)‖22dμ\(x\)𝑑μ\(y\)=2∫‖H\(x\)‖22dμ\(x\)−2\|∫H\(x\)dμ\(x\)\|2\\int\\\|H\(x\)\-H\(y\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)d\\mu\(y\)=2\\int\\\|H\(x\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)\-2\\left\|\\int H\(x\)\\ \\mathrm\{d\}\\mu\(x\)\\right\|^\{2\}\(138\)
###### Proof of Lemma[F\.9](https://arxiv.org/html/2606.27767#Thmlemma9)\.
Note that:
∫‖H\(x\)−H\(y\)‖22dμ\(x\)dμ\(y\)\\displaystyle\\int\\\|H\(x\)\-H\(y\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)=∫\(‖H\(x\)‖2\+‖H\(y\)‖2−2⟨H\(x\),H\(y\)⟩\)dμ\(x\)dμ\(y\)\\displaystyle=\\int\(\\\|H\(x\)\\\|^\{2\}\+\\\|H\(y\)\\\|^\{2\}\-2\\langle H\(x\),H\(y\)\\rangle\)\\ \\mathrm\{d\}\\mu\(x\)\\mathrm\{d\}\\mu\(y\)\(139\)=2∫‖H\(x\)‖22dμ\(x\)−2\|∫H\(x\)dμ\(x\)\|2\.\\displaystyle=2\\int\\\|H\(x\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)\-2\\left\|\\int H\(x\)\\ \\mathrm\{d\}\\mu\(x\)\\right\|^\{2\}\.∎
Using Lemma[F\.9](https://arxiv.org/html/2606.27767#Thmlemma9)in Equation \([137](https://arxiv.org/html/2606.27767#A6.E137)\) forH\(x\)=x−T\(x\)H\(x\)=x\-\\mathrm\{T\}\(x\)we obtain:
∫‖a\(x\)‖22dμ\(x\)\\displaystyle\\int\\\|a\(x\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)≤2Λ\+2∫‖x−T\(x\)‖22dμ\(x\)−2Λ\+2\|∫\(x−T\(x\)\)dμ\(x\)\|2\\displaystyle\\leq 2\\Lambda^\{2\}\_\{\+\}\\int\\\|x\-\\mathrm\{T\}\(x\)\\\|\_\{2\}^\{2\}\\ \\mathrm\{d\}\\mu\(x\)\-2\\Lambda^\{2\}\_\{\+\}\\left\|\\int\(x\-\\mathrm\{T\}\(x\)\)\\ \\mathrm\{d\}\\mu\(x\)\\right\|^\{2\}\(140\)≤2Λ\+2‖T−Id‖L2\(μ\)2\.\\displaystyle\\leq 2\\Lambda\_\{\+\}^\{2\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|^\{2\}\_\{L^\{2\}\(\\mu\)\}\.Therefore we have:
‖a‖L2\(μ\)≤2Λ\+‖T−Id‖L2\(μ\)\.\\\|a\\\|\_\{L^\{2\}\(\\mu\)\}\\leq\\sqrt\{2\}\\Lambda\_\{\+\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\)\}\.\(141\)
Bounding the term‖b‖L2\(μ\)\\\|b\\\|\_\{L^\{2\}\(\\mu\)\}\.Note that∇V−\\nabla\\mathrm\{V\}^\{\-\}isΛ−\\Lambda\_\{\-\}\-Lipschitz since∇2ψ−⪯Λ−Id\\nabla^\{2\}\\psi^\{\-\}\\preceq\\Lambda\_\{\-\}I\_\{d\}and hence for allx∈Ωx\\in\\Omega,
∇2V−\(x\)=∫∇2ψ−\(x−y\)dν\(y\)⪯Λ−Id\.\\nabla^\{2\}\\mathrm\{V\}^\{\-\}\(x\)=\\int\\nabla^\{2\}\\psi^\{\-\}\(x\-y\)\\ \\mathrm\{d\}\\nu\(y\)\\preceq\\Lambda\_\{\-\}I\_\{d\}\.\(142\)Hence
‖b\(x\)‖2=‖∇V−\(x\)−∇V−\(T\(x\)\)‖2≤Λ−‖x−T\(x\)‖2\.\\\|b\(x\)\\\|\_\{2\}=\\\|\\nabla\\mathrm\{V\}^\{\-\}\(x\)\-\\nabla\\mathrm\{V\}^\{\-\}\(\\mathrm\{T\}\(x\)\)\\\|\_\{2\}\\leq\\Lambda\_\{\-\}\\\|x\-\\mathrm\{T\}\(x\)\\\|\_\{2\}\.\(143\)Integrating onxxand by Jensen inequality, we have finally
‖b‖L2\(μ\)≤Λ−‖T−Id‖L2\(μ\)\.\\\|b\\\|\_\{L^\{2\}\(\\mu\)\}\\leq\\Lambda\_\{\-\}\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\)\}\.\(144\)
Final Bound\.Combining Equations \([137](https://arxiv.org/html/2606.27767#A6.E137)\) and \([144](https://arxiv.org/html/2606.27767#A6.E144)\) we obtain:
‖∇W2ℱ\+\(μ\)−∇W2ℱ\+\(σ\)∘T‖L2\(μ\)\\displaystyle\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\sigma\)\\circ\\mathrm\{T\}\\\|\_\{L^\{2\}\(\\mu\)\}=‖a\(x\)\+b\(x\)‖L2\(μ\)\\displaystyle=\\\|a\(x\)\+b\(x\)\\\|\_\{L^\{2\}\(\\mu\)\}\(145\)≤‖a‖L2\(μ\)\+‖b‖L2\(μ\)\\displaystyle\\leq\\\|a\\\|\_\{L^\{2\}\(\\mu\)\}\+\\\|b\\\|\_\{L^\{2\}\(\\mu\)\}≤\(2Λ\+\+Λ−\)‖T−Id‖L2\(μ\)\.\\displaystyle\\leq\\left\(\\sqrt\{2\}\\Lambda\_\{\+\}\+\\Lambda\_\{\-\}\\right\)\\\|\\mathrm\{T\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\)\}\.
#### F\.16Proof of[Theorem˜D\.1](https://arxiv.org/html/2606.27767#Thmtheorem1)
In Wasserstein CCCP we have∇W2ℱ\+\(μk\+1\)∘Tk\+1=∇W2ℱ−\(μk\)\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)and hence
∇W2ℱ\(μk\)\\displaystyle\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)=∇W2ℱ\+\(μk\)−∇W2ℱ−\(μk\)\\displaystyle=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\-\}\(\\mu\_\{k\}\)\(146\)=∇W2ℱ\+\(μk\)−∇W2ℱ\+\(μk\+1\)∘Tk\+1\.\\displaystyle=\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\.Hence we have
‖∇W2ℱ\(μk\)‖L2\(μk\)\\displaystyle\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}=‖∇W2ℱ\+\(μk\)−∇W2ℱ\+\(μk\+1\)∘Tk\+1‖L2\(μk\)\\displaystyle=\\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\}\)\-\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}^\{\+\}\(\\mu\_\{k\+1\}\)\\circ\\mathrm\{T\}\_\{k\+1\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\}\(147\)≤L‖Tk\+1−Id‖L2\(μk\),\\displaystyle\\leq L\\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|\_\{L^\{2\}\(\\mu\_\{k\}\)\},where we used[Proposition˜D\.13](https://arxiv.org/html/2606.27767#Thmproposition13)andL=2Λ\+\+Λ−L=\\sqrt\{2\}\\Lambda\_\{\+\}\+\\Lambda\_\{\-\}\. Sinceλ±≥0\\lambda\_\{\\pm\}\\geq 0andλ\+\+λ−\>0\\lambda\_\{\+\}\+\\lambda\_\{\-\}\>0, we can apply[Proposition˜5](https://arxiv.org/html/2606.27767#Thmproposition5)withα±=λ±\\alpha\_\{\\pm\}=\\lambda\_\{\\pm\}\. Taking minimum on both sides in inequality \([147](https://arxiv.org/html/2606.27767#A6.E147)\), and applying[Proposition˜5](https://arxiv.org/html/2606.27767#Thmproposition5)we have finally
min0≤k≤K−1‖∇W2ℱ\(μk\)‖L2\(μk\)2\\displaystyle\\min\_\{0\\leq k\\leq K\-1\}\\ \\\|\\nabla\_\{\\mathrm\{W\}\_\{2\}\}\\mathcal\{F\}\(\\mu\_\{k\}\)\\\|^\{2\}\_\{L^\{2\}\(\\mu\_\{k\}\)\}≤L2min0≤k≤K−1‖Tk\+1−Id‖L2\(μk\)2\\displaystyle\\leq L^\{2\}\\min\_\{0\\leq k\\leq K\-1\}\\ \\\|\\mathrm\{T\}\_\{k\+1\}\-\\mathrm\{Id\}\\\|^\{2\}\_\{L^\{2\}\(\\mu\_\{k\}\)\}\(148\)≤2L2α\+\+α−\(ℱ\(μ0\)−ℱ\(μK\)\)K\.\\displaystyle\\leq\\frac\{2L^\{2\}\}\{\\alpha^\{\+\}\+\\alpha^\{\-\}\}\\frac\{\\big\(\\mathcal\{F\}\(\\mu\_\{0\}\)\-\\mathcal\{F\}\(\\mu\_\{K\}\)\\big\)\}\{K\}\.AsK→∞K\\to\\inftywe have a stationary point onℱ\\mathcal\{F\}\.Similar Articles
Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise
This paper studies nonconvex stochastic optimization under Blum-Gladyshev noise, where gradient variance grows with distance from initialization. It proves convergence guarantees for normalized SGD with momentum and a variance-reduced STORM method, achieving minimax optimal rates under certain conditions.
Mirror Descent-Type Algorithms for the Variational Inequality Problem with Functional Constraints
This paper proposes mirror descent-type algorithms for solving variational inequality problems with functional constraints, proving optimal convergence rates for problems with bounded monotone operators and Lipschitz convex constraints. A modification is introduced to improve efficiency for many constraints.
Utility-Constrained Policy Optimization
This paper introduces a simple yet powerful methodology for Utility-Constrained MDPs (UCMDPs) that enables risk-sensitive constraints without fixing constraint limits in advance, outperforming baselines on Safety Gymnasium benchmarks.
Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment
Introduces Smooth Maximum Mean Discrepancy (SMMD), a loss function that aligns predicted numeric distributions with targets using kernel matching and graph-based smoothness, improving numerical prediction accuracy in LLMs across multiple tasks.
Model Merging by Output-Space Projection
This paper presents a new framework for model merging that casts the problem as a convex quadratic program over residual updates, minimizing a squared-output calibration objective. It subsumes existing heuristic methods and provides a closed-form diagnostic to predict merge quality, showing consistent gains on language and vision benchmarks.